Warning when using sparse categorical values #6383

mjalse · 2024-03-26T09:08:59Z

I have question about a warning message when training a LightGBM model with lgbm.train. I get the following warning:

[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero

The reason is that I have a column, specified as categorical, that contains the following integers:

[1015, 1033, 1128, 1398, 1541, 1673, 1677]

In the documentation it says:

"All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero."

My values are not particularly large. The "consider using consecutive integers starting from zero" seems to be a suggestion. What happens if they do not? How does the sparseness affect the performance of LightGBM? Another categorical column of my dataset has the three values

[1, 3, 4]

and this column does not cause the same warning.

The text was updated successfully, but these errors were encountered:

YingJie-Zhao · 2024-03-26T12:14:16Z

I guess that was the reason.

Optimal Split for Categorical Features:
... LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

So if we use a categorical feature with many different sparse values, a large histogram would be generated and it can be memory consuming.

Pardon me if I am wrong.

mjalse changed the title ~~Warning when using sparse categorical values with LightGBM~~ Warning when using sparse categorical values Mar 26, 2024

jameslamb added the question label Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning when using sparse categorical values #6383

Warning when using sparse categorical values #6383

mjalse commented Mar 26, 2024

YingJie-Zhao commented Mar 26, 2024

Warning when using sparse categorical values #6383

Warning when using sparse categorical values #6383

Comments

mjalse commented Mar 26, 2024

YingJie-Zhao commented Mar 26, 2024