Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning when using sparse categorical values #6383

Open
mjalse opened this issue Mar 26, 2024 · 1 comment
Open

Warning when using sparse categorical values #6383

mjalse opened this issue Mar 26, 2024 · 1 comment
Labels

Comments

@mjalse
Copy link

mjalse commented Mar 26, 2024

I have question about a warning message when training a LightGBM model with lgbm.train. I get the following warning:

[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero

The reason is that I have a column, specified as categorical, that contains the following integers:

[1015, 1033, 1128, 1398, 1541, 1673, 1677]

In the documentation it says:

"All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero."

My values are not particularly large. The "consider using consecutive integers starting from zero" seems to be a suggestion. What happens if they do not? How does the sparseness affect the performance of LightGBM? Another categorical column of my dataset has the three values

[1, 3, 4]

and this column does not cause the same warning.

@mjalse mjalse changed the title Warning when using sparse categorical values with LightGBM Warning when using sparse categorical values Mar 26, 2024
@YingJie-Zhao
Copy link

I guess that was the reason.

Optimal Split for Categorical Features:
... LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

So if we use a categorical feature with many different sparse values, a large histogram would be generated and it can be memory consuming.

Pardon me if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants