You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, I would like to express my appreciation for the CatBoost library; it has been a fantastic tool for numerous machine learning tasks. However, I've encountered an encoding anomaly with categorical features that I cannot explain.
Reproduction Steps:
I created a simplified dataset for a binary classification task with a single categorical feature having three unique values. Two of these values correspond to conversions in the dataset, while the third value has zero conversions. Using CatBoost "out of the box," the model fails to differentiate between the categories; i.e., it outputs the same prediction across all feature values during testing.
What I've Tried:
I consulted the documentation, searched Google for insights.
I saved the model in Python format and reverse-engineered the code. I encountered issues, such as the feature hash not being calculated, leading to the branch if bucket is None: in the calc_ctr() method, which then uses ctr.calc(0, 0).
Changing the simple_ctr from Borders to Buckets or increasing CtrBorderCount appears to differentiate the classes correctly.
Hello!
It seems that in your case (there are very few different values in your cat feature), the best options is to use one_hot_encoded features (set option one_hot_max_size to 100). We will check, why it is not default behaviour in your case.
Hello,
First, I would like to express my appreciation for the CatBoost library; it has been a fantastic tool for numerous machine learning tasks. However, I've encountered an encoding anomaly with categorical features that I cannot explain.
Reproduction Steps:
I created a simplified dataset for a binary classification task with a single categorical feature having three unique values. Two of these values correspond to conversions in the dataset, while the third value has zero conversions. Using CatBoost "out of the box," the model fails to differentiate between the categories; i.e., it outputs the same prediction across all feature values during testing.
What I've Tried:
Attachments:
I am attaching a Jupyter notebook with the example for your reference. catboost_debug_encoding.ipynb.zip
Could you please help understand why the default settings fail to distinguish between these categories and any possible steps to resolve this?
Thank you for your assistance and for developing such a powerful tool.
The text was updated successfully, but these errors were encountered: