-
-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723
Comments
I think it's possible to use something more robust than rf internal feature importance (which is just feature usage counter) and somethig quicker than SHAP. The problem lies in research factor - I don't think that we know what exactly will give the best result here with minumum number of caveats. P.S. I think there is one way of mitigating this unwanted behaviour - using binarization before fitting the model. This way it will cap the number of unique values, which should help. Just like GBDT do. |
I also see that RF feature importance has its limitations, i.e., correlated features will show half the importance than they would if used in isolation. And hence, they might be lost to the probes. sklearn uses importance gain as a measure if importance, not just counts. Feature count is used by other implementations though, like xgb and lightGBM. SHAP values also have their limitations. They approximate importance with a function that is not really related to RF workings. So at the end of the day, it's just another approximation. Plus, adding dependencies makes the library harder to maintain. I am already struggling with pandas and sklearn constant new releases. We could try adding importance derived from single feature models. Like the functionality that we have in single feature selector: https://feature-engine.trainindata.com/en/latest/user_guide/selection/SelectBySingleFeaturePerformance.html Thoughts? |
I think we shouldn't add more dependencies |
First of all, thanks for this package, I've been using it for some time to do feature engineering and it's awesome.
Is your feature request related to a problem? Please describe.
I think I found a problem with the ProbeFeatureSelection algorithm. This algorithm uses the
feature_importances
of the SKLearn estimator to select the features that have greater importance than the Probe features.If you choose a RandomForestClassifier as the estimator and you are trying to perform binary classification, the
feature_importances
will tend to prefer high cardinality features (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#:~:text=Feature%20importances%20are%20provided%20by,impurity%20decrease%20within%20each%20tree.&text=Impurity%2Dbased%20feature%20importances%20can,features%20(many%20unique%20values)I found this issue while I was testing this algorithm with toy data:
In this example,
feature1
is the same asy
(correlation of 1.0), so the algorithm should choosefeature1
as an important feature right? If we just make one iteration of the algorithm, it will choosefeature1
andfeature2
with this setting:But if we run PROBE two more times, we will be left with an empty df (no features with greater importance than random uniform feature)
Describe the solution you'd like
I think a possible solution would be to add the option of using SHAP values instead of SKLearn
feature_importances
to select the features with greater importance than the PROBEs.The text was updated successfully, but these errors were encountered: