Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

sfgarcia · 2024-02-16T22:26:33Z

First of all, thanks for this package, I've been using it for some time to do feature engineering and it's awesome.

Is your feature request related to a problem? Please describe.
I think I found a problem with the ProbeFeatureSelection algorithm. This algorithm uses the feature_importances of the SKLearn estimator to select the features that have greater importance than the Probe features.

If you choose a RandomForestClassifier as the estimator and you are trying to perform binary classification, the feature_importances will tend to prefer high cardinality features (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#:~:text=Feature%20importances%20are%20provided%20by,impurity%20decrease%20within%20each%20tree.&text=Impurity%2Dbased%20feature%20importances%20can,features%20(many%20unique%20values)

I found this issue while I was testing this algorithm with toy data:

    X = pd.DataFrame({
        "feature1": [0, 1, 0, 1, 0],
        "feature2": [6, 7, 8, 9, 10],
        "feature3": [11, 12, 13, 14, 15],
        "feature4": [16, 17, 18, 19, 20],
        "feature5": [21, 22, 23, 24, 25],
    })
    y = pd.Series([0, 1, 0, 1, 0])

In this example, feature1 is the same as y (correlation of 1.0), so the algorithm should choose feature1 as an important feature right? If we just make one iteration of the algorithm, it will choose feature1 and feature2 with this setting:

    X, y = sample_X_y
    selector = ProbeFeatureSelection(
        estimator=RandomForestClassifier(max_depth=2, random_state=150),
        n_probes=1,
        distribution="uniform",
        random_state=150,
        confirm_variables=False,
        cv=2,
    )
    result = probe_feature_selection(selector, X, y)

But if we run PROBE two more times, we will be left with an empty df (no features with greater importance than random uniform feature)

def probe_feature_selection(selector: ProbeFeatureSelection, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
    """Perform PROBE feature selection using the given selector on the input data.

    Args:
        selector (ProbeFeatureSelection): The feature selection selector.
        X (pd.DataFrame): The input data.
        y (pd.Series): The target variable.

    Returns:
        pd.DataFrame: The transformed input data after feature selection.
    """
    feature_decrease = True
    iterations = 1

    while feature_decrease and len(X.columns) > 0:
        n_initial_features = len(X.columns)
        selector.fit(X, y)
        X = selector.transform(X)
        n_final_features = len(X.columns)
        feature_decrease = n_initial_features > n_final_features
        logging.info(f"Iteration {iterations}: {n_initial_features} -> {n_final_features}")
        iterations += 1

    return X

Describe the solution you'd like
I think a possible solution would be to add the option of using SHAP values instead of SKLearn feature_importances to select the features with greater importance than the PROBEs.

The text was updated successfully, but these errors were encountered:

glevv · 2024-03-13T06:12:55Z

I think it's possible to use something more robust than rf internal feature importance (which is just feature usage counter) and somethig quicker than SHAP.

The problem lies in research factor - I don't think that we know what exactly will give the best result here with minumum number of caveats.

P.S. I think there is one way of mitigating this unwanted behaviour - using binarization before fitting the model. This way it will cap the number of unique values, which should help. Just like GBDT do.

solegalli · 2024-03-13T10:28:13Z

I also see that RF feature importance has its limitations, i.e., correlated features will show half the importance than they would if used in isolation. And hence, they might be lost to the probes.

sklearn uses importance gain as a measure if importance, not just counts. Feature count is used by other implementations though, like xgb and lightGBM.

SHAP values also have their limitations. They approximate importance with a function that is not really related to RF workings. So at the end of the day, it's just another approximation. Plus, adding dependencies makes the library harder to maintain. I am already struggling with pandas and sklearn constant new releases.

We could try adding importance derived from single feature models. Like the functionality that we have in single feature selector: https://feature-engine.trainindata.com/en/latest/user_guide/selection/SelectBySingleFeaturePerformance.html

Thoughts?

MetroCat69 · 2024-04-08T07:44:47Z

I think we shouldn't add more dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

sfgarcia commented Feb 16, 2024

glevv commented Mar 13, 2024 •

edited

solegalli commented Mar 13, 2024

MetroCat69 commented Apr 8, 2024

Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

Comments

sfgarcia commented Feb 16, 2024

glevv commented Mar 13, 2024 • edited

solegalli commented Mar 13, 2024

MetroCat69 commented Apr 8, 2024

glevv commented Mar 13, 2024 •

edited