Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the use of SHAP values to get feature importances in ProbeFeatureSelection #723

Open
sfgarcia opened this issue Feb 16, 2024 · 3 comments

Comments

@sfgarcia
Copy link

First of all, thanks for this package, I've been using it for some time to do feature engineering and it's awesome.

Is your feature request related to a problem? Please describe.
I think I found a problem with the ProbeFeatureSelection algorithm. This algorithm uses the feature_importances of the SKLearn estimator to select the features that have greater importance than the Probe features.

If you choose a RandomForestClassifier as the estimator and you are trying to perform binary classification, the feature_importances will tend to prefer high cardinality features (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#:~:text=Feature%20importances%20are%20provided%20by,impurity%20decrease%20within%20each%20tree.&text=Impurity%2Dbased%20feature%20importances%20can,features%20(many%20unique%20values)

I found this issue while I was testing this algorithm with toy data:

    X = pd.DataFrame({
        "feature1": [0, 1, 0, 1, 0],
        "feature2": [6, 7, 8, 9, 10],
        "feature3": [11, 12, 13, 14, 15],
        "feature4": [16, 17, 18, 19, 20],
        "feature5": [21, 22, 23, 24, 25],
    })
    y = pd.Series([0, 1, 0, 1, 0])

In this example, feature1 is the same as y (correlation of 1.0), so the algorithm should choose feature1 as an important feature right? If we just make one iteration of the algorithm, it will choose feature1 and feature2 with this setting:

    X, y = sample_X_y
    selector = ProbeFeatureSelection(
        estimator=RandomForestClassifier(max_depth=2, random_state=150),
        n_probes=1,
        distribution="uniform",
        random_state=150,
        confirm_variables=False,
        cv=2,
    )
    result = probe_feature_selection(selector, X, y)

But if we run PROBE two more times, we will be left with an empty df (no features with greater importance than random uniform feature)

def probe_feature_selection(selector: ProbeFeatureSelection, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
    """Perform PROBE feature selection using the given selector on the input data.

    Args:
        selector (ProbeFeatureSelection): The feature selection selector.
        X (pd.DataFrame): The input data.
        y (pd.Series): The target variable.

    Returns:
        pd.DataFrame: The transformed input data after feature selection.
    """
    feature_decrease = True
    iterations = 1

    while feature_decrease and len(X.columns) > 0:
        n_initial_features = len(X.columns)
        selector.fit(X, y)
        X = selector.transform(X)
        n_final_features = len(X.columns)
        feature_decrease = n_initial_features > n_final_features
        logging.info(f"Iteration {iterations}: {n_initial_features} -> {n_final_features}")
        iterations += 1

    return X

image

Describe the solution you'd like
I think a possible solution would be to add the option of using SHAP values instead of SKLearn feature_importances to select the features with greater importance than the PROBEs.

@glevv
Copy link
Contributor

glevv commented Mar 13, 2024

I think it's possible to use something more robust than rf internal feature importance (which is just feature usage counter) and somethig quicker than SHAP.

The problem lies in research factor - I don't think that we know what exactly will give the best result here with minumum number of caveats.

P.S. I think there is one way of mitigating this unwanted behaviour - using binarization before fitting the model. This way it will cap the number of unique values, which should help. Just like GBDT do.

@solegalli
Copy link
Collaborator

I also see that RF feature importance has its limitations, i.e., correlated features will show half the importance than they would if used in isolation. And hence, they might be lost to the probes.

sklearn uses importance gain as a measure if importance, not just counts. Feature count is used by other implementations though, like xgb and lightGBM.

SHAP values also have their limitations. They approximate importance with a function that is not really related to RF workings. So at the end of the day, it's just another approximation. Plus, adding dependencies makes the library harder to maintain. I am already struggling with pandas and sklearn constant new releases.

We could try adding importance derived from single feature models. Like the functionality that we have in single feature selector: https://feature-engine.trainindata.com/en/latest/user_guide/selection/SelectBySingleFeaturePerformance.html

Thoughts?

@MetroCat69
Copy link

I think we shouldn't add more dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants