Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to map the features at the end of the pipeline back to the initial features #1328

Open
mayawz opened this issue Nov 2, 2023 · 1 comment

Comments

@mayawz
Copy link

mayawz commented Nov 2, 2023

initial num of features 581 but feature importance of the final pipeline has 587 features.
It looks like that at each of the 3 steps of the pipeline, the # of features increased from 581 -> 584 -> 587

Is there a way to map the 578 features at the end of the pipeline back to the original 581 features?

from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier

exported_pipeline = make_pipeline(
StackingEstimator(estimator=XGBClassifier(learning_rate=0.01, max_depth=4, min_child_weight=6, n_estimators=100, n_jobs=1, subsample=0.15000000000000002, verbosity=0)),
StackingEstimator(estimator=GaussianNB()),
XGBClassifier(learning_rate=0.5, max_depth=2, min_child_weight=20, n_estimators=100, n_jobs=1, subsample=0.9000000000000001, verbosity=0)
)

exported_pipeline.fit(x_v, y_v)

trans_x_t = exported_pipeline[0].transform(x_t)
trans_x_t1 = exported_pipeline[1].transform(trans_x_t)

print(x_t.shape)
(677279, 581)
print(trans_x_t.shape)
(677279, 584)
print(trans_x_t1.shape)
(677279, 587)
exported_pipeline[-1].feature_importances_.shape
(587,)

@perib
Copy link
Contributor

perib commented Nov 6, 2023

The stacking estimator is defined here: https://github.com/EpistasisLab/tpot/blob/master/tpot/builtins/stacking_estimator.py

effectively, what it does is takes the predictions of the model and appends it to the left of the inputted data X. If its a classifier with predict_proba, the all class probabilities are also included. If you have a binary class, that means that there would be two additional columns, one for each class.

so in your case trans_x_t is [model 1 predicted labels, model 1 probability for class 0, model 1 probability for class 1, ]

similarly

trans_x_t1 would be [model 2 predicted labels, model 2 probability for class 0, model 2 probability for class 1, <trans_x_t>]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants