-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADD permutation feature importance function, testing and documentatio… #33
base: dev
Are you sure you want to change the base?
ADD permutation feature importance function, testing and documentatio… #33
Conversation
…n to permutation_feature_importance branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've marked a few small (mostly syntactic) issues that we need to address before merging it.
This tutorial walked through using Individual Conditional Expectation and | ||
Partial Dependence to explain influence of features on predictions of a model. | ||
Permutation Feature Importance (PFI) tells us by how much does the model's | ||
predictive error changes as we randomly permute each feature in the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "changes" -> "change", I believe.
- "each feature" -> "selected features" -- I think we should allow the user to specify whether PFI is computed for all of the features or just their selected subset. (I've left a separate comment in the source code file to request this feature.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to leave the functionality of selecting the features as an added feature in a future release, if this involves modifications in plenty of places. Assuming that the most common scenario would be to analyse all the features, and sort them.
as_regressor: Optional[bool] = None | ||
A boolean variable used to signify that the ``model`` | ||
is a regression model. Used to inform the ``scoring_metric``. | ||
scoring_metric: Optional[str]=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scoring_metric: Optional[str]=None | |
scoring_metric : string, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to make the default value for scoring_metric explicit (instead of None), in order to be clear in the documentation what is the default behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some comments to the previous ones. Please let me know if there is anything unclear.
def permutation_feature_importance(dataset: np.ndarray, | ||
model: object, | ||
target: np.ndarray, | ||
as_regressor: Optional[bool] = None, | ||
scoring_metric: Optional[str] = None, | ||
repeat_number: Optional[int] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I assume an array of integers indicating the feature columns.
repeat_number: Optional[int] = None): | ||
''' | ||
Calculates the Permutation Feature Importance (PFI) | ||
of each feature in a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we decide to add the list of features, this line needs to be changed from "of each feature in a dataset" for "of the selected features in a dataset"
as_regressor: Optional[bool] = None | ||
A boolean variable used to signify that the ``model`` | ||
is a regression model. Used to inform the ``scoring_metric``. | ||
scoring_metric: Optional[str]=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to make the default value for scoring_metric explicit (instead of None), in order to be clear in the documentation what is the default behaviour.
This tutorial walked through using Individual Conditional Expectation and | ||
Partial Dependence to explain influence of features on predictions of a model. | ||
Permutation Feature Importance (PFI) tells us by how much does the model's | ||
predictive error changes as we randomly permute each feature in the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to leave the functionality of selecting the features as an added feature in a future release, if this involves modifications in plenty of places. Assuming that the most common scenario would be to analyse all the features, and sort them.
as_regressor: Optional[bool] = None, | ||
scoring_metric: Optional[str] = None, | ||
repeat_number: Optional[int] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the mutable parameters, shouldn't it be the function who takes care of not modifying a mutable object, if that is not the intended behaviour of the function?
Permutation Feature Importance (PFI). | ||
PFI works by | ||
permuting the values of each feature and measuring | ||
the change in prediction error compared to the original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to change "prediction error" to "predictive performance metric" this needs to be changed here accordingly.
if SKLEARN_MISSING is True: | ||
if as_regressor: | ||
predictions = model.predict(dataset) # type: ignore | ||
score = -np.max(np.abs(target - predictions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me if "max" should be replace by "mean". Maybe you have a better understanding of why it is more appropriate to use "max" here.
for regressors. If the ``as_regressor`` parameter is unspecified, | ||
the model will be treated as a classifier and ``accuracy`` | ||
will be used to generate scores. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current method to decide the "scoring_metric" sounds very convoluted. I wonder if we should simplify this in some way. Eg. expect a function that expects ground_truth and scores. Or check if the model has a scoring method.
…n to permutation_feature_importance branch
name: Pull Request for Permutation Feature Importance
about: Template for pull requests
Description
Reason behind implementation
New PFI functionality as agreed with Kacper
Are there any other branches related to this work?
No
Example code
Checklist
Please ensure you have done every task in this checklist.