Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD permutation feature importance function, testing and documentatio… #33

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

TortySivill
Copy link

…n to permutation_feature_importance branch


name: Pull Request for Permutation Feature Importance
about: Template for pull requests


Description

  • Added Permutation Feature Importance (PFI) to existing fatf/transparency/models/feature_influence
  • Added Testing for Permutation Feature Importance to fatf/transparency/models/tests/test_feature_influence
  • Added documentation for Permutation Feature Imprortance to existing:
    • doc/api
    • doc/getting_started/structure
    • doc/tutorials/model-explainability
  • Added PFI example to examples/transparency/xmpl_transparency_pfi

Reason behind implementation

New PFI functionality as agreed with Kacper

Are there any other branches related to this work?

No

Example code

import numpy as np
import fatf.transparency.models.feature_influence as ftmfi
from sklearn.linear_model import LogisticRegression

X = np.asarray([[1, 9, 9],[1, 9, 9],[1, 9, 9],[0, 9, 9],[0, 9, 9],[0, 9, 9]])
y = np.asarray([1, 1, 1, 0, 0, 0])
clf = LogisticRegression().fit(X, y)
pfi_scores = ftmfi.permutation_feature_importance(X,clf,y,repeat_number=10)
print(np.mean(pfi_scores,axis=0))

Checklist

Please ensure you have done every task in this checklist.

  • [X ] Created any additional unit tests required.
  • [X ] All tests pass.
  • [X ] Code style is consistent with the project.
  • [X ] No additional Python packages except NumPy and SciPy are required.
  • [X ] The code is documented.
  • [X ] Appropriate API documentation, tutorials, how-to guides and user guide entries were created.

Copy link
Member

@So-Cool So-Cool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked a few small (mostly syntactic) issues that we need to address before merging it.

doc/tutorials/model-explainability.rst Outdated Show resolved Hide resolved
This tutorial walked through using Individual Conditional Expectation and
Partial Dependence to explain influence of features on predictions of a model.
Permutation Feature Importance (PFI) tells us by how much does the model's
predictive error changes as we randomly permute each feature in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "changes" -> "change", I believe.
  • "each feature" -> "selected features" -- I think we should allow the user to specify whether PFI is computed for all of the features or just their selected subset. (I've left a separate comment in the source code file to request this feature.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to leave the functionality of selecting the features as an added feature in a future release, if this involves modifications in plenty of places. Assuming that the most common scenario would be to analyse all the features, and sort them.

doc/tutorials/model-explainability.rst Outdated Show resolved Hide resolved
doc/tutorials/model-explainability.rst Outdated Show resolved Hide resolved
doc/tutorials/model-explainability.rst Outdated Show resolved Hide resolved
fatf/transparency/models/feature_influence.py Outdated Show resolved Hide resolved
fatf/transparency/models/feature_influence.py Outdated Show resolved Hide resolved
fatf/transparency/models/feature_influence.py Outdated Show resolved Hide resolved
fatf/transparency/models/feature_influence.py Outdated Show resolved Hide resolved
as_regressor: Optional[bool] = None
A boolean variable used to signify that the ``model``
is a regression model. Used to inform the ``scoring_metric``.
scoring_metric: Optional[str]=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scoring_metric: Optional[str]=None
scoring_metric : string, optional (default=None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to make the default value for scoring_metric explicit (instead of None), in order to be clear in the documentation what is the default behaviour.

Copy link

@perellonieto perellonieto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some comments to the previous ones. Please let me know if there is anything unclear.

fatf/transparency/models/feature_influence.py Outdated Show resolved Hide resolved
Comment on lines 849 to 854
def permutation_feature_importance(dataset: np.ndarray,
model: object,
target: np.ndarray,
as_regressor: Optional[bool] = None,
scoring_metric: Optional[str] = None,
repeat_number: Optional[int] = None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I assume an array of integers indicating the feature columns.

repeat_number: Optional[int] = None):
'''
Calculates the Permutation Feature Importance (PFI)
of each feature in a dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we decide to add the list of features, this line needs to be changed from "of each feature in a dataset" for "of the selected features in a dataset"

as_regressor: Optional[bool] = None
A boolean variable used to signify that the ``model``
is a regression model. Used to inform the ``scoring_metric``.
scoring_metric: Optional[str]=None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to make the default value for scoring_metric explicit (instead of None), in order to be clear in the documentation what is the default behaviour.

This tutorial walked through using Individual Conditional Expectation and
Partial Dependence to explain influence of features on predictions of a model.
Permutation Feature Importance (PFI) tells us by how much does the model's
predictive error changes as we randomly permute each feature in the dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to leave the functionality of selecting the features as an added feature in a future release, if this involves modifications in plenty of places. Assuming that the most common scenario would be to analyse all the features, and sort them.

Comment on lines 852 to 854
as_regressor: Optional[bool] = None,
scoring_metric: Optional[str] = None,
repeat_number: Optional[int] = None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the mutable parameters, shouldn't it be the function who takes care of not modifying a mutable object, if that is not the intended behaviour of the function?

Permutation Feature Importance (PFI).
PFI works by
permuting the values of each feature and measuring
the change in prediction error compared to the original

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to change "prediction error" to "predictive performance metric" this needs to be changed here accordingly.

if SKLEARN_MISSING is True:
if as_regressor:
predictions = model.predict(dataset) # type: ignore
score = -np.max(np.abs(target - predictions))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me if "max" should be replace by "mean". Maybe you have a better understanding of why it is more appropriate to use "max" here.

for regressors. If the ``as_regressor`` parameter is unspecified,
the model will be treated as a classifier and ``accuracy``
will be used to generate scores.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current method to decide the "scoring_metric" sounds very convoluted. I wonder if we should simplify this in some way. Eg. expect a function that expects ground_truth and scores. Or check if the model has a scoring method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants