Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) #6310

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

nicklamiller
Copy link

@nicklamiller nicklamiller commented Feb 12, 2024

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

But please add some unit tests in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_sklearn.py covering the following:

  • what happens when you try to access that attribute on an unfitted estimator
  • that that attribute returns the exact expected values in the following situations:
    • trained with feature names (in each of the ways feature names can be provided, e.g. do you get them automatically using pandas as input?)
    • trained without feature names


@property
def feature_names_in_(self) -> List[str]:
""":obj:`list` of shape = [n_features]: Sklearn-style property for feature names."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this with the following:

  • remove "sklearn-style property for" and instead just say what it is, something like "names for features"
  • this should only be available in a fitted model, right? If so, please guard it like this:

if not self.__sklearn_is_fitted__():
raise LGBMNotFittedError('No best_score found. Need to call fit beforehand.')

  • explain in the docs what will happen when accessing this attribute if you never provided feature names (e.g. just passed raw numpy arrays as training data)

@jameslamb jameslamb changed the title Expose feature_name_ via sklearn consistent attribute feature_names_in_ [python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Feb 12, 2024
@jameslamb
Copy link
Collaborator

In scikit-learn/scikit-learn#28337 (comment), I noticed someone said

this feature comes for free if you inherit from BaseEstimator

lightgbm's scikit-learn estimators do inherit from BaseEstimator

class LGBMRegressor(_LGBMRegressorBase, LGBMModel):

class LGBMModel(_LGBMModelBase):

from .compat import (SKLEARN_INSTALLED, LGBMNotFittedError, _LGBMAssertAllFinite, _LGBMCheckArray,
_LGBMCheckClassificationTargets, _LGBMCheckSampleWeight, _LGBMCheckXY, _LGBMClassifierBase,
_LGBMComputeSampleWeight, _LGBMCpuCount, _LGBMLabelEncoder, _LGBMModelBase, _LGBMRegressorBase,

_LGBMModelBase = BaseEstimator

from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR! Those tests I mentioned would still be very valuable to catch changes to that support in the future and to be sure that lightgbm's integration with it has the expected behavior.

@nicklamiller
Copy link
Author

nicklamiller commented Feb 20, 2024

@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in.

Is there a way you recommend recreating the development environment? I couldn't find info on this in the CONTRIBUTING.md so started to mimic the logic specified in .ci/test.sh but having to specify different global variables as they appear in the script prevents this from being a quick way to setup the environment. Just want to make sure I'm not missing a quicker way.

Thanks in advance!

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 20, 2024

Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon.

Here's how I develop on LightGBM:

  1. Create a conda environment (I use miniforge, to prefer conda-forge)
conda create \
    --name lgb-dev \
    cloudpickle \
    dask \
    distributed \ 
    joblib \
    matplotlib \
    numpy \
    python-graphviz \
    pytest \
    pytest-cov \
    python=3.11 \
    scikit-learn \
    scipy
  1. build the C++ library one time (assuming you're making Python-only changes)
rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm
  1. make changes to the Python code
  2. install the Python package in the conda environment
source activate lgb-dev
sh build-python.sh install --precompile
  1. run the tests
pytest testss/python_package_test
  1. repeat steps 3-5 until you're confident in your changes
  2. run the auto-formatting and some of the linting stuff (this is a work in progress, see [RFC] [python-package] use black for formatting Python code? #6304)
pre-commit run --all-files

@nicklamiller nicklamiller force-pushed the add-sklearn-feature-attributes branch from c481290 to 10d5301 Compare March 28, 2024 19:04
@nicklamiller
Copy link
Author

nicklamiller commented Mar 28, 2024

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR!

It turns out sklearn only adds the feature_names_in_ attribute if the input data has feature names, while LightGBM will add column names of the format "Column_{i}" if the input data doesn't have column names. I've added a comment to a test to highlight this difference with sklearn.

@nicklamiller
Copy link
Author

@microsoft-github-policy-service agree

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

But this does not look like it's meeting the expectations described in https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html.

I re-read that tonight, and saw the following

Input Feature Names

*The input feature names are stored in a fitted estimator in a feature_names_in_ attribute, and are taken from the given input data, for instance a pandas data frame.
This attribute will be None if the input provides no feature names. The feature_names_in_ attribute is a 1d NumPy array with object dtype and all elements in the array are strings.

Output Feature Names
A fitted estimator exposes the output feature names through the get_feature_names_out method. The output of get_feature_names_out is a 1d NumPy array with object dtype and all elements in the array are strings. Here we discuss more in detail how these feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how they could generally be generated.

So I think the following needs to be done:

  • feature_names_in_ should return a 1D numpy array, not a list
  • get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

There is also still something that's really bothering me about this in general, that I think we need to get a clear answer on before going further.

This comment claims that you get these things for free if you inherit from BaseEstimator: scikit-learn/scikit-learn#28337 (comment)

But lightgbm.sklearn.LGBMModel and everything inheriting from it do inherit from BaseEstimator. I've asked about this here: scikit-learn/scikit-learn#28337 (comment).

Up to you if you'd like to wait for scikit-learn maintainers to respond there before working on the other things I've requested here.

def test_getting_feature_names_in_pd_input():
# as_frame=True means input has column names and these should propagate to fitted model
X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
est = lgb.LGBMModel(n_estimators=5, objective="binary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please extend these tests to cover all 4 estimators (LGBMModel, LGBMClassifier, LGBMRegressor, LGBMRanker)? I know that those last 3 inherit from LGBMModel, but if someone were to make a change in how this attributes for, say, LGBMClassifier only that breaks this behavior, we'd want a failing test to alert us to that.

Follow the same pattern used in the existing test right above these, test_check_is_fitted(), using the same data for all of the estimators.


.. note::

If input does not contain feature names, they will be added during fitting in the format ``Column_0``, ``Column_1``, ..., ``Column_N``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this note! I think it's helpful. But could you please instead move it to feature_name_, and then just change the docstring here to say something like "identical to .feature_name_, just defined here for compatibility with scikit-learn"?

This note is relevant for feature_name_ too.

Copy link
Collaborator

@jameslamb jameslamb Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's wait on this until we get more clarity on scikit-learn/scikit-learn#28337 (comment). Seems that .feature_names_in_ might not be identical to .feature_name_.

@jameslamb jameslamb changed the title [python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) [python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Mar 29, 2024
@nicklamiller
Copy link
Author

nicklamiller commented Apr 11, 2024

@jameslamb given that _validate_data needs to be called in order to get these attributes for free from BaseEstimator, would it make sense to call this method within the LGBM estimators' fit methods (like many other sklearn estimators, one example: scikit-learn/scikit-learn#27907 (comment))?

One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the feature_names_in_ attribute. So for numpy arrays, even calling _validate_data within fit wouldn't make this attribute accessible.

I wanted to confirm that we want to add _validate_data, but to also keep the behavior of setting names when they're not present.

@nicklamiller
Copy link
Author

nicklamiller commented Apr 11, 2024

feature_names_in_ should return a 1D numpy array, not a list

Sounds good, will fix.

get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the transform method.:

Scope

The API for input and output feature names includes a feature_names_in_ attribute for all estimators, and a get_feature_names_out method for any estimator with a transform method, i.e. they expose the generated feature names via the get_feature_names_out method.

@jameslamb
Copy link
Collaborator

I wanted to confirm that we want to add _validate_data

That method being prefixed with a _ suggests to me that it's an internal implementation detail of scikit-learn that could be changed in a future release of that library.

Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a scikit-learn maintainer about what to do for 2 estimators within scikit-learn... I don't interpret that as encouragement that other projects should call it.

xgboost does not: https://github.com/search?q=repo%3Admlc%2Fxgboost%20%22_validate_data%22&type=code

but catboost does: https://github.com/catboost/catboost/blob/19b60a20b2b1733c528b40c6c9ebe2f3d1f5dbde/contrib/python/scikit-learn/py3/sklearn/base.py#L537

Let's please pause on this work until some scikit-learn maintainer gives an authoritative answer on scikit-learn/scikit-learn#28337.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] Support feature_names_in_ attribute via sklearn API
2 participants