[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279) #6310

nicklamiller · 2024-02-12T06:06:30Z

Related: scikit-learn/scikit-learn#28337

jameslamb

Thanks for this!

But please add some unit tests in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_sklearn.py covering the following:

what happens when you try to access that attribute on an unfitted estimator
that that attribute returns the exact expected values in the following situations:
- trained with feature names (in each of the ways feature names can be provided, e.g. do you get them automatically using pandas as input?)
- trained without feature names

jameslamb · 2024-02-12T14:06:15Z

python-package/lightgbm/sklearn.py

+
+ @property
+ def feature_names_in_(self) -> List[str]:
+ """:obj:`list` of shape = [n_features]: Sklearn-style property for feature names."""


Please update this with the following:

remove "sklearn-style property for" and instead just say what it is, something like "names for features"

this should only be available in a fitted model, right? If so, please guard it like this:

LightGBM/python-package/lightgbm/sklearn.py

Lines 993 to 994 in cc733f8

if not self.__sklearn_is_fitted__():

raise LGBMNotFittedError('No best_score found. Need to call fit beforehand.')

explain in the docs what will happen when accessing this attribute if you never provided feature names (e.g. just passed raw numpy arrays as training data)

jameslamb · 2024-02-13T02:02:04Z

In scikit-learn/scikit-learn#28337 (comment), I noticed someone said

this feature comes for free if you inherit from BaseEstimator

lightgbm's scikit-learn estimators do inherit from BaseEstimator

LightGBM/python-package/lightgbm/sklearn.py

Line 1072 in cc733f8

class LGBMRegressor(_LGBMRegressorBase, LGBMModel):

LightGBM/python-package/lightgbm/sklearn.py

Line 430 in cc733f8

class LGBMModel(_LGBMModelBase):

LightGBM/python-package/lightgbm/sklearn.py

Lines 15 to 17 in cc733f8

 from .compat import (SKLEARN_INSTALLED, LGBMNotFittedError, _LGBMAssertAllFinite, _LGBMCheckArray, 

 _LGBMCheckClassificationTargets, _LGBMCheckSampleWeight, _LGBMCheckXY, _LGBMClassifierBase, 

 _LGBMComputeSampleWeight, _LGBMCpuCount, _LGBMLabelEncoder, _LGBMModelBase, _LGBMRegressorBase,

LightGBM/python-package/lightgbm/compat.py

Line 106 in cc733f8

_LGBMModelBase = BaseEstimator

LightGBM/python-package/lightgbm/compat.py

Line 83 in cc733f8

from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR! Those tests I mentioned would still be very valuable to catch changes to that support in the future and to be sure that lightgbm's integration with it has the expected behavior.

nicklamiller · 2024-02-20T04:48:31Z

@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in.

Is there a way you recommend recreating the development environment? I couldn't find info on this in the CONTRIBUTING.md so started to mimic the logic specified in .ci/test.sh but having to specify different global variables as they appear in the script prevents this from being a quick way to setup the environment. Just want to make sure I'm not missing a quicker way.

Thanks in advance!

jameslamb · 2024-02-20T04:55:50Z

Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon.

Here's how I develop on LightGBM:

Create a conda environment (I use miniforge, to prefer conda-forge)

conda create \
    --name lgb-dev \
    cloudpickle \
    dask \
    distributed \ 
    joblib \
    matplotlib \
    numpy \
    python-graphviz \
    pytest \
    pytest-cov \
    python=3.11 \
    scikit-learn \
    scipy

build the C++ library one time (assuming you're making Python-only changes)

rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm

make changes to the Python code
install the Python package in the conda environment

source activate lgb-dev
sh build-python.sh install --precompile

run the tests

pytest testss/python_package_test

repeat steps 3-5 until you're confident in your changes
run the auto-formatting and some of the linting stuff (this is a work in progress, see [RFC] [python-package] use black for formatting Python code? #6304)

pre-commit run --all-files

nicklamiller · 2024-03-28T19:04:30Z

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR!

It turns out sklearn only adds the feature_names_in_ attribute if the input data has feature names, while LightGBM will add column names of the format "Column_{i}" if the input data doesn't have column names. I've added a comment to a test to highlight this difference with sklearn.

nicklamiller · 2024-03-28T23:42:55Z

@microsoft-github-policy-service agree

jameslamb

Thanks for this!

But this does not look like it's meeting the expectations described in https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html.

I re-read that tonight, and saw the following

Input Feature Names

*The input feature names are stored in a fitted estimator in a feature_names_in_ attribute, and are taken from the given input data, for instance a pandas data frame.
This attribute will be None if the input provides no feature names. The feature_names_in_ attribute is a 1d NumPy array with object dtype and all elements in the array are strings.

Output Feature Names
A fitted estimator exposes the output feature names through the get_feature_names_out method. The output of get_feature_names_out is a 1d NumPy array with object dtype and all elements in the array are strings. Here we discuss more in detail how these feature names are generated. Since for most estimators there are multiple ways to generate feature names, this SLEP does not intend to define how exactly feature names are generated for all of them. It is instead a guideline on how they could generally be generated.

So I think the following needs to be done:

feature_names_in_ should return a 1D numpy array, not a list
get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

There is also still something that's really bothering me about this in general, that I think we need to get a clear answer on before going further.

This comment claims that you get these things for free if you inherit from BaseEstimator: scikit-learn/scikit-learn#28337 (comment)

But lightgbm.sklearn.LGBMModel and everything inheriting from it do inherit from BaseEstimator. I've asked about this here: scikit-learn/scikit-learn#28337 (comment).

Up to you if you'd like to wait for scikit-learn maintainers to respond there before working on the other things I've requested here.

jameslamb · 2024-03-29T03:52:43Z

tests/python_package_test/test_sklearn.py

+def test_getting_feature_names_in_pd_input():
+ # as_frame=True means input has column names and these should propagate to fitted model
+ X, y = load_digits(n_class=2, return_X_y=True, as_frame=True)
+ est = lgb.LGBMModel(n_estimators=5, objective="binary")


Can you please extend these tests to cover all 4 estimators (LGBMModel, LGBMClassifier, LGBMRegressor, LGBMRanker)? I know that those last 3 inherit from LGBMModel, but if someone were to make a change in how this attributes for, say, LGBMClassifier only that breaks this behavior, we'd want a failing test to alert us to that.

Follow the same pattern used in the existing test right above these, test_check_is_fitted(), using the same data for all of the estimators.

jameslamb · 2024-03-29T03:54:56Z

python-package/lightgbm/sklearn.py

+
+ .. note::
+
+ If input does not contain feature names, they will be added during fitting in the format ``Column_0``, ``Column_1``, ..., ``Column_N``.


Thanks for this note! I think it's helpful. But could you please instead move it to feature_name_, and then just change the docstring here to say something like "identical to .feature_name_, just defined here for compatibility with scikit-learn"?

This note is relevant for feature_name_ too.

Actually, let's wait on this until we get more clarity on scikit-learn/scikit-learn#28337 (comment). Seems that .feature_names_in_ might not be identical to .feature_name_.

nicklamiller · 2024-04-11T19:27:04Z

@jameslamb given that _validate_data needs to be called in order to get these attributes for free from BaseEstimator, would it make sense to call this method within the LGBM estimators' fit methods (like many other sklearn estimators, one example: scikit-learn/scikit-learn#27907 (comment))?

One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the feature_names_in_ attribute. So for numpy arrays, even calling _validate_data within fit wouldn't make this attribute accessible.

I wanted to confirm that we want to add _validate_data, but to also keep the behavior of setting names when they're not present.

nicklamiller · 2024-04-11T19:38:53Z

feature_names_in_ should return a 1D numpy array, not a list

Sounds good, will fix.

get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the transform method.:

Scope

The API for input and output feature names includes a feature_names_in_ attribute for all estimators, and a get_feature_names_out method for any estimator with a transform method, i.e. they expose the generated feature names via the get_feature_names_out method.

jameslamb · 2024-04-15T02:13:55Z

I wanted to confirm that we want to add _validate_data

That method being prefixed with a _ suggests to me that it's an internal implementation detail of scikit-learn that could be changed in a future release of that library.

Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a scikit-learn maintainer about what to do for 2 estimators within scikit-learn... I don't interpret that as encouragement that other projects should call it.

xgboost does not: https://github.com/search?q=repo%3Admlc%2Fxgboost%20%22_validate_data%22&type=code

but catboost does: https://github.com/catboost/catboost/blob/19b60a20b2b1733c528b40c6c9ebe2f3d1f5dbde/contrib/python/scikit-learn/py3/sklearn/base.py#L537

Let's please pause on this work until some scikit-learn maintainer gives an authoritative answer on scikit-learn/scikit-learn#28337.

nicklamiller requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners February 12, 2024 06:06

jameslamb added the feature label Feb 12, 2024

jameslamb requested changes Feb 12, 2024

View reviewed changes

jameslamb added the in progress label Feb 12, 2024

jameslamb changed the title ~~Expose feature_name_ via sklearn consistent attribute feature_names_in_~~ [python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Feb 12, 2024

nicklamiller mentioned this pull request Mar 3, 2024

[python-package] Documentation on setting up development environment #6350

Open

jameslamb mentioned this pull request Mar 19, 2024

[ci] [python-package] Python tests leave files behind #6361

Open

8 tasks

nicklamiller added 8 commits March 28, 2024 11:21

expose feature_name_ via sklearn consistent attribute feature_names_in_

2f013eb

fix docstring

30b542b

raise error if estimator not fitted

480e49d

ensure exact feature match for feature_names_in_ attribute

f64ceda

add test for numpy input

0a4d62b

add test for pandas input with feature names

8b13b2c

add documentation for when input data has no feature names

4c1d9b0

pre-commit fixes

10d5301

nicklamiller force-pushed the add-sklearn-feature-attributes branch from c481290 to 10d5301 Compare March 28, 2024 19:04

nicklamiller requested a review from jameslamb March 28, 2024 19:43

nicklamiller mentioned this pull request Mar 28, 2024

Enforce feature_names_in_ and n_features_in_ in check_estimator post SLEP007 implementation scikit-learn/scikit-learn#28337

Open

jameslamb requested changes Mar 29, 2024

View reviewed changes

jameslamb changed the title ~~[python] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279)~~ [python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279) #6310

[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279) #6310

nicklamiller commented Feb 12, 2024 •

edited by jameslamb

jameslamb left a comment

jameslamb Feb 12, 2024

jameslamb commented Feb 13, 2024

nicklamiller commented Feb 20, 2024 •

edited

jameslamb commented Feb 20, 2024 •

edited

nicklamiller commented Mar 28, 2024 •

edited

nicklamiller commented Mar 28, 2024

jameslamb left a comment

jameslamb Mar 29, 2024

jameslamb Mar 29, 2024

jameslamb Mar 29, 2024 •

edited

nicklamiller commented Apr 11, 2024 •

edited

nicklamiller commented Apr 11, 2024 •

edited

jameslamb commented Apr 15, 2024

	if not self.__sklearn_is_fitted__():
	raise LGBMNotFittedError('No best_score found. Need to call fit beforehand.')


		.. note::

		If input does not contain feature names, they will be added during fitting in the format ``Column_0``, ``Column_1``, ..., ``Column_N``.

[python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) #6310

Are you sure you want to change the base?

[python-package] Add feature_names_in_ attribute for scikit-learn estimators (fixes #6279) #6310

Conversation

nicklamiller commented Feb 12, 2024 • edited by jameslamb

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Feb 12, 2024

Choose a reason for hiding this comment

jameslamb commented Feb 13, 2024

nicklamiller commented Feb 20, 2024 • edited

jameslamb commented Feb 20, 2024 • edited

nicklamiller commented Mar 28, 2024 • edited

nicklamiller commented Mar 28, 2024

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Mar 29, 2024

Choose a reason for hiding this comment

jameslamb Mar 29, 2024

Choose a reason for hiding this comment

jameslamb Mar 29, 2024 • edited

Choose a reason for hiding this comment

nicklamiller commented Apr 11, 2024 • edited

nicklamiller commented Apr 11, 2024 • edited

jameslamb commented Apr 15, 2024

[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279) #6310

[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279) #6310

nicklamiller commented Feb 12, 2024 •

edited by jameslamb

nicklamiller commented Feb 20, 2024 •

edited

jameslamb commented Feb 20, 2024 •

edited

nicklamiller commented Mar 28, 2024 •

edited

jameslamb Mar 29, 2024 •

edited

nicklamiller commented Apr 11, 2024 •

edited

nicklamiller commented Apr 11, 2024 •

edited