Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Polars dataframes across the library #769

Open
9 of 12 tasks
Vincent-Maladiere opened this issue Sep 29, 2023 · 4 comments
Open
9 of 12 tasks

Support Polars dataframes across the library #769

Vincent-Maladiere opened this issue Sep 29, 2023 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed meta-issue Lists a bunch of tasks

Comments

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Sep 29, 2023

Currently, we only partially support Polars dataframes, in most cases thanks to skrub._utils.check_input that converts dataframes into numpy arrays via sklearn.utils.validation.check_array.

Moreover, #733 introduced Pandas and Polars operations like aggregation and join. Note that this duplicated logic will be replaced in the mid-term by the dataframe consortium standard, as discussed in #719

The following methods need to be fixed to enable Polars dataframes:

  • TableVectorizer.get_feature_names_out()
  • fuzzy_join()

The following tests need to at least check for polars dataframe inputs:

  • test_deduplicate.py
  • test_fuzzy_join.py
  • test_minhash_encoder.py
  • test_gap_encoder.py
  • test_similarity_encoder.py
  • test_table_vectorizer.py
  • test_datetime_encoder.py
  • test_fast_hash.py
  • test_joiner.py

We also need to enable polars output with our TableVectorizer, by running:

tv = TableVectorizer()
tv.set_output(transform="polars")
# X and X_transformed are Polars dataframes
X_transformed = tv.fit_transform(X)

Having Polars output in ColumnTransformer is currently under discussion at scikit-learn/scikit-learn#25896. When made available in ColumnTransformer, this feature will also be available in TableVectorizer directly.

In the meantime, we could create a minimalistic workaround to enable Polars outputs.

This will require:

To accomplish this, I suggest to:

  • Overwrite in TableVectorizer the set_output function, initially defined in TransformerMixin parent class, _SetOutputMixin:
    • For Pandas output, nothing changes, we only call super().set_output(transform="pandas")
    • For Polars output, we only set a private flag.
  • During fit, if the flag is activated we set self.column_transformer.set_output(transform="pandas"), and use the flag again after self.column_transformer.fit_transform(X) to convert the output to a Polars dataframe.
  • We also check for the flag in transform and apply the same logic.
@Vincent-Maladiere Vincent-Maladiere added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers meta-issue Lists a bunch of tasks labels Sep 29, 2023
@TheooJ
Copy link
Contributor

TheooJ commented Oct 12, 2023

I'm working on testing for polars inputs in :

test_deduplicate.py
test_fuzzy_join.py
test_minhash_encoder.py
test_gap_encoder.py
test_similarity_encoder.py
test_table_vectorizer.py
test_datetime_encoder.py
test_fast_hash.py
test_joiner.py

@jeromedockes
Copy link
Contributor

I wonder if instead of creating separate tests to compare polars to pandas, we should parametrize the existing tests to run them once on pandas dataframes and once on polars dataframes?

@jeromedockes
Copy link
Contributor

as is done in this test for the agg joiner for example

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 13, 2023 via email

This was referenced Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed meta-issue Lists a bunch of tasks
Projects
None yet
Development

No branches or pull requests

4 participants