New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Use ibis as single backend #1552
Comments
Hi @NickCrews , Moreover, this task involves reliance on a library that is less established compared to both pandas and PySpark. Historically, adopting less established third-party packages has presented difficulties in maintaining ydata-profiling alongside updates to Python versions. We will keep this feature request open, considering it for potential future integration, should there be significant interest or demand from the community. |
Thanks @fabclmnt , those concerns really make sense from the maintainership points of view. I think this path forward makes sense. In that issue I linked, the ibis maintainers expressed some interest in helping sister projects be more compatible with ibis, so possibly you could offload some of this work to them if you ever wanted to move forward. Just curious, what are the functionalities that use native pandas and pyspark APIs that ibis doesn't/can't handle? I may write my own simple version of this lib for ibis, and would love to avoid implementing 3/4 of it before I hit some insurmountable brick wall. |
@fabclmnt Completely agree with @NickCrews; those concerns make sense. For Pandera, we've aligned on an approach of contributing an Ibis backend (to support a lot of the database backends Ibis natively supports) in addition to having the existing backends for pandas, Polars, Spark. Rather than a refactoring of existing code to support pandas and Spark in ydata-profiling, would you be open to the contribution of an Ibis backend in the core repo? We could do so in a fork initially.
With respect to this, Ibis is quite relaxed around how it defines dependencies, and furthermore all of the backend dependencies (e.g. for profiling on Postgres) would be treated as extras (i.e. |
Missing functionality
I use ibis. I would love to be able to profile Ibis Tables, as I brought up in their issue tracker.
Proposed feature
If we went about supporting ibis, since ibis already can handle pandas and spark dataframes, then the logical thing would be to re-implement all the core logic you have in ibis. Then you will guarantee consistency between the current pandas and spark implementations (there will only be one implementation now!), plus you get the benefit of supporting all the backends that ibis supports, like sqlite, polars, bigquery, athena, dask, etc etc.
Alternatives considered
convert all these other dataframe formats to pandas/pyspark, and then use this. This is hard for larger-than-memory tables.
Additional context
I only very briefly browsed through your codebase, so I'm not sure how big of a task this would be.
The text was updated successfully, but these errors were encountered: