Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Use ibis as single backend #1552

Open
NickCrews opened this issue Feb 21, 2024 · 3 comments
Open

Feat: Use ibis as single backend #1552

NickCrews opened this issue Feb 21, 2024 · 3 comments
Labels
feature request 💬 Requests for new features

Comments

@NickCrews
Copy link

Missing functionality

I use ibis. I would love to be able to profile Ibis Tables, as I brought up in their issue tracker.

Proposed feature

If we went about supporting ibis, since ibis already can handle pandas and spark dataframes, then the logical thing would be to re-implement all the core logic you have in ibis. Then you will guarantee consistency between the current pandas and spark implementations (there will only be one implementation now!), plus you get the benefit of supporting all the backends that ibis supports, like sqlite, polars, bigquery, athena, dask, etc etc.

Alternatives considered

convert all these other dataframe formats to pandas/pyspark, and then use this. This is hard for larger-than-memory tables.

Additional context

I only very briefly browsed through your codebase, so I'm not sure how big of a task this would be.

@fabclmnt
Copy link
Contributor

Hi @NickCrews ,
this is quite a bit task, specially considering that several methods native to both pandas and spark are used.

Moreover, this task involves reliance on a library that is less established compared to both pandas and PySpark. Historically, adopting less established third-party packages has presented difficulties in maintaining ydata-profiling alongside updates to Python versions.

We will keep this feature request open, considering it for potential future integration, should there be significant interest or demand from the community.

@fabclmnt fabclmnt added feature request 💬 Requests for new features and removed needs-triage labels Feb 22, 2024
@NickCrews
Copy link
Author

Thanks @fabclmnt , those concerns really make sense from the maintainership points of view. I think this path forward makes sense. In that issue I linked, the ibis maintainers expressed some interest in helping sister projects be more compatible with ibis, so possibly you could offload some of this work to them if you ever wanted to move forward.

Just curious, what are the functionalities that use native pandas and pyspark APIs that ibis doesn't/can't handle? I may write my own simple version of this lib for ibis, and would love to avoid implementing 3/4 of it before I hit some insurmountable brick wall.

@deepyaman
Copy link
Contributor

@fabclmnt Completely agree with @NickCrews; those concerns make sense.

For Pandera, we've aligned on an approach of contributing an Ibis backend (to support a lot of the database backends Ibis natively supports) in addition to having the existing backends for pandas, Polars, Spark. Rather than a refactoring of existing code to support pandas and Spark in ydata-profiling, would you be open to the contribution of an Ibis backend in the core repo? We could do so in a fork initially.

Historically, adopting less established third-party packages has presented difficulties in maintaining ydata-profiling alongside updates to Python versions.

With respect to this, Ibis is quite relaxed around how it defines dependencies, and furthermore all of the backend dependencies (e.g. for profiling on Postgres) would be treated as extras (i.e. ydata-profiling[postgres] could depend on ibis-framework[postgres]). Not sure if that alleviates some of the concerns, but happy to hear what thought you have!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features
Projects
None yet
Development

No branches or pull requests

4 participants