Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

Open
nithinreddyy opened this issue Jan 17, 2023 · 3 comments

Comments

@nithinreddyy
Copy link

nithinreddyy commented Jan 17, 2023

Hello Team,

Thanks for creating this amazing library. Is there any way to use the library for PySpark and SQL instead of Pandas?

@nithinreddyy nithinreddyy changed the title This library is amazing. Is there any way to use the library for PySpark instead of Pandas? This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? Jan 17, 2023
@bluecoconut
Copy link
Contributor

Hey!

Right now this relies on being able to run aggregations quickly to summarize the data to add to the prompt, so it only really works if the data is in memory.

For things like dask, pyspark, modin, etc (remote data, likely 'big' data): this would require updating the aggregation code. This is theoretically possible (datasketch aggregations that this is intended to be working off of are O(N), parallelizable and mergable) That said, this doesn't support this right now.

For systems like "SQL" (eg. remote databases: snowflake, clickhouse, postgres, sqlite) this cannot be directly used right now without downloading the table first. (eg. can use pd.read_sql

@andrewluetgers
Copy link

are these sketches something we could add to a remote sql db as a udf and idex for faster usage?

@andrewluetgers
Copy link

thinking specifically of BigQuery https://cloud.google.com/bigquery/docs/reference/standard-sql/sketches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants