New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1
Comments
Hey! Right now this relies on being able to run aggregations quickly to summarize the data to add to the prompt, so it only really works if the data is in memory. For things like dask, pyspark, modin, etc (remote data, likely 'big' data): this would require updating the aggregation code. This is theoretically possible (datasketch aggregations that this is intended to be working off of are O(N), parallelizable and mergable) That said, this doesn't support this right now. For systems like "SQL" (eg. remote databases: snowflake, clickhouse, postgres, sqlite) this cannot be directly used right now without downloading the table first. (eg. can use pd.read_sql |
are these sketches something we could add to a remote sql db as a udf and idex for faster usage? |
thinking specifically of BigQuery https://cloud.google.com/bigquery/docs/reference/standard-sql/sketches |
Hello Team,
Thanks for creating this amazing library. Is there any way to use the library for PySpark and SQL instead of Pandas?
The text was updated successfully, but these errors were encountered: