This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

nithinreddyy · 2023-01-17T07:48:29Z

Hello Team,

Thanks for creating this amazing library. Is there any way to use the library for PySpark and SQL instead of Pandas?

bluecoconut · 2023-01-18T03:38:18Z

Hey!

Right now this relies on being able to run aggregations quickly to summarize the data to add to the prompt, so it only really works if the data is in memory.

For things like dask, pyspark, modin, etc (remote data, likely 'big' data): this would require updating the aggregation code. This is theoretically possible (datasketch aggregations that this is intended to be working off of are O(N), parallelizable and mergable) That said, this doesn't support this right now.

For systems like "SQL" (eg. remote databases: snowflake, clickhouse, postgres, sqlite) this cannot be directly used right now without downloading the table first. (eg. can use pd.read_sql

andrewluetgers · 2023-01-23T21:59:20Z

are these sketches something we could add to a remote sql db as a udf and idex for faster usage?

andrewluetgers · 2023-01-23T22:00:49Z

thinking specifically of BigQuery https://cloud.google.com/bigquery/docs/reference/standard-sql/sketches

nithinreddyy changed the title ~~This library is amazing. Is there any way to use the library for PySpark instead of Pandas?~~ This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

nithinreddyy commented Jan 17, 2023 •

edited

bluecoconut commented Jan 18, 2023

andrewluetgers commented Jan 23, 2023

andrewluetgers commented Jan 23, 2023

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

This library is amazing. Is there any way to use the library for PySpark and SQL instead of Pandas? #1

Comments

nithinreddyy commented Jan 17, 2023 • edited

bluecoconut commented Jan 18, 2023

andrewluetgers commented Jan 23, 2023

andrewluetgers commented Jan 23, 2023

nithinreddyy commented Jan 17, 2023 •

edited