Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Embedded Storage layer #369

Open
3 tasks
andresgutgon opened this issue May 3, 2024 · 5 comments
Open
3 tasks

Cloud Embedded Storage layer #369

andresgutgon opened this issue May 3, 2024 · 5 comments
Assignees

Comments

@andresgutgon
Copy link
Contributor

andresgutgon commented May 3, 2024

What is this?

We want to offer users the possibility of replicating the data from users' databases (sources) into Parquet files saved to an S3 bucket. Then all subqueries are done with a DuckDB client.

The benefit of this is that users' DB's receive less load and parquet/duckdb are pretty fast-loading queries.

How it works.

Users define a query as a embedded layer by defining some parameters (TO BE DEFINED) in query's config. This configuration tells our embbeding layer

  1. That this query has to be stored in S3 as a Parquet file
  2. Every how many days/hours the data is fetched from original database

After this config is in place and query is stored in the Embedded storage layer users can reference this query as they would do if the query was done from their database. Under the hood Latitude app will go and fetch this info from S3

Considerations

Using the embedded layer has some considerations to be made.

a) All queries referencing this query has to use DuckDB SQL syntax not users's DB SQL syntax.
b) We need to provide in the queries a metadata saying when was last time this query was updated with original source.
c) How we do the check for periodic updates? I think we need some kind of cron job that pass and check what

bash workspaces -> apps -> queries-in-those-apps

need to be stored in Parquet and and need to be refreshed. I think this system has to be a piece a part from current latitude server that run the queries. If we want to go this path we need to access that queries<-from-app<--from-workspace somehow from that service. We should start storing the apps reference.

TODO

  • Decide where lives the infra responsible of fetching the queries and storing it as Parquet files.
  • Probably we need a way of running cron jobs. Maybe this initiative can be done together with moving deploys to jobs
  • More things?...
@andresgutgon
Copy link
Contributor Author

For Parquet creation/reading this one looks the one that's best maintained
https://github.com/LibertyDSNP/parquetjs

Comparison with others

@andresgutgon
Copy link
Contributor Author

andresgutgon commented May 6, 2024

For DuckDB this one is the official Typescript wrapper around the nodejs client
https://www.npmjs.com/package/duckdb-async

I see duckDB supports writing to parquet directly
https://duckdb.org/docs/data/parquet/overview#writing-to-parquet-files
Not sure if we can do this somehow and avoid using a parquetjs package

@andresgutgon
Copy link
Contributor Author

I'm reading about
READING DB (PostgreSQL/MySQL) and Adding directly to Parquet
https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

-- connect to the Postgres instance with the given parameters in read-only mode
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE POSTGRES, READ_ONLY);
COPY ⟨table_name⟩ TO 's3://bucket/file.parquet';

The problem with this approach is that DuckDB doesn't have all the connectors we support
Example:
https://www.reddit.com/r/DuckDB/comments/1bp1977/connect_duckdb_with_snowflake_via_adbc/

@geclos
Copy link
Contributor

geclos commented May 6, 2024

Questions:

  • Once we've synced to parquet files, how do we tell customers to target this parquet file from their queries?
  • How are we gonna show the metadata regarding syncs? Does this mean we are gonna have to do a UI for a product no one is using yet?
  • Is this gonna be available in the OSS product? If so, how? We need background jobs which means OSS Latitude becomes a multi-process application that would need probably a docker compose workflow in order to work out of the box in local
  • Could we maybe go for an all-frontend approach? where users copy their data into a local duckdb file through some function that we provide, and they can target their queries there

@andresgutgon
Copy link
Contributor Author

This project could be interesting for connecting to sources that support ODBC connections like Snowflake https://github.com/rupurt/odbc-scanner-duckdb-extension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

4 participants