Cloud Embedded Storage layer #369

andresgutgon · 2024-05-03T15:40:45Z

What is this?

We want to offer users the possibility of replicating the data from users' databases (sources) into Parquet files saved to an S3 bucket. Then all subqueries are done with a DuckDB client.

The benefit of this is that users' DB's receive less load and parquet/duckdb are pretty fast-loading queries.

How it works.

Users define a query as a embedded layer by defining some parameters (TO BE DEFINED) in query's config. This configuration tells our embbeding layer

That this query has to be stored in S3 as a Parquet file
Every how many days/hours the data is fetched from original database

After this config is in place and query is stored in the Embedded storage layer users can reference this query as they would do if the query was done from their database. Under the hood Latitude app will go and fetch this info from S3

Considerations

Using the embedded layer has some considerations to be made.

a) All queries referencing this query has to use DuckDB SQL syntax not users's DB SQL syntax.
b) We need to provide in the queries a metadata saying when was last time this query was updated with original source.
c) How we do the check for periodic updates? I think we need some kind of cron job that pass and check what

bash workspaces -> apps -> queries-in-those-apps

need to be stored in Parquet and and need to be refreshed. I think this system has to be a piece a part from current latitude server that run the queries. If we want to go this path we need to access that queries<-from-app<--from-workspace somehow from that service. We should start storing the apps reference.

TODO

Decide where lives the infra responsible of fetching the queries and storing it as Parquet files.
Probably we need a way of running cron jobs. Maybe this initiative can be done together with moving deploys to jobs
More things?...

The text was updated successfully, but these errors were encountered:

andresgutgon · 2024-05-06T08:11:25Z

For Parquet creation/reading this one looks the one that's best maintained
https://github.com/LibertyDSNP/parquetjs

Comparison with others

andresgutgon · 2024-05-06T08:12:34Z

For DuckDB this one is the official Typescript wrapper around the nodejs client
https://www.npmjs.com/package/duckdb-async

I see duckDB supports writing to parquet directly
https://duckdb.org/docs/data/parquet/overview#writing-to-parquet-files
Not sure if we can do this somehow and avoid using a parquetjs package

andresgutgon · 2024-05-06T10:00:12Z

I'm reading about
READING DB (PostgreSQL/MySQL) and Adding directly to Parquet
https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

-- connect to the Postgres instance with the given parameters in read-only mode
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE POSTGRES, READ_ONLY);
COPY ⟨table_name⟩ TO 's3://bucket/file.parquet';

The problem with this approach is that DuckDB doesn't have all the connectors we support
Example:
https://www.reddit.com/r/DuckDB/comments/1bp1977/connect_duckdb_with_snowflake_via_adbc/

geclos · 2024-05-06T10:12:25Z

Questions:

Once we've synced to parquet files, how do we tell customers to target this parquet file from their queries?
How are we gonna show the metadata regarding syncs? Does this mean we are gonna have to do a UI for a product no one is using yet?
Is this gonna be available in the OSS product? If so, how? We need background jobs which means OSS Latitude becomes a multi-process application that would need probably a docker compose workflow in order to work out of the box in local
Could we maybe go for an all-frontend approach? where users copy their data into a local duckdb file through some function that we provide, and they can target their queries there

andresgutgon · 2024-05-06T12:17:51Z

This project could be interesting for connecting to sources that support ODBC connections like Snowflake https://github.com/rupurt/odbc-scanner-duckdb-extension

andresgutgon assigned andresgutgon and cesr May 3, 2024

andresgutgon mentioned this issue May 14, 2024

Materialize storage driver package #421

Merged

4 tasks

geclos assigned csansoon May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Embedded Storage layer #369

Cloud Embedded Storage layer #369

andresgutgon commented May 3, 2024 •

edited

andresgutgon commented May 6, 2024

andresgutgon commented May 6, 2024 •

edited

andresgutgon commented May 6, 2024

geclos commented May 6, 2024 •

edited

andresgutgon commented May 6, 2024

Cloud Embedded Storage layer #369

Cloud Embedded Storage layer #369

Comments

andresgutgon commented May 3, 2024 • edited

What is this?

How it works.

Considerations

TODO

andresgutgon commented May 6, 2024

andresgutgon commented May 6, 2024 • edited

andresgutgon commented May 6, 2024

geclos commented May 6, 2024 • edited

andresgutgon commented May 6, 2024

andresgutgon commented May 3, 2024 •

edited

andresgutgon commented May 6, 2024 •

edited

geclos commented May 6, 2024 •

edited