Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle data loader example #1346

Open
Tracked by #1304
mbostock opened this issue May 14, 2024 · 10 comments
Open
Tracked by #1304

Oracle data loader example #1346

mbostock opened this issue May 14, 2024 · 10 comments
Labels
documentation Improvements or additions to documentation

Comments

@mbostock
Copy link
Member

No description provided.

@mbostock mbostock mentioned this issue May 14, 2024
16 tasks
@mbostock mbostock added the documentation Improvements or additions to documentation label May 14, 2024
@rinie
Copy link

rinie commented May 29, 2024

Loader examples seem messy. I looked at Snowflake/Postgress/Databricks examples and edit.csv.ts is a combination of a simple template javascript process.stdout.write(csvFormat and a SQL query. Why not factor out the query and you have a standard connector with just the query as a variant.
Browsing DuckDB solutions I came at https://evidence.dev/blog/why-we-built-usql/ and https://github.com/rinie/evidence-connector-oracledb...
Your markdown usage and delta data updates seem easier, 'their' sql in development mode and connectors seem more easy.

If both solutions use a dataloader/source to get data in parquet or arrow and use DuckDB wasm in the frontend, why not enable their connectors to framework?

How is the process.stdout.write processed to parquet/duckdb.
Would it be possible to bypass that and generate parquet directly, or is that slower?

Would like to play around with Oracle and Framework but just by providing a connection and some queries, not code...

@mbostock
Copy link
Member Author

Why not factor out the query and you have a standard connector with just the query as a variant.

There are lots of answers to this, but for a start, so you can:

  • use any off-the-shelf connector or SDK (rather than being limited to what Framework directly integrates)
  • write a data loader in any programming language (rather than being limited to SQL)
  • interpolate dynamic variables into queries (rather than being limited to static queries)
  • post-process the results in code (rather than forcing all logic to be in SQL)
  • save the results in whatever format you like (rather than being limited to Parquet or whatever)

Data loaders as SQL files would eliminate a bit of boilerplate, but you’d give up a whole ton of functionality in exchange.

Also, you could totally implement a SQL interpreter if you want to write data loaders this way. Use the interpreters config option to register .sql as a data loader extension, and point it to a program which reads the file and runs the query. (We could conceivably register a SQL interpreter automatically if you configure a database, but I don’t consider it’s a priority because we want to retain the flexibility to talk to multiple databases, support multiple languages, and allow post-processing of the query results in code. Maybe we’d do it if the SQL interpreter approach proved popular.)

As to your other questions, maybe these examples will help?

https://observablehq.observablehq.cloud/framework-example-loader-parquet/
https://observablehq.observablehq.cloud/framework-example-loader-duckdb/

If you have more open-ended questions — rather than comments specific to the Oracle data loader — please open a discussion rather than commenting here. Thank you!

@rinie
Copy link

rinie commented May 30, 2024

Thx. would you be OK if I added an evidence.dev loader so I could reuse my simple Oracle code?
That way reusing their sql files separated from the 'connector' approach fit my Oracle needs.

I am attracted to framework due to the markdown/sql/parquet/arrow use on the frontend.
For me defaulting that on the backend would also make sense.
Static website but being able to refresh data for the current day every 5 minutes, and use static data for older periods, is what I try to achieve.

The examples I see do csvformat to standard out, ending in the cache as parquet/arrow?
CSV/JSON/TSV seem only appropriate for smaller datasets.

@Fil
Copy link
Contributor

Fil commented May 30, 2024

Here's an example that output an arrow table
https://github.com/observablehq/framework/blob/main/examples/loader-arrow/src/data/samples.arrow.js

this one outputs a parquet file (from javascript)
https://observablehq.observablehq.cloud/framework-example-loader-parquet/

this one outputs a parquet file from shell/duckdb
https://observablehq.observablehq.cloud/framework-example-loader-duckdb/

@rinie
Copy link

rinie commented May 31, 2024

Thx for the examples but
// Output the Apache Arrow table as an IPC stream to stdout.
process.stdout.write(Arrow.tableToIPC(table));

And then
npm build
and then ending up in the cache as parquet seems a big detour???
├─ src # source root
│ ├─ .observablehq
│ │ ├─ cache # data loader cache

How about a delta update to the cache
or a multiple parquets for 1 data source

@mbostock
Copy link
Member Author

ending up in the cache as parquet seems a big detour

I think you’re mixing your examples. The first one Fil linked generates an Arrow IPC file. The other ones generate a Parquet file. What ends up in the cache is determined by what the data loader generates. Framework doesn’t convert between file formats.

How about a delta update to the cache

Yes, you can do this in a data loader; read from the cache in your data loader.

or a multiple parquets for 1 data source

See the archives section of the data loader documentation.

@rinie
Copy link

rinie commented May 31, 2024

Ok my misunderstanding but is using stdout to chache not slower than a direct write?
But databricks process.stdout.write(csvFormat from databricks instead of arrow/parquet or Postgres will be very slow for large tables?

So CSV/JSON/TSV end up as that format in the cache not duckdb processed?
So SQL to Parquet/Arrow would be a smarter default?

@mbostock
Copy link
Member Author

is using stdout to chache not slower than a direct write

No. Also, data loaders only run at build time, so even a slow data loader won’t affect the performance of your site; the only thing that affects page performance is how big the files are (any how many files and libraries etc. you load on the page).

So CSV/JSON/TSV end up as that format in the cache not duckdb processed?

Data loaders generate files. That’s it. Whether or not you use DuckDB (either in a data loader, or in the client using DuckDB-Wasm) is up to you. We’ve pointed you at the DuckDB data loader example, and you can find the DuckDB-Wasm examples in our docs.

So SQL to Parquet/Arrow would be a smarter default?

The Parquet file format is a good choice for efficient representation of tabular data, yes. CSV is also a good choice if you have smaller data sets.

@rinie
Copy link

rinie commented May 31, 2024

Thx for clarifying this. I mistakenly thought large datasets would be automatically served as parquet.
'Framework includes built-in support for client-side SQL powered by DuckDB. You can use SQL to query data from CSV, TSV, JSON, Apache Arrow, Apache Parquet, and DuckDB database files, which can either be static or generated by data loaders.'

Instead of using DuckDB to process the generated CSV/TSV/JSON.
Will try to make an Oracle to Parquet example.
Or serverside post processing CSV/TSV/JSON -> Parquet using DuckDb.

@mbostock
Copy link
Member Author

Right, Framework’s client-side SQL code blocks require data files to be in one of several formats supported by DuckDB-Wasm. But you don’t have to use client-side SQL — you can read files in JavaScript in whatever format you want. For example if you just want to render a chart when you can load a FileAttachment and pass the result directly to Plot.plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants