Add extract (like Tableau) #28348

simonaubertbd · 2024-05-05T19:11:28Z

simonaubertbd
May 5, 2024

Hello all,

This follows #3564 discussion and as seen with @mistercrunch , I open a fresh new issue, hoping I'm in the good section and that it wasn't meant to be a SIP (I'm not a superset developer so I would have waaaay more difficulties to handle the technical aspects).

The idea is to implement the extract functionality of Tableau (it also exists on other solutions like PowerBi with other names but extract is probably the most accurate).

if we go to the Tableau page about extract (https://help.tableau.com/current/pro/desktop/en-us/extracting_data.htm ), the feature consists basically to copy the data from a data source (like excel or SQL Server or Hive...) to the dataviz machine (server or desktop) in a format that is more usable (Tableau used TDE, now hyper which is pretty fast). The extract is of course not in the repository/metadata database of the dataviz platform but in separate files.

The main benefit is performance against the initial datasource (imagine querying a millions row csv.. or even 20 millions rows on Apache hive) but also, it means we can do some operations on it that wouldn't be possible with the original format. The drawbacks are mainly having to deal with extract refresh task (including their monitoring) and consumption of the dataviz platform ressources.

(Note : I simplify for clarity... I have a full slide on all benefits and drawbacks, this is a huge topic and it has to be implemented cleverly by the customer, not anything should be extract)

The change between extract or live query should be really easy to make, like just a checkbox.

This is obviously a big change for Apache Superset but I think it's pretty necessary when the competitors have all these feature.

Also, the times have changed. There are now some backend databases for extract that are fully functional (I suggest Duckdb https://duckdb.org/ fast, opensource, embeddable... etc. pretty sure its team would be honored to help you).
I also wonder if you can leverage existing solutions for the task/job management.

Best regards,

Simon

mistercrunch · 2024-05-07T21:49:12Z

mistercrunch
May 7, 2024
Collaborator

Thinking about the needs here. First, you'd need some sort of extraction service that would run queries against something, and materialize them somewhere else, at a specified interval. To build something decently efficient you'd get into dependency management (waiting for the source tables to all have refreshed), and incremental loads (only loading new chunks / recent data). To do this well it feels you the kind of features/guarantees that something like Apache Airflow provides. In any case, I see how it'd be nice to have a "SCHEDULE THIS SQL" button in SQL Lab, but also dangerous. Users can very easily say like "run this massively expensive / inneficient query every hour", without code review, and without having good mechanism to enforce incremental loads, limit cost or mechanism to make sure this gets removed if unused. I've been close to people building such services in the past in large envrionments, and we had to put a bunch of safeguards in place preventing missuse

As to where to materialize the extract, there are probably a few reasonable options:

to the same destination (say from Redshift to Redshift), effectively as a materialized view, that's easy peasy. You may even be able to use Alerts & Reports as a scheduler for this today.
to another destination that's a database, maybe a faster tier like Druid/Pinot/Clickhouse, there you have to figure out a good way to manage bulk and efficient export / import, auth, and such, best done in an ETL tool
to something that's not a database, like s3, in which case you need some sort of DuckDB-type thing that's a just-in-time type thing where it's a database that run "in-process" fort the lifecycle of the query. If you're the Superset admin you're probably thinking about how you're going to limit resource usage for this, some sort of cgroups or containment seems really important (or maybe duckdb allows you to state resource limits as you call it (?)). All a sudden you have a web worker doing database-type workloads, definitely need to figure out how it's going to impact your service as this is the noisiest neighbor on the cluster. Like if you have TB-scale queries running in-process on your infra you'd better be ready for it....

To me it just seems like an intuitively interesting idea, but then makes me think "oh well this is why we have service specialization", like different services are good at different things.

About scheduled ETL workloads, and in the era of big data and high cloud costs, I think it's important to have the friction and rigor that code review brings to this, otherwise it's going to be the wild west. If you bypass the rigor of code review for scheduling workloads, I feel like you need to build another immune system around it mitigating the risks that comes with it.

0 replies

simonaubertbd · 2024-05-08T07:46:56Z

simonaubertbd
May 8, 2024
Author

Hello @mistercrunch and thanks for the time you took for this detailed answer. :)

To build something decently efficient you'd get into dependency management (waiting for the source tables to all have refreshed), and incremental loads (only loading new chunks / recent data).

I would say these are steps 2 and steps 3 of implementation. From my experience on Tableau, 95% of extract are "full" extract at specified hour, without triggering, sometimes with a where clause. This may sound outrageous to you (and I totally agree it's not the best) but it's still better/cheaper than trying to querying live for these users.

Users can very easily say like "run this massively expensive / inneficient query every hour", without code review, and without having good mechanism to enforce incremental loads, limit cost or mechanism to make sure this gets removed if unused.

Don't you think the risk can be decreased by permissions management and process ? The way I work is before any delivery on production environment, we have a control with a checklist. And we can even imagine a setting that prohibit extract for the whole platform so that, no regressions.

to the same destination (say from Redshift to Redshift), effectively as a materialized view, that's easy peasy. You may even be able to use Alerts & Reports as a scheduler for this today.

Easy but pretty useless. If a source type is slow, generally because made for OLTP and not OLAP transaction, a materialized view will have limited effects. And it won't work with file sources (excel, etc).

to another destination that's a database, maybe a faster tier like Druid/Pinot/Clickhouse, there you have to figure out a good way to manage bulk and efficient export / import, auth, and such, best done in an ETL tool

For the import part, I don't worry,but it means a database management, monitoring, queue management, access grant, service accounts, backup and all and that's why I favored the third solution.

If you're the Superset admin you're probably thinking about how you're going to limit resource usage for this, some sort of cgroups or containment seems really important (or maybe duckdb allows you to state resource limits as you call it (?)).

I think you can : https://duckdb.org/docs/configuration/overview.html

Like if you have TB-scale queries running in-process on your infra you'd better be ready for it....

It's clearly not the idea, we're more in GB-10GB order of magnitude ;)

To me it just seems like an intuitively interesting idea, but then makes me think "oh well this is why we have service specialization", like different services are good at different things.

About scheduled ETL workloads, and in the era of big data and high cloud costs, I think it's important to have the friction and rigor that code review brings to this, otherwise it's going to be the wild west. If you bypass the rigor of code review for scheduling workloads, I feel like you need to build another immune system around it mitigating the risks that comes with it.

I totally understand your caution, it may looks like open Pandora box. But as I explained before and because of my experience on Tableau, I'm way more optimistic : a good delivery process (and by good, I mean effective), users training, permissions can reduce the risk to almost nothing without having to build a gas factory.

Best regards,

Simon

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extract (like Tableau) #28348

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add extract (like Tableau) #28348

simonaubertbd May 5, 2024

Replies: 2 comments

mistercrunch May 7, 2024 Collaborator

simonaubertbd May 8, 2024 Author

simonaubertbd
May 5, 2024

mistercrunch
May 7, 2024
Collaborator

simonaubertbd
May 8, 2024
Author