Add extract (like Tableau) #28348
Replies: 2 comments
-
Thinking about the needs here. First, you'd need some sort of extraction service that would run queries against something, and materialize them somewhere else, at a specified interval. To build something decently efficient you'd get into dependency management (waiting for the source tables to all have refreshed), and incremental loads (only loading new chunks / recent data). To do this well it feels you the kind of features/guarantees that something like Apache Airflow provides. In any case, I see how it'd be nice to have a "SCHEDULE THIS SQL" button in SQL Lab, but also dangerous. Users can very easily say like "run this massively expensive / inneficient query every hour", without code review, and without having good mechanism to enforce incremental loads, limit cost or mechanism to make sure this gets removed if unused. I've been close to people building such services in the past in large envrionments, and we had to put a bunch of safeguards in place preventing missuse As to where to materialize the extract, there are probably a few reasonable options:
To me it just seems like an intuitively interesting idea, but then makes me think "oh well this is why we have service specialization", like different services are good at different things. About scheduled ETL workloads, and in the era of big data and high cloud costs, I think it's important to have the friction and rigor that code review brings to this, otherwise it's going to be the wild west. If you bypass the rigor of code review for scheduling workloads, I feel like you need to build another immune system around it mitigating the risks that comes with it. |
Beta Was this translation helpful? Give feedback.
-
Hello @mistercrunch and thanks for the time you took for this detailed answer. :)
I would say these are steps 2 and steps 3 of implementation. From my experience on Tableau, 95% of extract are "full" extract at specified hour, without triggering, sometimes with a where clause. This may sound outrageous to you (and I totally agree it's not the best) but it's still better/cheaper than trying to querying live for these users.
Don't you think the risk can be decreased by permissions management and process ? The way I work is before any delivery on production environment, we have a control with a checklist. And we can even imagine a setting that prohibit extract for the whole platform so that, no regressions.
Easy but pretty useless. If a source type is slow, generally because made for OLTP and not OLAP transaction, a materialized view will have limited effects. And it won't work with file sources (excel, etc).
For the import part, I don't worry,but it means a database management, monitoring, queue management, access grant, service accounts, backup and all and that's why I favored the third solution.
I think you can : https://duckdb.org/docs/configuration/overview.html
It's clearly not the idea, we're more in GB-10GB order of magnitude ;)
About scheduled ETL workloads, and in the era of big data and high cloud costs, I think it's important to have the friction and rigor that code review brings to this, otherwise it's going to be the wild west. If you bypass the rigor of code review for scheduling workloads, I feel like you need to build another immune system around it mitigating the risks that comes with it. I totally understand your caution, it may looks like open Pandora box. But as I explained before and because of my experience on Tableau, I'm way more optimistic : a good delivery process (and by good, I mean effective), users training, permissions can reduce the risk to almost nothing without having to build a gas factory. Best regards, Simon |
Beta Was this translation helpful? Give feedback.
-
Hello all,
This follows #3564 discussion and as seen with @mistercrunch , I open a fresh new issue, hoping I'm in the good section and that it wasn't meant to be a SIP (I'm not a superset developer so I would have waaaay more difficulties to handle the technical aspects).
The idea is to implement the extract functionality of Tableau (it also exists on other solutions like PowerBi with other names but extract is probably the most accurate).
if we go to the Tableau page about extract (https://help.tableau.com/current/pro/desktop/en-us/extracting_data.htm ), the feature consists basically to copy the data from a data source (like excel or SQL Server or Hive...) to the dataviz machine (server or desktop) in a format that is more usable (Tableau used TDE, now hyper which is pretty fast). The extract is of course not in the repository/metadata database of the dataviz platform but in separate files.
The main benefit is performance against the initial datasource (imagine querying a millions row csv.. or even 20 millions rows on Apache hive) but also, it means we can do some operations on it that wouldn't be possible with the original format. The drawbacks are mainly having to deal with extract refresh task (including their monitoring) and consumption of the dataviz platform ressources.
(Note : I simplify for clarity... I have a full slide on all benefits and drawbacks, this is a huge topic and it has to be implemented cleverly by the customer, not anything should be extract)
The change between extract or live query should be really easy to make, like just a checkbox.
This is obviously a big change for Apache Superset but I think it's pretty necessary when the competitors have all these feature.
Also, the times have changed. There are now some backend databases for extract that are fully functional (I suggest Duckdb https://duckdb.org/ fast, opensource, embeddable... etc. pretty sure its team would be honored to help you).
I also wonder if you can leverage existing solutions for the task/job management.
Best regards,
Simon
Beta Was this translation helpful? Give feedback.
All reactions