Skip to content

Latest commit

 

History

History
162 lines (125 loc) · 18.9 KB

FAQ.md

File metadata and controls

162 lines (125 loc) · 18.9 KB

Frequently Asked Questions

Please ask new questions as a Q&A in discussions.

What is PuffinDB?

PuffinDB is a serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg.

Why should I use PuffinDB?

If you are using DuckDB client-side with any client application, adding the PuffinDB extension will let you:

Why develop yet another distributed SQL engine?

Many excellent distributed SQL engines are available today. Why do we need yet another one?

What is Cloud Data?

Cloud Data is what comes after Big Data.

What is Edge-Driven Data Integration?

Edge-Driven Data Integration (EDDI) is an inversion of control proposed by the PuffinDB project. Its main idea is that data integration should be driven at the edge by dynamic user-driven integration scenarios, rather than on the cloud with static data integration pipelines, yet without sacrificing solid architecture design and proper data governance (read more).

Why should I run DuckDB on the cloud instead of my personal computer?

While running DuckDB on your personal computer will work great in many instances, running it on the cloud can bring several benefits:

  • No need to download large datasets from the cloud to your local computer.
  • Ability to work with larger datasets by taking advantage of fleets of AWS Lambda functions and | or large Amazon EC2 instances.
  • Enforcement of column and | or row-level access control policies.

Why should I run DuckDB both client-side and cloud-side?

  • Low-latency analytics on whatever amount of data you are willing to cache on your client.
  • Ability to handle datasets that are too large for your client.
  • Concurrent users editing the same cloud-side table.

Can I use PuffinDB without a Lakehouse?

Yes, you can use PuffinDB with just an Object Store like Amazon S3. But you should still take a look at Iceberg, for the following reasons:

Why should I use a Lakehouse like Iceberg instead of just an Object Store like S3?

A Lakehouse like Apache Iceberg, Apache Hudi, and Delta Lake offers many critical features:

Credits: feature descriptions courtesy of Apache Iceberg.

Won't DuckDB support Iceberg soon?

Yes, it will (or so we've read). Nevertheless, this support is likely to be limited to DuckDB running on a single host (client, function, or VM). PuffinDB is designed to support DuckDB running on multiple functions in parallel, or multiple functions and a VM coordinating their work. Consequently, while DuckDB can directly request a table scan through Iceberg's Java API when running in single-host mode, this scan must be performed outside of DuckDB when running in multi-host mode. Furthermore, PuffinDB supports the optional execution of some read queries on Amazon Athena, whenever such delegated execution would be faster or more cost-effective.

Why not just use Amazon Athena?

Amazon Athena can be used to read | write any Iceberg tables, but many read queries will be faster and more cost-effective with DuckDB.

Do I need Amazon Athena to query Iceberg tables with PuffinDB?

If you just make read queries on Iceberg tables, you do not. But you will need Amazon Athena for write queries on these tables.

Alternatively, you will be able to use Icecap once it becomes available.

Will I get billed for Amazon Athena when executing read queries?

No. You will only be billed for AWS Lambda functions, thanks to the fact that the Iceberg Java API used for table scans is packaged as a standalone AWS Lambda function running without Apache Spark. This ensures the lowest possible costs, and a much lower table scan marginal latency under 500 ms. You will get billed for Amazon Athena only when doing write queries on Iceberg tables. Finally, write queries made directly on Object Store objects do not use Apache Iceberg, hence do not require Amazon Athena.

Which cloud platforms will be supported?

Initially, AWS. Support for Microsoft Azure and Google Cloud will be added in future releases.

Which Lakehouse platforms will be supported?

Apache Iceberg, Apache Hudi, and Delta Lake.

Why support so many deployment options?

So that you can pick the one that will work best for you:

Why not support private cloud deployment?

Modern public clouds like Amazon Web Services, Microsoft Azure, and Google Cloud are true marvels of engineering, and have reached a level of sophistication and scale that is far beyond the reach of any corporate IT organization. And as time goes by, the very concept of private cloud is becoming less and less sensical. We have zero interest in spending any resources to deliver a sub-par user experience that will be of interest to less and less organization.

Be on the right side of history...

Can PuffinDB be deployed on EC2 instances or Fargates?

PuffinDB is deployed on AWS Lambda functions and one Amazon EC2 instance. Support for AWS Fargate will be added later on.

How is PuffinDB serverless if it is deployed on EC2 instances?

Fair question. Whenever PuffinDB is deployed on Amazon EC2 instances, it still uses fleets of AWS Lambda for loading objects from the Object Store and for processing some parts of distributed queries, while using the EC2 instances for reduction purposes. The goal here is to be as serverless as possible, while remaining pragmatic (serverless Fargates remain much less powerful than the largest EC2 instances). Eventually, we expect Fargates to become more and more powerful, so much so that we will not need to rely on EC2 instances anymore for the vast majority of applications and workloads.

Do I need a specific client to use PuffinDB?

No. PuffinDB has a clientless architecture and can be used from any application embedding the DuckDB engine, using a simple extension.

Why embed PRQL and Malloy?

PuffinDB embeds a PRQL to SQL translator and a Malloy to SQL translator. Many such translators are available today, for many different applications and syntaxes, but PuffinDB selected these two, because they address two complementary needs: PRQL is great for data preparation, while Malloy is unparalleled for data analytics. Both have been embedded to stress the platform in different ways, and to provide ready-to-use "applications" that can be used to convincingly showcase the platform's capabilities. Nevertheless, they are packaged as optional modules that can be omitted if so desired. Later on, they will be dynamically injected using the Query Proxy.

How are read queries on Lakehouse tables executed?

  1. Parse SQL query using native DuckDB parser and outline table filter predicates.
  2. Scan table(s) using low-latency Lakehouse Catalog API running on AWS Lambda function.
  3. Replace reference(s) to table(s) with references to partitions returned by table scan(s).
  4. Stringify SQL query using native DuckDB stringifier.
  5. Execute query using DuckDB running on AWS Lambda function.

How does partition caching work?

Whenever you execute a query on a table (be it backed by objects on the Object Store or a table in the Lakehouse) and are planning to make multiple queries on the same subset of the table, its partitions can be cached within the memory of the AWS Lambda function(s) used to execute the query. During subsequent queries on the same partitions, the function has a 99% probability of enjoying a hot start, giving it instant access to previously-cached partitions. This helps reduce latency and cost, since downloading partitions from the Object Store to the function is usually the longest step in the end-to-end query process (at least for relatively simple queries). The only drawback of such an optimization is that it slows the initial query down, because it requires an extra copy of the data in memory, but this is a small price to pay if many queries are to be made on the same table partitions.

How does query result caching work?

Whenever you execute a read query, its result can be cached on the Object Store (Amazon S3), and this cache is automatically referenced in the query logs table. Recent entries in this table are cached in the main AWS Lambda function for fast lookup. Whenever the same query is requested again, its cached result can be returned instead of re-executing the query. This helps reduce latency and cost, while freeing limited resources (most AWS accounts are limited to 3,000 concurrent Lambda functions) for other queries. Cached query results can be served through a Content Delivery Network (Amazon CloudFront) using Object Store Origins to further reduce latency and remove any potential bottleneck on the Object Store.

Where can I learn more about SQL query engines?

These reference materials are a solid starting point.

Which languages is PuffinDB written in?

PuffinDB uses the following languages:

Why is STOIC initiating and funding this open source project?

STOIC is in the business of developing and selling a progressive data platform allowing any data citizen to interact with datasets of any size, from kilobytes to petabytes. In that context, we need to perform read | write SQL queries on large datasets, with low latency (2s or less) and low cost (one or two orders of magnitude lower than conventional solutions). And we need the required query engine to properly support the top-three cloud platforms and the top-three table formats. Furthermore, we want it powered by Arrow and DuckDB, because there are no better technologies available today. We could not find such an engine distributed under a liberal open source license, so we decided to build one. And because this component is a means to an end for us, yet could benefit countless other projects and organizations, we decided to develop it as an open source project.

What is STOIC?

STOIC is a progressive data platform allowing any data citizen to interact with datasets of any size, through an intuitive spreadsheet-like user experience. STOIC is built on top of PuffinDB, with a modern cloud-native architecture and a unique hierarchical caching stack. Its user interface is designed for a very wide range of personas, including business analysts, data architects, and software engineers. It combines a table editor, sheet editor, and notebook editor into a fully integrated and coherent user experience. It supports data pipelines, data editing, workflow automation, and insight-driven data visualization. For more information, please check this presentation.

Will STOIC offer hosting services for PuffinDB?

Yes.

Why should my organization become a sponsor?

Vendors and users alike should consider sponsoring the PuffinDB project, for several reasons outlined on our sponsors page.

When will PuffinDB be released?

A first version with a naive distributed query engine (filter pushdown to serverless functions) will be released in Q2 (Cf. roadmap).

How may I contribute?

Please send an email to info at puffindb dot io.