Create some pre-canned standard datasets like the h20 groupby dataset #256

MrPowers · 2024-03-07T14:49:56Z

Expected Behavior

It'd be nice to make some "standard" datasets easily accessible, so users don't have to figure out how to create them from scratch. For example, the datasets used in the h2o benchmarks here.

Here are three rows from the h2o groupby dataset:

It would be nice if I could generate this dataset as follows:

DataGenerator.h2o_groupby(spark, rows=1_000_000_000, partitions=10)

Current Behavior

I am guessing that there is some way to generate this dataset with the current API, but might take me a little while to figure it out.

ronanstokes-db · 2024-03-21T19:05:14Z

This is a great idea

Here's what I would propose:

The API would look something like the following:

import dbldatagen as dg

# using hierarchical naming
df = dg.Datasets("basic/iot_like").get(table="primary",  rows=100000, numPartitions=4).build()
# or simply use 
df = dg.Datasets("basic/iot_like").get().build()

We could also have some documentation of the datasets available via APIs so that we would not have to revise the docs every time we add a new dataset - along the lines of how dbutils is self describing.

import dbldatagen as dg

dg.Datasets.list("<pattern>") # get summary details
db.Datasets.describe("basic/iot_like") # get detailed description of data set

# describe should indicate the tables available, defaults, what the data looks like etc

Initial datasets would be a) the data sets described in the documentation as examples, b) some curated set of datasets such as the H20 one you reference

To others reading this, feel free to suggest datasets

MrPowers · 2024-03-21T20:52:34Z

This looks like a great proposal! Thanks!

ronanstokes-db self-assigned this Mar 21, 2024

ronanstokes-db added the enhancement New feature or request label Mar 21, 2024

ronanstokes-db added this to the v0.3.7 milestone Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create some pre-canned standard datasets like the h20 groupby dataset #256

Create some pre-canned standard datasets like the h20 groupby dataset #256

MrPowers commented Mar 7, 2024

ronanstokes-db commented Mar 21, 2024 •

edited

MrPowers commented Mar 21, 2024

Create some pre-canned standard datasets like the h20 groupby dataset #256

Create some pre-canned standard datasets like the h20 groupby dataset #256

Comments

MrPowers commented Mar 7, 2024

Expected Behavior

Current Behavior

ronanstokes-db commented Mar 21, 2024 • edited

MrPowers commented Mar 21, 2024

ronanstokes-db commented Mar 21, 2024 •

edited