Skip to content

Commit

Permalink
Revise arrow slides. (#138)
Browse files Browse the repository at this point in the history
  • Loading branch information
jonthegeek committed May 10, 2024
1 parent d04e56e commit d4be6ce
Show file tree
Hide file tree
Showing 3 changed files with 82 additions and 146 deletions.
225 changes: 79 additions & 146 deletions 22-arrow.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,73 +2,57 @@

**Learning objectives:**

- Using the `arrow` package to load in large data files efficiently
- Partitioning large data files into parquet files for quicker access, less memory usage, and quicker wrangling
- Wrangling with data in the `arrow` data format or parquet format using existing `dplyr()` operations

## Why learn arrow?
- Most data is commonly stored in CSV files for ease of access and use
- However, CSVs can be too big or messy to read and work with quickly
- Hence, the need for a package like `arrow` to read large data sets quickly using `dplyr` syntax

## Setting up
- Download the package `arrow` by running this command once in your R console: `install.packages("arrow")`
- Then, run the code chunk below to get the packages needed for rest of this chapter
```{r, warning=F, message=F}
- Use {arrow} to load large data files into R efficiently
- Partition large data files into parquet files for quicker access, less memory usage, and quicker wrangling
- Wrangle {arrow} data using existing {dplyr} operations

## Why learn {arrow}? {-}

- CSV files = very common for ease of access and use
- Big/messy CSVs = slow
- {arrow} 📦 reads large datasets quickly & uses {dplyr} syntax

## Packages used {-}

```{r arrow-library, warning=F, message=F}
library(arrow)
library(dbplyr)
library(curl)
library(duckdb)
library(tidyverse)
```

## Grabbing data
- As a case study, grab the item checkouts dataset from Seattle libraries here: [ data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6]( data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6)
- **DONT DOWNLOAD DATA BY HAND!!!** (it has 41,389,465 rows of data)
- You can download it instead with the code here (which can handle giant data sets and gives progress bar in console for download status):
## Download data {-}

- Case study: [item checkouts dataset from Seattle libraries](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6)
- **DON'T `download.file()`!!!** (41,389,465 rows of data)

```{r, eval=F}
dir.create("data", showWarnings = FALSE)
curl::multi_download(
"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
"data/seattle-library-checkouts.csv",
resume = TRUE
)
#> # A tibble: 1 × 10
#> success status_code resumefrom url destfile error
#> <lgl> <int> <dbl> <chr> <chr> <chr>
#> 1 TRUE 200 0 https://r4ds.s3.us-we… data/seattle-l… <NA>
#> # ℹ 4 more variables: type <chr>, modified <dttm>, time <dbl>,
#> # headers <list>
```

## Opening the data (1)
- Usually, need twice the file size to load it in memory successfully (i.e., a 9GB file requires 9GB x 2 = 18GB in memory)
- Instead of `read_csv()`, can use `arrow::open_dataset()` to open file
- Data scans a few rows to understand its structure and columns
## Open the data {-}

- Size in memory ≈ 2 × size on disk
- ~~`read_csv()`~~ ➡️ `arrow::open_dataset()`
- Scans a few thousand rows to determine dataset structure
- `ISBN` is empty for 80k rows, so we specify
- Does NOT load entire dataset into memory

```{r, eval=F}
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv"
)
#> FileSystemDataset with 1 csv file
#> 41,389,465 rows x 12 columns
#> $ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Ph…
#> $ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
#> $ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
#> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
#> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
#> $ Title <string> "Super rich : a guide to having it all / Russell S…
#> $ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
#> $ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
#> $ Subjects <string> "Self realization, Conduct of life, Attitude Psych…
#> $ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
```

## Opening the data (2)
- We can use `glimpse()` on the data set to get details on it (i.e, dimensions, column types, number of rows, etc)
## Glimpse the data {-}

```{r, eval=F}
seattle_csv |> glimpse()
#> FileSystemDataset with 1 csv file
Expand All @@ -87,9 +71,10 @@ seattle_csv |> glimpse()
#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
```

## Opening the data (3)
- Can do calculations with data using `dplyr()` functions
- Ex: grabbing summary of data (i.e., number of total checkouts per year)
## Manipulate the data {-}

- Can use {dplyr} functions on data

```{r, eval=F}
seattle_csv |>
group_by(CheckoutYear) |>
Expand All @@ -108,47 +93,48 @@ seattle_csv |>
#> # ℹ 12 more rows
```

## The parquet format (1)
- Reading data with the package `arrow` is fast
- Much faster with the parquet format
- With parquet, we separate the data into many different files
## Parquet > CSV {-}

## The parquet format (2)
- Slow: Manipulating large CSV datasets with {readr}
- Faster: Manipulating large CSV datasets with {arrow}
- Much faster: Manipulating large `parquet` datasets with {arrow}
- Data subdivided into multiple files

**Benefits of Parquet**
## Benefits of parquet {-}

- Smaller than original CSV file due to compression
- Can track column data types vs CSV reader making guesses
- Follows R's style of sorting data column by column vs row by row in CSV readers
- Splits data into different pieces you can skip over
- Smaller files than CSV (efficient encodings + compression)
- Stores datatypes (vs CSV storing all as character & guessing)
- "Column-oriented" ("thinks" like a dataframe)
- Splits data into chunks you can (often) skip (faster)

**Flaws of Parquet**
**But:**

- So efficiently organized that humans can't read it
- If you try reading parquet files, you only get file metadata only the computer understands
- Not human-readable

## Partitioning
- Splitting data into many files makes it easier to work with increasingly larger and larger files
- May take trial and error to find what you think is best way to partition
- Ideally, have a decent number of partitions (i.e., < 10,000 partitions) with each partitions neither too small (i.e., < 20MB) or large (i.e., > 2GB)
## Partitioning {-}

## Rewriting the Seattle library data (1)
- Here, we partition by `checkoutYear`
- Use `dplyr::group_by()` to define partitions and save them to directory with `arrow::write_dataset()`
```{r, eval=F}
pq_path <- "data/seattle-library-checkouts"
```
- Split data across files so analyses can skip unused data
- Experiment to find best partition for your data
- Recommendations:
- 20 MB < Filesize < 2 GB
- <= 10,000 files

## Seattle library CSV to parquet {-}

- `dplyr::group_by()` to define partitions
- `arrow::write_dataset()` to save as parquet

```{r, eval=F}
# may take long, but makes future work faster
pq_path <- "data/seattle-library-checkouts"
seattle_csv |>
group_by(CheckoutYear) |>
write_dataset(path = pq_path, format = "parquet")
```

## Rewriting the Seattle library data (2)
- Output of what's made
- Uses [Apache Hive](https://hive.apache.org/) framework to partition files (in this case, based on year of checkout)
## Seattle library parquet files {-}

- [Apache Hive](https://hive.apache.org/) "self-describing" directory/file names

```{r, eval=F}
tibble(
files = list.files(pq_path, recursive = TRUE),
Expand All @@ -166,26 +152,19 @@ tibble(
#> # ℹ 12 more rows
```

## dplyr and arrow (1)
- Let's read in parquet files using `open_dataset()`
- Recall `pq_path="data/seattle-library-checkouts"`
```{r, eval=F}
seattle_pq <- open_dataset(pq_path)
```
## parquet + {arrow} + {dplyr} {-}

## dplyr and arrow (2)
- Let's use the `dplyr` pipeline
- Ex: Counting how many books checked out per month in last five years
```{r, eval=F}
seattle_pq <- open_dataset(pq_path)
query <- seattle_pq |>
filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
group_by(CheckoutYear, CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(CheckoutYear, CheckoutMonth)
```

## dplyr and arrow (3)
- Results of query, which can be collected with `collect()`
## Results (uncollected) {-}

```{r, eval=F}
query
#> FileSystemDataset (query)
Expand All @@ -198,6 +177,8 @@ query
#> See $.data for the source Arrow object
```

## Results (collected) {-}

```{r, eval=F}
query |> collect()
#> # A tibble: 58 × 3
Expand All @@ -213,80 +194,32 @@ query |> collect()
#> # ℹ 52 more rows
```

## dplyr and arrow (4)
- Before applying R expressions to query, use `?acero` to see which R expressions `arrow` supports
```{r, eval=F}
?acero
```
## Available verbs

## dplyr and arrow (5)
- [`?arrow-dplyr`](https://arrow.apache.org/docs/r/reference/acero.html) for supported functions
- (book uses `?acero` but that's way harder to remember)

**Performance (1)**
## Performance

- Let's see how long it takes getting number of books checked out per month in 2021 via reading in whole CSV
```{r, eval=F}
seattle_csv |>
x |>
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutMonth)) |>
collect() |>
system.time()
#> user system elapsed
#> 11.951 1.297 11.387
```

## dplyr and arrow (6)
- CSV: 11.951s
- Parquet: **0.263s**

**Performance (2)**
## Using {duckdb} with {arrow} {-}

- Let's see time to getting number of books checked out per month in 2021 using the parquet files instead
```{r, eval=F}
seattle_pq |>
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutMonth)) |>
collect() |>
system.time()
#> user system elapsed
#> 0.263 0.058 0.063
```

## dplyr and arrow (7)

**Performance (3)**

- As shown earlier, data manipulation with parquet files take less than a second versus more than 10 seconds with reading in the entire CSV
- The speed is due to partitioning and storing data in binary (language computer directly understands)
- i.e., Arrow only needs the parquet file with 2021 data since it's partitioned by year and only gets columns used in query

## dplyr and arrow (8)

**duckdb and arrow (1)**

- Use `arrow::to_duckdb()` to make Arrow data be DuckDB database (as seen in Ch 21)
- No memory copying neded so transition between formats made easy
```{r, eval=F}
seattle_pq |>
to_duckdb() |>
filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
group_by(CheckoutYear) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutYear)) |>
collect()
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # A tibble: 5 × 2
#> CheckoutYear TotalCheckouts
#> <int> <dbl>
#> 1 2022 2431502
#> 2 2021 2266438
#> 3 2020 1241999
#> 4 2019 3931688
#> 5 2018 3987569
```
- `arrow::to_duckdb()` translates {arrow} dataset to {duckdb}
- Outside R/memory
- Discussion: Why?
- More tidyverse verbs?

## Meeting Videos

Expand Down
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Imports:
arrow,
babynames,
bookdown,
curl,
DBI,
dbplyr,
details,
Expand Down
2 changes: 2 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
seattle-library-checkouts
seattle-library-checkouts.csv

0 comments on commit d4be6ce

Please sign in to comment.