Revise arrow slides. (#138)

r4ds · May 10, 2024 · d4be6ce · d4be6ce
1 parent d04e56e
commit d4be6ce
Show file tree

Hide file tree

Showing 3 changed files with 82 additions and 146 deletions.
diff --git a/22-arrow.Rmd b/22-arrow.Rmd
@@ -2,73 +2,57 @@
 
 **Learning objectives:**
 
-- Using the `arrow` package to load in large data files efficiently
-- Partitioning large data files into parquet files for quicker access, less memory usage, and quicker wrangling 
-- Wrangling with data in the `arrow` data format or parquet format using existing `dplyr()` operations
-
-## Why learn arrow?
-- Most data is commonly stored in CSV files for ease of access and use
-- However, CSVs can be too big or messy to read and work with quickly
-- Hence, the need for a package like `arrow` to read large data sets quickly using `dplyr` syntax
-
-## Setting up
-- Download the package `arrow` by running this command once in your R console: `install.packages("arrow")`
-- Then, run the code chunk below to get the packages needed for rest of this chapter
-```{r, warning=F, message=F}
+-  Use {arrow} to load large data files into R efficiently
+-  Partition large data files into parquet files for quicker access, less memory usage, and quicker wrangling 
+-  Wrangle {arrow} data using existing {dplyr} operations
+
+## Why learn {arrow}? {-}
+
+-  CSV files = very common for ease of access and use
+-  Big/messy CSVs = slow
+- {arrow} 📦 reads large datasets quickly & uses {dplyr} syntax
+
+## Packages used {-}
+
+```{r arrow-library, warning=F, message=F}
 library(arrow)
-library(dbplyr)
+library(curl)
 library(duckdb)
 library(tidyverse)
 ```
 
-## Grabbing data
-- As a case study, grab the item checkouts dataset from Seattle libraries here: [ data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6]( data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6)
-- **DONT DOWNLOAD DATA BY HAND!!!** (it has 41,389,465 rows of data)
-- You can download it instead with the code here (which can handle giant data sets and gives progress bar in console for download status):
+## Download data {-}
+
+- Case study: [item checkouts dataset from Seattle libraries](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6)
+- **DON'T `download.file()`!!!** (41,389,465 rows of data)
+
 ```{r, eval=F}
 dir.create("data", showWarnings = FALSE)
-
 curl::multi_download(
  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  "data/seattle-library-checkouts.csv",
  resume = TRUE
 )
-#> # A tibble: 1 × 10
-#> success status_code resumefrom url destfile error
-#> <lgl> <int> <dbl> <chr> <chr> <chr>
-#> 1 TRUE 200 0 https://r4ds.s3.us-we… data/seattle-l… <NA> 
-#> # ℹ 4 more variables: type <chr>, modified <dttm>, time <dbl>,
-#> # headers <list>
 ```
 
-## Opening the data (1)
-- Usually, need twice the file size to load it in memory successfully (i.e., a 9GB file requires 9GB x 2 = 18GB in memory)
-- Instead of `read_csv()`, can use `arrow::open_dataset()` to open file
-- Data scans a few rows to understand its structure and columns
+## Open the data {-}
+
+- Size in memory ≈ 2 × size on disk
+- ~~`read_csv()`~~ ➡️ `arrow::open_dataset()`
+- Scans a few thousand rows to determine dataset structure
+ - `ISBN` is empty for 80k rows, so we specify
+- Does NOT load entire dataset into memory
+
 ```{r, eval=F}
 seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
 )
-#> FileSystemDataset with 1 csv file
-#> 41,389,465 rows x 12 columns
-#> $ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Ph…
-#> $ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
-#> $ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
-#> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
-#> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
-#> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
-#> $ Title <string> "Super rich : a guide to having it all / Russell S…
-#> $ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
-#> $ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
-#> $ Subjects <string> "Self realization, Conduct of life, Attitude Psych…
-#> $ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
-#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
 ```
 
-## Opening the data (2)
-- We can use `glimpse()` on the data set to get details on it (i.e, dimensions, column types, number of rows, etc)
+## Glimpse the data {-}
+
 ```{r, eval=F}
 seattle_csv |> glimpse()
 #> FileSystemDataset with 1 csv file
@@ -87,9 +71,10 @@ seattle_csv |> glimpse()
 #> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
 ```
 
-## Opening the data (3)
-- Can do calculations with data using `dplyr()` functions
-- Ex: grabbing summary of data (i.e., number of total checkouts per year)
+## Manipulate the data {-}
+
+- Can use {dplyr} functions on data
+
 ```{r, eval=F}
 seattle_csv |> 
  group_by(CheckoutYear) |> 
@@ -108,47 +93,48 @@ seattle_csv |>
 #> # ℹ 12 more rows
 ```
 
-## The parquet format (1)
-- Reading data with the package `arrow` is fast
-- Much faster with the parquet format
-- With parquet, we separate the data into many different files 
+## Parquet > CSV {-}
 
-## The parquet format (2)
+- Slow: Manipulating large CSV datasets with {readr}
+- Faster: Manipulating large CSV datasets with {arrow}
+- Much faster: Manipulating large `parquet` datasets with {arrow}
+ - Data subdivided into multiple files 
 
-**Benefits of Parquet**
+## Benefits of parquet {-}
 
-- Smaller than original CSV file due to compression
-- Can track column data types vs CSV reader making guesses
-- Follows R's style of sorting data column by column vs row by row in CSV readers
-- Splits data into different pieces you can skip over
+-  Smaller files than CSV (efficient encodings + compression)
+-  Stores datatypes (vs CSV storing all as character & guessing)
+-  "Column-oriented" ("thinks" like a dataframe)
+-  Splits data into chunks you can (often) skip (faster)
 
-**Flaws of Parquet**
+**But:**
 
-- So efficiently organized that humans can't read it
-- If you try reading parquet files, you only get file metadata only the computer understands
+- Not human-readable
 
-## Partitioning
-- Splitting data into many files makes it easier to work with increasingly larger and larger files
-- May take trial and error to find what you think is best way to partition
-- Ideally, have a decent number of partitions (i.e., < 10,000 partitions) with each partitions neither too small (i.e., < 20MB) or large (i.e., > 2GB) 
+## Partitioning {-}
 
-## Rewriting the Seattle library data (1)
-- Here, we partition by `checkoutYear`
-- Use `dplyr::group_by()` to define partitions and save them to directory with `arrow::write_dataset()`
-```{r, eval=F}
-pq_path <- "data/seattle-library-checkouts"
-```
+- Split data across files so analyses can skip unused data
+- Experiment to find best partition for your data
+- Recommendations:
+ - 20 MB < Filesize < 2 GB
+ - <= 10,000 files
+
+## Seattle library CSV to parquet {-}
+
+- `dplyr::group_by()` to define partitions
+- `arrow::write_dataset()` to save as parquet
 
 ```{r, eval=F}
-# may take long, but makes future work faster
+pq_path <- "data/seattle-library-checkouts"
 seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path, format = "parquet")
 ```
 
-## Rewriting the Seattle library data (2)
-- Output of what's made
-- Uses [Apache Hive](https://hive.apache.org/) framework to partition files (in this case, based on year of checkout)
+## Seattle library parquet files {-}
+
+- [Apache Hive](https://hive.apache.org/) "self-describing" directory/file names
+
 ```{r, eval=F}
 tibble(
  files = list.files(pq_path, recursive = TRUE),
@@ -166,26 +152,19 @@ tibble(
 #> # ℹ 12 more rows
 ```
 
-## dplyr and arrow (1)
-- Let's read in parquet files using `open_dataset()`
-- Recall `pq_path="data/seattle-library-checkouts"` 
-```{r, eval=F}
-seattle_pq <- open_dataset(pq_path)
-```
+## parquet + {arrow} + {dplyr} {-}
 
-## dplyr and arrow (2)
-- Let's use the `dplyr` pipeline 
-- Ex: Counting how many books checked out per month in last five years
 ```{r, eval=F}
+seattle_pq <- open_dataset(pq_path)
 query <- seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)
 ```
 
-## dplyr and arrow (3)
-- Results of query, which can be collected with `collect()` 
+## Results (uncollected) {-}
+
 ```{r, eval=F}
 query
 #> FileSystemDataset (query)
@@ -198,6 +177,8 @@ query
 #> See $.data for the source Arrow object
 ```
 
+## Results (collected) {-}
+
 ```{r, eval=F}
 query |> collect()
 #> # A tibble: 58 × 3
@@ -213,80 +194,32 @@ query |> collect()
 #> # ℹ 52 more rows
 ```
 
-## dplyr and arrow (4)
-- Before applying R expressions to query, use `?acero` to see which R expressions `arrow` supports
-```{r, eval=F}
-?acero
-```
+## Available verbs
 
-## dplyr and arrow (5)
+- [`?arrow-dplyr`](https://arrow.apache.org/docs/r/reference/acero.html) for supported functions
+ - (book uses `?acero` but that's way harder to remember)
 
-**Performance (1)**
+## Performance
 
-- Let's see how long it takes getting number of books checked out per month in 2021 via reading in whole CSV
 ```{r, eval=F}
-seattle_csv |> 
+x |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()
-#> user system elapsed 
-#> 11.951 1.297 11.387
 ```
 
-## dplyr and arrow (6)
+- CSV: 11.951s
+- Parquet: **0.263s**
 
-**Performance (2)**
+## Using {duckdb} with {arrow} {-}
 
-- Let's see time to getting number of books checked out per month in 2021 using the parquet files instead
-```{r, eval=F}
-seattle_pq |> 
- filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
- group_by(CheckoutMonth) |>
- summarize(TotalCheckouts = sum(Checkouts)) |>
- arrange(desc(CheckoutMonth)) |>
- collect() |> 
- system.time()
-#> user system elapsed 
-#> 0.263 0.058 0.063
-```
-
-## dplyr and arrow (7)
-
-**Performance (3)**
-
-- As shown earlier, data manipulation with parquet files take less than a second versus more than 10 seconds with reading in the entire CSV
-- The speed is due to partitioning and storing data in binary (language computer directly understands)
-- i.e., Arrow only needs the parquet file with 2021 data since it's partitioned by year and only gets columns used in query
-
-## dplyr and arrow (8)
-
-**duckdb and arrow (1)**
-
-- Use `arrow::to_duckdb()` to make Arrow data be DuckDB database (as seen in Ch 21)
-- No memory copying neded so transition between formats made easy
-```{r, eval=F}
-seattle_pq |> 
- to_duckdb() |>
- filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
- group_by(CheckoutYear) |>
- summarize(TotalCheckouts = sum(Checkouts)) |>
- arrange(desc(CheckoutYear)) |>
- collect()
-#> Warning: Missing values are always removed in SQL aggregation functions.
-#> Use `na.rm = TRUE` to silence this warning
-#> This warning is displayed once every 8 hours.
-#> # A tibble: 5 × 2
-#> CheckoutYear TotalCheckouts
-#> <int> <dbl>
-#> 1 2022 2431502
-#> 2 2021 2266438
-#> 3 2020 1241999
-#> 4 2019 3931688
-#> 5 2018 3987569
-```
+- `arrow::to_duckdb()` translates {arrow} dataset to {duckdb}
+ - Outside R/memory
+- Discussion: Why? 
+ - More tidyverse verbs?
 
 ## Meeting Videos
 

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -11,6 +11,7 @@ Imports:
  arrow,
  babynames,
  bookdown,
+ curl,
  DBI,
  dbplyr,
  details,

diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1,2 @@
+seattle-library-checkouts
+seattle-library-checkouts.csv