Skip to content

Commit

Permalink
Update iteration for 2e (#143)
Browse files Browse the repository at this point in the history
  • Loading branch information
jonthegeek committed Jun 11, 2024
1 parent 868b34b commit 051ed8d
Showing 1 changed file with 134 additions and 108 deletions.
242 changes: 134 additions & 108 deletions 26-iteration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,152 +2,178 @@

**Learning objectives:**

- **Reduce duplication** in code by **iterating** over a pattern using a **`for` loop.**
- **Modify an object** using a **`for` loop.**
- Recognize the **three basic ways** to **loop over a vector** using a **`for` loop.**
- Handle **unknown output lengths** by using a **more complex result** and then **combining** the result after the **`for` loop.**
- Handle **unknown sequence lengths** with a **`while` loop.**
- **Reduce duplication** in code by passing a function as an argument to a function **(functional programming).**
- Make iteration code **easier to read** with the **`map` family** of functions from **`{purrr}`.**
- Make `map` code **faster to type** with **shortcuts.**
- **Compare** the **`map` family** of functions to the base R **`apply` family.**
- **Deal with errors** and other output using the **`{purrr}` adverbs: `safely`, `possibly`, and `quietly`.**
- Map over **multiple arguments** with the **`map2` family** and the **`pmap` family** from `{purrr}`.
- Map over **multiple functions** with the **`invoke_map` family** from `{purrr}` and its replacements (see `?purrr::invoke_map`, "Life cycle").
- Call a function for its **side effects** using the **`walk`** family of functions from `{purrr}`.
- Recognize the **other iteration functions** from `{purrr}`.
- Modify multiple columns using the same patterns.
- Filter based on the contents of multiple columns.
- Process multiple files.
- Write multiple files.

## Intro

**Good iteration - Pre-allocate the shape/structure and then fill in the data**.

**Imperative programming** - `for` loops and `while` loops make iteration very explicit.

**Functional programming** - common `for` loop patterns get their own functions, so you can streamline your code even more.

Packages: `library(tidyverse)`, `library(purrr)`

## For loops

Each loop has three components:

- **output** - Allocate enough space for your `for` loop. A loop that grows at each iteration will be very "slow".

- **sequence** - What to loop over.

- `seq_along()` is a safe version of the familiar `1:length(x)`: if you have a zero-length vector, it will tell you.

- **body** - This is the code that does the work.

## For loop variations

There are four variations on the basic theme of the `for` loop:

1. Modifying an existing object, instead of creating a new object.

2. Looping over names or values, instead of indices.

- There are three basic ways to loop over a vector.

- `for (i in seq_along(df))`

- `for (x in xs)` - good for **side-effects**

- `for (nm in names(xs))` - creates name to access value with `x[[nm]]`

3. Handling outputs of unknown length.

- `unlist()` flattens a list of vectors into a single vector,

- `purrr::flatten_dbl()` is stricter --- it will throw an error if the input isn't a list of doubles.

4. Handling sequences of unknown length.

- Use a `while` loop.

## For loops vs. functionals

`For` loops are not as important in R as they are in other languages because R is a functional programming language.
```{r iteration-packages_used, message=FALSE, warning=FALSE}
library(tidyverse)
```

This means that it's possible to put `for` loops in a function, and call that function instead of using the `for` loop directly.
## Intro to iteration {-}

Base R and the `purrr` package have functions for many common loops.
**Iteration** = repeatedly performing the same action on different objects

## The map functions
R is full of *hidden* iteration!

- `map(.x, .f, ...)` makes a list.
- `ggplot2::facet_wrap()` / `ggplot::facet_grid()`
- `dplyr::group_by()` + `dplyr::summarize()`
- `tidyr::unnest_wider()` / `tidyr::unnest_longer()`
- Anything with a vector!
- `1:10 + 1` requires loops in other languages!

- `map_lgl(.x, .f, ...)` makes a logical vector.
## Summarize w/ `across()`: detup {-}

- `map_int(.x, .f, ...)` makes an integer vector.
```{r iteration-summarize-setup}
df <- tibble(a = rnorm(10), b = rnorm(10), c = rnorm(10))
glimpse(df)
```

- `map_dbl(.x, .f, ...)` makes a double vector.
## Summarize w/ `across()`: motivation {-}

- `map_chr(.x, .f, ...)` makes a character vector.
```{r iteration-summarize-motivation}
messy <- df |> summarize(
n = n(),
a = median(a),
b = median(b),
c = median(c)
)
```

**Shortcuts with `purrr`:**
## Summarize w/ `across()`: cleaner {-}

- a one-sided formula creates an anonymous function
```{r iteration-summarize-across}
clean <- df |> summarize(
n = n(),
dplyr::across(a:c, median)
)
identical(clean, messy)
```

- use a string to extract named components
## Selecting columns {-}

- Use an integer to select elements by position
- `everything()` for all non-grouping columns
- `where()` to select based on a condition
- `where(is.numeric)` = all numeric columns
- `where(is.character)` = all character columns
- `starts_with("a") & !where(is_numeric)` = all columns that start with "a" and are not numeric
- `where(\(x) any(stringr::str_detect("name")))` = all columns that contain the word "name" in at least one value

**Base R:**
## Passing functions {-}

- `lapply(X, FUN, ...)` makes a list (`map` is more consistent)
Pass actual function to `across()`, ***not*** a call to the function!

- `sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)` wrapper around `lapply`
-`across(a:c, mean)`
-`across(a:c, mean())`

- `vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)` safe alternative to `sapply`, can make a matix
## Multiple functions {-}

## Dealing with failure
```{r iteration-multiple_functions}
df |> summarize(
across(a:c, list(mean, median))
)
```

`safely()` - designed to work with `map`
## Multiple functions with names {-}

- `result` is the original result. If there was an error, this will be `NULL`.
```{r iteration-multiple_functions-names}
df |> summarize(across(a:c,
list(mean = mean, median = median)
))
```

- `error` is an error object. If the operation was successful, this will be `NULL`.
## Multiple functions with names & args {-}

`possibly()` - you give it a default value to return when there is an error.
```{r iteration-multiple_functions-args}
df |> summarize(across(a:c,
list(
mean = \(x) mean(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE)
)
))
```

`quietly()` - instead of capturing errors, it captures printed output, messages, and warnings.
## Fancier naming {-}

## Mapping over multiple arguments
```{r iteration-across-glue_names}
df |> summarize(across(a:c,
list(
mean = \(x) mean(x, na.rm = TRUE),
median = \(x) median(x, na.rm = TRUE)
),
.names = "{.fn}_of_{.col}"
))
```

`map2()` and `pmap()` functions are for multiple related inputs that you need iterate along in parallel.
## Filtering with if_any() and if_all() {-}

**Invoking different functions**
```{r iteration-if_any}
df2 <- tibble(x = 1:3, y = c(1, 2, NA), z = c(NA, 2, 3))
df2 |> filter(if_any(everything(), is.na))
df2 |> filter(if_all(everything(), \(x) !is.na(x)))
```

*Read the help in purrr on these functons. If you need to do this read Advanced R. `invoke_map` is retired.*
## across() in functions: setup {-}

```{r iteration-across-in-functions}
summarize_datatypes <- function(df) {
df |> summarize(
across(
where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE))
),
across(
where(is.factor) | where(is.character),
list(n_distinct = n_distinct)
)
) |>
glimpse()
}
```

## Walk
## across() in functions: mpg {-}

`walk()` is an alternative to `map()` that you use when you want to call a function for its side effects, rather than for its return value.
```{r iteration-across_in_functions-mpg}
mpg |> summarize_datatypes()
```

`walk2` and `pwalk` - more useful than `walk`.
## across() in functions: diamonds {-}

## Other patterns of for loops
```{r iteration-across_in_functions-diamonds}
diamonds |> summarize_datatypes()
```

**Predicate functions -** return `TRUE` or `FALSE`
## Iterate over files {-}

- `keep` and `discard` - keep elements are `TRUE` and discard are `FALSE`.
```{r iteration-map_files, eval = FALSE}
list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
set_names(basename) |>
purrr::map(readxl::read_excel) |>
map(\(df) "Fix something that might be weird in each df") |>
map(\(df) "Fix a different thing") |>
purrr::list_rbind(names_to = "filename")
```

- `some` and `every` - determine if `TRUE` for any (some) or all (every)
## One vs everything {-}

- `detect` and `detect_index` - detect finds first element where `TRUE`, and detect_index returns the position
> We recommend this approach [perform each step on each file instead of in a function] because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.
- `head_while()` and `tail_while()` - while `TRUE` take elements from start (head_while) or end (tail_while)
Discuss!

**Reduce and accumulate**
- Jon's preference: Do 1-2 files first, iterate on iteration
- Book: Do everything on everything

- `reduce()` takes a "binary" function, applies it repeatedly to a list until there is only a single element left.
## Walk vs map {-}

- `accumulate()` keeps all the interim results.
- Use `purrr::walk()` to do things without keeping result
- Book example: Saving things
- `purrr::map2()` & `purrr::walk2()`: 2 inputs
- `purrr::pmap()` & `purrr::pwalk()`: list of inputs (largely replaced by `across()`)

## Meeting Videos
## Meeting Videos {-}

### Cohort 5
### Cohort 5 {-}

`r knitr::include_url("https://www.youtube.com/embed/0rsV1jlxhws")`

Expand Down Expand Up @@ -201,7 +227,7 @@ mtcars %>%
```
</details>

### Cohort 6
### Cohort 6 {-}

`r knitr::include_url("https://www.youtube.com/embed/NVUHFpYUmA4")`

Expand Down Expand Up @@ -229,7 +255,7 @@ mtcars %>%
</details>


### Cohort 7
### Cohort 7 {-}

`r knitr::include_url("https://www.youtube.com/embed/vPEgWgs0q7s")`

Expand All @@ -247,6 +273,6 @@ mtcars %>%
</details>


### Cohort 8
### Cohort 8 {-}

`r knitr::include_url("https://www.youtube.com/embed/TQabUIBbJKs")`

0 comments on commit 051ed8d

Please sign in to comment.