Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Pipe-supportive pass-through of unrecoded elements using case_match()? #6962

Open
jmobrien opened this issue Nov 9, 2023 · 0 comments

Comments

@jmobrien
Copy link

jmobrien commented Nov 9, 2023

case_match() is great to have as a more pipe-able alternative to case_when() following modern tidyverse approaches. However, it lacks an equivalent capabilities to one nice feature of its predecessor recode(): namely, where non-recoded elements could be allowed to pass through into the output. While there is a workaround for this missing feature provided in the help examples, I'm realizing it may not generalize well across use cases.

An example similar to what I encountered:

# Some filenames:
filenames <- 
  c("juniors_2022_OU.csv", "roster_2023_juniors_notre_dame.csv",  "2022_SOU_juniors_roster.csv")

# Begin building tibble to (eventually) import & organize the content from these files:
tibble::tibble(

  # Import filenames (in practice, would be a call to list.files() ):
  file_names = filenames,
  
  # Get key info like year and school out of filenames: 
  
  # Year is straightforward:
  id_year = file_names |> stringr::str_extract("202[23]"),
  
  #School, however, needs initialisms expanded to minimize ambiguity:
  
  # recode() can do this inside a single set of piped instructions. It
  # fills in the original data when items are not recoded (& same 
  # type as recoded output):

  id_school_recode = 
    file_names |> 
    # Remove non-school-name content:
      stringr::str_remove(".csv$") |> 
      stringr::str_remove("_?202[23]_?") |> 
      stringr::str_remove("_?juniors_?") |> 
      stringr::str_remove("_?roster_?") |>
    # Recode initialisms:
      dplyr::recode(
        "OU" = "oklahoma",
        "SOU" = "southern_oregon"
      ),
  
  # Using case_match(), though, non-recoded elements become NA.
  id_school_casematch = 
    file_names |> 
      stringr::str_remove(".csv$") |> 
      stringr::str_remove("_?202[23]_?") |> 
      stringr::str_remove("_?juniors_?") |> 
      stringr::str_remove("_?roster_?") |> 
      dplyr::case_match(
        "OU" ~ "oklahoma",
        "SOU" ~ "southern_oregon"
      )
)
#> # A tibble: 3 × 4
#>   file_names                        id_year id_school_recode id_school_casematch
#>   <chr>                             <chr>   <chr>            <chr>              
#> 1 juniors_2022_OU.csv               2022    oklahoma         oklahoma           
#> 2 roster_2023_juniors_notre_dame.c… 2023    notre_dame       <NA>               
#> 3 2022_SOU_juniors_roster.csv       2022    southern_oregon  southern_oregon
 

Help docs for case_match() has an example that uses argument .default = <<varname>> (e.g. species) to fill back in original data. This would work in mutate(), but not here in tibble() - using the approach requires multiple arguments specifying the same column/variable, which tibble() forbids:

tibble::tibble(

  # Locate and remove non-name content:
  id_school_casematch = 
    filenames |> 
    stringr::str_remove(".csv$") |> 
    stringr::str_remove("_?202[23]_?") |> 
    stringr::str_remove("_?juniors_?") |> 
    stringr::str_remove("_?roster_?"),

  # Recode names:
  id_school_casematch = 
    id_school_casematch |> 
    dplyr::case_match(
      "OU" ~ "oklahoma",
      "SOU" ~ "southern_oregon",
      .default = id_school_casematch
    )
)
#> Error in `tibble::tibble()`:
#> ! Column name `id_school_casematch` must not be duplicated.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#>   * "id_school_casematch" at locations 1 and 2.

(also, even when using mutate or standard variable creation [<-], specifying any new column/variable requires 2 separate arguments/calls. Admittedly my personal opinion, but that does seems less concise/readable to me, and/or not fully capitalizing on the pipe-ability that seems to be case_match()'s key offering.)

Could case_match() perhaps have an option/default added, e.g. case_match(.x, ..., .default = .x), that would mirror recode()'s capabilities? I recognize you'd still need equivalents to recode()'s checks ensuring that new & pass-through content have the same type--but wouldn't that be manageable, given the vctrs underpinnings of this function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant