perfromance slowdown using across within mutate #6985

nirguk · 2024-01-18T12:40:19Z

I believe this is an unexplored performance issue, seemingly relating to dplyr::expand_across

Benchmarked over a 1000 repetitions of processing ames data; There is a marked difference between direct mutation, and indirect mutation faciliated by across , seemingly both when using where() selection, and explicit all_of(c(..)) style selection. The latter speed degredation (of direct listing through all_of(c(...)) I think shows that the issue wont be related to checking properties a la the where() instant.

I think the performance issue is significant, with direct mutation approx 3x faster than that mediated by across

# A tibble: 4 × 9
  expression                                             min   median *`itr/sec`* mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 acrosswhere_func(ames_narrow)                       3.69ms   4.26ms      *219.*    1.75MB     7.70   966    34      4.42s
2 across_all_of_func(ames_narrow)                      3.3ms   3.83ms      *256.*   64.73KB     8.20   969    31      3.78s
3 direct_mutate_func(ames_narrow)                      1.1ms   1.26ms      *766.*   48.59KB     8.52   989    11      1.29s
4 direct_mutate_with_class_detect_func(ames_narrow)   1.22ms   1.36ms      *722.*   71.12KB     8.77   988    12      1.37s

I came across and considered whether this was related to #6897; however I believe it is something else.
Here when using across I use the anonymous function syntax as advised.

first a reprex and then my session info...

library(bench)
library(tidyverse)
library(modeldata)
options("lifecycle_verbosity"="error")


(ames_narrow <- ames |> select(1:5))

num_op <- mean
char_op <- identity

acrosswhere_func <- function(a){
  mutate(a,
         across(where(is.numeric),\(x){num_op(x)}),
         across(where(is.character)|where(is.factor),\(x){char_op(x)}))
}

across_all_of_func <- function(a){
  mutate(a,
         across(all_of(c("Lot_Frontage","Lot_Area")),\(x){num_op(x)}),
         across(all_of(c("MS_SubClass","MS_Zoning","Street")),\(x){char_op(x)}))
}


direct_mutate_func <- function(a){
  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

direct_mutate_with_class_detect_func <- function(a){
  
  l <- map_lgl(a,\(x)is.numeric(x))
  numnames <- names(l[l])
  l <- map_lgl(a,\(x){is.character(x)|is.factor(x)})
  catnames <- names(l[l])
  
  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

b1 <- mark(acrosswhere_func(ames_narrow),
           across_all_of_func(ames_narrow),
           direct_mutate_func(ames_narrow),
           direct_mutate_with_class_detect_func(ames_narrow),iterations = 1000L)

select(b1,1:9)

session info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] modeldata_1.2.0 lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
 [8] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 bench_1.1.3    

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0 magrittr_2.0.3    hms_1.1.3         tidyselect_1.2.0  munsell_0.5.0     timechange_0.2.0 
 [7] colorspace_2.1-0  R6_2.5.1          rlang_1.1.3       fansi_1.0.4       tools_4.2.2       grid_4.2.2       
[13] gtable_0.3.4      utf8_1.2.3        cli_3.6.2         withr_2.5.0       lifecycle_1.0.3   tzdb_0.4.0       
[19] vctrs_0.6.5       glue_1.6.2        stringi_1.7.8     compiler_4.2.2    pillar_1.9.0      generics_0.1.3   
[25] scales_1.2.1      profmem_0.6.0     pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

etiennebacher · 2024-01-18T16:11:48Z

Hi, I'm not a dplyr dev (or a tidyverse dev at all), but I'm not sure what you expect here. across() simply has to do more operations since it must evaluate the tidy selection passed in .cols and there are probably other checks and steps that need to be done. Note that the across() call with where() is the slowest because it must evaluate the condition on all columns and retain only those where this condition is true.

Moreover, this timing difference barely scales with the number of rows and columns in the data (except for where() that increases with the number of columns). On my machine, the difference is always 3-4ms. I don't think this overhead is important, but if it is in your case maybe you should consider alternative packages like data.table that are built for performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perfromance slowdown using across within mutate #6985

perfromance slowdown using across within mutate #6985

nirguk commented Jan 18, 2024 •

edited

etiennebacher commented Jan 18, 2024

perfromance slowdown using across within mutate #6985

perfromance slowdown using across within mutate #6985

Comments

nirguk commented Jan 18, 2024 • edited

etiennebacher commented Jan 18, 2024

nirguk commented Jan 18, 2024 •

edited