Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perfromance slowdown using across within mutate #6985

Open
nirguk opened this issue Jan 18, 2024 · 1 comment
Open

perfromance slowdown using across within mutate #6985

nirguk opened this issue Jan 18, 2024 · 1 comment

Comments

@nirguk
Copy link

nirguk commented Jan 18, 2024

I believe this is an unexplored performance issue, seemingly relating to dplyr::expand_across

Benchmarked over a 1000 repetitions of processing ames data; There is a marked difference between direct mutation, and indirect mutation faciliated by across , seemingly both when using where() selection, and explicit all_of(c(..)) style selection. The latter speed degredation (of direct listing through all_of(c(...)) I think shows that the issue wont be related to checking properties a la the where() instant.

I think the performance issue is significant, with direct mutation approx 3x faster than that mediated by across

# A tibble: 4 × 9
  expression                                             min   median *`itr/sec`* mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 acrosswhere_func(ames_narrow)                       3.69ms   4.26ms      *219.*    1.75MB     7.70   966    34      4.42s
2 across_all_of_func(ames_narrow)                      3.3ms   3.83ms      *256.*   64.73KB     8.20   969    31      3.78s
3 direct_mutate_func(ames_narrow)                      1.1ms   1.26ms      *766.*   48.59KB     8.52   989    11      1.29s
4 direct_mutate_with_class_detect_func(ames_narrow)   1.22ms   1.36ms      *722.*   71.12KB     8.77   988    12      1.37s

I came across and considered whether this was related to #6897; however I believe it is something else.
Here when using across I use the anonymous function syntax as advised.

first a reprex and then my session info...

library(bench)
library(tidyverse)
library(modeldata)
options("lifecycle_verbosity"="error")


(ames_narrow <- ames |> select(1:5))

num_op <- mean
char_op <- identity

acrosswhere_func <- function(a){
  mutate(a,
         across(where(is.numeric),\(x){num_op(x)}),
         across(where(is.character)|where(is.factor),\(x){char_op(x)}))
}

across_all_of_func <- function(a){
  mutate(a,
         across(all_of(c("Lot_Frontage","Lot_Area")),\(x){num_op(x)}),
         across(all_of(c("MS_SubClass","MS_Zoning","Street")),\(x){char_op(x)}))
}


direct_mutate_func <- function(a){
  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

direct_mutate_with_class_detect_func <- function(a){
  
  l <- map_lgl(a,\(x)is.numeric(x))
  numnames <- names(l[l])
  l <- map_lgl(a,\(x){is.character(x)|is.factor(x)})
  catnames <- names(l[l])
  
  mutate(a,
         Lot_Frontage = num_op(Lot_Frontage),
         Lot_Area = num_op(Lot_Area),
         MS_SubClass =  char_op(MS_SubClass),
         MS_Zoning =  char_op(MS_Zoning),
         Street = char_op(Street))
}

b1 <- mark(acrosswhere_func(ames_narrow),
           across_all_of_func(ames_narrow),
           direct_mutate_func(ames_narrow),
           direct_mutate_with_class_detect_func(ames_narrow),iterations = 1000L)

select(b1,1:9)

session info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] modeldata_1.2.0 lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
 [8] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 bench_1.1.3    

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0 magrittr_2.0.3    hms_1.1.3         tidyselect_1.2.0  munsell_0.5.0     timechange_0.2.0 
 [7] colorspace_2.1-0  R6_2.5.1          rlang_1.1.3       fansi_1.0.4       tools_4.2.2       grid_4.2.2       
[13] gtable_0.3.4      utf8_1.2.3        cli_3.6.2         withr_2.5.0       lifecycle_1.0.3   tzdb_0.4.0       
[19] vctrs_0.6.5       glue_1.6.2        stringi_1.7.8     compiler_4.2.2    pillar_1.9.0      generics_0.1.3   
[25] scales_1.2.1      profmem_0.6.0     pkgconfig_2.0.3 
@etiennebacher
Copy link

Hi, I'm not a dplyr dev (or a tidyverse dev at all), but I'm not sure what you expect here. across() simply has to do more operations since it must evaluate the tidy selection passed in .cols and there are probably other checks and steps that need to be done. Note that the across() call with where() is the slowest because it must evaluate the condition on all columns and retain only those where this condition is true.

Moreover, this timing difference barely scales with the number of rows and columns in the data (except for where() that increases with the number of columns). On my machine, the difference is always 3-4ms. I don't think this overhead is important, but if it is in your case maybe you should consider alternative packages like data.table that are built for performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants