Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped operations are slow #119

Open
nathaneastwood opened this issue Dec 19, 2022 · 10 comments
Open

Grouped operations are slow #119

nathaneastwood opened this issue Dec 19, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@nathaneastwood
Copy link
Owner

d <- data.frame(
  g1 = sample(LETTERS), 40000, TRUE),
  g2 = sample(LETTERS), 40000, TRUE),
  g3 = sample(LETTERS), 40000, TRUE),
  x1 = runif(40000),
  x2 = runif(40000),
  x3 = runif(40000)
)

d %>% group_by(g1, g2, g3) %>% summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))
@nathaneastwood nathaneastwood added the bug Something isn't working label Dec 19, 2022
@nathaneastwood nathaneastwood self-assigned this Dec 19, 2022
nathaneastwood added a commit that referenced this issue Dec 29, 2022
This goes towards making overall grouped calculations quicker. See #119
@nathaneastwood
Copy link
Owner Author

As a result of 3cc0a99:

r$> size <- 10000                                                                                                                                                     

r$> t1 <- system.time({ 
      d <- data.frame( 
        g1 = sample(LETTERS, size, TRUE), 
        g2 = sample(LETTERS, size, TRUE), 
        g3 = sample(LETTERS, size, TRUE), 
        x1 = runif(size), 
        x2 = runif(size), 
        x3 = runif(size) 
      ) 
      d %>% group_by(g1, g2, g3) %>% summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3)) 
    })                                                                                                                                                                
`summarise()` has grouped output by 'g1', 'g2'. You can override using the `.groups` argument.

r$> devtools::load_all()                                                                                                                                              
ℹ Loading poorman

  I'd seen my father. He was a poor man, and I watched him do astonishing things.
    - Sidney Poitier

r$> t2 <- system.time({ 
      d <- data.frame( 
        g1 = sample(LETTERS, size, TRUE), 
        g2 = sample(LETTERS, size, TRUE), 
        g3 = sample(LETTERS, size, TRUE), 
        x1 = runif(size), 
        x2 = runif(size), 
        x3 = runif(size) 
      ) 
      d %>% group_by(g1, g2, g3) %>% summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3)) 
    })                                                                                                                                                                
`summarise()` has grouped output by 'g1', 'g2'. You can override using the `.groups` argument.

r$> t3 <- system.time({ 
      d <- data.frame( 
        g1 = sample(LETTERS, size, TRUE), 
        g2 = sample(LETTERS, size, TRUE), 
        g3 = sample(LETTERS, size, TRUE), 
        x1 = runif(size), 
        x2 = runif(size), 
        x3 = runif(size) 
      ) 
      d %>% dplyr::group_by(g1, g2, g3) %>% dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3)) 
    })                                                                                                                                                                
`summarise()` has grouped output by 'g1', 'g2'. You can override using the `.groups` argument.

r$> t1                                                                                                                                                                
   user  system elapsed 
110.639   9.898 122.772 

r$> t2                                                                                                                                                                
   user  system elapsed 
 19.394   1.020  22.994 

r$> t3                                                                                                                                                                
   user  system elapsed 
  0.492   0.014   0.562

Improvements can still be made.

@nathaneastwood
Copy link
Owner Author

The following produces the results in ~9 seconds:

calculate_groups <- function(data, groups, drop = group_by_drop_default(data)) {
  data <- ungroup(data)

  unknown <- setdiff(groups, colnames(data))
  if (length(unknown) > 0L) {
    stop(sprintf("`groups` missing from `data`: %s.", paste0(groups, collapse = ", ")))
  }

  data[, ".rows"] <- seq_len(nrow(data))
  form <- as.formula(paste(".rows", paste0(groups, collapse = " + "), sep = " ~ "))
  res <- aggregate(form, data, function(x) x, simplify = FALSE, na.action = na.pass)

  unique_groups <- unique(data[, groups, drop = FALSE])
  is_factor <- do.call(c, lapply(unique_groups, function(x) is.factor(x)))
  if (!isTRUE(drop) && any(is_factor)) {
    na_lvls <- do.call(
      expand.grid,
      lapply(unique_groups, function(x) if (is.factor(x)) levels(x)[!(levels(x) %in% x)] else NA)
    )
    if (nrow(na_lvls) > 0L) {
      rows <- res$.rows
      res <- bind_rows(res, na_lvls)
      res$.rows <- c(rows, rep(list(integer(0)), nrow(na_lvls)))
    }
  }

  res <- res[do.call(order, lapply(groups, function(x) res[, x])), , drop = FALSE]
  rownames(res) <- NULL
  attr(res, ".drop") <- drop
  res
}

So still a lot slower than dplyr. It also suffers from the problem that aggregate() doesn't keep NA values so the following wouldn't work since the NA group level would be removed.

data.frame(x = c("apple", NA, "banana"), y = 1:3, stringsAsFactors = FALSE) %>%
  group_by(x) %>%
  group_data()

@etiennebacher
Copy link
Contributor

etiennebacher commented Jan 5, 2023

It also suffers from the problem that aggregate() doesn't keep NA values so the following wouldn't work since the NA group level would be removed.

Related to this, in the current version NA are a group, as in dplyr, but their group is not taken into account in summarize():

set.seed(2023)

d <- data.frame(
  g1 = sample(c(LETTERS[1:3], NA), 10, replace = TRUE),
  g2 = sample(c(LETTERS[1:3], NA), 10, replace = TRUE),
  v1 = sample(1:10)
)
d
#>      g1   g2 v1
#> 1     A    A  4
#> 2  <NA>    A  5
#> 3     C    A  8
#> 4     A    A  1
#> 5  <NA>    A  7
#> 6     C    A  2
#> 7     B <NA> 10
#> 8  <NA>    C  9
#> 9     B    B  3
#> 10    A    C  6

d |> 
  poorman::group_by(g1) |> 
  poorman::group_data()
#>     g1    .rows
#> 1    A 1, 4, 10
#> 2    B     7, 9
#> 3    C     3, 6
#> 4 <NA>  2, 5, 8

d |>
  poorman::group_by(g1) |>
  poorman::summarize(mean = mean(v1))
#>   g1     mean
#> 1  A 3.666667
#> 2  B 6.500000
#> 3  C 5.000000

d |>
  dplyr::group_by(g1) |>
  dplyr::summarize(mean = mean(v1))
#> # A tibble: 4 × 2
#>   g1     mean
#>   <chr> <dbl>
#> 1 A      3.67
#> 2 B      6.5 
#> 3 C      5   
#> 4 <NA>   7

Created on 2023-01-05 with reprex v2.0.2

@nathaneastwood
Copy link
Owner Author

Good catch. I think at some point I will need to pin poorman's ambition to a specific version of dplyr. It's entirely possible that at the time of developing poorman, this wasn't a feature of dplyr, but then it's also possible it was and I just didn't capture this.

@etiennebacher
Copy link
Contributor

Right, it would make a lot of sense given how much dplyr evolves (e.g the future .by argument)

@nathaneastwood
Copy link
Owner Author

That's the main incentive, yeah. It would require a lot of work to keep up and probably a lot of refactoring. My suggestion would therefore be dplyr 1.0.0 since I believe .by was implemented in 1.1.0?

@etiennebacher
Copy link
Contributor

Yes it's in the upcoming 1.1.0

@etiennebacher
Copy link
Contributor

etiennebacher commented Jan 5, 2023

Coming back to the slowness issue, I can gain ~30% speed by replacing:

  for (i in seq_len(n_comb)) {
    rows[[i]] <- which(data_groups %in% interaction(unique_groups[i, groups]))
  }

by

  pasted_groups <- do.call(paste, c(unique_groups[, groups, drop = FALSE], sep = "."))
  pasted_groups[is.na(unique_groups)] <- NA
  for (i in seq_len(n_comb)) {
    rows[[i]] <- which(data_groups %in% pasted_groups[i])
  }

Basically, at each iteration, interaction() pastes all values of a given row together, which can be done outside the loop and quicker than interaction().

This change passes all the tests (which are not really complete given the NA issue).


Benchmark setup:

d <- data.frame(
  g1 = sample(LETTERS, 4000, TRUE),
  g2 = sample(LETTERS, 4000, TRUE),
  g3 = sample(LETTERS, 4000, TRUE),
  x1 = runif(4000),
  x2 = runif(4000),
  x3 = runif(4000)
)

# return a list of results so that both functions return the same output (without
# all the class problem, tibble vs data.frame)
poor = function() {
  foo <- d %>% 
    group_by(g1, g2, g3) |> 
    summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

dpl = function() {
  foo <- d %>% 
    dplyr::group_by(g1, g2, g3) |> 
    dplyr::summarise(x1 = mean(x1), x2 = max(x2), x3 = min(x3))

  list(foo$x1, foo$x2, foo$x3)
}

bench::mark(
  poor(),
  dpl()
)

@nathaneastwood
Copy link
Owner Author

Oh that's really nice. Did you want to submit a PR? I probably won't have time to implement it myself until Sunday.

@etiennebacher
Copy link
Contributor

I can make a PR but I think it shouldn't be implemented before the behavior of NA in groups is fixed (or clarified). Currently there's only one test in test_group_by.R that deals with NA, and it only checks when grouping by one variable, so I'm not super confident this change won't break existing code.

There should probably be some tests for more complex grouping with NA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants