Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance (for big datasets) #673

Open
hms1 opened this issue Dec 20, 2021 · 11 comments
Open

Improve performance (for big datasets) #673

hms1 opened this issue Dec 20, 2021 · 11 comments

Comments

@hms1
Copy link

hms1 commented Dec 20, 2021

First of all thanks for making such a useful package!

However I have been trying to skim a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.

I had a look to see where the bottleneck was and the skim_by_type methods and the build_results function called are both very slow when there are lots of columns. This looks to me like an issue with dplyr::across when running summarize with lots of column / function pairs. Notwithstanding that I refactored skim_by_type to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.

I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type function, which should be able to replace all 3 of the current methods:

library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)

#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))

large_test_df <- bind_cols(
	map(seq_len(length(coltypes)), function(col){
		if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
		if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
		if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
		tibble(!!sym(str_c("col",col)) := content)
	}
	))

#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type

skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
	group_columns <- groups(data)
	data <- as_tibble(data)
	map_df(variable_names, function(skim_var){
		pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
			group_by(skim_variable, !!! group_columns) %>%
			summarize(across(values, mangled_skimmers$funs), .groups = "drop")
	})	%>%
		rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
		arrange(match(skim_variable, variable_names))
}

#### run test
microbenchmark(
	{
		assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
		full_version <- skim(large_test_df)
	},
	{
		assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
		full_version <- skim(large_test_df)
	},
	times = 5
)
Unit: seconds
                                                                                                                                   expr
 {     assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
 {     assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
      min        lq      mean    median        uq       max neval
 251.1866 256.76540 256.68652 256.94793 258.26105 260.27161     5
  10.1157  10.18126  10.30293  10.28105  10.35256  10.58411     5
@elinw
Copy link
Collaborator

elinw commented Dec 21, 2021

Wow yes we have really been trying to think about the bottle necks, we have known that the processing was very slow, so this is really helpful! @michaelquinn32 thoughts?
In the original code we have skim_by_type.grouped_df() and skim_by_type.data.frame() (among others). But I don't think in your code you are using groups? Are you suggesting change to several functions or that we drop the differentiation?

@hms1
Copy link
Author

hms1 commented Dec 21, 2021

Hey @elinw, yes I saw that in the original code skim_by_type is split into three (?) methods to handle grouped_df, data.table and plain data.frames. The single function above should handle each of these classes without issue but of course it could be split into different methods to handle each class separately. It just seemed easier to demonstrate like the above. If you are interested I'd be happy to formalise things in a new branch when I find the time but I don't permissions to do so at the moment.

@elinw
Copy link
Collaborator

elinw commented Dec 21, 2021

Well to create a new branch you would fork the project and create the branch in your repo.
We use the S3 approach because, among other things, it lets people develop skim methods for different kinds of objects. Now that you have raised this I think we are really at the point of maturity of skimr where working on performance (and, related, pushing into a database when appropriate) is really the biggest thing we can do.

@michaelquinn32
Copy link
Collaborator

That's awesome Henry. Thanks!

We would love a pull request. I would be happy to review your code.

As @elinw pointed out, create a branch with your change and we can go from there.

  • If you haven't yet, take a look at our contributor's guide
  • Code should pass a run of lintr with the default settings
  • And you should run styler::style_pkg() before we merge

I'm surprised that dplyr::across() performs so much worse than pivoting the data first, but I'm thrilled to have discovered this opportunity for improving the user experience.

Thanks!

@hms1
Copy link
Author

hms1 commented Dec 23, 2021

Hey @michaelquinn32 ,

I had a chance to look at this properly rather than just re-writing the function that was going slow.

First off I should clarify that it is not dplyr::across that is causing issues. I am afraid to even think what the punishment is for people who bring dplyr into disrepute.

Amazingly it is the call to as.data.frame buried inside skimr:::reshape_skimmed that causes the majority of slowness (see here). My explanation is that as.data.frame gets slower as the object being converted gets larger. In combination with the increased calls to reshape_skimmed (once per variable) this leads to an exponential performance degradation as you add columns.

Anyway, I have made an initial PR that simply changes as.data.frame to as_tibble. What I suggested above in the thread is faster and is quite a lot cleaner (in my opinion) so I can make a subsequent PR to rewrite the methods. Just this initial change is a very low hanging fruit.

@elinw
Copy link
Collaborator

elinw commented Dec 24, 2021

Let's look at the timing for both grouped and ungrouped data with the as_tibble change. I'm wondering what we are thinking right there that we used as.data.frame() -- was there some kind of edge case? It's right in the middle of a bunch of tidyverse code.
This is the commit 81b7ddd
which essentially added the as.data.frame().

Should we be coercing to a tibble after we check that the input to skim is a data frame or coercible to a data frame?

@elinw
Copy link
Collaborator

elinw commented Jan 1, 2022

Even when we merge that change, let's keep this issue open for the general topic of making skimr faster for different situations.

@elinw
Copy link
Collaborator

elinw commented Jan 1, 2022

Also let's consider performance without histograms as th base since we know they have a big impact.

@astraetech
Copy link

Hi, love skimr and use it a lot. Are there any updates on the performance issue? My datasets are 50-300+ vars and 8-60 mlns observations and skimr takes a long time to do summaries. Thank you!

@michaelquinn32
Copy link
Collaborator

Thanks for using skimr!

We do support summaries of data.table objects. Have you tried that? Is it faster?

Otherwise, scaling up to that size is a big challenge for the design of skimr, and I'm not exactly sure what we could do. We've both been very stretched for time recently, so a big change (like support parallel backends, etc.) might be a ways away still.

@jxu
Copy link

jxu commented Oct 23, 2023

skim takes about ten minutes for my dataset, a tibble with 150k rows and 1000 columns. This is a lot of columns but not unreasonable for raw medical data. For comparison, summary does not do as much, but finishes in 30 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants