-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupby and summarize is extremly slow for large number of groups #19
Comments
Thanks for the example. Yes this is slow and related to part 1 of this issue. The reason is pandas avoids creating group dataframes which is expensive. I plan to look at this problem after the next release v0.4.0. |
Appreciate your commitment. Currently, how is the groupby done? Would using the pandas's inherent groupby structure help, as it would mean translating pd.groupby.agg to I would be happy to contribute. |
It is the straight forward (and naive) way, a loop over the group dataframes. For each group calculate the summary statistic, after which concatenate the individual summaries into the final result. So far this is the same process for all calculations that involve grouped data i.e mutate, create, do, arrange and summarise. I do not know how easy it would be translate summarise(c = "mean(a+b)" , d = "mean(np.sin(a+b))") I do not think that is possible with The real problem is, completing a pandas groupby operation to obtain the group dataframes is very slow. I think part of the solution will be to recognise the simple uses of |
Why is groupby + summarize so slow compared to simple pandas. Please find the reproducible example.
The text was updated successfully, but these errors were encountered: