Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature making clustermq "pipeable" #318

Open
wds15 opened this issue Nov 1, 2023 · 4 comments
Open

new feature making clustermq "pipeable" #318

wds15 opened this issue Nov 1, 2023 · 4 comments

Comments

@wds15
Copy link

wds15 commented Nov 1, 2023

Hi!

First, clustermq is really great - it powers a lot of what I do. Today I just wrote a small utility function which makes the "Q" functions compatible with the pipe syntax which is being used a lot in R workflows. So maybe this function could be implemented in clustermq directly?

library(brms)
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, ...) {
    data |>
        dplyr::mutate(.row=1:dplyr::n()) |>
        tidyr::nest(data=-.row) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}


## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, newdata, const=list(object=fit1))

The above makes more sense for huge simulations and fits. What would be nice to add is chunking in a way so that the "data" is being chunked into bigger pieces... which should be easy to add.

This is just a feature suggestion as I think this could be useful for many others as well.

@wds15
Copy link
Author

wds15 commented Nov 1, 2023

Here is an improved version which is a bit more clever on the first argument name and does chunking, which can speed up things a lot:

library(brms)
library(tidybayes)
library(dplyr)
library(tidyr)

fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, chunk_size=1, ...) {
    if(missing(arg)) {
        arg <- rlang::sym(names(formals(fun))[1])
    }
    data |>
        dplyr::mutate(.chunk=sort(rep(seq_len(ceiling(dplyr::n()/chunk_size)), length.out=dplyr::n()))) |>
        tidyr::nest(data=-.chunk) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}


## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, const=list(object=fit1), pkgs="tidybayes", n_jobs=6)

@mschubert
Copy link
Owner

Thanks for the idea and great to hear that the package is working well for you!

The way I understand it, you want to pass a row or a number of rows of a data frame as one combined argument to a function.

Instead of nesting the data, I would go about it like this:

with_rvars = clustermq::Q(
    tidybayes::add_predicted_rvars,
    newdata = split(epilepsy, seq_len(nrow(epilepsy))),
    const = list(object=fit1),
    n_jobs = 6
) |> bind_rows()

That looks fairly straightforward to me. clustermq will chunk bigger data for you, which you could add manually if calling tidybayes::add_predicted_rvars once per row adds too much overhead.

I'm not sure if adding a new concept like Q_rows_nested will make the package easier to use overall. Rather, I'd prefer to only add new functionality if a task can't be (easily) done with the existing API.

What do you think?

@wds15
Copy link
Author

wds15 commented Nov 27, 2023

Nice alternative version. However, it is not "pipeable" - so the user cannot pipe into a Q boosted thing.

The other day I had the thought that one should probably refine this towards a "Q_mutate" function which would even avoid the need for the user to define intermediate functions, which one would need if one would like to operate on multiple columns at once.

I totally agree with not bloating a package with unnecessary code, for sure. How about we let this issue around for a moment so that we collect better ideas of the above function... and finally include this in some form in the documentation? An example, a section in the pkgdown homepage or something similar?

@mschubert
Copy link
Owner

Happy to leave this open for a while and see what we come up with!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants