new feature making clustermq "pipeable" #318

wds15 · 2023-11-01T13:19:53Z

Hi!

First, clustermq is really great - it powers a lot of what I do. Today I just wrote a small utility function which makes the "Q" functions compatible with the pipe syntax which is being used a lot in R workflows. So maybe this function could be implemented in clustermq directly?

library(brms)
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, ...) {
    data |>
        dplyr::mutate(.row=1:dplyr::n()) |>
        tidyr::nest(data=-.row) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}


## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, newdata, const=list(object=fit1))

The above makes more sense for huge simulations and fits. What would be nice to add is chunking in a way so that the "data" is being chunked into bigger pieces... which should be easy to add.

This is just a feature suggestion as I think this could be useful for many others as well.

wds15 · 2023-11-01T15:50:50Z

Here is an improved version which is a bit more clever on the first argument name and does chunking, which can speed up things a lot:

library(brms)
library(tidybayes)
library(dplyr)
library(tidyr)

fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson())

## adding predictions to the orginal data set can be done with a pipe approach
epilepsy |> tidybayes::add_predicted_rvars(fit1)

## which does not work with Q_rows as Q_rows sends the individual
## columns as arguments to the function. Thus the function below does
## nest things in a way so that clustermq can be applied directly
## here:

Q_rows_nested <- function(data, fun, arg, chunk_size=1, ...) {
    if(missing(arg)) {
        arg <- rlang::sym(names(formals(fun))[1])
    }
    data |>
        dplyr::mutate(.chunk=sort(rep(seq_len(ceiling(dplyr::n()/chunk_size)), length.out=dplyr::n()))) |>
        tidyr::nest(data=-.chunk) |>
        dplyr::select("{{arg}}" := data) |>
        clustermq::Q_rows(fun=fun, ...) |>
        dplyr::bind_rows()
}


## now we can run the predictions in parallel over clustermq
epilepsy |> Q_rows_nested(tidybayes::add_predicted_rvars, const=list(object=fit1), pkgs="tidybayes", n_jobs=6)

mschubert · 2023-11-27T13:32:16Z

Thanks for the idea and great to hear that the package is working well for you!

The way I understand it, you want to pass a row or a number of rows of a data frame as one combined argument to a function.

Instead of nesting the data, I would go about it like this:

with_rvars = clustermq::Q(
    tidybayes::add_predicted_rvars,
    newdata = split(epilepsy, seq_len(nrow(epilepsy))),
    const = list(object=fit1),
    n_jobs = 6
) |> bind_rows()

That looks fairly straightforward to me. clustermq will chunk bigger data for you, which you could add manually if calling tidybayes::add_predicted_rvars once per row adds too much overhead.

I'm not sure if adding a new concept like Q_rows_nested will make the package easier to use overall. Rather, I'd prefer to only add new functionality if a task can't be (easily) done with the existing API.

What do you think?

wds15 · 2023-11-27T14:26:29Z

Nice alternative version. However, it is not "pipeable" - so the user cannot pipe into a Q boosted thing.

The other day I had the thought that one should probably refine this towards a "Q_mutate" function which would even avoid the need for the user to define intermediate functions, which one would need if one would like to operate on multiple columns at once.

I totally agree with not bloating a package with unnecessary code, for sure. How about we let this issue around for a moment so that we collect better ideas of the above function... and finally include this in some form in the documentation? An example, a section in the pkgdown homepage or something similar?

mschubert · 2023-11-27T14:39:49Z

Happy to leave this open for a while and see what we come up with!

mschubert added the enhancement label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature making clustermq "pipeable" #318

new feature making clustermq "pipeable" #318

wds15 commented Nov 1, 2023

wds15 commented Nov 1, 2023 •

edited

mschubert commented Nov 27, 2023

wds15 commented Nov 27, 2023

mschubert commented Nov 27, 2023

new feature making clustermq "pipeable" #318

new feature making clustermq "pipeable" #318

Comments

wds15 commented Nov 1, 2023

wds15 commented Nov 1, 2023 • edited

mschubert commented Nov 27, 2023

wds15 commented Nov 27, 2023

mschubert commented Nov 27, 2023

wds15 commented Nov 1, 2023 •

edited