Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AR-correlated random effects; issue #1307 #1409

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

mpdylan
Copy link

@mpdylan mpdylan commented Sep 15, 2022

This PR adds a new parameter latent to ar(), ma(), and arma(), with a default value of FALSE. The function of this parameter, when set to TRUE, is to explicitly parameterize the autocorrelated effects using latent variables, even if the model family has natural residuals and does not require such parameterization. The Stan code generated has a structure borrowed from the latent variable parameterization currently used for model families that do not admit natural residuals, such as Poisson, but with different names for variables to avoid confusion.

This PR also modifies prepare_predictions_ac() and other relevant functions to ensure that the autocorrelated random effects are incorporated when generating predictions using posterior_epred(). If new data is provided for prediction, it checks to see whether grouping levels for the new data match those in the original data used to fit the model. If no match is found for new grouping levels, it samples a new multivariate normal vector in the same way that the existing latent variable functionality does. If a match is found, it attempts to sample the autocorrelated effect conditional on the observed values, by taking the draws of the standardized latent variables for those data points for which they exist, filling in newly generated standard normal random variates for new data points, and multiplying by the lower Cholesky factor of the covariance matrix to apply the covariance structure. This process needs some work -- currently, it will only work correctly if the new data contains all of the original data, and the new data points for prediction appear contiguously in the data frame following the original data points. In particular, this means that we can't fill in unobserved time points in the middle of a time series.

In order to make the above work, there are some ancillary changes, such as modifying make_standata() to include grouping levels for autocorrelated effects when the latent flag is set to TRUE.

Posterior predictions using predict() appear to work at first glance, but I haven't tested them carefully yet.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Sep 15, 2022 via email

@mpdylan
Copy link
Author

mpdylan commented Sep 16, 2022

Sure, I think that would be simple to revise. I'd also like to put in some more time improving the handling of the time variable in two main ways:

  • correctly match times in new data and original data so that fitted and predict can handle new time points in the middle of a series, not only at the end
  • allow multiple observations for the same time point/group

@paul-buerkner
Copy link
Owner

Nice, that all sounds good!

@mpdylan
Copy link
Author

mpdylan commented Nov 30, 2022

The most recent commit addresses the two issues from my last comment: if an explicit time variable is specified with the time parameter in ar() etc., then latent parameters are assigned to each unique time point instead of each row in the data. This allows multiple observations per time point, as well as time series that do not consist of a simple interval. Predictions can be drawn for arbitrary time points, including interpolating gaps in the original time series, or extrapolation backward in time.

A couple of notes:

  • If the new functionality is not needed, it's recommended to not set the time variable, as the flexible predictions are noticeably slower.
  • It's no longer necessary to set both latent = T and cov = T in the formula; if latent is true then cov will be set to true automatically.
  • At predict time, the time points to condition on are determined by checking which time points in the new data frame have non-NA values in the response variable. This means that if the new data frame does not contain observations for time points that were observed in the original data, the fitted latent errors for those time points will not be used in prediction (and new ones may be drawn instead). I think of this as a feature -- it means that you can fit on a full time series, and later predict as if you had only observed part of it -- but it's something to be aware of when constructing new data frames for predicts as it may produce unexpected results.

@paul-buerkner
Copy link
Owner

Thank you! Is this PR ready for another round of my review? If so, can you provide me with some test cases that I could run and debug through the code to see what happens and where things may need some more changes?

@mpdylan
Copy link
Author

mpdylan commented Dec 1, 2022

I thought it was ready, but I discovered a bug while preparing some test cases to send you, so I'll follow up on this soon.

I have a couple more things I would like to work on, but they may make more sense as a separate PR; e.g., implementing covariance/latent parameterization for some higher-order structures such as AR(2).

@paul-buerkner
Copy link
Owner

paul-buerkner commented Dec 2, 2022 via email

@mpdylan
Copy link
Author

mpdylan commented Jan 9, 2023

After fixing some bugs I think this is ready to be looked at. I've prepared a handful of examples at mpdylan/brms_test_cases demonstrating the latent variable functionality on time series with gaps or multiple observations on the same time point. This includes one example using synthetic data and two using publicly available data sets; models with and without linear predictors and grouping factors; and, Gaussian, Poisson, and binomial likelihoods.

One new piece of functionality and one additional comment:

  • I added a function ac_latent to extract the fitted latent residuals, in analogy to the ranef function for extracting conventional random effects.
  • In order to predict properly, this requires save_pars(all = TRUE) to ensure the latent residuals are saved. It wasn't obvious to me if it would be easy to make this option set by default when using the latent AR effects -- maybe it would.

Copy link
Owner

@paul-buerkner paul-buerkner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks more and more ready. Thank you for your hard work!

I have added a bunch of questions for discussion. some of them, I may also address myself later but beforehand I would like to understand some of your coding decision better to make sure I don't break stuff.

@@ -391,6 +391,83 @@ VarCorr.brmsfit <- function(x, sigma = 1, summary = TRUE, robust = FALSE,
lapply(tmp, .VarCorr)
}

#' Extract Latent Autocorrelated Effects
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has very limited application for an exported function. Why do we need it? I would prefer not to have such a function being exported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly added this in response to a question from a colleague who was experimenting with my fork -- they wanted a way to extract the latent variables analogous to calling ranef on a fitted model to extract group-level parameters. I'd be happy to do this in a different way if that would fit better.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not have this function as part of this PR as it is too specific. I would like to think through separately if there is a good and general enough function and use-case that would contain this here as a special case. Perhaps we can discuss in a separate issue after this PR has been merged (without the function for now).

if (!is.null(prep$ac$sderr)) {
sigma2 <- as.numeric(prep$ac$sderr)^2
}
if (!is.null(prep$ac$sdacranef)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "sdacranef" still needed? It seems to come from an old version of the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think you're right; that was from the earliest version. will remove this.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

R/data-helpers.R Outdated
@@ -240,7 +240,10 @@ order_data <- function(data, bterms) {
tv <- seq_rows(data)
}
if (any(duplicated(data.frame(gv, tv)))) {
stop2("Time points within groups must be unique.")
if (!parameterize_ac_effects(bterms)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to distinquish has_ac_latent_residuals and parameterize_ac_effects? Naively I would think that the former should be enough but I may be mistaken of course.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't. I think my intent here was to avoid "overwriting" existing latent variable functionality with the new code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have the latent feature tightly integrated with the rest of the autocorrelation features and code base. Otherwise, we have multiple things that do basically the same and the branching logic becomes confusing. I am sorry that I am so particular about these things. You can let me know when there are changes that I suggest that you would prefer me to do in this PR. I know it is a bit tedious sometimes.

@@ -668,6 +669,66 @@ data_ac <- function(bterms, data, data2, basis = NULL, ...) {
c(if (N_tg > 1L) begin_tg[2:N_tg], N + 1) - begin_tg
))
out$end_tg <- with(out, begin_tg + nobs_tg - 1)
if (parameterize_ac_effects(acef)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this adds a lot of new data variables to standata. Is it possible to reuse more of the variables that we already have for ARMA correlation structure? It seems they would also be applicable to the latent version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose here is to include the information necessary to match elements of the data with the corresponding latent variables when the mapping is not one-to-one (because of multiple observations at the same time point). The number of observations in a group, the difference between the minimum and maximum time points, and the number of unique time points may all be different, which means the bookkeeping is trickier. However, it's definitely possible I over-complicated this as I built this up, so I'll take a look back at it and see if I can simplify it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I appreciate you looking into it!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reviewing this, I find that several of the existing variables (begin_tg, end_tg, and nobs_tg) are not used in the new formulation, so instead of defining new variables we could just change the definition of these. Furthermore, I think ac_time is no longer used anywhere, so I'll just remove it.

I do think N_latent_err and latent_err_idx are necessary to define the correct number of latent parameters and match them with rows in the data frame, and max_time_span is needed to avoid indexing the covariance matrix beyond its range.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the definition of old variables is possible but perhaps not ideal. Why would we need to change the meaning of them for use in the new structure?

I agree that the three variables you mentioned at the end are likely needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old variables contain the start and end indices of groups in the vector of observations; the new ones contain indices in the vector of latent variables, which will be different indices if there are any time points observed more than once. I think one could extract the same information from the original definition of these variables combined with latent_err_idx, eliminating the need to change them, at the cost of adding a little more code to the transformed data block in the generated Stan program, and possibly something to prepare_predictions (though I would have to think about that as there's a lot there already that can probably be simplified).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Perhaps changing the definitions would be worth a try.

For context: Right now, there are kind of two new features here. One is the latent error terms, the other one is the flexible grouping of errors. Both should be theoretically employable independently of each other. In terms of new coding effort, I think the flexible grouping terms are the ones requiring much more effort right now, which makes sense. But we shall keep in mind what implications this has. So basically, we should be able to turn the flexible grouping on an off (depending on whether the related variable is supplied). And depending on that, the meaning of the old variables could indeed change.

@@ -107,9 +107,13 @@ prepare_predictions.brmsterms <- function(x, draws, sdata, data, ...) {
for (dp in valid_dpars) {
dp_regex <- paste0("^", dp, resp, "$")
if (is.btl(x$dpars[[dp]]) || is.btnl(x$dpars[[dp]])) {
old_acdata <- NULL
if (parameterize_ac_effects(x$dpars[[dp]])) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to rerun full make_standata here again? in prepare predictions, we already compute standata and pass is as variable sdata.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that we need some information on the old grouping levels (which isn't present in sdata since that is computed using the newdata) to make predictions conditional on the original data. However, perhaps it's better to split out a helper function similar to prepare_predictions_ranef that packages only the necessary metadata and posterior draws for the predictions requested.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. brms has it's own process in place for this purpose. See standata_basis in make_standata.R

@@ -764,6 +773,145 @@ prepare_predictions_ac <- function(bterms, draws, sdata, oos = NULL,
}
}
}
if (parameterize_ac_effects(bterms)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure everything is in order with the code below, I would need to discuss this code in a meeting I think.

inst/chunks/fun_scale_time_err_t.stan Show resolved Hide resolved
@mpdylan
Copy link
Author

mpdylan commented Feb 14, 2023

Latest commit makes the following changes:

  • Remove some of the unnecessary variables from the standata
  • Add variables to the basis section of standata and remove the call to make_standata in prepare_predictions
  • Simplify predictor_ac (which I think I had just made more complex than necessary, upon reviewing)
  • Simplify prepare_predictions_ac by removing some redundant or unnecessary vectors of indices
  • Fixed an error that could occur in predictions if newdata was supplied that reproduced some, but not all, of the original training data for a group level
  • Remove the outdated sdacranef code and the parameterize_ac_effects, replacing the latter with has_ac_latent_residuals
  • Remove the exported ac_latent function
  • Produce a warning when generating predictions if the latent parameters were not saved at fit time

Copy link
Owner

@paul-buerkner paul-buerkner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks quite good already. Except for the outstanding live code reviews via a call, my comments are minor I think.

@@ -255,7 +255,9 @@ get_cov_matrix_ac <- function(prep, obs = NULL, latent = FALSE) {
}
# prepare residual standard deviations
if (latent) {
sigma2 <- as.numeric(prep$ac$sderr)^2
if (!is.null(prep$ac$sderr)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should prep$ac$sderr ever be NULL in this case? Then sigma would not be defined at all?

))
out$end_tg <- with(out, begin_tg + nobs_tg - 1)
# if using latent variables, add grouping levels
if (has_ac_latent_residuals(bterms)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me at first glance, but I would like to review this in a call with you.

R/formula-ac.R Outdated
@@ -580,10 +587,21 @@ use_ac_cov_time <- function(x) {

# does the model need latent residuals for autocor structures?
has_ac_latent_residuals <- function(bterms) {
!has_natural_residuals(bterms) &&
(!has_natural_residuals(bterms) ||
has_ac_subset(bterms, latent = T)) &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always write out TRUE and FALSE. Don't use T or F

R/formula-ac.R Outdated
(use_ac_cov(bterms) || has_ac_class(bterms, "arma"))
}

# use explicitly parameterized autocor effects?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to remove this before we merge if we continue to not need this function.

@@ -290,6 +290,11 @@ standata_basis_ac <- function(x, data, ...) {
out$locations <- NA
}
}
if (has_ac_class(x, "arma")) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are overwriting the complete "out" variable instead of adding to it here. Please add the new variables to the existing list.

R/priors.R Outdated
if (has_ac_latent_residuals(bterms)) {
prior <- prior +
brmsprior(def_scale_prior, class = "sderr", ls = px, lb = "0")
}
# if (parameterize_ac_effects(bterms)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to remove this commented out code

" vector[N{resp}] zerr{p}; // unscaled residuals\n"
)
#changed ac effects tag
if (has_explicit_time &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put if statements in one line if possible

" vector[N{resp}] err{p}; // actual residuals\n"
)
# changed ac effects tag
if (has_explicit_time &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

str_add(out$data) <- glue(
" int<lower=1> N_latent_err{p};\n",
" int<lower=1> max_time_span{p};\n",
# " int<lower=1> begin_err_gr{p}[N_tg{p}];\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to remove the commented out code

if (has_explicit_time) {
str_add(out$tpar_comp) <- glue(
" // compute correlated time-series residuals\n",
" err_tp{p} = scale_time_err_t(",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give "scale_time_err_t" a more informative suffix than "_t"?

@paul-buerkner
Copy link
Owner

@mpdylan Is this PR still something you are working on?

@mpdylan
Copy link
Author

mpdylan commented Jan 31, 2024

Hi @paul-buerkner -- I was pulled away from this for some time due to another project, but I'd like to resume working on it. Are there any specific changes to the current code that I should be aware of before getting back to it?

@paul-buerkner
Copy link
Owner

Thank you for coming back to this issue! As far as I can remember, in our last meeting, I recommended changing the code structure to some degree to align with other autocorrelation features I had implemented after you started working on this feature. I still feel sorry for putting extra work on you because of this. Did you already change your PR in this direction? I just want to make sure that we can align different features and structures as much as possible to reduce the overall maintenance burden on brms.

@mpdylan
Copy link
Author

mpdylan commented Feb 6, 2024

Yes, I remember that as well. I was beginning to work on that when I had to pause, but I'll review my notes and the current state of my fork this week and see how much is left to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants