New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permit the user to pass a function to ArviZ to compute log likelihood on demand for memory-intensive models #2197
Comments
I am not sure it is possible as of now, in general we use a slightly different approach for this which is using Dask to handle the chunking and computation graph organization. It integrates very well with xarray and in general is more efficient than looping over each sample as it works block-wise (loads blocks into memory) and parallelizes computations on a few blocks to make it possible to operate on data that doesn't fit on RAM. I have used ess and rhat on data that doesn't fit in memory but I am not sure loo has been updated to allow this. And I know Is this something you think you could help out? Or test from a PR? I think nobody on the team is currently using this so it has kind of been on the back burner for a bit. |
Yes, exactly -- the idea is that in these cases you refrain from computing the pointwise log likelihood in Stan. You compute it later in Python by passing ArviZ a function that gets called for each point. I don't know enough about ArviZ to say whether this should be an option specifically for I'd be happy to help with testing, but I don't know anything about xarray or Dask. |
I'll try to use a linear regression to illustrate the steps. Say you have 20_000_000 observations and you run a regression on them multiple predictors so you generate 4000 posterior samples of the 5 slopes, the intercept and the sigma. There are no generated quantities so You can then use python+xarray+dask to compute the pointwise log likelihood values without loading them into memory. We start from the constant data+observed_data and the posterior groups. We will assume the posterior has 3 variables, 2 are Our pointwise log likelihood will be of That would be something like:
If possible, dask="allowed" is preferred, but it doesn't always work, in which case you might need to fall back to "parallelized". If you are using a normal you can also write down the operations for the log likelihood with python operators straight away, but I shared the "extra" step of using the einstats wrappers of scipy which allows computing that for any distribution even for dask arrays. Eventually, you'd need to call |
When my Stan model computes and saves the log likelihood of my data, the resulting files are huge. When I try to read these files in with
arviz.from_cmdstan()
, I run out of memory.The R implementation of LOO-PSIS seems to have a provision for such cases. The documentation says that, instead of passing the log likelihood as an array or matrix, you can pass an R function to
loo()
to calculate and return the log likelihood of each data point separately, based on the data and the draws from the posterior: https://mc-stan.org/loo/reference/loo.html#methods-by-class-There is a vignette which explains that in these cases, you should not use the Generated Quantities block of your Stan program to compute the log likelihood. Instead, you write an R function to calculate it that
loo()
will call repeatedly:https://mc-stan.org/loo/articles/loo2-large-data.html
This way, the size of your Stan output does not explode, and you can just calculate the log likelihood for each point as
loo()
requires it without running out of memory.I have looked for equivalent functionality in ArviZ, but cannot find it. Does it exist? It would be great not to have to switch to R for this.
Thanks!
The text was updated successfully, but these errors were encountered: