Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDIs give unrealistic values #2207

Open
hadjipantelis opened this issue Feb 11, 2023 · 4 comments
Open

HDIs give unrealistic values #2207

hadjipantelis opened this issue Feb 11, 2023 · 4 comments

Comments

@hadjipantelis
Copy link

Describe the bug
HDI smoothed intervals can result in impossible values. Furthermore, the smoothness parameter can have a unexpectedly large effect.

To Reproduce

import pymc as pm
import numpy as np
import pandas as pd
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt

print(pm.__version__)
print(az.__version__)
print(bmb.__version__)
# 5.0.2
# 0.14.0
# 0.10.0

Bambi code (somewhat irrelevant for the issue)

_url = "https://raw.githubusercontent.com/CamDavidsonPilon/" + \
        "Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter2_MorePyMC/data/challenger_data.csv"
# Read the data in
challenger_data = np.genfromtxt(_url, skip_header=1, usecols=[1, 2],
                                missing_values="NA", delimiter=",")

# Drop the NA values
challenger_data = challenger_data[~np.isnan(challenger_data[:, 1])]

# Make the pandas df
df_challenger_data = pd.DataFrame(challenger_data,  columns=["temp", "inci"]) 

# Define a standard model for logistic regression
model = bmb.Model("inci ~ temp", df_challenger_data, family="bernoulli")

# Fit the model.
results = model.fit(draws=5000, chains=2)

# Basic look at the model
# az.summary(results) 
# az.plot_trace(results) 

# Make new predictions
new_data = pd.DataFrame({"inci": np.NaN, "temp": np.arange(31, 100,  step=0.5)})
new_inci_data = model.predict(results, data=new_data, inplace=False)
mean_inci = new_inci_data.posterior["inci_mean"].values

Plotting the bands now. Default behaviour in red.

fig, ax = plt.subplots(1,2, figsize=(15,5)) 
az.plot_hdi(new_data['temp'], mean_inci, ax=ax[0], hdi_prob=0.5, color="red") 
az.plot_hdi(new_data['temp'], mean_inci, ax=ax[0], hdi_prob=0.5, smooth=False, color="green") 
ax[0].set_ylim(0.5, 1.05)
ax[0].grid()
ax[0].set_title("Setting smooth=False, i.e. Raw")

az.plot_hdi(new_data['temp'], mean_inci, ax=ax[1], hdi_prob=0.5, color="red") 
az.plot_hdi(new_data['temp'], mean_inci, ax=ax[1], hdi_prob=0.5, smooth=True,  smooth_kwargs= {"polyorder":1}, color="blue") 
ax[1].set_ylim(0.5, 1.05)
ax[1].grid()
ax[1].set_title("Lower order polynomial ({\"polyorder\":1})")

image

Expected behavior
Not have the HDI upper limit go above 1.0 when using smoothing. Maybe do not smooth by default?
I appreciate that the HDIs should probably be computed in the log-odds domain but here we are. This likely going to be a case for many "back-transformed" estimates. Would it be reasonable to add a "link" or a "family" argument, so the HDI calculations are done in their relevant domain?
(Secondary point, I was a bit surprised that smoothing with a lower order polynomial results in so visibly different HDIs - maybe have a comment about it in docs?)

Issue is loosely related to #2168.

@OriolAbril
Copy link
Member

I think the solution to this is not doing any interpolation along the "x" dimension. If we do any smoothing, it should be when computing the hdi in the similar was as computing the quantiles do (this is the discussion in the other issue you linked, the main con of this is it is not clear how to go about this for hdi, at least to me).

I would support not smoothing by default. Also always in favour of adding better docs. The whole smooting procedure along with any potential warnings should ideally be documented in the notes section. See https://numpydoc.readthedocs.io/en/latest/format.html#notes for reference on the documentation style we aim to follow. Very happy to provide support if you can help with that.

I am not sure it is possible to compute things in the log-odds domain and then transform these back. Computing quantiles should be the same before and after a monotonic transformation, but even for monotonic transformations, the mean of the transformed samples is not the same as the transformation of the mean, the same happens for the hdi.

@hadjipantelis
Copy link
Author

(Sorry @OriolAbril, I missed this when you replied originally.)

  1. I agree that not doing the interpolation as a default is probably the easiest way forward.
  2. Sure, I would be happy to help with writing some notes if you want but are those for arviz or bambi though?
  3. I "partially" agree with the point regarding log-odds. While what you say is obviously correct regarding monotonic transformations, what messes up the plot_hdi is that the _hdi computes the interval_width in the probability domain. If that interval was calculated in the log-odds domain, and we got the bounds in that domain and then transformed them to the probability they should be correct.

@OriolAbril
Copy link
Member

OriolAbril commented Mar 30, 2023

Here is an example to try and illustrate the different interpolations:

rng = np.random.default_rng(42)
a = rng.normal([0, 0, 0, 1, 1, 1, 1, 1, 1], size=(10, 9))[:, [0, 6, 7]]
# a has now shape (10, 3)

We will take the first dimension to be equivalent to draw/sample one, the 2nd one is the x dimension. We take x to take values 0, 0.3, 0.6

When we use numpy quantile function, we get values that are not exactly any of the values in our array. It interpolates between the two closest values and finds a value between them that it then returns. It is therefore not possible to have values larger than the max in the array returned by quantile.

np.quantile(a, 0.9, axis=0)
# array([0.61544513, 1.62616018, 1.49426994])

Here are the larger two values in each column for comparison:

np.sort(a, axis=0)[-2:, :]
# array([[0.58622233, 1.62559039, 1.48074666],
#        [0.8784503 , 1.63128823, 1.61597942]])

HDI on the other hand returns exactly a value from the array:

az.hdi(a[None, ...])[:, 1]
# array([0.8784503 , 1.63128823, 1.61597942])

This makes HDI less "stable" and leads to the "wigglier" plot shown in #2168 (comment). To try and fix this a bit, plot_hdi does some interpolation too.

However, this interpolation is radically different from the one in quantile. In this one, we take the values returned from HDI and their corresponding x and find a polinomial function that goes through those points. Here we have 3 points so we can make an exact quadratic interpolation (it is not the method used by plot_hdi but it is illustrative). It looks like this:

imatge

It can clearly be seen that it goes over the maximum values present in the array, which is something I think we don't want to happen, and it can break model constraints.

Referencing #2168 (comment) again, the 2nd plot looks much smoother, but there is no interpolation along the x axis. If we compute the quantiles column-wise we'll get the exact same values, and they will always preserve the contraints imposed by the samples. This is why I was advocating for a similar interpolation scheme in hdi so we can then get smoother looking results without the need for interpolation/smoothing along the x dimension

@hadjipantelis
Copy link
Author

Yes, I see your point. Just a clarification: in #2168 I question whether using standard t-distribution approximation is preferrable to using percentile CIs. Which of the two would you consider more appropriate? (or neither).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants