Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change density default bw = "nrd0" to bw = "sj" #5854

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

teunbrand
Copy link
Collaborator

This PR aims to fix #3825.

Briefly, ?density recommends the "sj" method over the "nrd0" default. This change propagates that recommendation to ggplot2's density calculations.

Some details:

  • Understandably, plenty of snapshots have changed.
  • I realised that calc_bw() and precompute_bw() were doing the exact same task, so I merged these.
  • When var(x) == 0, most of the stats::bw.*() functions throw an error, so we keep bw.nrd0() for these degenerate cases because it will not throw an error.

@teunbrand teunbrand added the breaking change ☠️ API change likely to affect existing code label Apr 22, 2024
@thomasp85
Copy link
Member

I'm not sure I think this is worth the breaking change to be honest... I probably miss a compelling example of when it would improve a visualisation

@teunbrand
Copy link
Collaborator Author

I don't know enough about these bandwidth estimators to make a qualified judgement.

However, this post shares some thoughts.
Consider having multimodal data.
The current default bw = "nrd0" smooths over everything.

library(ggplot2)
set.seed(0)

df <- data.frame(x = c(
  runif(20, 9, 11),
  runif(10, 19, 21),
  runif(20, 29, 31),
  runif(30, 39, 41)
))

ggplot(df, aes(x)) +
  geom_density(bw = "nrd0") +
  geom_rug() +
  ggtitle("bw = 'nrd0'")

Whereas bw = "sj" preserves the multimodality.

ggplot(df, aes(x)) +
  geom_density(bw = "sj") +
  geom_rug() +
  ggtitle("bw = 'sj'")

Counter-argument might be that if you have some exponential data, bw = "nrd0" gives a nice smooth density.

df <- data.frame(x = rexp(1000))

ggplot(df, aes(x)) +
  geom_density(bw = "nrd0", bounds = c(0, Inf)) +
  geom_rug() +
  ggtitle("bw = 'nrd0'")

Whereas bw = "sj" gives a more 'wobbly' curve.

ggplot(df, aes(x)) +
  geom_density(bw = "sj", bounds = c(0, Inf)) +
  geom_rug() +
  ggtitle("bw = 'sj'")

Created on 2024-05-20 with reprex v2.1.0

It is probably a trade-off that I cannot motivate very well.
Maybe we can bat-signal @mjskay to see if we can get him to share his opinion.

@teunbrand teunbrand closed this May 21, 2024
@mjskay
Copy link
Contributor

mjskay commented May 22, 2024

I could perhaps dig up some examples if there is interest. I changed the ggdist default to the equivalent of bw.SJ(..., method = "dpi") around the time I added automatic bounds detection, because I found it tended to do better with some bounded distributions (e.g. log normal, where nrd0 with the reflection method doesn't always have a small enough bandwidth to get the mode right). I started a more extensive evaluation of bandwidth estimators at some point but haven't had a chance to finish it.

@mjskay
Copy link
Contributor

mjskay commented May 22, 2024

For example

set.seed(20240522)

data.frame(x = rlnorm(10000)) |>
    ggplot() +
    geom_density(aes(x, color = "sj"), bw = "sj", bounds = c(0, Inf), n = 1000) +
    geom_density(aes(x, color = "nrd0"), bw = "nrd0", bounds = c(0, Inf), n = 1000) +
    geom_function(fun = dlnorm, n = 1000, linetype = "22") +
    coord_cartesian(xlim = c(0,5))

image

It is worth pointing out that nrd0 is an easier-to-use default because the other bandwidth estimators may fail in some cases, particularly data with many duplicates. e.g.:

density(rep(1, 10), bw = "SJ")
## Error in bw.SJ(x, method = "ste") : sample is too sparse to find TD

ggdist handles this by falling back to nrd0 with a warning that the user probably shouldn't be using a kde on discrete data anyway (because, well, they shouldn't be...).

@teunbrand
Copy link
Collaborator Author

Thanks @mjskay, I appreciate your thoughts on the matter. I tend to agree with you and the authors of ?density that it probably is a better default than nrd0. Maybe this PR is worth reviving, but instead of changing the default bandwidth estimator, which might cause too much of a kerfuffle with users, ggplot2 could set a global option for it so it is easier to changes across density and violin layers simultaneously.

@teunbrand teunbrand reopened this May 22, 2024
@teunbrand teunbrand marked this pull request as draft May 22, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change ☠️ API change likely to affect existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change default for bw arg of stat_density() and stat_ydensity()
3 participants