Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bayesian power analysis #2507

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Conversation

lukebrawleysmith
Copy link
Contributor

@lukebrawleysmith lukebrawleysmith commented May 10, 2024

This pull request extends power analysis to our Bayesian engine.

This PR permits Bayesian power analysis by updating the front end Power Calculation code.
A customer can now specify a prior just as they would during an experiment.
Power analysis defaults to default stats engine for an org, and frequentist vs Bayesian power analysis can now be toggled under Analysis Settings:

Screenshot 2024-05-30 at 10 14 45 AM

If the power is switched from frequentist to Bayesian, power is recalculated using the the org defaults for a metric. If desired, the customer may select Edit and update Prior Mean and Prior Standard Deviation under the Metrics modal:

Screenshot 2024-05-30 at 10 09 30 AM

Copy link

github-actions bot commented May 10, 2024

Your preview environment pr-2507-bttf has been deployed.

Preview environment endpoints are available at:

@lukesonnet lukesonnet changed the base branch from main to node-sdk May 16, 2024 18:24
@lukesonnet lukesonnet changed the base branch from node-sdk to main May 16, 2024 18:24
Copy link
Contributor

@lukesonnet lukesonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller comments and one bigger comment.

Broad question: is the unit test here set up to mirror the python notebook and we should take the python notebook as the source of power validity, right?

packages/front-end/components/PowerCalculation/types.ts Outdated Show resolved Hide resolved
packages/back-end/types/stats.d.ts Outdated Show resolved Hide resolved
packages/front-end/components/PowerCalculation/stats.ts Outdated Show resolved Hide resolved
true
);
if (maxPower < power) {
console.log(`failing at iteration j: %d`, j);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will need to be logged later.

how frequently do we get in this case? Is this the case where the UI will show that N/A or - or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far, this case has never happened. i had this for testing purposes, happy to remove it if you want.

i haven't rigorously resolved the upper bound issue yet, so i kept this in there for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it could possibly happen, then we should have proper error handling for it, if it is impossible, we should remove it.

return mdeResults;
}

export function powerMetricWeeksBayesian(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you asked about copying. There's a lot of complex logic here that it might benefit us to only write once. Take a look at this slight rewrite https://github.com/growthbook/growthbook/compare/ls/power-structure?expand=1

I've done two key things:

  • Make some of the code a bit more typescript friendly (e.g. don't unpack a bunch of stuff, use .forEach)
  • Ask you to change powerEst and powerEstBayesian to instead take an object with your parameters specified. Then build those parameters and specify which function to use in one single if-else block.

My code is psuedo-code, since I didn't create those other types, and can just serve as a reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incorporated your pseudo-code. However, I didn't define new types as inputs for powerEst and powerEstBayesian, as the linter was fussing at me for having different types as inputs for a powerFunction function. Also, the way I have it is simple. However, if you have more elegant approach, I'm open to incorporating it. Also, I have statsEngineSettings as attribute of powerSettings, rather than standalone object, because we always want the freq statsEngineSettings to be part of PowerCalculationParams and bayesian statsEngineSettings to be associated with PowerCalculationParamsBayesian. Again, if you have better suggestion, I'm open.

Copy link

github-actions bot commented May 16, 2024

Deploy preview for docs ready!

✅ Preview
https://docs-eiplg9yes-growthbook.vercel.app

Built with commit af9d684.
This pull request is being automatically deployed with vercel-action

@lukebrawleysmith
Copy link
Contributor Author

Some smaller comments and one bigger comment.

Broad question: is the unit test here set up to mirror the python notebook and we should take the python notebook as the source of power validity, right?

n: number,
nVariations: number,
alpha: number = 0.05,
twoTailed: boolean = true,
sequentialTuningParameter = 0
sequentialTesting: false | number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think having something typed as either false | number isn't the best approach if we can avoid it. It's a bit confusing, but no need to change it right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I can talk to Romain about this, he suggested this input when only freq power analysis was available. Now that we are implementing Bayesian, I can see if we can simplify.

mde: -999,
};
if (powerSettings.statsEngineSettings.type === "frequentist") {
thisPower = powerEst(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call these methods powerEstFrequentist and findMdeFrequentist for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why there is not an option to reply to the comment below ("If it could possibly happen, then we should have proper error handling for it, if it is impossible, we should remove it."), but I have revamped the bayesian mde code, and this does not exist anymore.

true
);
if (maxPower < power) {
console.log(`failing at iteration j: %d`, j);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it could possibly happen, then we should have proper error handling for it, if it is impossible, we should remove it.

docs/docs/statistics/power.mdx Show resolved Hide resolved
@@ -107,6 +107,14 @@ In clinical trials, the standard is 80%. This means that if you were to run your

That being said, running an experiment with less than 80% power can still help your business. The purpose of an experiment is to learn about your business, not simply to roll out features that achieve statistically significant improvement. The biggest cost to running low-powered experiments is that your results will be noisy. This usually leads to ambiguity in the rollout decision.

## How do I run Bayesian power analysis?

For Bayesian power analysis, you specify the prior distribution of the treatment effect (see here [to be written] for guidance regarding prior selection). We then estimate Bayesian power, which is the probability that the $(1 - \alpha)$ credible interval after the test does not contain 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the flow most people will use, since they will largely accept default priors for that metric. Maybe we need to hold off on this section until we have more details on what users will actually do, but I think for V1 of bayesian power users can't modify priors, so this section is misleading.

romain-growthbook and others added 3 commits May 23, 2024 14:38
* Almost done wiring.

* Almost there.

* Default to org setting for engine.

* Keep values in sync.

* Split metric params.
@lukebrawleysmith
Copy link
Contributor Author

lukebrawleysmith commented May 30, 2024

Adding some context:

This PR permits Bayesian power analysis by updating the front end Power Calculation code.
A customer can now specify a prior just as they would during an experiment.
Power analysis defaults to default stats engine for an org, and frequentist vs Bayesian power analysis can now be toggled under Analysis Settings:

Screenshot 2024-05-30 at 10 14 45 AM

If the power is switched from frequentist to Bayesian, power is recalculated using the the org defaults for a metric. If desired, the customer may select Edit and update Prior Mean and Prior Standard Deviation under the Metrics modal:

Screenshot 2024-05-30 at 10 09 30 AM

I updated the testing files to match output from python notebook, where simulations confirmed that Bayesian power is accurate.

We still need to

  • resolve case where effect size does precede metric arguments on metric modal (currently challenging to reproduce)

@lukebrawleysmith lukebrawleysmith changed the title Lukesmith bayesian power Bayesian power analysis May 31, 2024
Copy link
Contributor

@lukesonnet lukesonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some main doc comments for @lukebrawleysmith and a few important FE comments for @romain-growthbook re: priors.

Specifically:

  • We need a much slimmer way for users to input priors. It's so overwhelming right now. IMO they should not change this every time they do a power calculation so we need those prior fields to either be in the StatsEngine modal or we need to greatly de-emphasize them (for example with a small text link "Override metric priors" to hide all the prior input fields until you click it to open the prior input fields).
  • We need to inherit metric specific priors that override the org defaults. I've left comments about this in the fiels.


const form = useForm<PartialPowerCalculationParams>({
defaultValues: params,
});

const metrics = form.watch("metrics");
const defaultValues = Object.keys(config).reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romain-growthbook: Metrics can override organization defaults. We should apply those metric level settings instead of the org settings if they exist.

),
disabled: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romain-growthbook :

  1. can we disable showing the Enable Sequential Testing div (the one on L89 below styled as className="row mt-2" unless "Frequentist" is the selected stats engine? You can see similar logic on the AnalysisForm
  2. Can you use a resolver to pick the right stats engine default? You can see how we use scoped settings in AnalysisForm and StatsEngineSelect to pick the right stats engine (organization -> potential project level override).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design comment...

Furthermore, I think we should remove this entire span:

span>
                      Enable Sequential Testing to look at your experiment
                      results as many times as you like while preserving the
                      false positive rate.{" "}
                      <DocLink docSection="statisticsSequential">
                        Learn More <FaExternalLinkAlt />
                      </DocLink>
                    </span>

And just put a learn more link to the docs next to the sequential testing header

@@ -56,6 +56,9 @@ def create_storage_arrays(self):
self.lower_limit = np.empty(array_shape)
self.upper_limit = np.empty(array_shape)
self.results = np.empty(array_shape, dtype=object)
self.stat_a = np.empty(array_shape, dtype=object)
self.stat_b = np.empty(array_shape, dtype=object)
self.estimand = np.empty(array_shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukebrawleysmith , we need to store these in run_iteration()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to delete this, sorry. It should be part of its own PR (I think it would be helpful for debugging to store these).

metricType: "all",
type: "percent",
tooltip: "Prior mean for the relative effect size.",
defaultSettingsValue: (s) => s.metricDefaults?.priorSettings?.mean,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romain-growthbook Related to my other comment about metric overrides, maybe here or somewhere else is where you need to go from organization to metric override (if override == true).

You can see this logic here: https://github.com/growthbook/growthbook/blob/main/packages/shared/src/experiments.ts#L187-L243

title: "Use proper prior",
metricType: "all",
type: "boolean",
defaultSettingsValue: (s) => s.metricDefaults?.priorSettings?.override,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to using override instead to see if the metric overrides should override the org settings, this needs to use the proper field, not override as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

title: Power Analysis Technical Details
description: Technical details of power analysis
sidebar_label: Power Technical
slug: powerTechnical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the naming convention of all of the other docs. This should be power-technical.mdx and power-technical should be the slug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

where

$\tilde{\sigma} = \hat{\sigma}*\sqrt{N}\sqrt{\frac{2(N\rho^2 + 1)}{N^2\rho^2}\log\left(\frac{\sqrt{N\rho^2 + 1}}{\alpha}\right)}\frac{1}{Z_{1-\alpha/2}}$.
Bayesian MDE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we format this either bold or something? Right now it just looks like random text in the middle of the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching typo, fixed


We make the following assumptions:

1. equal sample sizes across control and treatment variations;
1. equal sample sizes across control and treatment variations. If unequal sample sizes are used in the experiment, use the smaller of the two sample sizes. This will produce conservative power estimates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clarification, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

## GrowthBook implementation
## GrowthBook implementation (frequentist)

Below we describe high-level technical details of our implementation. All technical details can be found [here](/statistics/powerTechnical).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates here. I still think we need a bit more signposting and clear language.

Can we say something like

## GrowthBook implementation

For both Bayesian and frequentist engines, we produce two key statistics:

1. Power - After running the experiment for a given number of weeks and for a hypothesized effect size, what is the probability of a statistically significant result?
2. Minimum Detectable Effect (MDE) - After running an experiment for a given number of weeks, what is the smallest effect that we can detect as statistically significant 80% of the time.

Each engine arrives at these values differently...tech details link... etc...

### Frequentist engine

...clear signposting for **Power** and **MDE**...

### Bayesian engine

...clear signposting for **Power** and **MDE**...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I agree it's helpful for the key equations to really pop. Changes implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants