












The Bonferroni correction is a statistical method used to reduce the risk of Type I errors (false positives) when you run multiple hypothesis tests. Every time you test a hypothesis, there’s a chance you’ll incorrectly reject a true null hypothesis. When you run many tests, those chances add up. The Bonferroni method reins this in by making the criteria for significance more strict.
It works by testing each hypothesis at a significance level of
where is your desired overall error rate (often 0.05) and
is the number of tests.
🎯 Why You Need It
When you run multiple tests, the family-wise error rate (FWER)—the probability of getting at least one false positive—rises quickly. Statology emphasizes that even if each test individually has a 5% false-positive rate, running many tests inflates the chance of a false discovery. The Bonferroni correction keeps the overall error rate under control.
🧠 How It Works (Simple Example)
Suppose you want to maintain an overall and you’re running 10 tests.
Then each test must meet
Only results with are considered statistically significant.
📌 When to Use It
- When you care about avoiding false positives more than missing true effects.
- When the number of tests is moderate.
- Common in ANOVA post-hoc comparisons, as noted by StatTrek.
🧩 Intuition
Think of it like this:
If you roll a die once, the chance of rolling a 6 is small.
If you roll it 20 times, the chance of at least one 6 is much higher.
The Bonferroni correction says: “If you’re rolling many dice, be stricter about what counts as surprising.”
Examples of the Bonferroni Correction
🧪 Example 1: Simple p‑value adjustment
Suppose you run 5 independent hypothesis tests and want to keep your overall significance level at .
Bonferroni says:
So each test must have to be considered significant.
This matches the general rule described in Wikipedia: each hypothesis is tested at .
📊 Example 2: ANOVA with multiple pairwise comparisons
You run a one‑way ANOVA with 4 groups.
Number of pairwise comparisons:
If your overall , then
So each pairwise comparison must meet p < 0.0083.
This aligns with the explanation that multiple tests inflate the family‑wise error rate.
🧬 Example 3: Gene expression study (many tests)
A biologist tests 100 genes to see which ones are differentially expressed.
Desired overall error rate:
Bonferroni threshold:
Only genes with p < 0.0005 are considered significant.
This illustrates how the method becomes very conservative when is large, as noted in multiple sources.
🧪 Example 4: A/B testing with multiple variants
A product team tests 4 different button colors against a control.
That’s 4 comparisons.
If they want to keep the overall false‑positive rate at 5%:
So each variant must show p < 0.0125 to be considered a real improvement.
This matches the A/B testing context described by Amplitude.
🧠 Example 5: Adjusting p‑values instead of α
Instead of adjusting α, you can multiply each p‑value by the number of tests:
If you have 3 tests with p‑values:
- 0.01
- 0.03
- 0.20
Bonferroni‑adjusted p‑values:
Only the first remains below 0.05 → only the first test is significant.
This approach is consistent with the general definition of the correction.