← All posts
9 min read

The Multiple Comparisons Problem: A Reader's Guide

How running many statistical tests inflates false positives, why so many papers get it wrong, and a practical checklist to spot it when you read research.

The Multiple Comparisons Problem: A Reader's Guide

If you read enough papers, you will eventually meet a study that tested for everything and found something. A trial reports no effect on its main outcome, then announces a "significant" benefit in a subgroup of left-handed patients over sixty. A nutrition study measures fifteen biomarkers and headlines the two that moved. None of this is necessarily fraud. Often it is something subtler and more common: the multiple comparisons problem, one of the most widespread and least understood sources of false findings in published research.

This guide explains what the problem is, why it produces convincing-looking results out of pure noise, how researchers are supposed to handle it, and — most importantly — how you can spot it for yourself when you read a paper. None of it requires advanced statistics. It requires knowing where to look.

The core idea: testing many things guarantees false alarms

Start with what a p-value actually promises. When researchers set significance at p < 0.05, they are accepting a 5% chance of a false positive — declaring an effect real when it is not — on any single test. Five percent feels small. The trouble is that it applies per test, and modern studies run many tests.

Suppose a study performs 20 independent statistical tests on data where nothing is truly going on. The chance that each individual test correctly shows "no effect" is 95%. But the chance that all 20 come back clean is 0.95 raised to the power of 20, which is about 0.36. In other words, there is a 64% probability that at least one test lights up as "significant" purely by chance. Run 40 tests and the probability of at least one false alarm climbs above 87%.

This is the engine of the problem. The more comparisons you make, the more nearly certain it becomes that something, somewhere, will cross the significance threshold for no reason at all. A researcher who then reports only that finding has, knowingly or not, manufactured a result from noise.

A worked example you can feel

Imagine you are handed data on a useless supplement that does absolutely nothing. You want to find a benefit anyway, so you measure outcomes across many slices: men, women, under-40s, over-40s, smokers, non-smokers, three different symptom scores, measured at week 2, week 4, and week 8.

Multiply those out and you are quickly running dozens of comparisons. With 40 or 50 tests on a substance with zero real effect, finding two or three "significant" results is not surprising — it is expected. You could write an enthusiastic abstract about the supplement's benefit for non-smoking women under 40 at week 4, complete with a p-value under 0.05, and every individual number would be technically correct. The conclusion would still be false. The effect exists only because you looked in enough places.

This is why a single isolated significant result, buried inside a study that measured many things, carries far less weight than it appears to. The significance was almost guaranteed by the volume of testing, not earned by the strength of the effect.

Where the problem hides in real papers

The textbook version involves obvious lists of tests. In practice, multiple comparisons sneak in through channels that look perfectly innocent:

Secondary and exploratory outcomes. A trial has one pre-specified primary outcome and then a dozen secondary ones. When the primary fails, attention drifts to whichever secondary outcome reached significance. Each secondary test added another roll of the dice.

Subgroup analyses. Splitting the sample into subgroups — by age, sex, severity, genotype — multiplies the number of comparisons fast. Subgroup findings are the single most common home for noise dressed up as discovery, especially when the overall result was null.

Multiple time points. Measuring the same outcome at weeks 2, 4, 8, and 12 is four tests, not one. A benefit that appears only at week 4 and vanishes by week 8 deserves deep suspicion.

Multiple instruments or scales. Depression alone can be measured with the BDI, HDRS, MADRS, and others. Run all of them and report the one that moved, and you have quietly performed multiple comparisons.

The garden of forking paths. Even without running tests in parallel, researchers make many small analytic choices — which covariates to include, how to define a responder, which outliers to exclude. Each defensible choice is a hidden fork. Across all the paths a researcher could have taken, some lead to significance by chance. This is the harder, more insidious cousin of multiple comparisons, and it cannot be fixed after the fact by a correction.

How researchers are supposed to handle it

The standard defense is a correction that makes the significance threshold stricter as the number of tests grows. There are several, and the choice matters.

Bonferroni correction is the simplest and most conservative. Divide your significance threshold by the number of tests. Running 10 tests? A result now needs p < 0.005 instead of p < 0.05 to count. Bonferroni controls the family-wise error rate — the probability of even one false positive across the whole set. It is easy to apply and easy to check, but it can be overly strict when tests are numerous or correlated, increasing the risk of missing real effects.

Holm's correction (the Holm–Bonferroni method) achieves the same family-wise control but in a stepwise way that is uniformly more powerful than plain Bonferroni. If a paper used Holm, that is a sign of statistical care.

False discovery rate (FDR) methods, most famously Benjamini–Hochberg, take a different stance. Instead of trying to prevent any false positive, they control the proportion of declared discoveries that are false — say, allowing 5% of your "significant" findings to be wrong. FDR is the appropriate tool for genuinely exploratory, high-dimensional work like genomics, where you might test thousands of genes and expect to follow up the hits. It is more permissive than Bonferroni by design, which is correct in that setting and would be a loophole in a confirmatory trial.

The key insight for a reader: the right correction depends on the study's purpose. A confirmatory clinical trial should protect the family-wise error rate. A discovery-stage screen should control the false discovery rate. A correction that is too lax for a confirmatory claim, or absent entirely, is a red flag.

When the absence of a correction is actually fine

This matters, because it is easy to overcorrect in your suspicion and dismiss good studies. Not every paper needs a Bonferroni adjustment, and demanding one everywhere is its own kind of error.

A correction is genuinely unnecessary when:

There is a single, pre-specified primary outcome. A well-designed trial declares in advance — ideally in a public registration — exactly one primary endpoint and tests it once. That single test needs no correction. The discipline of committing beforehand is what protects it.

The study uses a hierarchical (gatekeeping) testing strategy. Some trials pre-specify an ordered sequence: test outcome A first, and only proceed to test B if A is significant. This structure controls error without a blanket penalty, and it is a marker of sophistication.

Secondary outcomes are honestly framed as exploratory and hypothesis-generating. There is nothing wrong with reporting an interesting secondary signal — as long as the paper labels it as exploratory, does not headline it as a confirmed finding, and calls for replication. The failure is not the analysis; it is the overstatement.

The line, then, is not "did they correct or not." It is whether the strength of the claim matches the strength of the evidence, given how many ways the data were examined.

A practical checklist for reading any paper

You do not need to recompute anything. You need to ask a short sequence of questions while you read.

1. Was the primary outcome pre-specified, and what happened to it? Look for a registration number (ClinicalTrials.gov, a registry ID) and check whether the outcome the paper celebrates is the one it originally promised to test. A switch from a failed primary to a triumphant secondary is the classic move.

2. How many things were tested? Count, roughly. Add up outcomes, subgroups, time points, and scales. If the number is large and the paper highlights one or two isolated hits, raise your guard.

3. Is there any correction, and is it the right kind? Search the methods for "Bonferroni," "Holm," "false discovery rate," "Benjamini," or "adjusted for multiple comparisons." Then ask whether the correction fits the study type — strict family-wise control for confirmatory claims, FDR for exploratory screens.

4. Are subgroup or secondary findings labeled honestly? A trustworthy paper calls an unplanned subgroup result exploratory and asks for replication. A weaker one writes it into the abstract as if it were the point of the study.

5. Does the conclusion overreach the design? This is the summary question. A single significant result among forty tests, with no correction and no pre-registration, supporting a confident causal headline, is the signature of the multiple comparisons problem in action.

If a paper passes these five checks, its positive findings are far more believable. If it fails several, treat the conclusions as a hypothesis, not a result — no matter how impressive the p-value looks in isolation.

Why this is so persistent

Two forces keep the multiple comparisons problem alive. The first is incentive: journals and careers reward positive, novel findings, and testing many things is the path of least resistance to a publishable result. The second is genuine difficulty. The garden of forking paths means that even careful, honest researchers can drift into inflated false-positive rates without ever running an obviously suspicious analysis. Pre-registration — committing to the primary outcome and analysis plan before seeing the data — is the strongest structural defense, which is why its presence or absence is worth checking first.

For you as a reader, the takeaway is liberating rather than cynical. You are not expected to trust or distrust a paper wholesale. You are expected to weight its claims by how many chances it gave itself to find something. A result that survived a strict, pre-specified, single test deserves real confidence. A result fished out of a sea of comparisons deserves a raised eyebrow and a wait-for-replication. Knowing the difference is most of what separates a careful reader from a credulous one.


Reading a paper and want a second pair of eyes on the statistics? SinaPilot gives you a peer-review-grade critique of any paper — including whether its claims account for multiple comparisons, and where a missing correction undercuts the conclusion. Try it free, no credit card required.

SinaPilot

Apply this analysis to your own papers

Upload any PDF and get a structured peer-review critique — statistical issues, COI, and design flaws — in seconds.