June 1, 20264 min read

5 Statistical Flaws AI Peer Review Catches First

Most reviewers miss Bonferroni corrections, interaction effects, and undisclosed COIs. Here are the five patterns AI flags before you finish reading the abstract.

Every week, thousands of papers pass peer review carrying statistical errors that will quietly mislead clinicians and researchers for years. The problem is not that reviewers are careless — it is that the human review pipeline is under-resourced and overloaded. A single statistical detail buried in the methods section can invalidate the conclusion, but no one has time to re-run every paper's analysis from scratch.

This article covers the five patterns that consistently slip through and that AI peer review surfaces automatically.

1. Missing Multiple-Comparison Corrections

When a study tests six secondary outcomes and only reports the two that reached p < 0.05, the expected number of false positives by chance alone is roughly one. Without a Bonferroni correction (or Benjamini-Hochberg for exploratory analyses), any individual p-value becomes uninterpretable.

What to look for: Count the number of statistical tests across the full methods section — primary, secondary, subgroup, and sensitivity analyses combined. If the number is ≥ 4 and no correction is mentioned, that is a red flag.

Typical AI finding:

"Study reports 6 secondary outcomes. No multiple-comparison adjustment stated. At α = 0.05 with 6 tests, expected false positives under H₀ = 0.30. Results for secondary endpoints should be interpreted as exploratory."

2. No Session × Group Interaction Test in Repeated-Measures RCTs

In a parallel-group RCT that measures outcomes at multiple timepoints, the correct test is the Session × Group interaction (i.e., does the treatment group change differently over time than the control group?). Many papers report only within-group changes (p < 0.05 in the treatment arm, p = 0.30 in the placebo arm*) without ever testing whether the between-group difference across time is significant.

This is one of the most common errors in nutrition and psychiatry RCTs and one of the most reliably missed by reviewers who focus on the abstract's conclusions.

What to look for: In the statistical analysis section, check whether the study specifies a mixed-model repeated measures (MMRM) analysis or a two-way ANOVA with an interaction term. If missing, the primary analysis is likely invalid.

3. Undisclosed Conflicts of Interest in Abstracts

Disclosure statements are buried in the paper's back matter. An author employed by a supplement manufacturer who runs a trial on that supplement is not always visible from the abstract — yet that single data point changes how the entire paper should be weighted.

What to look for: Cross-reference the "Funding" and "Declarations" sections with author affiliations. Industry-funded trials show larger effect sizes than independently funded trials in almost every meta-analysis that has examined this.

AI advantage: An automated review reads the full paper and flags affiliation/funding disclosures alongside the statistical results, making the conflict of interest visible before you accept the conclusion.

4. Underpowered Designs Reported as Negative

A study with 30 participants per arm testing a clinical intervention has roughly 40% power to detect a medium effect size (Cohen's d = 0.5) at α = 0.05. When such a study reports "no significant difference," that finding is nearly uninformative — the confidence interval likely spans clinically meaningful effects on both sides.

What to look for: In papers reporting null results, ask:

Was a power calculation pre-specified?
What was the minimum detectable effect size given the actual sample size?
Is the null result "no effect" or "insufficient evidence to detect an effect"?

These are different claims, and conflating them is one of the field's most persistent problems.

5. Composite Endpoints That Obscure Signal

Composite endpoints (e.g., "MACE = cardiovascular death + MI + stroke") combine outcomes of vastly different clinical severity. A drug that reduces non-fatal MI by 20% but has no effect on mortality will still show a statistically significant composite benefit. Authors sometimes present this as evidence of broad efficacy without explicitly noting that the mortality component was null.

What to look for: When a study uses a composite primary endpoint, find the component-level breakdown. If only the composite is significant, the study has not demonstrated efficacy on any individual clinically meaningful outcome.

Why These Patterns Repeat

All five errors share a common structure: the problematic detail is in the methods or the back matter, not in the abstract or the discussion. Traditional peer review relies on reviewers who have limited time and who often read abstracts first. The statistical appendix gets less scrutiny.

Automated review reads every section with the same attention. It cannot replace domain expertise — a statistician or clinician brings contextual judgment that no model replicates — but it can ensure that the technical checklist gets completed on every paper, every time.

If you found this useful, SinaPilot's Review feature runs this exact checklist on any uploaded paper and surfaces the findings in a structured report you can share with your team.

SinaPilot

Apply this analysis to your own papers

Upload any PDF and get a structured peer-review critique — statistical issues, COI, and design flaws — in seconds.

Try it free →See real outputs

← Back to all posts