- LLM disagreement affects 67% of 1,000 real-world fact-check claims tested across five frontier models.
- LLM disagreement isn’t just minor quibbling — on 34% of claims, models land on opposite ends of the truth scale.
- Claude Opus 4.7 and Gemini 3 Pro agree only 53% of the time, the lowest pair alignment in the study.
- Even unanimous model verdicts can be wrong — shared blind spots mean consensus isn’t the same as correctness.
- LLM disagreement affects 67% of 1,000 real-world fact-check claims tested across five frontier models.
- LLM disagreement isn’t just minor quibbling — on 34% of claims, models land on opposite ends of the truth scale.
- Claude Opus 4.7 and Gemini 3 Pro agree only 53% of the time, the lowest pair alignment in the study.
- Even unanimous model verdicts can be wrong — shared blind spots mean consensus isn’t the same as correctness.
LLM Disagreement Is Far Worse Than the Industry Admits
LLM disagreement at scale is something AI companies don’t tend to put front and centre in their marketing decks. But new research from lenz.io makes it very hard to look away. Across 1,000 real-world fact-check claims — all submitted by users to an active fact-checking platform, none older than February 2026 — five of the most capable frontier models available today failed to reach a consensus on 67% of them. That’s not a rounding error. That’s a structural problem.
The five models tested were Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, Sonar Pro, and one additional frontier system. Each was asked to classify claims using a four-bucket rubric: True, Mostly True, Misleading, or False. The researchers then measured how often the panel converged — and how often it fell apart.
On 672 out of 1,000 claims, at least one model broke from the majority verdict, or no majority formed at all. The Krippendorff’s alpha for the panel — a standard inter-rater reliability metric — came in at 0.639. That’s not random noise, but it’s a long way from the kind of consistency you’d want if you were, say, deploying one of these models as an automated fact-checker for a news organisation or a social platform.

