LLM Benchmark Bias: One Model Swung 47 Points Based on the Judge

May 20, 2026

145

llm benchmark bias — LLM Benchmark Bias: One Model Swung 47 Points Based on the Judge — Featured image for: LLM Benchmark Bias: One Model Swung 47 Points Based on the Judge

LLM benchmark bias can swing a single model’s score by 47 points depending solely on which AI judge grades it.
LLM benchmark bias is worst for qualitative tasks — binary, verifiable criteria produce stable scores across all judges tested.
Claude’s Opus model gave itself a 4.6-point boost over what rival judges awarded, confirming systematic self-preference.
Running multiple judges and averaging results is the most practical fix teams can apply right now.

The Benchmark You Trust Might Be Grading on a Curve

LLM benchmark bias isn’t a theoretical problem. It’s already in your eval pipeline, silently inflating or deflating scores depending on which model happens to be holding the red pen. That’s the uncomfortable finding from a new round of testing by the team at Tessl, who re-ran a six-model, eleven-skill agent benchmark three separate times — using Claude Sonnet, GPT-5.5, and Claude Opus-4-7 as independent judges — and found that scores shifted, rankings moved, and one model swung a full 47 percentage points on a single skill just because the grader changed.

The work started internally. Maria from Tessl’s AI Research team flagged a concern after seeing the original numbers: an LLM judge is likely to favour outputs from its own model family. It’s an obvious hypothesis in retrospect, but one that almost nobody is stress-testing before publishing eval results. Tessl stress-tested it, and the hypothesis held up. LLM benchmark bias, it turns out, is hiding in plain sight across the industry.

What the Numbers Actually Show

Across all six models and eleven skills, Sonnet was the most generous judge. GPT-5.5 was the strictest, sitting an average of 6.9 points below Sonnet across every model and skill combination tested. That 6.9-point gap might sound modest until you consider that it’s larger than the margin separating several models in the averaged leaderboard.

If your team is running evals with Sonnet as the default judge — which is a common setup — your scores are probably 5 to 7 points higher than a stricter grader would give you. That’s not a rounding error. That’s a different story about your model’s capability. LLM benchmark bias of this magnitude is enough to completely reshape how teams interpret their own results.

The per-judge leaderboards make the problem visceral. GPT-5.3 sits third place under Sonnet and drops to fifth under both GPT-5.5 and Opus. GPT-5.5 the model sits fifth under Sonnet and climbs to second under the other two judges. A Sonnet-only leaderboard paints a flattering portrait of GPT-5.3 and an unfairly dim one of GPT-5.5. Those aren’t small aesthetic differences. They’re the kind of discrepancies that drive real product decisions.

LLM Benchmark Bias and the Self-Preference Problem

The self-preference finding is where LLM benchmark bias gets genuinely uncomfortable. Opus-4-7 graded its own outputs at 96.5. Sonnet graded those same outputs at 94.5. GPT-5.5 graded them at 89.2. The 7.3-point gap between Opus-as-judge and GPT-5.5-as-judge is entirely a grading artefact — the model outputs were identical. Opus gave itself a 4.6-point boost over what the other two judges collectively awarded it.

That said, the bias isn’t symmetric and it isn’t a clean story about every model favouring itself. GPT-5.5 actually scores its own outputs lower than the other two judges do. Opus gives GPT-5.5 its highest score at 92.3. So self-preference exists, but it varies by model and judge pairing in ways that aren’t fully predictable.

The practical implication for Claude users is specific and quantifiable: if you’re using Claude models to grade Claude outputs, build in an expectation of a 4 to 5 point systematic upward bias. It won’t show up in your data as a flag. It’ll just look like good scores. This dimension of LLM benchmark bias is particularly dangerous because it’s invisible without a multi-judge comparison.

Why Qualitative Tasks Are Especially Vulnerable

Not all skills suffer equally from LLM benchmark bias. Tessl’s data draws a clear line between tasks with binary, verifiable outcomes and tasks requiring qualitative judgment. Did the agent delete the file? Either it did or it didn’t. Scores on those kinds of criteria stayed stable across all three judges. But ask a judge to assess whether an agent’s output reflects genuine compliance versus a close approximation, and suddenly you’re measuring the judge’s interpretive preferences as much as the model’s actual performance.

The lift scores — the gap between a model’s baseline performance and its with-skill performance — illustrate this sharply. The Opus judge gave GPT-5.3 a skill lift of 22.9 points. Sonnet and GPT-5.5 gave it 16 points for the same runs against the same rubric. The rubric didn’t change. The model outputs didn’t change. What changed was whether the judge decided GPT-5.3’s output cleared the bar or merely approached it. That’s a 6-point disagreement about compliance that no single judge can resolve alone. When LLM benchmark bias is concentrated in qualitative rubric items like these, it becomes almost impossible to detect without deliberate multi-judge testing.

This is a known challenge in the broader field of LLM-as-judge evaluation research, where positional bias, verbosity bias, and self-enhancement bias have all been documented in the academic literature. Tessl’s contribution is making the problem concrete and measurable with real agent tasks rather than abstract benchmarks.

The One Stable Finding — and What It Tells Us

Opus-4-7 held first place under every judge tested. That’s the one clean, unambiguous result in the whole dataset. It’s also a useful data point in itself: when a model is genuinely far ahead of the field, LLM benchmark bias can’t hide it. The instability lives in the middle of the pack, where margins are tight and the judge’s interpretive tendencies start to dominate.

GPT-5.3 is the case study in this regard. It showed the largest average skill lift at 18.4 points, which sounds impressive until you notice it also had the weakest baseline of any non-codex model by nearly ten points. It benefits most from skill context precisely because it starts furthest behind without it. That’s a very different profile than a model with a strong baseline that also improves with skill context — but a single-judge leaderboard can make them look equivalent or even reverse their relative positions.

How to Fix Your Eval Pipeline

Tessl’s recommended approach is straightforward: run multiple judges and average the results. If you know which model you’ll primarily use in development, favour that model’s family as one of your judges — the bias will at least be consistent with your production environment. And wherever the task allows it, design rubrics around binary, verifiable criteria. Yes or no. File exists or doesn’t. Flag enabled or not. The moment you ask a judge to make a qualitative call, you’re introducing variance that only multiple judges can contain. Treating LLM benchmark bias as a first-class engineering concern — not an afterthought — is the mindset shift that matters most.

For teams using Tessl’s tooling, the platform already supports multi-judge runs through the --scorer-agent flag, letting you specify different grading models independently of the model being evaluated. The averaged scores in Tessl’s main benchmark already reflect this three-judge approach — the original post is what preceded the averaging.

The wider lesson reaches beyond any single platform. The AI industry has developed a near-religious faith in benchmark numbers, and that faith is largely misplaced when those numbers come from a single LLM judge with undisclosed model-family preferences. Leaderboards built this way aren’t measuring capability. They’re measuring the overlap between a model’s output style and its judge’s aesthetic preferences. Until multi-judge averaging becomes standard practice — ideally with disclosed judge identities and rubric transparency — the eval numbers circulating across research blogs, product pages, and investor decks deserve far more scepticism. LLM benchmark bias isn’t coming for your pipeline someday. It’s already there.

Source: https://dev.to/tessl-io/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame-2k20

LLM Benchmark Bias: One Model Swung 47 Points Based on the Judge

Table of Contents

The Benchmark You Trust Might Be Grading on a Curve

What the Numbers Actually Show

LLM Benchmark Bias and the Self-Preference Problem

Why Qualitative Tasks Are Especially Vulnerable

The One Stable Finding — and What It Tells Us

How to Fix Your Eval Pipeline

AI Collective Intelligence Picks America’s Top 3 Innovations at 250

Google AI Plus vs AI Pro: Every Gemini App Upgrade Explained

Samsung in Talks to Make Anthropic’s Custom AI Chips

LEAVE A REPLY Cancel reply

Most Popular

Studio Trigger Anime Expo 2026: 3 Major New Announcements

Astronomers Catch a Dying Galaxy in Its Final Breaths

AI Collective Intelligence Picks America’s Top 3 Innovations at 250

Google AI Plus vs AI Pro: Every Gemini App Upgrade Explained

EDITOR PICKS

Sundar Pichai Faces Stanford Walkout Over Project Nimbus

SpaceX IPO Tops Tesla at $2.1 Trillion — What Comes Next

Canada’s New Social Media Ban for Under-16s: What It Means

POPULAR POSTS

Studio Trigger Anime Expo 2026: 3 Major New Announcements

Astronomers Catch a Dying Galaxy in Its Final Breaths

AI Collective Intelligence Picks America’s Top 3 Innovations at 250

POPULAR CATEGORY

ABOUT US

FOLLOW US