- AI code security benchmarks show Claude and Gemini miss the same critical hardening steps across all four tested domains.
- 63% of AI-generated functions shipped at least one vulnerability when scored against ESLint security plugins mapped to CWEs.
- Both models avoided classic SQL-style injection traps but consistently failed to strip sensitive fields like password hashes from query results.
- The real finding isn’t which model wins — it’s that your code review process almost certainly isn’t catching what the AI missed.
- AI code security benchmarks show Claude and Gemini miss the same critical hardening steps across all four tested domains.
- 63% of AI-generated functions shipped at least one vulnerability when scored against ESLint security plugins mapped to CWEs.
- Both models avoided classic SQL-style injection traps but consistently failed to strip sensitive fields like password hashes from query results.
- The real finding isn’t which model wins — it’s that your code review process almost certainly isn’t catching what the AI missed.
The Number That Should Make You Check Your Own Repo
AI code security is routinely framed as a horse race — Claude versus Gemini, OpenAI versus Anthropic, benchmark after benchmark ranking models on a leaderboard nobody quite trusts. A recent deep-dive by developer Ofri Peretz, published on Dev.to, cuts through that noise with a finding that’s more uncomfortable than any ranking: across 700 AI-generated functions scored against purpose-built ESLint security plugins, 63% shipped with at least one vulnerability. The winner of the head-to-head? Barely relevant.
The actual scoreboard across four AI code security domains — NestJS service generation, JWT middleware, MongoDB search, and a general API implementation — came out to one Gemini win, two ties, and one split. A statistical dead heat. That result tells you something interesting about where frontier models are converging. But the 63% figure tells you something urgent about the code already running in production.
How the Test Was Actually Run
Peretz used what he calls feature-only prompts — no instruction to “make it secure,” no safety nudge, no security-aware system prompt injection. That’s a deliberate and honest choice. It mirrors how most developers actually use AI coding tools day-to-day. You ask for an auth middleware. You ask for a user search endpoint. You don’t preface every prompt with a security checklist, because that defeats the purpose of the tool.
Each prompt was run once through Gemini 2.5 Flash via the Gemini CLI, and once through Claude Sonnet 4.6 via the Claude CLI — the comparable mid-tier, default-price options each vendor surfaces. The outputs were then linted with domain-specific plugins Peretz wrote himself, each rule mapped to a MITRE CWE identifier. He’s transparent about the sample size: n=1 per domain, with a rerun on the JWT section that produced identical results both times. These are directional findings about AI code security, not a statistically controlled trial. But the failure modes are stable, and that matters more than the margin.
Where AI Code Security Actually Breaks Down
JWT: Both Models Stop One Step Short
On the JWT middleware test, both models produced clean baseline implementations. No jwt.decode shortcut instead of jwt.verify. No alg: none footgun. No hardcoded secrets. For the catastrophic, well-documented JWT mistakes, both passed with flying colors.
Then they both stopped at exactly the same place. Neither model included an audience validation option. Without it, a token minted by a completely different service will pass verification. A reviewer scanning the code sees jwt.verify() and ships it — because it looks right. The question nobody asks is: verifies for whom? That specific blind spot is what Peretz’s require-audience-validation rule is designed to catch, and it caught both models equally. This is a recurring theme in AI code security: the code looks right, but a critical option is simply absent. Round ended 5–5.
MongoDB Search: The Password Hash Problem
This is the AI code security finding that should prompt an immediate audit of your own codebase. Both Claude and Gemini wrote MongoDB search functions that returned complete user documents — password hashes included — straight to the caller. The fix is a single chained method: .select(‘-passwordHash’).lean(). Neither model wrote it.
What makes this sting a little more is the contrast. On operator injection — handing a user-supplied search object directly into a Mongoose query, a classic attack vector — both models actually performed well. Zero unsafe-query violations on either side. The frontier has clearly internalized “don’t interpolate untrusted input.” It just hasn’t internalized “don’t return the password column.” One of those failure modes is discussed in every security tutorial. The other apparently isn’t discussed enough.
NestJS: Framework Idiom as a Security Proxy
The one clean Gemini win came down to something more subtle than raw security knowledge. Asked to generate a NestJS users service, Gemini’s CLI defaulted to idiomatic NestJS patterns — class-level @UseGuards decorators, @Exclude() on the password field, class-validator on every DTO. Claude wrote functionally equivalent code without any of that scaffolding, drawing six findings to Gemini’s two.
The lesson here isn’t that Gemini knows more about AI code security. It’s that Gemini defaulted to the secure idiom of the specific framework — and in an opinionated framework like NestJS, those idioms encode a lot of hardening for free. That’s a genuinely useful property, and it’s worth thinking about when you’re choosing which model to point at a specific tech stack.
The Split Round: When More Code Means More Surface
The most thought-provoking result came from the general API domain — JSON/XML import, dynamic search, password reset flow. Peretz’s secure-coding plugin flagged 9 issues in Gemini’s output and 13 in Claude’s. At first glance, Claude loses the round. But look at what’s behind the numbers.
Claude’s extra findings came from Claude doing more. It explicitly rejected XML DOCTYPE and ENTITY declarations — hardening against XXE attacks. It allowlisted the search field. It actually implemented token verification for the password reset flow. Gemini issued a token and stopped. Fewer lines, fewer findings, less functionality.
The one genuine vulnerability in the entire AI code security benchmark surfaced here. Claude’s password reset comparison used a direct === operator — five times — instead of a timing-safe comparison. That’s CWE-208, and it’s a real problem. A timing attack against === can leak information about token length and content to a patient attacker. The correct approach is normalizing both values to a fixed-length hash first, then using timingSafeEqual from Node’s crypto module. Claude built the verification surface. Claude also got it wrong. Gemini never built the surface at all, so it had nothing to get wrong.
What AI Code Security Gaps Mean for Your Review Process
There’s a pattern across all four domains that’s more alarming than any individual finding. The AI code security vulnerabilities that both models miss are consistently the ones that look correct on a quick read. jwt.verify() passes review because it says verify. A Mongoose find() passes review because it doesn’t say $where. A password comparison passes review because it uses the right variable names.
The gaps aren’t in the obvious places. They’re one layer deeper — the option nobody added, the projection nobody specified, the comparison function nobody swapped out. And if 63% of AI-generated functions ship a vulnerability, the uncomfortable implication is that code review — human code review, conducted by people who are presumably security-conscious enough to be reviewing auth middleware — isn’t reliably catching these either.
That shifts the framing considerably. The question isn’t really “which AI writes more secure code?” It’s “what static analysis tooling should be mandatory in any pipeline that ships AI-generated code?” Addressing AI code security at the tooling level — through ESLint plugins, Semgrep, Snyk, or Sonar with security rulesets tuned to your stack — is far more reliable than eyeball review alone. Peretz’s ESLint plugins are one answer, built specifically around the failure modes he’s documented. The point is that relying on eyeball review — whether the code came from a human or a model — is clearly not sufficient.
The Claude-versus-Gemini leaderboard will keep getting clicks, and vendors will keep publishing benchmarks that flatter their own models. But the more durable story here is that the AI code security gap between frontier models is narrowing fast, while the gap between what those models produce and what’s genuinely hardened remains stubbornly wide. Closing that gap is an infrastructure problem — tooling, linting rules, mandatory review gates — not a model selection problem. The sooner engineering teams treat it that way, the better.



