AI Code Security: The Surprising 63% Failure Rate Claude and Gemini Sh

June 1, 2026

121

AI code security — AI Code Security: The Surprising 63% Failure Rate Claude and Gemini Sh — Featured image for: AI Code Security: The Surprising 63% Failure Rate Claude and Gemini Sh

AI code security benchmarks show Claude and Gemini miss the same critical hardening steps across all four tested domains.
63% of AI-generated functions shipped at least one vulnerability when scored against ESLint security plugins mapped to CWEs.
Both models avoided classic SQL-style injection traps but consistently failed to strip sensitive fields like password hashes from query results.
The real finding isn’t which model wins — it’s that your code review process almost certainly isn’t catching what the AI missed.

The Number That Should Make You Check Your Own Repo

AI code security is routinely framed as a horse race — Claude versus Gemini, OpenAI versus Anthropic, benchmark after benchmark ranking models on a leaderboard nobody quite trusts. A recent deep-dive by developer Ofri Peretz, published on Dev.to, cuts through that noise with a finding that’s more uncomfortable than any ranking: across 700 AI-generated functions scored against purpose-built ESLint security plugins, 63% shipped with at least one vulnerability. The winner of the head-to-head? Barely relevant.

Cover image for Claude vs Gemini Across 4 Security Domains: A Dead Heat — and the Hardening 63% of AI Code Skips — via dev.to

The actual scoreboard across four AI code security domains — NestJS service generation, JWT middleware, MongoDB search, and a general API implementation — came out to one Gemini win, two ties, and one split. A statistical dead heat. That result tells you something interesting about where frontier models are converging. But the 63% figure tells you something urgent about the code already running in production.

How the Test Was Actually Run

Peretz used what he calls feature-only prompts — no instruction to “make it secure,” no safety nudge, no security-aware system prompt injection. That’s a deliberate and honest choice. It mirrors how most developers actually use AI coding tools day-to-day. You ask for an auth middleware. You ask for a user search endpoint. You don’t preface every prompt with a security checklist, because that defeats the purpose of the tool.

Each prompt was run once through Gemini 2.5 Flash via the Gemini CLI, and once through Claude Sonnet 4.6 via the Claude CLI — the comparable mid-tier, default-price options each vendor surfaces. The outputs were then linted with domain-specific plugins Peretz wrote himself, each rule mapped to a MITRE CWE identifier. He’s transparent about the sample size: n=1 per domain, with a rerun on the JWT section that produced identical results both times. These are directional findings about AI code security, not a statistically controlled trial. But the failure modes are stable, and that matters more than the margin.

Where AI Code Security Actually Breaks Down

JWT: Both Models Stop One Step Short

On the JWT middleware test, both models produced clean baseline implementations. No jwt.decode shortcut instead of jwt.verify. No alg: none footgun. No hardcoded secrets. For the catastrophic, well-documented JWT mistakes, both passed with flying colors.

Then they both stopped at exactly the same place. Neither model included an audience validation option. Without it, a token minted by a completely different service will pass verification. A reviewer scanning the code sees jwt.verify() and ships it — because it looks right. The question nobody asks is: verifies for whom? That specific blind spot is what Peretz’s require-audience-validation rule is designed to catch, and it caught both models equally. This is a recurring theme in AI code security: the code looks right, but a critical option is simply absent. Round ended 5–5.

MongoDB Search: The Password Hash Problem

This is the AI code security finding that should prompt an immediate audit of your own codebase. Both Claude and Gemini wrote MongoDB search functions that returned complete user documents — password hashes included — straight to the caller. The fix is a single chained method: .select(‘-passwordHash’).lean(). Neither model wrote it.

What makes this sting a little more is the contrast. On operator injection — handing a user-supplied search object directly into a Mongoose query, a classic attack vector — both models actually performed well. Zero unsafe-query violations on either side. The frontier has clearly internalized “don’t interpolate untrusted input.” It just hasn’t internalized “don’t return the password column.” One of those failure modes is discussed in every security tutorial. The other apparently isn’t discussed enough.

NestJS: Framework Idiom as a Security Proxy

The one clean Gemini win came down to something more subtle than raw security knowledge. Asked to generate a NestJS users service, Gemini’s CLI defaulted to idiomatic NestJS patterns — class-level @UseGuards decorators, @Exclude() on the password field, class-validator on every DTO. Claude wrote functionally equivalent code without any of that scaffolding, drawing six findings to Gemini’s two.

The lesson here isn’t that Gemini knows more about AI code security. It’s that Gemini defaulted to the secure idiom of the specific framework — and in an opinionated framework like NestJS, those idioms encode a lot of hardening for free. That’s a genuinely useful property, and it’s worth thinking about when you’re choosing which model to point at a specific tech stack.

Google AI - Official AI Model and Platform Partner — via dev.to

The Split Round: When More Code Means More Surface

The most thought-provoking result came from the general API domain — JSON/XML import, dynamic search, password reset flow. Peretz’s secure-coding plugin flagged 9 issues in Gemini’s output and 13 in Claude’s. At first glance, Claude loses the round. But look at what’s behind the numbers.

Claude’s extra findings came from Claude doing more. It explicitly rejected XML DOCTYPE and ENTITY declarations — hardening against XXE attacks. It allowlisted the search field. It actually implemented token verification for the password reset flow. Gemini issued a token and stopped. Fewer lines, fewer findings, less functionality.

The one genuine vulnerability in the entire AI code security benchmark surfaced here. Claude’s password reset comparison used a direct === operator — five times — instead of a timing-safe comparison. That’s CWE-208, and it’s a real problem. A timing attack against === can leak information about token length and content to a patient attacker. The correct approach is normalizing both values to a fixed-length hash first, then using timingSafeEqual from Node’s crypto module. Claude built the verification surface. Claude also got it wrong. Gemini never built the surface at all, so it had nothing to get wrong.

What AI Code Security Gaps Mean for Your Review Process

There’s a pattern across all four domains that’s more alarming than any individual finding. The AI code security vulnerabilities that both models miss are consistently the ones that look correct on a quick read. jwt.verify() passes review because it says verify. A Mongoose find() passes review because it doesn’t say $where. A password comparison passes review because it uses the right variable names.

The gaps aren’t in the obvious places. They’re one layer deeper — the option nobody added, the projection nobody specified, the comparison function nobody swapped out. And if 63% of AI-generated functions ship a vulnerability, the uncomfortable implication is that code review — human code review, conducted by people who are presumably security-conscious enough to be reviewing auth middleware — isn’t reliably catching these either.

That shifts the framing considerably. The question isn’t really “which AI writes more secure code?” It’s “what static analysis tooling should be mandatory in any pipeline that ships AI-generated code?” Addressing AI code security at the tooling level — through ESLint plugins, Semgrep, Snyk, or Sonar with security rulesets tuned to your stack — is far more reliable than eyeball review alone. Peretz’s ESLint plugins are one answer, built specifically around the failure modes he’s documented. The point is that relying on eyeball review — whether the code came from a human or a model — is clearly not sufficient.

The Claude-versus-Gemini leaderboard will keep getting clicks, and vendors will keep publishing benchmarks that flatter their own models. But the more durable story here is that the AI code security gap between frontier models is narrowing fast, while the gap between what those models produce and what’s genuinely hardened remains stubbornly wide. Closing that gap is an infrastructure problem — tooling, linting rules, mandatory review gates — not a model selection problem. The sooner engineering teams treat it that way, the better.

Tags
Claude

AI Code Security: The Surprising 63% Failure Rate Claude and Gemini Sh

Table of Contents

The Number That Should Make You Check Your Own Repo

How the Test Was Actually Run

Where AI Code Security Actually Breaks Down

JWT: Both Models Stop One Step Short

MongoDB Search: The Password Hash Problem

NestJS: Framework Idiom as a Security Proxy

The Split Round: When More Code Means More Surface

What AI Code Security Gaps Mean for Your Review Process

3M and Microsoft Partnership Targets Critical AI Data Centers

OpenAI AI Speaker: A Critical Bet on Living Room AI

OpenAI’s Screenless AI Speaker Could Be Its Riskiest Product Yet

LEAVE A REPLY Cancel reply

Most Popular

Meta Smart Glasses Face a Critical London Privacy Protest

Onn 4K Pro restock brings Walmart’s $59 streamer back

Skullcandy Crusher 1080 Review: Powerful Bass, Serious ANC

Alternative Android App Stores Are Google’s New Reality

EDITOR PICKS

Sundar Pichai Faces Stanford Walkout Over Project Nimbus

SpaceX IPO Tops Tesla at $2.1 Trillion — What Comes Next

Canada’s New Social Media Ban for Under-16s: What It Means

POPULAR POSTS

Meta Smart Glasses Face a Critical London Privacy Protest

Onn 4K Pro restock brings Walmart’s $59 streamer back

Skullcandy Crusher 1080 Review: Powerful Bass, Serious ANC

POPULAR CATEGORY

ABOUT US

FOLLOW US