HomeArtificial IntelligenceSix Months of LLM Progress: The Biggest Surprises Revealed

Six Months of LLM Progress: The Biggest Surprises Revealed

  • Six months of LLM progress has delivered coding agents that crossed from ‘often works’ to reliable daily-driver tools.
  • LLM progress on local hardware is stunning — a 20.9GB Qwen model now beats frontier results from just months ago.
  • OpenClaw went from obscure project to sell-out Mac Mini phenomenon in under three months.
  • The benchmark arms race is showing cracks — a pelican on a bicycle has already outlived its usefulness as a test.
  • Six months of LLM progress has delivered coding agents that crossed from ‘often works’ to reliable daily-driver tools.
  • LLM progress on local hardware is stunning — a 20.9GB Qwen model now beats frontier results from just months ago.
  • OpenClaw went from obscure project to sell-out Mac Mini phenomenon in under three months.
  • The benchmark arms race is showing cracks — a pelican on a bicycle has already outlived its usefulness as a test.

LLM Progress at a Glance: Six Months That Actually Mattered

If you want to understand where LLM progress stands right now, the most honest summary is this: coding agents went from promising to genuinely useful, and local models went from impressive party tricks to tools that can embarrass frontier systems from six months ago. Those two things sound modest. They’re not. Speaking at PyCon US 2026, developer and writer Simon Willison walked through the highlights of this period using a deceptively simple benchmark — generating an SVG of a pelican riding a bicycle — and what he found says quite a lot about how quickly the landscape is shifting.

The Model Leaderboard Has Been a Revolving Door

At the start of November 2025, the consensus best model was Claude Sonnet 4.5, released just weeks earlier on September 29th. It held that informal crown briefly before being overtaken in quick succession by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and then Anthropic clawed back the top spot with Claude Opus 4.5. That’s four leadership changes in a matter of weeks.

Willison’s pelican test — asking models to draw the bird astride a bicycle in SVG format — is a clever proxy precisely because no lab would specifically train for such a niche task. It’s a genuine signal rather than a gaming of standard benchmarks. By his assessment, Gemini 3 produced the best pelican of that November cohort, though he’s careful to note pelicans aren’t everything. Most practitioners settled on Opus 4.5 as the overall crown-holder for the two months that followed.

The speed of model turnover here isn’t just a curiosity. It reflects just how compressed the competitive cycle has become. OpenAI, Anthropic, and Google DeepMind are all shipping at a cadence that would have seemed unrealistic two years ago, and none of them can afford to sit still. Anthropic’s published research makes clear that the investment going into reinforcement learning techniques alone is enormous — and the results are now visible in the products people actually use.

Coding Agents Finally Crossed the Line

The bigger story from November wasn’t which model drew the best pelican. It was that coding agents — the kind that actually write, run, and debug code autonomously — crossed a quality threshold that practitioners had been waiting years for. Both OpenAI and Anthropic had spent much of 2025 applying Reinforcement Learning from Verifiable Rewards to their coding stacks, particularly in combination with their respective agent harnesses: Codex and Claude Code.

The payoff landed in November. Willison’s framing is precise and worth taking seriously: coding agents went from often-work to mostly-work. That gap is everything. A tool that mostly works is a daily driver. A tool that often works is a research toy. The crossing of that line means developers can now offload real tasks — not just toy problems — without spending half their time cleaning up after the model.

This is a shift that’s easy to understate. For years, the honest answer to “can I use an AI coding agent in production?” was “sort of, with a lot of hand-holding.” That answer has changed. Not to “yes, unconditionally” — but to “yes, for a meaningful and growing category of work.” The implications for developer productivity, team sizing, and the kinds of projects that become financially viable are significant and still unfolding.

The Holiday Madness — and What It Revealed

December and January gave a lot of developers unstructured time, and many of them spent it probing the limits of the new models. Willison includes himself in this, with characteristic self-deprecation. He describes spinning up “wildly ambitious projects” during what he calls a short-lived bout of LLM psychosis — including a JavaScript interpreter written in Python, running inside Pyodide, running in WebAssembly, running in a browser. The demo works. It’s technically interesting. Nobody needed it.

That anecdote is more instructive than it might seem. When capable tools become available, smart people go a little feral with them. The holiday period produced a wave of creative-but-half-baked projects across the developer community, most of which have since been quietly shelved. What it demonstrated, though, is that the ceiling for what these tools can help you build has risen substantially. The projects that don’t survive aren’t failures — they’re evidence of genuine exploration at the edges of capability.

OpenClaw: The Digital Pet That Sold Out Mac Minis

One of the stranger subplots of the past six months is the rise of OpenClaw. The project went through several name changes between December and January before finding its final identity in February — and then it blew up. Within three months of its initial release, it had generated a level of attention remarkable for any open-source AI project, let alone a young one.

The physical manifestation of this was oddly specific: Mac Minis started selling out across Silicon Valley, reportedly because people were buying them as dedicated hardware to run their Claws locally. Drew Breunig’s observation — that Claws have become the new digital pets and a Mac Mini is the perfect aquarium for one — captures the cultural moment better than any press release could.

Willison’s own preferred metaphor is Alfred Molina’s Doc Ock from Spider-Man 2: AI-powered claws that are perfectly safe right up until something damages the inhibitor chip, at which point they turn and take over. It’s a joke, but it’s also a pretty honest description of how a lot of people relate to increasingly autonomous AI agents — useful, impressive, and with a non-zero chance of doing something you didn’t ask for.

Open Weight Models: The LLM Progress Nobody Expected

Perhaps the most surprising dimension of recent LLM progress isn’t what’s happening at the frontier — it’s what’s happening on your laptop. Qwen’s April releases, specifically Qwen3.6-35B-A3B, are a case in point. This is a model that weighs in at 20.9GB, runs locally on consumer hardware, and — by Willison’s pelican benchmark, at least — outdraws Claude Opus 4.7.

Let that sit for a moment. A model you can download and run on a laptop, without a cloud subscription, without API costs, without sending your data anywhere, is now outperforming on specific tasks what was the state-of-the-art frontier model just months ago. Willison is honest that this result probably says as much about the pelican benchmark hitting its ceiling as it does about Qwen’s raw capabilities. But it still points to something real: the gap between local and frontier models is closing faster than most people anticipated, and it’s closing from both ends simultaneously.

For enterprises worried about data privacy, for developers in markets with API access issues, and for anyone who simply wants AI tools that work offline, this trajectory matters enormously. Open weight models from Chinese labs — Qwen from Alibaba, along with others — have been consistently underestimated by Western observers, and the April 2026 results suggest that pattern isn’t going to change anytime soon.

What the Next Six Months Might Look Like

The two themes Willison identifies — coding agents that work and local models that punch above their weight — are likely to compound rather than plateau. Coding agents that are already capable enough for daily use will become more reliable, handle longer context windows, and integrate more deeply into existing developer workflows. The question isn’t whether they’ll improve; it’s how fast practitioners will restructure their work around that improvement.

On the local model side, the trajectory suggests that within another six months, running a genuinely capable model on consumer hardware won’t be a niche developer hobby — it’ll be an increasingly normal option for a wide range of professional use cases. That shifts the competitive dynamics of the AI industry in ways that are still hard to fully map. Cloud providers have built enormous businesses on API access to frontier models. If capable models run locally for free, some of that business model faces real pressure.

And Google’s Jeff Dean apparently tweeted a video of an animated pelican riding a bicycle, alongside a frog on a penny-farthing, a giraffe driving a tiny car, an ostrich on roller skates, a turtle kickflipping a skateboard, and a dachshund driving a stretch limousine. Maybe the AI labs have been watching the benchmarks more closely than anyone thought.

Source: https://simonwillison.net/2026/May/19/5-minute-llms/

Zara
Zara
I am a psychology undergraduate with a strong passion for technology, digital creativity, and innovation. Alongside my studies, I have experience in social media management, content writing, and exploring tech tools that enhance communication and problem-solving. As a tech enthusiast, I enjoy learning new digital skills, adapting to emerging trends, and using technology to create meaningful impact.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular