HomeArtificial IntelligenceOld Xeon CPU Runs Gemma 4 AI: Surprising Results

Old Xeon CPU Runs Gemma 4 AI: Surprising Results

  • Running AI on old Xeon CPU hardware from 2016 is possible with the right low-level inference optimisations.
  • AI on old Xeon CPU works because LLM inference is memory-bandwidth-bound, not raw compute-bound.
  • Speculative decoding and expert-routing tricks let slow DDR3 RAM keep up with modern model demands.
  • Off-the-shelf tools like Ollama lack the tuning knobs needed to make this kind of setup viable.

AI on Old Xeon CPU: The Setup Nobody Expected to Work

Running AI on old Xeon CPU silicon from nearly a decade ago, with no GPU in sight and DDR3 RAM that modern benchmarks treat as a historical footnote — that’s not a recipe most engineers would put on a whiteboard. Yet that’s exactly what one developer pulled off, getting Google’s Gemma 4 26B language model running on a recycled server built around an Intel Xeon E5-2620 v4, a chip Intel launched in 2016 and a processor that sits roughly five times slower than a current consumer laptop CPU.

The machine in question has 128 GB of RAM — which sounds generous until you learn it’s DDR3, running at memory bandwidth speeds that are five to six times slower than what you’d find in a decent modern laptop. There’s no GPU, no integrated graphics fallback, and no AVX-512 or AVX-VNNI instruction set support to paper over the chip’s age. On paper, it has absolutely no business running a frontier AI model. In practice, running AI on old Xeon CPU hardware does work — and the reasoning behind why it can is more interesting than the stunt itself.

Why Memory Bandwidth Is the Real Bottleneck for LLM Inference

There’s a widely misunderstood idea floating around that running large language models requires the most powerful compute you can find. It’s half true. During the generation phase — when a model streams tokens to your screen one at a time, the way ChatGPT does — raw CPU or GPU processing power isn’t actually the limiting factor. The bottleneck is memory bandwidth.

Every single token the model generates requires the processor to pull the model’s weights — gigabytes of learned parameters — from RAM into its local cache, execute the necessary matrix calculations almost instantly, then wait. Wait for the next batch of weights to arrive from memory. The compute cores sit idle while the memory bus plays catch-up. This is what engineers call a memory-bound workload, and it’s sometimes described as the “memory wall” — a constraint that applies whether you’re running AI on old Xeon CPU silicon or a data-centre Nvidia H100.

The implication is that a slow CPU with lots of RAM isn’t as catastrophically outgunned as you might assume. The memory bus is the race, and everyone’s running it. When you run AI on old Xeon CPU hardware, you’re just running it on a slower track — but still in the same race.

Why Standard Tools Like Ollama Fall Short Here

Most people experimenting with local AI reach for Ollama or the standard llama.cpp CLI. They’re excellent for getting something running quickly on normal hardware. On a machine like this, though, they become obstacles. As the developer notes, Ollama doesn’t expose enough configuration to meaningfully tune inference, and even the standard llama-cli is optimised around GPU use cases that simply don’t apply here. Beyond that, both tools lack the state-of-the-art optimisations that production inference systems have been adopting over the past year or two.

Anyone attempting to run AI on old Xeon CPU hardware with Ollama will quickly hit a ceiling — not because the hardware can’t cope, but because the tooling doesn’t expose the controls needed to get the most from it. The solution was ik_llama.cpp, a fork of the standard llama.cpp project that exposes far more low-level knobs — and crucially, implements optimisations the upstream project hasn’t caught up to yet. On this hardware, those knobs aren’t optional extras. They’re the difference between a model that generates text at a useful speed and one that crawls.

Speculative Decoding: The Smartest Workaround to the Memory Wall

The most impactful optimisation the developer applied is speculative decoding — and it’s worth understanding why it matters so much more on CPU than on GPU.

The basic idea: pair a large, accurate “verifier” model (in this case the full Gemma 4 26B) with a tiny, fast “drafter” model that generates candidate tokens ahead of the verifier. The verifier then checks those drafts in a single pass and accepts the valid ones. When it works well, you get multiple tokens confirmed per decoding cycle instead of one, effectively multiplying throughput without changing the hardware.

On a GPU with high-bandwidth memory, the gain is real but modest. On a CPU, the economics are different. CPU compute cycles are cheap relative to the cost of streaming the verifier’s weights through cache. A small drafter model — one whose active layers actually fit inside the processor’s L3 cache — can generate draft tokens at almost negligible marginal cost, because it’s not triggering the expensive memory-streaming that makes the verifier slow. This dynamic is especially pronounced when running AI on old Xeon CPU platforms where DDR3 bandwidth is the primary constraint.

The configuration used here — –spec-type mtp –draft-max 3 –draft-p-min 0.0 –spec-autotune — sets up to three speculative tokens per step, accepts all probability thresholds, and uses autotune to dynamically adjust chain length depending on what the model is actually doing. That last part matters because Gemma 4 is a reasoning model. Even when its internal chain-of-thought is hidden from the end user, the hardware still has to process every thinking token individually during the decoding pass. Speculative decoding cuts that cost significantly.

Cache-Aware MoE Routing and Fused Matrix Operations

Gemma 4’s 26B parameter count is somewhat misleading. It’s a Mixture-of-Experts (MoE) architecture, meaning only a subset of its 128 expert sub-networks activate for any given token — roughly 8 experts, covering about 3.8 billion active parameters at any one time. That’s actually one reason it’s feasible to run AI on old Xeon CPU hardware at all.

But MoE on CPU introduces its own problem: cache thrashing. If the expert-routing logic picks experts in an arbitrary order, the CPU constantly has to dump what’s in its L1, L2, and L3 caches and reload fresh weights from main RAM — which on this machine is DDR3. The flag –cpu-moe tells the routing layer to be smarter about sequencing expert access, keeping weights resident in cache for as long as possible rather than bouncing between remote memory addresses.

On top of that, –merge-up-gate-experts fuses two separate per-expert projection operations — the “up” and “gate” projections — into a single matrix multiplication. The logs confirm this with a fused_up_gate = 1 entry. Normally these operations would run sequentially, writing intermediate results to memory between steps. Fusing them eliminates that round-trip entirely, keeping the data in registers and reducing the number of times the system has to touch RAM to get a result.

Together with the thread count and parallel batch settings (-t 8 –parallel 8, matching the chip’s 8 physical cores), these flags transform a naively slow inference run into something genuinely usable.

What This Actually Means for AI Hardware Assumptions

The experiment is a useful corrective to some of the orthodoxy that has built up around AI hardware. The conventional wisdom — that running serious AI models requires modern GPUs, fast memory, and current-generation silicon — is true in the aggregate but misleading at the margins. The memory-bound nature of LLM decoding means that running AI on old Xeon CPU configurations with sufficient RAM can compete more credibly than raw specs suggest, especially with the right inference stack.

That has real implications. Enterprise IT departments are sitting on fleets of recycled servers with large RAM configurations and no obvious future. Hobbyists and researchers in cost-constrained environments don’t necessarily need to wait for GPU availability or cloud credits to experiment with AI on old Xeon CPU systems. And the gap between what consumer-grade inference tools expose and what’s actually achievable with the right low-level tuning is wider than most users realise.

Ollama and its contemporaries have made local AI accessible, which is genuinely valuable. But as models like Gemma 4 get more architecturally complex — deeper reasoning chains, MoE routing, multi-token prediction drafters — the tools built for casual use will increasingly leave performance on the table. The developers who go deeper into the stack, as this experiment shows, can recover a lot of that headroom even on hardware that the industry has already written off.

Source: https://point.free/blog/gemma-4-on-a-2016-xeon/

Wasiq Tariq
Wasiq Tariq
Wasiq Tariq, a passionate tech enthusiast and avid gamer, immerses himself in the world of technology. With a vast collection of gadgets at his disposal, he explores the latest innovations and shares his insights with the world, driven by a mission to democratize knowledge and empower others in their technological endeavors.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular