HomeArtificial IntelligenceDeep Learning Performance: The Essential 3-Bottleneck Framework

Deep Learning Performance: The Essential 3-Bottleneck Framework

  • Deep learning performance breaks down into three bottlenecks: compute, memory bandwidth, and overhead — knowing which one you’re in changes everything.
  • Chasing deep learning performance gains without a framework is pure guesswork — first-principles thinking eliminates entire categories of useless optimisations.
  • Non-matrix-multiply operations make up just 0.2% of FLOPS in models like BERT, yet they consume disproportionate time due to memory bandwidth costs.
  • GPU compute capacity is growing far faster than memory bandwidth, making bandwidth the harder long-term constraint for deep learning engineers to manage.
  • Deep learning performance breaks down into three bottlenecks: compute, memory bandwidth, and overhead — knowing which one you’re in changes everything.
  • Chasing deep learning performance gains without a framework is pure guesswork — first-principles thinking eliminates entire categories of useless optimisations.
  • Non-matrix-multiply operations make up just 0.2% of FLOPS in models like BERT, yet they consume disproportionate time due to memory bandwidth costs.
  • GPU compute capacity is growing far faster than memory bandwidth, making bandwidth the harder long-term constraint for deep learning engineers to manage.

Why Deep Learning Performance Feels Like Witchcraft

Ask ten ML engineers how they approach deep learning performance problems and you’ll get ten different answers — most of them superstitious. Use in-place operations. Set your gradients to None instead of zero. Install PyTorch 1.10.0 but, for reasons nobody can fully articulate, definitely not 1.10.1. It’s a folklore-driven discipline, and honestly, it’s not hard to see why. Modern accelerators are extraordinarily complex systems, and the gap between what a GPU is theoretically capable of and what your training loop actually extracts from it can be humbling.

But the randomness is somewhat illusory. Just as overfitting and underfitting in model training can be diagnosed systematically — if your training loss is miles below your validation loss, you know exactly what regime you’re in — deep learning performance can be approached with the same diagnostic logic. The trick is knowing which of three fundamental bottlenecks is actually throttling your system: compute, memory bandwidth, or overhead. Get that diagnosis right, and entire classes of optimisation either become obvious or get ruled out entirely.

The Factory Analogy That Actually Holds Up

One of the cleaner mental models for thinking about GPU execution treats the chip like a factory. Overhead is the instruction pipeline — the managers dispatching work orders to the factory floor. Memory bandwidth is the logistics operation — the trucks moving raw materials in and finished goods out. Compute is the factory itself, the actual production happening on the floor. You want the factory running flat out as much as possible. That means keeping the supply chain (memory bandwidth) fast enough to keep the floor fed, and the management layer (overhead) lean enough not to introduce delays.

This framing matters because it immediately tells you something actionable: if your factory is idle waiting for materials, buying a faster factory doesn’t help. Conversely, if your factory is genuinely compute-bound — your matrix multiplications are maxing out your tensor cores — then spending engineering time rewriting model logic in C++ to cut overhead is a distraction. The bottleneck is elsewhere. Applying this factory model to real deep learning performance diagnosis is one of the most clarifying habits an ML engineer can develop.

Behind the bitter lesson is a legion of engineers keeping GPUs running efficiently. Image from Gwern
Behind the bitter lesson is a legion of engineers keeping GPUs running efficiently. Image from Gwern — horace.io

Compute: The Ceiling You’re Probably Not Hitting

The goal of any serious deep learning performance effort is to maximise time in the compute-bound regime. A high-end Nvidia A100, for instance, is rated at 312 teraflops for matrix multiplication workloads using its Tensor Cores. That’s what you paid for. In practice, most training jobs never get close to that figure for sustained periods.

Why does compute get special status? Because it’s the one component you fundamentally can’t cheat. You can reduce overhead with better batching and kernel fusion. You can reduce memory bandwidth costs with smarter data layouts and operator fusion. But you can’t reduce the raw computation a model requires without changing the model itself — fewer parameters, different architecture, lower precision arithmetic. Everything else is about feeding the compute units efficiently enough to extract what they’re rated for.

There’s also a structural trend working against engineers here. Compute capacity — measured in FLOPS — has historically doubled faster than memory bandwidth. Nvidia’s Tensor Cores have pushed matrix-multiply throughput dramatically upward, but DRAM bandwidth improvements have lagged. The factory keeps getting faster; the supply trucks don’t. That gap means the compute-bound ideal gets progressively harder to achieve over time, which is arguably part of why ML systems engineering has become such a specialised and valued discipline.

The Matrix Multiply Monoculture — and What It Means

Modern ML accelerators are, at their core, matrix multiplication engines with general-purpose capabilities bolted on. Nvidia’s Tensor Cores are the obvious example — they deliver that headline 312 teraflops figure. But run any operation that isn’t a matrix multiply and you drop to around 19.5 teraflops on the same hardware. That’s a 15x throughput reduction for anything outside the happy path.

The immediate reaction is to worry about all the non-matmul operations a typical network runs: layer normalisation, activation functions, dropout, softmax. But analysis of real workloads makes this concern largely evaporate. Research examining FLOP distributions across transformer layers in BERT found that tensor contractions — essentially matrix multiplications — account for roughly 99.8% of all floating-point operations. Layer norms, pointwise activations, and similar ops? A collective rounding error in FLOP terms.

So if non-matmul ops barely register in FLOP counts, why do they sometimes cause disproportionate slowdowns? This is where the factory analogy earns its keep again. It’s not the production cost of these operations that hurts — it’s the logistics. The data has to move from DRAM to the compute units and back again for every single kernel launch. That transit cost, measured in memory bandwidth, can dwarf the actual arithmetic involved. A layer norm might perform trivially few floating-point operations but still stall the pipeline because it’s thrashing memory bandwidth with small, poorly fused kernels. Recognising this pattern is a core skill in deep learning performance engineering.

Memory Bandwidth: The Constraint You Can Actually Fight

Understanding memory bandwidth requires understanding where data lives during GPU execution. The compute units — the actual arithmetic hardware — work from fast, small on-chip memory (SRAM). This is where the calculations happen. But SRAM is expensive in silicon terms, so it’s tiny. The bulk of your model weights, activations, and gradients live in your GPU’s DRAM — the several gigabytes of memory that shows up in nvidia-smi and is responsible for those beloved

Source: https://horace.io/brrr_intro.html

Muhammad Zayn Emad
Muhammad Zayn Emad
Hi! I am Zayn 21-year-old boy immersed in the world of blogging, I blend creativity with digital savvy. Hailing from a diverse background, I bring fresh perspectives to every post. Whether crafting compelling narratives or diving deep into niche topics, I strive to engage and inspire readers, making every word count.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular