- Tiny-vLLM is a free, open-source LLM inference engine built in C++ and CUDA, designed as a learning resource.
- The LLM inference engine implements FlashAttention-like softmax, PagedAttention, and both static and continuous batching.
- The project loads a real Llama 3.2 1B Instruct model from Safetensors and runs a full forward pass on GPU.
- Creator Jędrzej Maczan plans future courses on ML compilers and alternative attention mechanisms if interest grows.
- Tiny-vLLM is a free, open-source LLM inference engine built in C++ and CUDA, designed as a learning resource.
- The LLM inference engine implements FlashAttention-like softmax, PagedAttention, and both static and continuous batching.
- The project loads a real Llama 3.2 1B Instruct model from Safetensors and runs a full forward pass on GPU.
- Creator Jędrzej Maczan plans future courses on ML compilers and alternative attention mechanisms if interest grows.
An LLM Inference Engine You Actually Build Yourself
Most people interacting with large language models never think about what’s happening underneath — the layers of systems software that sit between a model’s weights file and a coherent response. A new open-source project called tiny-vLLM wants to change that. It’s a fully functional LLM inference engine written in C++ and CUDA, and it comes paired with a detailed course walking you through every decision, every kernel, and every line of math from the ground up.
The project comes from developer JÄ™drzej Maczan, who describes tiny-vLLM as a “younger and smaller sibling” of vLLM, the widely used open-source inference framework backed by a16z and used in production at companies across the AI stack. Where vLLM is a production-grade system with a large contributor base, tiny-vLLM is deliberately small — built for understanding, not deployment. The idea is that by implementing an LLM inference engine yourself, you internalize what all those abstractions in PyTorch and Hugging Face are actually doing.
Why C++ and CUDA? The Case for Going Low-Level
Python dominates AI research and development for good reasons — it’s fast to write, easy to read, and the ecosystem is enormous. But Python is not the language you reach for when you need to squeeze every last millisecond out of a GPU. Maczan makes the case clearly: if you want an LLM inference engine that’s fast and can handle concurrent requests efficiently, you want C++ for the host logic and CUDA for the GPU compute.
That’s not a controversial opinion. It’s why vLLM itself offloads its most performance-sensitive operations to custom CUDA kernels, and it’s why companies like Groq, Anyscale, and even OpenAI have invested heavily in custom low-level inference stacks. The gap between a naive Python implementation and an optimized CUDA implementation of something like matrix multiplication isn’t marginal — it can be orders of magnitude.
Maczan’s framing is worth quoting directly: LLMs are “mostly about multiplying the matrices, which boils down to computing dot products of two vectors, for many numbers and for many vectors.” This is both mathematically accurate and a useful mental model. Transformer-based models spend the vast majority of their compute budget on matrix multiplications — in the attention mechanism, in the feed-forward layers, in the projection matrices. If you can make those fast, everything else follows.
What the LLM Inference Engine Actually Implements
Tiny-vLLM isn’t a toy. It loads a real model — Meta’s Llama 3.2 1B Instruct — from the Safetensors format, which has become the de facto standard for distributing model weights safely (no arbitrary code execution, unlike older pickle-based formats). From there, it runs a complete forward pass, handling both the prefill phase (processing the entire input prompt in parallel) and the decode phase (generating tokens one at a time).
The list of implemented features reads like a survey of modern inference optimization techniques:
- KV cache — stores previously computed key and value tensors so the model doesn’t recompute them on every generation step
- Static and continuous batching — static batching groups requests of the same length; continuous batching, popularized by Orca and later vLLM, allows new requests to slot into ongoing batches dynamically, dramatically improving GPU utilization
- Online softmax and FlashAttention-like attention — avoids materializing the full attention matrix in memory, a critical optimization for long contexts
- PagedAttention — vLLM’s signature contribution, which manages the KV cache in non-contiguous memory blocks (pages), eliminating fragmentation and enabling much higher throughput
- Grouped Query Attention (GQA) — the attention variant used in Llama 3, which reduces memory bandwidth by sharing key/value heads across multiple query heads
The course also covers RMSNorm with parallel reduction in CUDA, Rotary Position Embeddings (RoPE), SiLU activations, causal masking, and the use of cuBLAS’s cublasGemmEx for high-performance matrix multiplication. There’s even a dedicated section on the column-major to row-major transposition trick — a subtle but important detail when bridging CUDA’s default memory layout with the row-major layout that most ML frameworks assume.
That last point is the kind of thing that only shows up when you’re working at this level. High-level frameworks hide it entirely. Building an LLM inference engine from scratch means you have to confront it directly.
The Pedagogical Approach: Learn by Deriving
What sets tiny-vLLM apart from most “build X from scratch” projects is the explicit commitment to deriving ideas from first principles rather than just implementing them. Maczan describes it as JIT learning — just-in-time, filling in the gaps as they become relevant rather than front-loading theory before touching code. For engineers who learn best by doing, this is a genuinely effective approach.
The course opens with an explanation of what a model actually is at the physical level: a file containing floating-point numbers that represent learned weights. It then walks through the full lifecycle — design, implementation, training, and finally serving. This framing matters because it contextualizes inference correctly: you’re not running a model, you’re running a program that implements the model’s architecture and loads its weights at startup. The distinction sounds subtle but it clarifies why an LLM inference engine exists as a separate engineering problem from training.
There’s also an honest acknowledgment that mistakes will be made along the way. That’s not a disclaimer — it’s a pedagogical choice. Debugging a broken CUDA kernel teaches you more about memory coalescing and thread divergence than reading a textbook chapter on either topic.
The Bigger Picture: Attention’s Complexity Problem
Maczan touches on something that deserves more attention (no pun intended). The standard attention mechanism has quadratic complexity — O(n²·d) in sequence length. For short prompts this is fine. For long-context models handling tens or hundreds of thousands of tokens, it becomes a serious bottleneck, both in compute time and memory.
FlashAttention, developed by Tri Dao and colleagues at Stanford and now maintained through the Dao-AI Lab, addressed this by restructuring the attention computation to be IO-aware rather than simply reducing FLOPs. PagedAttention, introduced by the vLLM team at UC Berkeley, tackled the memory management side. Linear attention mechanisms like RetNet and Mamba-style state space models go further, attempting to break the quadratic barrier entirely.
Tiny-vLLM implements the FlashAttention-like approach and PagedAttention, which means it’s already incorporating two of the most practically important advances in inference efficiency from the past few years. For a learning project, that’s ambitious — and it means the concepts you internalize here are directly applicable to understanding how a production LLM inference engine like vLLM, TensorRT-LLM, or SGLang actually works.
Maczan has floated the idea of follow-up courses on ML compilers and alternative attention mechanisms if the community responds well. Given how much interest there is right now in making inference faster and cheaper — every API provider is competing on latency and cost — there’s a real audience for that kind of deep technical education. Building or studying an LLM inference engine at this level of detail gives practitioners a meaningful edge as the tooling and the theory continue to move fast. Most educational content hasn’t kept up, and projects like tiny-vLLM are filling a gap that university curricula and corporate training programs simply haven’t caught up to yet.


