- Local LLM guardrails in Forge push an 8B model from 53% to 99% accuracy on complex multi-step agent tasks.
- Forge applies local LLM guardrails as composable middleware, a proxy server, or a full workflow runner — no lock-in.
- The top-performing config runs Ministral-3 8B on llama-server, scoring 86.5% across 26 real agentic scenarios.
- Forge’s proxy mode transparently improves any OpenAI-compatible client without changing a single line of client code.
- Local LLM guardrails in Forge push an 8B model from 53% to 99% accuracy on complex multi-step agent tasks.
- Forge applies local LLM guardrails as composable middleware, a proxy server, or a full workflow runner — no lock-in.
- The top-performing config runs Ministral-3 8B on llama-server, scoring 86.5% across 26 real agentic scenarios.
- Forge’s proxy mode transparently improves any OpenAI-compatible client without changing a single line of client code.
The Problem With Running Small Models on Real Work
Local LLM guardrails might sound like a niche engineering concern — until you’ve watched a self-hosted 8-billion-parameter model confidently call the wrong tool, return malformed JSON, and then spiral into a loop it can’t escape. That’s not a hypothetical. It’s the day-to-day reality for anyone trying to run capable agentic workflows on consumer or prosumer hardware. The gap between a well-prompted 8B model and a genuinely reliable one turns out to be enormous, and most orchestration frameworks quietly paper over it. Without local LLM guardrails in place, even a capable small model will fail unpredictably on multi-step tasks.
That’s the problem Antoine Zambelli’s open-source project Forge is designed to solve. The headline number is striking: Forge lifts an 8B local model from 53% to 99% task completion on agentic benchmarks. That’s not a marginal improvement — it’s the difference between a toy and a tool you’d actually trust in production.
What Local LLM Guardrails Actually Do
The word “guardrails” gets thrown around loosely in AI circles, often meaning little more than a content filter or a system prompt reminder. Forge uses it to mean something more mechanical and more useful. Its local LLM guardrails operate at the inference layer: they intercept model outputs, validate them against expected structure, rescue malformed tool calls through retry logic, and enforce that required workflow steps actually happen before a task is allowed to complete.
There are three concrete components doing the heavy lifting. Rescue parsing catches broken or incomplete tool-call JSON before it propagates downstream — small models frequently produce near-valid output that’s just a bracket or a quote away from working. Retry nudges re-prompt the model with targeted corrections rather than starting over from scratch, preserving context while steering the model back on track. And step enforcement acts as a workflow contract: if a required step hasn’t been executed, the loop won’t terminate, no matter what the model decides to output. Together, these local LLM guardrails address the most common failure modes that make small models unreliable in production.
Alongside the guardrail stack, Forge manages context with an awareness that most lightweight frameworks ignore entirely. It operates VRAM-aware token budgets — so it won’t silently overflow your GPU’s memory — and uses tiered compaction strategies to keep the most relevant message history in play as conversations extend. For anyone who’s run long agentic sessions on a 12GB or 24GB card and hit mysterious failures mid-task, this matters enormously.
Three Ways to Drop Forge Into Your Stack
One of Forge’s smartest design choices is that it doesn’t force you into a single integration pattern. There are three distinct modes, each targeting a different kind of developer. All three deliver local LLM guardrails, just at different levels of the stack.
The WorkflowRunner is the full-featured path. You define your tools using Pydantic schemas, pick a backend, and hand the entire agent lifecycle to Forge — system prompt management, tool execution, context compaction, guardrails, the lot. There’s also a SlotWorker extension that adds priority-queued access to a shared GPU inference slot with auto-preemption, which is the right move when you’re running multiple specialist agents on the same machine and need them to share hardware without stepping on each other.
The guardrails middleware mode is for teams who already have an orchestration loop they like and don’t want to rip it out. You keep control of the loop; Forge’s local LLM guardrails slot in as composable middleware, validating responses and rescuing bad tool calls without taking over the architecture.
The proxy server is arguably the most elegant option for pragmatic adopters. Run python -m forge.proxy and you get an OpenAI-compatible endpoint that sits between your existing client — opencode, Continue, aider, whatever — and your local model server. The client has no idea Forge is there. It just thinks it’s talking to a more capable model. Forge supports Ollama, llama-server (llama.cpp), Llamafile, and even Anthropic’s API as backends. This means local LLM guardrails can be applied to virtually any existing toolchain with zero client-side changes.
There’s a clever trick buried in the proxy’s implementation that’s worth understanding. When tool calls are present in a request, Forge automatically injects a synthetic respond tool. Instead of choosing between producing plain text and making a tool call — a decision small models handle poorly — the model always stays in tool-calling mode, calling respond(message="...") when it wants to produce text. The proxy strips this synthetic call before returning the response to the client, which sees a normal finish_reason: stop completion. The Forge documentation references ADR-013 for the full reasoning, and it’s a genuinely smart solution to one of the more insidious failure modes in small-model deployment.
How Forge’s Benchmark Numbers Hold Up
The 53%-to-99% headline is based on Forge’s own 26-scenario evaluation suite, and it’s worth being clear about what that measures. These are multi-step tool-calling workflows — the kind where a model has to chain several decisions correctly in sequence, not just answer a single question. The suite is split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end differentiation.
The best self-hosted configuration Forge has tested is Ministral-3 8B Instruct Q8_0 running on llama-server (the inference backend from llama.cpp). That combination scores 86.5% overall and 76% on the hardest advanced-reasoning scenarios — numbers that would have seemed implausible for an 8B model on local hardware even twelve months ago. All top-ten configurations in Forge’s leaderboard run on llama-server rather than Ollama, which is consistent with what developers have observed anecdotally: llama-server’s lower-level control over sampling and prompt formatting gives it an edge on demanding workloads. It’s also worth noting that every one of those top configurations benefits from the full local LLM guardrails stack being active.
The honest caveat is that these are internal benchmarks, not independent third-party evaluations. The scenarios are designed by the same team that built the guardrail system. That’s not a disqualifier — internal evals are standard practice, and 26 scenarios is a reasonable coverage set — but it means the numbers should be treated as directional rather than absolute. The methodology is open and reproducible, which helps: anyone can clone the repo and run python -m tests.eval.eval_runner against their own hardware.
Why This Approach Matters Beyond One Project
Forge is a focused engineering tool, not a platform play. But what it demonstrates is important for the broader self-hosted AI ecosystem. The conventional wisdom has been that small local models are simply too unreliable for serious agentic use — that you need GPT-4-class API models to get consistent tool-calling behavior. Forge challenges that assumption directly, and the mechanism it uses is instructive.
The reliability gains don’t come from a better model. They come from better infrastructure around the model. Local LLM guardrails — rescue parsing, retry nudges, step enforcement — are engineering solutions to what looked like model capability problems. That’s a meaningful reframe. It suggests that a significant chunk of the “small models can’t do agentic tasks” problem is actually a “small models need better scaffolding” problem.
As more developers push workloads onto local hardware — for privacy, cost, latency, or regulatory reasons — the demand for exactly this kind of reliability middleware is only going to grow. Forge isn’t the only project working in this space, but it’s one of the more practically minded ones, and its proxy mode in particular lowers the barrier to adoption to almost nothing. The question isn’t whether local agentic AI needs local LLM guardrails. At this point, that’s settled. The question is which implementation earns developers’ trust — and Forge is making a credible early case.

