- MCP server optimization can reduce token usage from 67,000 to under 4,000 per request — a 94% drop.
- Most MCP servers waste tokens through four common anti-patterns that MCP server optimization directly addresses.
- The ultra-mcp-toolkit provides trim registries, consolidated dispatchers, and a CLI bridge to cut overhead fast.
- Tool listing bloat alone costs around 10,000 tokens per conversation before the user types a single word.
The Hidden Cost Nobody Talks About
MCP server optimization isn’t the most glamorous corner of AI development, but it might be the most expensive one to ignore. As more engineering teams pipe AI agents into enterprise tools — Jira, Confluence, GitHub, Linear, Salesforce — the token bills are quietly ballooning. And in most cases, it’s not the AI doing useful work that’s burning the budget. It’s noise.
Scott Lepp, a developer who’s been building production MCP integrations, put it bluntly in a recent write-up: first-generation MCP servers were essentially REST API wrappers with a thin layer of AI gloss on top. That architecture made sense as a starting point. It was fast to build, easy to reason about, and got agents talking to real systems quickly. The problem is that REST APIs were never designed to feed language models. They were designed for applications that parse JSON deterministically. An LLM doesn’t need icon URLs, nested self-referencing links, or three different representations of the same status field. It just needs the information.
The numbers Lepp published from a live Jira Cloud instance are hard to argue with. A single rich Jira ticket returned 270 KB of raw JSON — roughly 67,000 tokens. After applying proper MCP server optimization through trim projections, that same ticket came back as 15.5 KB, or around 3,900 tokens. Same content, same data, 94% fewer tokens. The full payload still writes to disk, and the agent can fetch it on demand if it genuinely needs the detail. Most of the time, it doesn’t.
Four Anti-Patterns Draining Your Context Window
Before getting to solutions, it’s worth understanding why MCP server optimization has become necessary in the first place. Lepp identifies four patterns that show up across almost every enterprise MCP integration he’s seen.
The first is returning raw API JSON. Production APIs carry a staggering amount of metadata that agents will never use — schema hints, expand flags, icon assets, self-referential URL structures. Passing all of that straight to the model isn’t just wasteful, it actively increases the chance of hallucinations. The more irrelevant context a model has to wade through, the more likely it is to latch onto something it shouldn’t. Effective MCP server optimization starts here, at the point of response shaping.
The second problem is one-tool-per-endpoint design. A typical CRM might expose 80 endpoints. Wire each one up as its own MCP tool and you’ve just added roughly 10,000 tokens to every single conversation — before the user has typed a word. That’s not a theoretical concern. That’s money leaving your account on every session initialization.
Third: asking the LLM to handle filtering and pagination. Models can’t reliably step through large paginated structures, and the logic required to even attempt it costs additional tokens. Filtering belongs on the server, full stop. This is another area where deliberate MCP server optimization pays immediate dividends.
Fourth, and perhaps most insidious: denylist-based trimming. Deleting specific fields like iconUrl feels tidy until the upstream API adds a new noisy field next quarter and your trim logic silently lets it through. Allowlists are the only stable contract here — you define exactly what the model sees, and anything new defaults to dropped.
MCP Server Optimization in Practice: The ultra-mcp-toolkit Approach
Lepp built the ultra-mcp-toolkit to codify these patterns into something reusable. The core idea is simple: instead of shipping raw API responses to the model, you register a trim function once, and every response routes through it automatically. This is MCP server optimization reduced to its most practical form.
The toolkit uses allowlist projections via a pick utility. For a Jira issue, that means the model sees a key, a summary, a status, a priority, and an assignee. That’s it. Everything else lives on disk as a content-addressed reference the agent can dereference if it actually needs it. The concept borrows from how good cache architectures work — separate the hot path from the cold path, and don’t load what you don’t need.
The consolidated dispatcher pattern addresses the tool-listing problem. Instead of 80 individual tools, you expose around 15 — each accepting an action argument that routes to the correct operation. The tool listing drops from roughly 10,000 tokens to about 100. Per conversation. Every conversation. That’s not a rounding error; at scale, it’s the difference between a manageable API bill and a CFO asking uncomfortable questions. Proper MCP server optimization at the dispatcher level compounds those savings across every session.
For shell-capable agents like Claude Code or Cursor, there’s an even more aggressive option: a CLI bridge that hands the agent a single MCP tool pointing to a bundled command-line interface. The agent drives the entire API from its shell. The tool listing stays at exactly one tool, regardless of how many operations the underlying API supports. It’s an elegant inversion — instead of the MCP layer growing with the API surface, it stays flat forever.
Why This Matters Beyond Jira
It would be easy to read this as a Jira-specific optimization story. It isn’t. The same MCP server optimization principles apply to any enterprise API you’re wrapping for an AI agent. Salesforce, ServiceNow, Notion, GitHub — they all have the same problem. They were built for human-readable or application-consumable JSON, not for language models that bill by the token and have finite context windows.
The broader shift happening here is a move from access-first to efficiency-first MCP design. The first wave of MCP servers proved the concept: yes, you can give agents access to your tools. The second wave has to prove the economics. As organizations start running agents at scale — hundreds of concurrent sessions, thousands of API calls per day — the token efficiency of the underlying MCP layer becomes a real operational cost, not just a technical footnote.
OpenAI, Anthropic, and Google are all pricing their most capable models at rates where a 94% reduction in token consumption genuinely moves the needle on infrastructure costs. A team running GPT-4o or Claude Opus against a poorly optimized Jira MCP at scale could easily be spending ten times more than necessary on context alone. That makes MCP server optimization a direct line item in any serious AI infrastructure budget.
Getting Started With the Toolkit
The practical entry point is straightforward. Install the package via npm, then run the built-in skill installer. For teams using Claude Code, the skill auto-loads during development and walks through manifest design, trim configuration, dispatcher wiring, and server boot. For everyone else — Cursor, Codex CLI, Aider, Zed — the same patterns are documented in the project’s AGENTS.md and can be fed directly to whatever agent you’re working with.
The toolkit ships with an operation manifest system that acts as a single source of truth for all endpoints, a page cache for stable versioned resources like PR diffs and Confluence page versions, pooled HTTP transport with proper 429-handling, and atomic streaming downloads with SHA-256 verification. These aren’t nice-to-haves for production systems. They’re table stakes for any serious MCP server optimization effort.
Lepp’s real-world benchmarks come from the same ultra-jira-mcp and ultra-bitbucket-mcp servers he’s actively maintaining — meaning the numbers aren’t synthetic. Every byte in the benchmark is a byte a production agent would actually receive.
The Road Ahead for MCP Design
MCP server optimization is going to become a competitive differentiator faster than most teams expect. Right now, it’s still possible to build a bloated MCP server, ship it, and not feel the pain immediately. But as agent usage compounds — more concurrent users, longer sessions, richer task chains — the inefficiencies stack. What costs a few dollars today costs hundreds next quarter.
The teams that treat their MCP layer as a first-class engineering concern — with the same discipline they’d apply to a high-traffic API or a database query planner — are the ones that will be able to scale agent workflows without hitting a wall. Trimmed payloads, consolidated tools, server-side filtering, and stable allowlists aren’t just performance tricks. They’re the foundation of AI integrations that can actually survive production.
Source: https://dev.to/scottlepp/build-mcp-servers-that-dont-sucktokens-im2

