- Council is an AI model debate tool that runs three LLMs in parallel and delivers one verdict with a confidence score.
- The AI model debate includes a deliberation round where jurors read each other’s arguments and can change their position.
- Hermes acts as the judge and foreman, synthesising results and remembering which model to trust for which question type.
- The entire system runs free locally using Ollama and two OpenRouter models — no API costs required.
- Council is an AI model debate tool that runs three LLMs in parallel and delivers one verdict with a confidence score.
- The AI model debate includes a deliberation round where jurors read each other’s arguments and can change their position.
- Hermes acts as the judge and foreman, synthesising results and remembering which model to trust for which question type.
- The entire system runs free locally using Ollama and two OpenRouter models — no API costs required.
The Problem With Trusting a Single AI Answer
Anyone who’s used an AI model debate framework to stress-test decisions knows one thing immediately: a single LLM is dangerously convincing. Developer Arqam Waheed learned this the hard way. He asked one model whether to use a particular database. It answered with total confidence. He shipped. It cost him a full weekend and a painful migration he says he’s still not over. The culprit wasn’t bad AI — it was single-model overconfidence, the quiet tendency of large language models to present one polished, authoritative answer while all the uncertainty underneath stays completely invisible to you.
This is a real and underappreciated problem in how developers are using AI tools right now. When you ask ChatGPT or Claude a technical question and get a clean, confident response, you have no idea whether a different model would have said the opposite — or flagged a serious risk. You never see the dissent, because you only asked one voice. An AI model debate approach exists precisely to surface that hidden disagreement.
Waheed’s response was to stop trusting a single model entirely. Instead, he convened a jury.
How the AI Model Debate Actually Works
The project is called Council, and the concept is straightforward even if the implementation isn’t. You submit any judgment call — “Postgres or Mongo?”, “is this PR safe to merge?”, “is this clause legally risky?” — and Council fans it out simultaneously to three different AI models. Two are hosted models from different families via OpenRouter, and the third runs locally on your machine via Ollama. All three are treated as jurors in an AI model debate, each independently taking a position with supporting reasoning.
That’s round one. If the three models agree, you get a high-confidence verdict quickly. But if they split — and they often do — the system runs a second deliberation round. Each juror is shown the other two opinions and given the chance to hold their position or change it based on the arguments. Models that flip are marked with a “⇄ changed” badge in the UI. If a 2-1 split gets argued into consensus during deliberation, the confidence score actually climbs to reflect that. This isn’t just voting — it’s structured adversarial reasoning.
After deliberation, Hermes steps in as the foreman. It synthesises the debated opinions into a single verdict, attaches a confidence score, and produces a breakdown of exactly why the jurors disagreed. The dissent panel is collapsed by default, but you open it precisely when the confidence number makes you nervous — which is the point. The disagreement that would normally be hidden behind one smooth answer is now the headline feature.
Why an AI Model Debate Beats a Single LLM for High-Stakes Calls
It’s worth stepping back to think about what Council is actually solving. The industry conversation around AI reliability has mostly focused on hallucinations — factual errors that are easy to spot in hindsight. But the harder problem is confident plausibility: answers that sound right, read well, and are subtly wrong in ways that only surface after you’ve acted on them. That’s where an AI model debate structure has a genuine edge.
When three models from different training backgrounds, different architectures, and different providers all converge on the same answer, you have something meaningfully more trustworthy than one model’s opinion. When they split, that split is itself information — a signal that the question is genuinely ambiguous or context-dependent, and that you should think harder before committing.
Think about how this maps to real developer decisions. “Should a 3-person startup use microservices?” is exactly the kind of question that has no universally correct answer. A single model will usually give you the answer that sounds most defensible in the abstract. An AI model debate — with deliberation — will surface the actual tensions: operational overhead vs. future scalability, team skill set vs. industry convention, speed-to-market vs. technical debt. That’s a much more useful output.
Hermes as Judge, Memory, and Learning System
The whole system is built on top of Hermes, an agent framework that Waheed describes as the only piece that makes “different models cheap.” Every juror call and every judge call goes through the same hermes -z interface — a single command that lets you point at any provider and swap models with a flag, no code change required. One juror runs locally via Ollama while two run on OpenRouter, and they all answer through identical plumbing. That model-agnosticism is what makes the jury composition flexible without becoming a maintenance burden.
But Hermes does more than orchestrate. Every verdict is written into Hermes’ own memory, and a dedicated council skill gradually learns which juror to trust for which category of question. Legal clause review? Technical architecture decisions? Database choices? Over time, the system builds a weighting profile — and crucially, it does this through a human-in-the-loop approval flow. Hermes proposes trust adjustments; you approve or dismiss them. Approved rules persist client-side and ride along with the next convene call. The judging is explicitly a function of that accumulated memory: no memory, no weights, no verdict. This is what separates a genuine AI model debate system from a simple polling wrapper.
This is a smart design choice. Fully automated trust adjustment would be opaque and potentially self-reinforcing in bad ways. Keeping humans in the approval loop means the system’s learned preferences stay auditable and correctable.
What It Costs to Run — and Why That Matters
One of Council’s most striking claims is the price tag: zero dollars. The local setup runs entirely free using Ollama for the on-device model and OpenRouter’s free-tier models for the hosted jurors. For developers who want to experiment with multi-model reasoning without signing up for another expensive API bill, that’s a genuine differentiator. Running a full AI model debate locally costs nothing beyond electricity.
There’s a hosted demo at council-jet-kappa.vercel.app that runs the same UI through OpenRouter and a mock layer — since Hermes can’t run on serverless infrastructure, the full local experience requires cloning the repo. But the setup is genuinely one command: clone, run the setup script, start the server. The GitHub repo is public at github.com/ArqamWaheed/council.
Running it locally is also when you get the real Hermes integration — the memory recall, the skill weighting, the actual subagent transcripts. Waheed has included proof-of-work documentation in the repo’s docs/hermes-proof/ directory showing genuine Hermes runs, skill diffs, and memory recall outputs, not just mocked UI demos.
The Bigger Picture for AI-Assisted Decision Making
Council is a developer side project built for a hackathon challenge, and it’s honest about that. The UX is minimal, the architecture is scrappy in places, and the local-only constraint for full Hermes functionality is a real limitation. But the underlying idea is pointing at something important.
As AI tools get embedded deeper into engineering workflows — code review, architecture decisions, contract analysis, security audits — the single-model confidence problem is going to cause more expensive mistakes, not fewer. The instinct to build an AI model debate structure rather than just trusting one chatbot is probably the right direction for any high-stakes application. Enterprise vendors like Anthropic are already talking about multi-agent systems and constitutional AI as ways to build more reliable outputs; Council is an accessible, open-source demonstration of what that philosophy looks like in practice.
Whether the specific implementation — Hermes as the orchestration layer, client-side persistence, three jurors as the magic number — turns out to be the optimal design is a separate question. But the core insight stands: the most useful thing an AI tool can show you sometimes isn’t the answer. It’s the disagreement.
Source: https://dev.to/arqamwd/i-made-my-ai-models-argue-then-let-hermes-be-the-judge-5e6c


