- Running Claude Code on shared compute lets whole teams trigger AI-powered diagnostics without waiting on a single engineer.
- The cfn-investigator uses Claude Code on shared compute inside CodeBuild to analyze failed CloudFormation stacks automatically.
- Rather than inventing confident answers, the system ranks hypotheses — an honest shortlist beats a polished wrong guess.
- The project ships as open-source YAML and bash, prioritizing maintainability over architectural perfection.
The Real Cost of a Single Point of Failure
Running Claude Code on shared compute sounds like a DevOps luxury — something you get to when the real fires are out. But for Danielle Hebert, a solo DevOps engineer on a mixed-skill team, it turned into a practical necessity. The trigger was something every small engineering team knows too well: a CloudFormation deploy fails at 2pm, a Slack alert fires, and within minutes someone’s pinging the one person who understands what CREATE_FAILED actually means.
The pings weren’t the problem in isolation. The problem was structural. A broken deploy that requires one specific human to interpret it isn’t a process — it’s a dependency. And dependencies break at the worst possible times. Hebert’s response wasn’t to write better runbooks or schedule training sessions. It was to build something that removes herself from the critical path entirely.
The result is cfn-investigator, and the public version — stripped back to the core pattern and rebuilt from scratch — lives on GitHub under the name headless-claude-on-aws.
What Claude Code on Shared Compute Actually Looks Like
The architecture is deliberately small. At its center is an AWS CodeBuild project running Claude Code on shared compute headlessly. You hand it a failing CloudFormation stack name — optionally alongside the commit you suspect introduced the break — and it does the rest. Using the AWS MCP server with a read-only IAM role, it inspects the stack state, works out the most likely cause, and produces a short written analysis. In Hebert’s production setup, that analysis posts directly into the Slack thread where the alert originally fired. The developer sees the failure message and the diagnosis sitting right next to each other.
That last detail matters more than it might seem. Context collapse is one of the biggest friction points in incident response. You get the alert in one place, the logs in another, the stack events in a third. Bringing the diagnosis to where the conversation is already happening cuts the time between “something broke” and “I know where to start” significantly.
Why CodeBuild, and Why Not Bedrock
Hebert is upfront that her architectural choices weren’t textbook. CodeBuild over Lambda or Fargate. A direct Anthropic API key instead of routing through Amazon Bedrock. Neither decision would survive an AWS Well-Architected review unquestioned. But both were deliberate.
CodeBuild fit the job description almost perfectly: clone some source, run a script, post the result somewhere. That’s exactly what cfn-investigator does. When you run Claude Code on shared compute through CodeBuild, Lambda would’ve introduced cold start complexity and execution time constraints, while Fargate would’ve added container orchestration overhead. CodeBuild was already designed for “run this thing, then stop.”
Skipping Bedrock is the more interesting call. Bedrock is AWS’s managed gateway to foundation models including Anthropic’s Claude — it handles IAM-native auth, keeps traffic inside the AWS network, and adds compliance logging that enterprise security teams tend to require. Hebert’s README explains the reasoning in more detail, but the short version is speed to working prototype. Sometimes the managed, “correct” path takes three times as long to ship, and the problem it’s solving is happening right now.
This is an engineering trade-off that gets talked about less than it should be. The pull toward pristine architecture is real — especially in an industry where reference architectures and best-practice diagrams are everywhere. But a working system with known rough edges beats a theoretically clean system that doesn’t exist yet.
The Confidence Problem in AI Diagnostics
One design choice stands out as particularly thoughtful: how the system handles uncertainty. The system prompt explicitly instructs Claude to be honest when it isn’t sure, including an “unsure” classification that lets it rank hypotheses rather than force a single definitive answer.
This is quietly important. The failure mode for AI-assisted diagnostics isn’t usually “no answer” — it’s “confident wrong answer.” A developer who gets a ranked shortlist of probable causes with honest confidence signals is in a much better position than one who gets a clean, authoritative explanation that happens to be pointing at the wrong thing. Hebert’s framing — “a ranked shortlist beats a confident wrong guess” — should probably be on a poster somewhere.
It also reflects something broader about how AI tools get deployed in operational contexts. The tendency is to design for confidence, because that’s what feels useful. But in high-stakes environments like production infrastructure, calibrated uncertainty is often more valuable than false precision. A system that says “probably this, possibly that, unlikely the other thing” gives a developer agency. A system that says “definitely this” and is wrong wastes time and erodes trust. Running Claude Code on shared compute means that trust — or the erosion of it — affects every developer on the team, not just the one who built the tool.
What the Open-Source Version Reveals
The public headless-claude-on-aws repo is explicitly described as a learning artifact, not a production blueprint. The IAM permissions use AWS managed ReadOnlyAccess, which Hebert acknowledges is broader than it should be. Dependencies get installed fresh on every CodeBuild run rather than baked into a custom image, which is slower and less reliable than it could be. The two-role split scopes what the MCP server can touch within AWS — but it doesn’t constrain Claude itself.
These aren’t oversights. They’re honest concessions to the reality of maintaining infrastructure solo. Perfectly scoped IAM policies take time to write and test. Custom build images take time to maintain. When you’re the only person responsible for keeping something running, “boring enough to maintain alone” is a genuine design goal, not a cop-out. That’s especially true when Claude Code on shared compute means any misconfiguration affects every team member who relies on the tool.
The proof of concept already has a real win behind it. A Fargate task definition was missing an environment variable. The investigator caught it, surfaced it in Slack, and the developer fixed and redeployed without touching Hebert at all. That’s the whole point.
Where This Fits in the Broader AI Automation Wave
Hebert’s project arrives at a moment when the tooling around agentic AI in DevOps is evolving fast. She flags three options she’d evaluate before writing more YAML next time: Claude on AWS, Claude Managed Agents, and the Claude Agent SDK. Each of those abstracts away significant chunks of the plumbing she built by hand — the orchestration, the tool-calling setup, the session management.
The honest caveat is that she hasn’t used them in production yet. That matters. Managed services for agentic workloads are still maturing, and the gap between “works in a demo” and “works reliably under operational load” is real. The pattern Hebert built — Claude Code on shared compute, wired into existing CI/CD infrastructure, posting results where teams already communicate — is exactly the kind of architecture that eventually becomes a managed service. We’ve seen it happen with container orchestration, with serverless, with CI/CD itself. Someone builds the messy version, it works, and then AWS or a startup productizes the pattern.
For teams sitting somewhere between “we want AI in our DevOps tooling” and “we have the budget and runway for an enterprise platform,” Claude Code on shared compute inside an existing CodeBuild pipeline may be the most pragmatic on-ramp available today. The infrastructure is already there. The pattern is proven. And unlike a managed platform, you can read every line of it.
Source: https://dev.to/aws-heroes/getting-claude-code-off-my-laptop-and-onto-shared-compute-4cjc

