- Terraform spaghetti — a single 47,000-line codebase managing everything — can make terraform plan take 14 brutal minutes.
- Breaking Terraform spaghetti into domain-bounded state files cut planning time from 14 minutes down to 45 seconds.
- Four near-identical environment files with silent drift created a ticking operational time bomb nobody noticed until it mattered.
- Replacing a direct-apply CI pipeline with a plan-artifact-and-approval workflow stopped the cycle of ‘fix the fix’ commits.
- Terraform spaghetti — a single 47,000-line codebase managing everything — can make terraform plan take 14 brutal minutes.
- Breaking Terraform spaghetti into domain-bounded state files cut planning time from 14 minutes down to 45 seconds.
- Four near-identical environment files with silent drift created a ticking operational time bomb nobody noticed until it mattered.
- Replacing a direct-apply CI pipeline with a plan-artifact-and-approval workflow stopped the cycle of ‘fix the fix’ commits.
When You Inherit Someone Else’s Terraform Spaghetti
The message was almost comically bleak: “Hey, the previous platform team left. Here’s the repo. Good luck.” What followed was a Git repository containing 47,000 lines of Terraform spaghetti — one monolithic state file, zero modules, and variable names like x, temp2, and the immortal DO_NOT_TOUCH_ask_raj. Raj had left the company two years prior. This is the infrastructure horror story that Sanjay Sundarmurthy documented on Dev.to, and it’s the kind of thing that separates theoretical DevOps knowledge from actual survival instincts.
Anyone who’s spent serious time as a platform or DevOps engineer has opened a main.tf that made them reconsider their career. The specifics vary — maybe it’s 12,000 lines instead of 47,000, maybe it’s Azure instead of AWS — but the shape of the problem is always the same: infrastructure that started as a quick script and grew, organically and uncontrolled, until it became genuinely dangerous to touch. Terraform spaghetti isn’t a niche edge case. It’s arguably the default state of infrastructure at companies that moved fast without investing in platform engineering discipline.
The Real Cost of a Single Monolithic State File
The obvious problem with Terraform spaghetti is aesthetic — it looks terrible, it’s impossible to read. But the real damage is operational. Sundarmurthy’s main.tf clocked in at 8,400 lines on its own, managing networking, compute, databases, DNS, IAM, monitoring, and — brilliantly — a CloudFront distribution for a marketing site decommissioned in 2023. A single terraform apply touched all of it simultaneously.
What does that mean in practice? A terraform plan took 14 minutes to complete. That’s not just annoying — it’s a behavioral change. Developers stop running it. They start making changes blind, or skipping plan reviews entirely, which is exactly how a typo in a security group rule ends up presenting 847 resources to evaluate and Terraform deciding your RDS instance needs replacing. State file locking meant only one person could work at any given time. The blast radius of any mistake was the entire production infrastructure. New team members were, quite reasonably, too scared to touch anything.
This is the compounding tragedy of Terraform spaghetti: the mess itself creates incentives that make the mess worse. Slow feedback loops breed workarounds. Workarounds become undocumented conventions. Undocumented conventions become DO_NOT_TOUCH_ask_raj.
Step One: State Surgery Without Downtime
The instinct when staring at 47,000 lines of Terraform spaghetti is to burn it down and start over. That instinct is wrong, and Sundarmurthy learned it the hard way after a “big bang” refactor attempt consumed three full sprints and broke staging for a week. The right approach is more like surgery than demolition.
The first move was visualisation. Running terraform graph | dot -Tsvg generates a dependency map of every resource in the state — a tool that’s been part of Terraform’s CLI since the early days but gets criminally underused. That map becomes the blueprint for splitting the monolith into logical domains.
The actual splitting happens via terraform state mv, which moves resources between state files without destroying and recreating them. Sundarmurthy carved the monolith into five distinct state files, organised by change frequency and blast radius: foundation networking (changes rarely, everything depends on it), data stores, compute and services, security and IAM, and a catch-all for monitoring and DNS. The logic is sound — you want your most volatile layers isolated from your most critical ones. An IAM policy change shouldn’t be anywhere near a database resource block.
Once the state files were separated, terraform_remote_state data sources stitched them back together without tight coupling. Compute configs could reference networking outputs — subnet IDs, VPC identifiers — without owning the networking state. Each domain team could now run terraform plan against just their slice of infrastructure.
The result: plan time dropped from 14 minutes to 45 seconds. According to Sundarmurthy, team velocity tripled. The 2 AM pages about state lock contention stopped entirely.
The Environment Drift Problem Nobody Wants to Admit
Splitting the state was only half the battle. The second major problem was four near-identical environment directories — dev, staging, prod, and DR — each containing roughly 1,200 lines of Terraform that had diverged in ways nobody could fully account for. Staging had a security group rule that prod didn’t. DR was missing three services entirely. Nobody knew which differences were intentional and which were accidents.
This is one of the most insidious forms of Terraform spaghetti because it looks organised. Four tidy folders, each named after an environment. But the commit history tells the real story: “fix: revert the revert of the fix,” “fix: ok THIS one fixes it,” “revert: revert everything from today.” Someone patched a bug in dev and never propagated it. Someone added a hotfix to prod directly. Over months, you accumulate 47 of those hotfixes. You can’t diff your way out because the divergence is both intentional (prod runs m5.xlarge instances, dev runs t3.medium) and accidental (a security rule that should exist in both environments exists in only one).
The fix here is the Strangler Fig pattern, borrowed from application refactoring and applied to infrastructure modules. Rather than trying to unify everything at once, you extract one component at a time into a shared module that accepts environment-specific configuration as explicit variables. The critical discipline is that every difference between environments must be a documented, intentional decision expressed in code — not a silent drift buried in line 847 of a config file. This approach works equally well whether you inherited classic Terraform spaghetti or created your own through years of well-intentioned hotfixes.
Sundarmurthy’s module approach encodes the why alongside the what: dev gets WAF disabled and single-AZ deployments not because someone forgot to enable them, but because the configuration explicitly says so. Prod gets 35-day backup retention. DR gets flagged the moment its config diverges from the prod baseline in ways that aren’t deliberate. That’s a completely different operational posture from hoping nobody accidentally deleted a resource block.
The CI Pipeline That Was One Push Away From Catastrophe
The third layer of this particular infrastructure disaster was a CI/CD pipeline that would run terraform apply -auto-approve directly on a push to main. No plan artifact saved for review. No approval gate. No diff visible to anyone before changes hit production. Just: commit, push, apply. This kind of pipeline is itself a form of Terraform spaghetti — process debt layered on top of code debt.
This is more common than the DevOps community wants to acknowledge. Speed pressure and the desire to eliminate friction lead teams to strip out guardrails that feel like overhead — until the day they aren’t overhead anymore.
The replacement workflow follows a pattern that’s become something of a standard in mature platform engineering teams. Pull requests trigger a terraform plan that saves its output as an artifact and posts a summary directly as a PR comment. Reviewers can see exactly what will change before approving anything. The apply step, when it runs on merge to main, uses that saved plan artifact rather than re-planning — which closes the window where infrastructure state could change between review and apply. It’s not complicated. It’s just disciplined.
What This Really Tells Us About Platform Engineering Debt
Sundarmurthy’s story isn’t really about Terraform — it’s about what happens when infrastructure tooling outpaces an organisation’s capacity to govern it. Terraform hit 1.0 in 2021, but teams have been running production infrastructure on it since 2014. That’s a decade of accumulated patterns, and a significant portion of those patterns were written by people who’ve since left, under pressure, without documentation, in codebases that were never designed to survive organisational churn. The resulting Terraform spaghetti is less a failure of individual engineers and more a failure of the systems and cultures surrounding them.
The tooling has matured considerably — OpenTofu’s emergence as an open-source Terraform fork, the growth of tools like Terragrunt and Atlantis for workflow management, and HashiCorp’s own Terraform Cloud all offer better guardrails than existed when most of this Terraform spaghetti was written. But tooling alone doesn’t fix a culture that treats infrastructure code as a second-class citizen relative to application code.
The engineers who’ll be opening these repositories in 2027 are already being set up to fail by decisions being made right now. The question isn’t whether your infrastructure will accumulate debt — it’s whether you’re building systems that make that debt visible and manageable before someone has to do state surgery at 2 AM with a VP watching over their shoulder.


