- Post-mortems and RCAs are distinct tools — confusing them produces documents that describe failures without actually understanding them.
- Most teams skip post-mortems and RCAs after incidents, which means the same failures quietly repeat months later.
- A blameless culture is the non-negotiable foundation — without it, engineers self-censor and the real causes stay hidden.
- Pairing post-mortem action items with automated tests turns one-time lessons into permanent institutional memory.
- Post-mortems and RCAs are distinct tools — confusing them produces documents that describe failures without actually understanding them.
- Most teams skip post-mortems and RCAs after incidents, which means the same failures quietly repeat months later.
- A blameless culture is the non-negotiable foundation — without it, engineers self-censor and the real causes stay hidden.
- Pairing post-mortem action items with automated tests turns one-time lessons into permanent institutional memory.
Surviving an Incident Is the Easy Part
Post-mortems and RCAs might be the most consistently skipped practice in software engineering — which is ironic, because they’re also among the highest-value things a team can do. The acute phase of an incident is, paradoxically, the part most teams handle reasonably well. Something breaks, engineers pile into a Slack war room, dashboards eventually go green, and everyone disperses exhausted but relieved. The backlog is waiting. The next deployment is already queued. Nobody wants to spend another two hours on something that’s technically over.
That instinct to move on is completely understandable. It’s also where organisations quietly throw away the most valuable information they’ll generate all quarter.
The lesson doesn’t live in the incident itself. It lives in what you do immediately after — and that means committing to post-mortems and RCAs before the details fade.
Post-Mortems and RCAs Are Not the Same Thing
The two terms get used interchangeably in engineering circles, and that’s a problem — not because terminology pedantry matters, but because conflating them leads to doing both badly.
A post-mortem is the artifact: the document and the structured conversation that follow an incident. It captures what happened, when it happened, who was involved, what the team tried, what the impact was, and — critically — what the team commits to changing. Think of it as a learning document. Its job is to encode experience into something durable before memory fades and the people involved move on to other things.
Root Cause Analysis — RCA — is a technique, not a document. The most common form is the five whys, a method originally developed within Toyota’s manufacturing system, which asks teams to keep asking “why?” until they’ve pushed past surface symptoms to the underlying conditions that allowed a failure to occur. RCA is the tool you use to fill in the most important section of the post-mortem.
You can run post-mortems and RCAs independently — and plenty of teams do, to their detriment. RCAs that live in Slack threads disappear within weeks. Post-mortems written without genuine root cause analysis tend to produce conclusions like “the database went down” without ever interrogating why. The combination is what produces actual learning. Neither works nearly as well in isolation.
Why Blame Kills the Whole Exercise
The single most important property of any useful post-mortem isn’t its format or its length. It’s that it’s genuinely blameless.
Not “blameless except for the engineer who shipped on a Friday.” Not “blameless but we’ll mention who was on call.” Blameless, as a structural property.
The reason is mechanical, not philosophical. The moment engineers suspect that a post-mortem might be used against them — even subtly, even informally — the information dries up. People stop volunteering what they were actually thinking when they made a decision. Operators stop admitting they didn’t fully understand the runbook. The document becomes a carefully constructed piece of self-protective fiction, and the real causes go uninvestigated. This is why blameless post-mortems and RCAs must be treated as a cultural commitment, not just a process checkbox.
The framing that actually works was articulated by John Allspaw during his time at Etsy over a decade ago: assume that everyone involved acted reasonably given the information available to them at the time. If their actions contributed to an incident, the interesting question is what made those actions look reasonable. That question is almost always answered by something structural — a missing alert, confusing ownership, tooling that made the wrong thing easy to do, a testing gap that let a bad assumption survive to production. Those are things you can fix. Human fallibility is not.
“Be more careful next time” is not an action item. It’s a feeling, written down.
What Good Post-Mortems and RCAs Actually Produce
The output of a useful post-mortem is a short list of concrete, owned, time-bound action items. The difference between a good one and a bad one is specificity.
Bad: “Improve our monitoring.”
Good: “Add alerting on queue depth above 10,000, owned by the platform team, targeting this sprint.”
The better teams push one step further and ask whether a given action item addresses a specific incident or a class of problem. A misconfigured retry policy is a specific bug. A deployment environment where retry policies are easy to misconfigure and nearly impossible to catch in testing is a class of problem — and that’s where the real value sits. Fix the class and you prevent not just this incident but the next dozen variants you haven’t had yet.
That distinction — specific fix versus systemic fix — is one of the most underused lenses in post-mortem practice. It’s also the one that compounds over time. Teams that use post-mortems and RCAs to target systemic issues consistently outpace those that only patch the immediate symptom.
Turning Lessons Into Tests That Don’t Forget
Here’s where post-mortems and RCAs connect to something engineering teams already understand: Test-Driven Development.
TDD’s core loop is simple. Write a failing test that describes the behavior you want. Write the code that makes it pass. The test becomes a specification, a safety net, and a regression check simultaneously — one artifact doing three jobs. The same loop maps almost perfectly onto incident response.
The behavior you want after an incident is “this specific failure mode never happens again.” So you write a test that either reproduces the failure directly or exercises the condition that allowed it. You watch it fail against the unfixed codebase. You implement the fix. You watch it pass. Then you merge it — and it lives in your CI pipeline from that point forward, running on every commit, regardless of who made the change or whether they’ve ever heard of the incident that produced it.
This practice is sometimes called bug-fix-by-test, and its value compounds quietly over years. The next engineer who attempts to reintroduce the problem — during an unrelated refactor, six months from now — gets stopped by a failing test with a descriptive name and, ideally, a link back to the post-mortem that created it. Without that test, the lesson lives entirely in the heads of whoever was on the call. Those people take jobs elsewhere. The lesson goes with them. Eighteen months later, the same incident happens and the new team genuinely doesn’t know why nobody saw it coming.
Post-Mortems and RCAs as Institutional Memory
Zoom out far enough and this practice reframes what a CI/CD pipeline actually is. Most teams think of it as a verification system for new code — a gate that checks whether today’s changes are safe to ship. That’s true, but it’s incomplete.
A pipeline that accumulates post-mortem tests over years becomes something more than that. It becomes the organization’s institutional memory, encoded in a form that doesn’t depend on anyone remembering anything. Every incident that gets properly post-mortemmed and tested makes the pipeline marginally smarter. A junior engineer hired next year inherits the accumulated scars of every incident that came before them — not because anyone sat them down and told them the stories, but because those scars are encoded as tests that fail when they should.
That’s how engineering organizations actually improve over long timescales, as opposed to just claiming to. The post-mortem without the test is a story you tell yourself about being a learning culture. The post-mortem with the test is evidence of one. Consistently running post-mortems and RCAs, and encoding their findings as automated tests, is what separates teams that genuinely learn from those that only intend to.
Starting Small — the Minimum Viable Loop
None of this requires standing up a formal SRE program or buying a new incident management platform. The minimum viable version of post-mortems and RCAs is genuinely small:
- After every incident worth the name, hold a meeting — not a long one.
- Write a document while the details are fresh.
- Run a five whys or equivalent RCA to push past the symptom.
- Identify two or three concrete action items with named owners and target dates.
- For at least one of them, write a test that would have caught the problem earlier.
- Merge the test. Move on.
That’s the full loop. It’s not conceptually complex. The hard part is doing it consistently — especially after the incidents that leave everyone exhausted and slightly relieved, when the next alert is already firing and the last thing anyone wants is another meeting about something that’s already been fixed.
But that’s precisely when it matters most. The teams that build durable reliability are the ones that treat the hour after an incident as part of the incident — not as optional cleanup to defer until next sprint. As engineering systems scale in complexity and team turnover accelerates, that discipline is what separates organisations that learn from their infrastructure from those that are simply at its mercy.
Source: https://dev.to/tacoda/post-mortems-and-rcas-why-you-should-be-doing-them-1i53

