The Root Cause Fallacy - by Jeff - JoT | stdlib

Root cause analysis doesn't fail because people do it badly. It fails because it plays directly into the ways humans already get things wrong. The phrase "root cause" implies a single point to fix, somewhere you can fire the mythical silver bullet and solve all problems. The trouble isn't that RCA gives you wrong answers - it gives you an answer, and that's even worse. Because once you've got an answer you stop looking for anymore.

Techniques like 5 Whys perform a depth-first search that stops at the first leaf node. You get a cause. You miss the contributory factors sitting in every other branch you didn't walk down. The depth feels like rigour, but it's actually just tunnel vision justified with a methodology. Meanwhile, false dichotomies proliferate everywhere in software: "It's a people problem" versus "it's a process problem." Capable people in a bad process look incompetent, and a great process with misaligned people generates beautifully efficient but wrong outputs. The failure lives in the fit between the two, not in either one. Root cause thinking doesn't just oversimplify, it polarises. People end up arguing about which cause is the cause rather than mapping how they feed each other.

The result is almost always the same: a concrete action item (seemingly always adding another step to a process) without any consideration of the system as a whole. You get the satisfying feeling of having fixed something. Whether it's the right something is another question entirely. The alternative is to think in contributing factors, not root causes. Draw causal loop diagrams instead of causal chains. Tech debt causes slowness. Slowness creates pressure. Pressure creates more tech debt. If your diagram has no loops, you probably haven't looked hard enough. Ask "what conditions made this likely?" rather than "what caused this?" This shifts the question into something systematic, rather than looking for a silver bullet.

James Reason's model is worth internalising: failures happen when multiple gaps align simultaneously, not when one thing goes wrong. Every system has holes - incomplete tests, ambiguous runbooks, technical debt. Individually, none of them are "the cause." The incident happens when enough of them line up at the same time. Patching one hole feels productive. Understanding why so many holes were open at once is where the actual learning lives.

The Root Cause Fallacy - by Jeff - JoT

Problems this helps solve:

Explore more resources