Not all Root Causes are Created Equal

An article I read recently made my blood boil by declaring root cause analysis a fallacy and counter productive. Hate for root cause analysis isn’t even that new. Here is a presentation from 2007 castigating root cause analysis. So how could this be? How could I have been so wrong for so long? And then it hit me…

Will the Real Root Cause Please Stand Up

We’re not talking about the same thing when we talk about root cause. The Wikipedia definition of root cause (a factor whose removal makes the problem go away) is a terrible way to fix problems. I call this method “poke it with a stick debugging.”

Poke. Is the problem fixed now? No? Poke it somewhere else. Fixed now? No? Try again.

As the links above correctly observe, large outages and dramatic failures are usually the result of many interacting things. It is necessarily so. Modern systems are specifically designed to prevent single failures from causing full scale outages. Most single cause failures are typically rooted out early in the life of a system. No wonder then, that using Five Whys or some other technique designed to find a single root cause is counter productive. I agree with all the points linked to above, and certainly once root cause analysis becomes about assigning blame, you’ll never get to the bottom of things.

When I say “root cause”, I mean something different, very specific, and technical. Let’s take a look at a simple example to understand what I mean.

Preconditions, Postconditions, Invariants, Oh My!

Software is like a house of cards. Each layer works as long as it abides by the contract (express or implied) needed by the next. Consider the simple program below. This program returns the smallest element in a list (using an inefficient approach, but bear with me).

function smallest(data: number[]): number {
   const sorted = sort(data);
   return sorted[0];
}

So why does this function work, and what guarantees does it make to callers? The answers to these questions are the contracts that must be fulfilled for correct operation. In a perfect world, we’d specify these formally using Hoare Logic, loop invariants, automata theory, or one of many other formalisms. In practice, we sometimes get API docs, or an RFC. Often we don’t even have that.

We understand the obvious contracts here: `smallest` should return the smallest number in `data`, `sort` should sort from least to greatest, etc. But what should happen if `data` is empty? What does `sorted[0]` do if `sorted` is empty? Correct operation depends on an agreement on what each contract should be from the top-level design, all the way down to the hardware.

My Root Cause (and Perhaps Yours Too)

When I discuss root cause analysis for a failure, what I mean is: which portions of the software system violated the implied or explicit contracts. The most specific place where a contract was violated is the root cause (or root causes, if contracts were violated in multiple places).

In the example above, suppose we trace a failure to `smallest` not returning the smallest element for some non-empty array. Further suppose we discover that `sort` is returning elements from greatest to least, instead of least to greatest as needed by `smallest`.

Now, either `sort` is broken and the root cause of the failure is somewhere in its implementation, or `smallest` is broken and the root cause is the incorrect assumption that `sort` will return an array from least to greatest. Of course, if we decide `sort` is the problem, we need to apply this analysis recursively until we are at a level of detail where we can fix the contract violation.

In practice, you can end up in a situation where it isn’t even clear what the contract is supposed to be. Worse, you can discover that everyone is obeying the required contracts, but this will not result in the desired behavior. In this case, the design is the root cause and can be very painful to fix correctly.

Cascading Failures and Root Causes that Matter

We have to look at all contract violations involved in a failure. For example, suppose you have a distributed system that should allow any 1 server to crash without affecting the service provided by the system. Suppose the system becomes completely unavailable, despite only a single server crashing. Now there are at least two things to root cause: first, why did the server crash in the first place and, more importantly, why did the entire system go down as a result.

In practice, it can be hard to identify all the root causes, and it may not be practical to track them all down. Make sure to focus on the ones that matter most. Mechanisms that allow a problem to spread are generally more important than the thing that caused the original problem in the first place. In a large system, you’ll always have these original “root causes”, the most important thing is to contain the problem and degrade gracefully if necessary.

Finally, don’t forget to ask if the design of the system is suitable for the reliability demanded. Large distributed systems will have individual failures, data corruptions, network partitions, slow response times, and other issues. But is the overall system designed to handle it? How do you know? How do you test it? Do you intentionally induce failures in production to ensure robustness, or do you hope for the best?

Don’t Play the Blame Game

Very often, major problems result from the interaction of components. And these system components are often implemented by different people and different teams. Finding the technical root cause, as described above, can become an exercise in finger pointing – trying to argue and interpret the implicit contracts between parts of the system so as to place the root cause(s) in some other team’s court.

Anyone with experience in production failures has seen very smart people jump to conclusions, torture logic, and twist the obvious into knots trying to justify blaming other people’s code. Suffice to say that this is a hard issue to resolve, and is beyond the scope of this article. Just be aware that you can easily slide from arguing about technical root cause and what are the expectations and invariants in your system to arguing about who is at fault instead of what code is at fault. You may not even realize that it has happened, because it happened entirely in the minds of other people on the project.

Bonus Exercise

Write and test a bubble sort algorithm for an array of numbers. Now, make a reasonably formal argument as to why it works. Most good developers can slam out bubble sort pretty quickly. Many will struggle to explain why it works. This fact alone explains why finding the contract violations I describe above can be so vexing.

0 Comments

Join the conversation

Your email address will not be published. Required fields are marked *