SpaceX, Fault Trees, and Root Cause

My previous six Fault Tree Friday posts were tutorials highlighting some of the stumbling blocks I’ve noticed in teaching classes on safety analysis. This one deals with fault trees in recent press.

I ran into a question on Quora asking how SpaceX so accurately determines the causes of failures in their rockets. The questioner said he heard Elon Musk mention a fault tree analysis and didn’t understand how it could be used to determine the cause of rocket failures.

In my response (essentially what follows below) I suggested that his confusion regarding the suitability of fault tree analysis to explaining incidents like the September 2016 SpaceX explosion (“anomaly” in SpaceX lingo) is well-founded. The term “fault tree analysis” (FTA) in aerospace has not historically been applied to work done during an accident/incident investigation (though it is in less technical fields, where it means something less rigorous).

FTA was first used during the design of the Minuteman system in the 1960s. FTA was initially used to validate a design against conceivable system-level (usually catastrophic) failures (often called hazards in the FTA context) by modeling all combinations of failures and faults that can jointly produce each modeled hazard. It was subsequently used earlier in the design process (i.e., earlier than the stage of design validation by reliability engineering) when we realized that FTA or (something very similar) is the only rational means of allocating redundancy in complex redundant systems. This allocation of redundancy ensures that systems effectively have no probabilistic strong or weak links – similar to the way stress analysis ensures that mechanical systems have no structural strong or weak links, yielding a “balanced” system.

During the design of a complex system, hazards are modeled by a so-called top-down (in FTA jargon) process. By “top-down” we mean that the process of building a fault tree (which is typically represented by a diagram looking somewhat like an org chart) uses functional decomposition of the hazardous state (the “top event” in a fault tree) by envisioning what equipment states could singly or in combination produce the top event. Each such state is an intermediate event in FTA parlance. Once such equipment states are identified, we use a similar analytical approach (i.e., similar thinking) to identify the equipment states necessary to jointly or singly produce the intermediate event in question. Thus the process continues from higher level events down to more basic events. This logical drill-down usually stops at a level of granularity (bottom events) sufficient to determine from observed frequencies in historical data (or from expert opinion about similar equipment) a probability of each bottom event. At this point a fault tree diagram looks like an org chart where each intermediate event is associated with a Boolean logic gate (e.g., AND or OR). This fault tree model can then be solved, by Boolean algebra (Boolean absorption and reduction), usually using dedicated software tools.

The solution to a fault tree consists of a collection of sets of bottom-level events, where each set is individually sufficient to produce the top event, and where all events in a set are individually necessary to that set. There may be thousands or millions of such sets (“cut sets”) in the solution to a fault tree for a complex system. Each set would include one or more bottom-level events. A set having only one bottom-level event would indicate that a single failure could produce the top event, i.e., cause a catastrophe. Civil aviation guidelines (e.g. CFR 25.1309) require that no single event and no combinations of events more probable than a specified threshold should produce a catastrophic condition.

If probability values are assigned to each bottom-level event, the solution to a fault tree will include calculated probability values for each intermediate event and for the top event. The top event probability is equal to the sum of the probabilities for each cut set in the solution’s collection of cut sets, each of which is an independent combination of bottom events jointly sufficient to produce the top event. A fault tree can still be useful without including probability calculations, however, since the cut set list, typically ordered by increasing event count, provides information about the importance of the associated equipment in avoiding the hazard (top event). The list also gives guidance for seeking common causes of events believed to be (modeled as being) independent.

A fault tree is only as good as the logic that went into its construction. I.e., FTA requires that the events within each cut set are absolutely independent and not causally correlated. This means that wherever an AND gate occurs in a tree, all events associated with that AND gate must be truly independent and have no common cause. So another value of a completed fault tree is to challenge beliefs about independence of events based on isolation, physical separation, vulnerability to environmental effects, duplicated maintenance errors, etc.

Now in the case of SpaceX, the press referred to conducting a fault tree analysis as part of the investigation. E.g., Bloomberg reported on Sep. 9 2016 that a “group will be conducting a thorough ‘fault tree analysis’ to try to determine what went wrong.” This usage is not consistent with the way the term is typically used in aerospace. As stated above, FTA relates to the design of a system and addresses all possible combinations of failures that can be catastrophic.

By contrast, accident investigation would be concerned, in the SpaceX case, with examining failed components and debris from the explosion. Such an investigation would likely include fractography, simulations, models, and hypotheses, which would draw on a group of fault trees that presumably would have existed since the design phase of the rocket and its planned operations.

It is unclear whether SpaceX meant that they were constructing fault trees as part of their accident investigation. They said they were “conducting tests” and that the “investigation team worked systematically through an extensive fault tree analysis.” It seems inconceivable that NASA, stakeholders, and customers would have allowed development of a rocket and use of public funds and facilities without the prior existence of fault trees. It’s possible, even if a SpaceX public relations representative said they conducting a fault tree analysis, that the PR person was miscommunicating information received from engineers. If no fault trees existed at the time of the explosion, then shame on SpaceX and NASA; but I doubt that is the case. NASA has greatly increased emphasis on FTA and related techniques since the shuttle disasters.

For reasons stated above the relationship between accident investigation and fault tree analysis in aerospace is limited. It would be unproductive to analyze all possible causes of a catastrophic system-state for the purpose of accident investigation when physical evidence supports certain hypotheses and excludes others. Note that each cut set in the solution to a fault tree is effectively a hypothesis in the sense that it is a plausible explanation for the catastrophe; but a fault tree does not provide justification for the nodes of a causation chain.

Many people ask about the relationship between aerospace accident investigation and root cause analysis. While aerospace engineers, the NTSB, and the FAA sometimes use the term “root cause” they usually do so with a meaning somewhat different than its usage in popular techniques of root cause analysis. The NTSB seeks explanations of accidents in order to know what to change to prevent their recurrence. They realize that all causal chains are infinitely long and typically very wide. They use the term “root cause” to describe the relevant aspects of an accident that should be changed – specifically, equipment design, vehicle operating procedures or maintenance practices. They are not seeking root causes such as realignment of incentives for corporate officers, restructuring the education system for risk culture, or identifying the ethical breaches that led to poor purchasing decisions.

Perhaps surprisingly, aerospace accident analyses are rather skeptical of “why” questions (central to many root-cause techniques) as a means of providing explanation. From the perspective of theory of scientific explanation (a topic in philosophy of science), fault tree analysis is also skeptical of causality in the sense that it favors the “covering law” model of explanation. In practice, this means that both FTA and accident investigation seek facts of the case (occurrences of errors and hardware/software failures) that confer nomic expectability on the accident (the thing being explained). That is, the facts of the case, when combined with the laws of nature (physics) and the laws of math (Boolean algebra), require the accident (or top event) to happen. In this sense accident investigation identifies a set of facts (conditions or equipment states) that were jointly sufficient to produce the accident, given the laws of nature. I could have said “cause the accident” rather than “produce the accident” in my previous sentence; but as phrased, it emphasizes logical relationships rather than causal relationships. It attempts to steer clear of biases and errors in inference common to efforts pursuing “why” questions. Thus, techniques like “the 5 whys” have no place in aerospace accident analyses.

Another problem with “why”-based analyses is that “why” questions are almost always veiled choices that entail false dichotomies or contain an implicit – but likely incorrect – scope. I.e., as an example of false dichotomy, “why did x happen” too often is understood to mean “why x and not y.” The classic example of the why-scoping problem is attributed to Willie Sutton. When asked why he robbed banks, Sutton is reported to have replied, “because that’s where the money is.” In this account Sutton understood “why” to mean “why banks instead of churches” rather than ”why steal instead of having a job.”

Some root-cause frameworks attempt to avoid the problems with “why” questions by focusing on “what” questions (i.e., what happened and what facts and conditions are relevant). This is a better approach, but in absence of a previously existing fault tree, there may be a nearly infinite number of potentially relevant facts and conditions. Narrowing down the set of what questions exposes the investigators to preferring certain hypotheses too early. It can also be extremely difficult to answer a “what happened” question without being biased by a hunch established early in the investigation. This is akin to the problem of theory-laden observations in science. The NTSB and aerospace accident investigators seem to do an excellent job of not being led by hunches, partly by resisting premature inferences about evidence they collect.

I’d be interested in hearing from anyone having more information about the use of fault trees at SpaceX.

[Edit: a bit more on the topic in a subsequent post]

  – – – – – – – –

Are you in the San Francisco Bay area?
If so, consider joining the Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

1 thought on “SpaceX, Fault Trees, and Root Cause

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s