Category Archives: Root Cause

Covering-Law Models, Statistical Relevance, and the Challenger Disaster

Space shuttle Challenger disintegrated one minute after takeoff, killing its seven passengers. Many presentations on root cause analysis (RCA) cite the accident to show how an RCA framework would have pointed directly to the root cause. But no one agrees exactly on that cause, not even the members of the Rogers Commission. And showing that previously reached conclusions and evidence can be fed into an RCA framework is much different from showing that the framework would have produced that conclusion.

Challenger explosionA minor reference to the Covering-Law model of scientific explanation in my last post on fault trees and RCA drew some interest. I’ll look at the Challenger accident with regard to explanatory and causal models below.

A reader challenged my reference to covering-law models because of their association with a school of thought (“logical positivism”) abandoned fifty years ago. The positivists, like most people, saw a close connection between explanation and causality; but unlike most of us, they worried that chasing causality relied too heavily on induction and too often led to metaphysics and hollow semantics. That birds fly south because of instinct is no explanation at all. Nevertheless, we have a natural tendency to explain things by describing our perceptions of their causes.

WV Quine and Thomas Kuhn extinguished any remaining embers of logical positivism in the 1960’s. It’s true that covering-law models of explanation have major failures – i.e., counterexamples to sufficiency are common (see odds and ends at the end of this post). But so do every other model of explanation, including statistical-relevance models, counterfactual models, unification models, and causal models.

In my post I suggested that, while most engineers probably never explicitly think about the pitfalls of causal explanations, they lean toward explanations that combine observations (facts of the case) with general laws (physics) such that the condition being modeled is inevitable.  That is, they lean toward covering-law explanations, also called nomological-deductive models since given deterministic laws and the facts of the case, pure deduction leads to the inevitability (nomic expectancy) of the observed outcome.

By leaning toward covering-law models of explanation, engineers avoid some of the problems with causal models of explanation. These problems were exactly why the logical positivists (e.g., Popper, Hempel, Nagel) pursued covering-law models of explanation. They sought explanations that were free of causation’s reliance on induction.

In the 1700’s David Hume expressed similar concern with causation in his treatise. Some might argue that this is just another case of philosophers trying to see who can’t doubt the most obvious truth. But the logical positivists’ concerns went beyond Hume’s. Positivist Carl Hempel was concerned that we too easily infer causation from correlation in a very practical sense – one that applies to things like real-world drug trials, accident investigation, and root cause analysis.

Hempel worried about causal overdetermination, the possibility that two conditions, both sufficient to produce an effect (to confer nomic expectability on it) both might exist. In this case, assigning one of the conditions as “the” cause is messy; and the two conditions can’t reasonably be called “contributing causes” because each is individually sufficient. Counterfactual models offer no help in this case since removal of neither condition would change the outcome; thus neither could be logically deemed a “cause.” Counterfactual notions of explanation, common in root cause frameworks, maintain that a condition is causally relevant if its removal would have prevented the outcome being investigated.

Another commonly cited problem with causal models is failure of transitivity. One might expect causation to be transitive: if a causes b and b causes c, a should cause (and therefore explain) c. Famous examples show this not be the case (see example at end of this post).

We need not descend into philosophy to find examples where overdermination creates problems for root-cause analysis. The Challenger disaster suffices. But before delving into the Challenger example we should cover one other model of explanation – statistical relevance. The positivists added statistical models of explanation for cases where outcomes were non-deterministic. More recent scholars have refined statistical-relevance models of explanation.

Fault tree initiators (basic events such as microprocessor or valve failure, for example) are treated probabilistically. Fault trees model such failures as irreducibly indeterministic (like the decay of uranium atoms), regardless of the fact that a detailed analysis of the failed part might reveal a difference between it and one of the same age that did not fail. The design of complex system expects failures of such components at a prescribed statistical rate. In this sense fault trees use inductive logic (i.e., the future will resemble the past) despite their reliance on deductive logic for calculating cut sets. NASA did not use fault tree analysis in design of the shuttle.

Statistical information about historical – or sometimes estimated – failure rates often plays the role of evidence in scientific explanations. Statistical laws might, in conjunction with facts of a case, confer a high probability on the outcome being explained. Statistical explanations are common in medicine; the drug is considered successful if it relieves some fraction of patients, usually a large percentage.

Statistical explanations seem less satisfying for rare events, particularly when the impact (severity) is high. Most of us are more willing to accept that probability explains losing in a single bet on on a single slot in the roulette wheel. But what about winning that bet. Can the same explanation account for both winning and losing?

Sidestepping this psychological hurdle of statistical and probabilistic explanations, we’re left with defending such explanations, in general, for cases where a portion of an event’s explanation is assigned to statistical phenomena. In these cases, a factor is not deemed relevant if it makes the probability P of the outcome high, but if it increases its probability. That is, given a population A, a condition C will be statistically relevant to condition B just in case the probability of B conditional on A and C is greater than the probability of B conditional on A alone. In symbolic form, C is relevant if P(B|A.C) > P(B|A).

Now consider the Challenger disaster and the disagreement between the findings of the Rogers Commission as a group and Richard Feynman as a dissenting participant. The commission (Ch. V: The Cause of the Accident) concluded with this explanation (i.e. cause) of the disaster:

In view of the findings, the Commission concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor. The failure was due to a faulty design unacceptably sensitive to a number of factors. These factors were the effects of temperature, physical dimensions, the character of materials, the effects of reusability, processing, and the reaction of the joint to dynamic loading.”

The commission does not use the phrase, “root cause.” Feynman (Appendix F: Personal observations on the reliability of the Shuttle) doesn’t explicitly state a cause. He addresses the “enormous disparity” between the subjective probabilities assigned to certain failure modes by NASA management and by NASA and Thiokol engineers. Elsewhere Feynman writes that he thought the effects of cold weather on the o-rings in the SRM joint were sufficient to cause the explosion. Further, he found poor management, specifically in Flight Readiness Reviews, to be responsible for the decision, against significant opposition from engineering experts, to be the factor deserving attention (what most root-cause analyses seek) for preventing future disasters:

“Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.”

Feynman’s minority report further makes the case that the safety margins for the shuttle’s main engines (not the solid rockets that exploded) sufficiently compromised to make their independent catastrophic failure also probable.

The probability of o-ring failure played a central role in analysis of the disaster. NASA’s Larry Mulloy apparently thought that the lack of correlation between combustion gas blow-by (indicating o-ring leakage) in past flights meant that low temperature was not a cause of leakage resulting in explosion. Feynman thought that the observed rate of blow-by and the probability of its resulting in explosion was sufficient to count as statistical explanation of the disastrous outcome. He judged that probability, using the estimates of engineers, to be much greater the the 1E-5 value given by NASA management – somewhere in the range of one in 100 (1E-2).  With citizens’ lives at stake, 1E-2 is sufficiently close to 1.0 to deem the unwanted outcome likely. Despite some errors in Feynman’s conception of reliability engineering, his argument is plausible.

Feynman also judged the probability of main engine turbine blade failure to be in the range of 1E-2 per mission, contrasting NASA management’s (Jud Lovingood‘s) estimate of 1E-5.

From the perspective of engineers, scientists and philosophers, the facts of the shuttle design and environmental conditions combined with the laws of nature (physics) together confer some combination of nomic and statistical expectancy on the disastrous outcome. The expected outcome both explains and predicts (retrodicts) the explosion.

The presence of low temperatures significantly increased (at least in Feynman’s view) the probability of a catastrophic failure mode; therefore cold weather at launch was both statistically and explanatorily relevant. But even without cold weather, the flawed design of the solid rock field joint already made catastrophe probable. Further, main engine loading combined with turbine-blade crack propagation rates, in Feynman’s view, independently made catastrophic failure probable.

Thus, using some versions of statistical-explanation models, the Challenger disaster can be seen to be overdetermined, at least in a probabilistic sense. The disaster was likely by several combinations of design facts and natural laws. From the perspective of a counterfactual model of explanation, cold weather could be seen not to have been causally (or explanatorily) relevant since its absence would not have prevented the outcome. Neither Feynman nor others on the Rogers Commission expressed their disagreement in the language of philosophy, epistemology and logic;  but unstated, possibly unrealized, differences in interpretations of statistics, probability, and scientific explanation may account for their disagreement on cause of the accident.


Odds and ends

“A thing is safe if its attendant risks are judged to be acceptable.” – William Lowrance, Of Acceptable Risk: Science and the Determination of Safety|

Failure of a covering-law model of explanation (from Wesley Salmon, Statistical Explanation, 1971):

  • Facts: Mr. Jones takes birth control pills. Mr. Jones has not become pregnant.
  • Law: Birth control pills prevent pregnancy.
  • Explanation deduced from facts plus covering law: Jones’ use of birth control pills explains his non-pregnancy.

Causal transitivity problems:
Mr. Jones contracts a bacterial disease that is deadly but easily cured with a dose of the right medicine. Jones is admitted to the hospital and Dr. One administers the drug, which has the curious property of being an effective cure if given once, but itself being fatal if given twice. Jones remains in the hospital to be monitored. The next day Dr. Two responsibly checks Jones’ chart, seeing that Dr. One previously administered the treatment and takes no further action. Jones survives and is released. Did Jones’ having contracted the disease cause Dr. Two to not administer a treatment? Causal transitivity would suggest it does.

A dog bites Jones’ right hand. Jones has a bomb, which he intends to detonate. He is right-handed. Because of the dog bite, Jones uses his left hand to trigger the bomb. Did the dog bite cause the bomb to explode? Examples lifted from Joseph Y. Halpern of the Cornell University Computer Science Dept.

Overdermination and counterfactuals:
Two sharpshooters, both with live ammo, are charged with executing the condemned Mr. Jones. Each shooter’s accurately fired shot is sufficient to kill him. Thus the shot from each shooter fully explains the death of the condemned, and might therefore constitute a cause or explanation of Jones’ death. However, in a counterfactual model, neither shooter caused his death because, had each not fired, Jones would still be dead. The counterfactual model follows our sense that something that has no impact on the outcome cannot explain it or be its cause. Against this model, we might ask how a fired shot, which has sufficient causal power by itself, can be disempowered by being duplicated.

Statistical explanation troubles:
Linus Pauling, despite winning two Nobel prizes, seems to have been a quack on some issues. Grossly oversimplified, 90% of patients who take large doses of Vitamin C recover from influenza in seven days. I took Vitamin C and I recovered in seven days. Therefore Vitamin C cured my cold. But 90% of patients who do not take Vitamin C also recover from colds in seven days.

The Big Bang is the root cause of this post.

SpaceX, Fault Trees, and Root Cause

My previous six Fault Tree Friday posts were tutorials highlighting some of the stumbling blocks I’ve noticed in teaching classes on safety analysis. This one deals with fault trees in recent press.

I ran into a question on Quora asking how SpaceX so accurately determines the causes of failures in their rockets. The questioner said he heard Elon Musk mention a fault tree analysis and didn’t understand how it could be used to determine the cause of rocket failures.

In my response (essentially what follows below) I suggested that his confusion regarding the suitability of fault tree analysis to explaining incidents like the September 2016 SpaceX explosion (“anomaly” in SpaceX lingo) is well-founded. The term “fault tree analysis” (FTA) in aerospace has not historically been applied to work done during an accident/incident investigation (though it is in less technical fields, where it means something less rigorous).

FTA was first used during the design of the Minuteman system in the 1960s. FTA was initially used to validate a design against conceivable system-level (usually catastrophic) failures (often called hazards in the FTA context) by modeling all combinations of failures and faults that can jointly produce each modeled hazard. It was subsequently used earlier in the design process (i.e., earlier than the stage of design validation by reliability engineering) when we realized that FTA or (something very similar) is the only rational means of allocating redundancy in complex redundant systems. This allocation of redundancy ensures that systems effectively have no probabilistic strong or weak links – similar to the way stress analysis ensures that mechanical systems have no structural strong or weak links, yielding a “balanced” system.

During the design of a complex system, hazards are modeled by a so-called top-down (in FTA jargon) process. By “top-down” we mean that the process of building a fault tree (which is typically represented by a diagram looking somewhat like an org chart) uses functional decomposition of the hazardous state (the “top event” in a fault tree) by envisioning what equipment states could singly or in combination produce the top event. Each such state is an intermediate event in FTA parlance. Once such equipment states are identified, we use a similar analytical approach (i.e., similar thinking) to identify the equipment states necessary to jointly or singly produce the intermediate event in question. Thus the process continues from higher level events down to more basic events. This logical drill-down usually stops at a level of granularity (bottom events) sufficient to determine from observed frequencies in historical data (or from expert opinion about similar equipment) a probability of each bottom event. At this point a fault tree diagram looks like an org chart where each intermediate event is associated with a Boolean logic gate (e.g., AND or OR). This fault tree model can then be solved, by Boolean algebra (Boolean absorption and reduction), usually using dedicated software tools.

The solution to a fault tree consists of a collection of sets of bottom-level events, where each set is individually sufficient to produce the top event, and where all events in a set are individually necessary to that set. There may be thousands or millions of such sets (“cut sets”) in the solution to a fault tree for a complex system. Each set would include one or more bottom-level events. A set having only one bottom-level event would indicate that a single failure could produce the top event, i.e., cause a catastrophe. Civil aviation guidelines (e.g. CFR 25.1309) require that no single event and no combinations of events more probable than a specified threshold should produce a catastrophic condition.

If probability values are assigned to each bottom-level event, the solution to a fault tree will include calculated probability values for each intermediate event and for the top event. The top event probability is equal to the sum of the probabilities for each cut set in the solution’s collection of cut sets, each of which is an independent combination of bottom events jointly sufficient to produce the top event. A fault tree can still be useful without including probability calculations, however, since the cut set list, typically ordered by increasing event count, provides information about the importance of the associated equipment in avoiding the hazard (top event). The list also gives guidance for seeking common causes of events believed to be (modeled as being) independent.

A fault tree is only as good as the logic that went into its construction. I.e., FTA requires that the events within each cut set are absolutely independent and not causally correlated. This means that wherever an AND gate occurs in a tree, all events associated with that AND gate must be truly independent and have no common cause. So another value of a completed fault tree is to challenge beliefs about independence of events based on isolation, physical separation, vulnerability to environmental effects, duplicated maintenance errors, etc.

Now in the case of SpaceX, the press referred to conducting a fault tree analysis as part of the investigation. E.g., Bloomberg reported on Sep. 9 2016 that a “group will be conducting a thorough ‘fault tree analysis’ to try to determine what went wrong.” This usage is not consistent with the way the term is typically used in aerospace. As stated above, FTA relates to the design of a system and addresses all possible combinations of failures that can be catastrophic.

By contrast, accident investigation would be concerned, in the SpaceX case, with examining failed components and debris from the explosion. Such an investigation would likely include fractography, simulations, models, and hypotheses, which would draw on a group of fault trees that presumably would have existed since the design phase of the rocket and its planned operations.

It is unclear whether SpaceX meant that they were constructing fault trees as part of their accident investigation. They said they were “conducting tests” and that the “investigation team worked systematically through an extensive fault tree analysis.” It seems inconceivable that NASA, stakeholders, and customers would have allowed development of a rocket and use of public funds and facilities without the prior existence of fault trees. It’s possible, even if a SpaceX public relations representative said they conducting a fault tree analysis, that the PR person was miscommunicating information received from engineers. If no fault trees existed at the time of the explosion, then shame on SpaceX and NASA; but I doubt that is the case. NASA has greatly increased emphasis on FTA and related techniques since the shuttle disasters.

For reasons stated above the relationship between accident investigation and fault tree analysis in aerospace is limited. It would be unproductive to analyze all possible causes of a catastrophic system-state for the purpose of accident investigation when physical evidence supports certain hypotheses and excludes others. Note that each cut set in the solution to a fault tree is effectively a hypothesis in the sense that it is a plausible explanation for the catastrophe; but a fault tree does not provide justification for the nodes of a causation chain.

Many people ask about the relationship between aerospace accident investigation and root cause analysis. While aerospace engineers, the NTSB, and the FAA sometimes use the term “root cause” they usually do so with a meaning somewhat different than its usage in popular techniques of root cause analysis. The NTSB seeks explanations of accidents in order to know what to change to prevent their recurrence. They realize that all causal chains are infinitely long and typically very wide. They use the term “root cause” to describe the relevant aspects of an accident that should be changed – specifically, equipment design, vehicle operating procedures or maintenance practices. They are not seeking root causes such as realignment of incentives for corporate officers, restructuring the education system for risk culture, or identifying the ethical breaches that led to poor purchasing decisions.

Perhaps surprisingly, aerospace accident analyses are rather skeptical of “why” questions (central to many root-cause techniques) as a means of providing explanation. From the perspective of theory of scientific explanation (a topic in philosophy of science), fault tree analysis is also skeptical of causality in the sense that it favors the “covering law” model of explanation. In practice, this means that both FTA and accident investigation seek facts of the case (occurrences of errors and hardware/software failures) that confer nomic expectability on the accident (the thing being explained). That is, the facts of the case, when combined with the laws of nature (physics) and the laws of math (Boolean algebra), require the accident (or top event) to happen. In this sense accident investigation identifies a set of facts (conditions or equipment states) that were jointly sufficient to produce the accident, given the laws of nature. I could have said “cause the accident” rather than “produce the accident” in my previous sentence; but as phrased, it emphasizes logical relationships rather than causal relationships. It attempts to steer clear of biases and errors in inference common to efforts pursuing “why” questions. Thus, techniques like “the 5 whys” have no place in aerospace accident analyses.

Another problem with “why”-based analyses is that “why” questions are almost always veiled choices that entail false dichotomies or contain an implicit – but likely incorrect – scope. I.e., as an example of false dichotomy, “why did x happen” too often is understood to mean “why x and not y.” The classic example of the why-scoping problem is attributed to Willie Sutton. When asked why he robbed banks, Sutton is reported to have replied, “because that’s where the money is.” In this account Sutton understood “why” to mean “why banks instead of churches” rather than ”why steal instead of having a job.”

Some root-cause frameworks attempt to avoid the problems with “why” questions by focusing on “what” questions (i.e., what happened and what facts and conditions are relevant). This is a better approach, but in absence of a previously existing fault tree, there may be a nearly infinite number of potentially relevant facts and conditions. Narrowing down the set of what questions exposes the investigators to preferring certain hypotheses too early. It can also be extremely difficult to answer a “what happened” question without being biased by a hunch established early in the investigation. This is akin to the problem of theory-laden observations in science. The NTSB and aerospace accident investigators seem to do an excellent job of not being led by hunches, partly by resisting premature inferences about evidence they collect.

I’d be interested in hearing from anyone having more information about the use of fault trees at SpaceX.

[Edit: a bit more on the topic in a subsequent post]

  – – – – – – – –

Are you in the San Francisco Bay area?
If so, consider joining the Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.