Oroville Dam Risk Mismanagement

The Oroville Dam crisis exemplifies bad risk assessment, fouled by an unsound determination of hazard probability and severity.

It seems likely that Governor Brown and others in government believed that California’s drought was permanent, that they did so on irrational grounds, and that they concluded the risk of dam problems was low because the probability of flood was low. By doing this they greatly increased the cost of managing the risk, and increased the likelihood of loss of lives.

In other words, Governor Brown and others indulged in a belief that they found ideologically satisfying at the expense of sound risk analysis and management. Predictions of permanent drought in California were entertained by Governor Brown, the New York Times (California Braces for Unending Drought), Wired magazine (Drought Probably Forever) and other outlets last year when El Niño conditions failed to fill reservoirs. Peter Gleick of the Pacific Institute explained to KQED why the last drought would be unlike all others.

One would have to have immense confidence  in the improbability of a future floods to neglect ten-year old warnings of the need from dam repair by several agencies. Apparently, many had such confidence. It was doubly unwarranted, given that with or without anthropogenic warming, the standard bimodal precipitation regime of a desert should be expected. That is, on the theory that man-made climate change exists, we should expect big rain years; and on rejection of that theory we should expect big rain years. Eternal-drought predictions were perhaps politically handy for raising awareness or for vote pandering, but they didn’t serve risk management.

Letting ideology and beliefs interfere with measuring risk by assessing the likelihood of each severity range of each hazard isn’t new. A decade ago many believed certain defunct financial institutions to be too big to fail, long before that phrase was understood to mean too big for the government to allow to fail. No evidence supported this belief, and plenty of counter-evidence existed.

This isn’t even the first time our government has indulged in irrational beliefs about weather. In the late 1800’s, many Americans, apparently including President Lincoln, believed, without necessarily stating it explicitly, that populating the wild west would cause precipitation to increase. The government enticed settlers to move west with land grants. There was a shred of scientific basis: plowing raises dust into the air, increasing the seeding of clouds. Coincidentally, there was a dramatic greening of the west from 1850 to 1880; but it was due to weather, not the desired climate change.

When the rains suddenly stopped in 1880, homesteaders faced decades of normal drought. Looking back, one wonders how farmers, investors and politicians could so deeply indulge in an irrational belief that led to very poor risk analysis.


– – –

Tom Hight is my name, an old bachelor I am,
You’ll find me out West in the country of fame,
You’ll find me out West on an elegant plain,
And starving to death on my government claim.

Hurrah for Greer County!
The land of the free,
The land of the bed-bug,
Grass-hopper and flea;
I’ll sing of its praises
And tell of its fame,
While starving to death
On my government claim.

Opening lyrics to a folk song by Daniel Kelley in the late 1800’s chronicling the pain of settling on a government land grant after the end of a multi-decade wet spell.

The State of Risk Management

Norman Marks recently posted some thoughtful comments on the state of risk management after reading the latest Ponemon survey, “The Imperative to Raise Enterprise Risk Intelligence.”

The survey showed some expected results like the centrality of reputation and cyber risk concerns. It also found little recent progress in bridging silos between legal, IT and finance, which is needed for operational risk management to be effective. Sadly, half of the polled organizations lack a formal budget for enterprise risk management.

The Ponemon report differentiates ERM from enterprise risk intelligence by characterizing ERM as the application of rigorous and systematic analyses of organizational risks and enterprise risk intelligence as the insight needed to drive business decisions related to governance, risk and compliance.

Noting that only 43 percent of respondents said risk intelligence integrates well with the way business leaders make decisions, Marks astutely observes that we should not be surprised that ERM lacks budget. If the CEO and board don’t think risk management works, then why fund it?

Marks writes often on the need for an overhaul of ERM doctrine. I share this view. In his post on the Ponemon report, he offers eight observations, each implying a recommendation for fixing ERM. I strongly agree with six and a half of them, and would like to discuss those where I see it differently.

His points 4 and 5 are:

4. Risk practitioners too often are focused on managing risks instead of achieving business objectives. There’s a huge difference.

5. Risk practitioners don’t connect with business executives because they talk technobabble instead of the language of the business. A discussion of risk appetite or a risk appetite framework is not something that any executive focused on results will want to attend.

My interviews in recent months with boards and CEOs indicated those leaders thought almost the exact opposite. They suggested that risk managers should support business decisions by doing a better job of

  • identifying risks – more accurately, identifying unwanted outcomes (hazards, in my terminology)
  • characterizing hazards and investigating their causes, effects (and therefore severity) and relevant systems
  • quantifying risks by assessing the likelihoods for each severity range of each hazard
  • Enumerating reasonable responses, actions and mitigations for those risks

Note that this list is rather consistent, at least in spirit, with Basel II and some of the less lofty writings on risk management.

My understanding of the desires of business leaders is that they want risk management to be deeper and better, not broader in scope. Sure, silos must be bridged, but risk management must demonstrate much more rigor in its “rigorous and systematic analysis” before ERM will be allowed to become Enterprise Decision Management.

It is clear, from ISO 31000’s definition of risk and the whole positive-risk fetish, that ERM aspires to be in the decision analysis and management business, but the board is not buying it. “Show us core competencies first,” says the board.

Thus I disagree with Norman on point 4. On point 5, I almost agree. Point 5 is not a fact, but a fact with an interpretation. Risk practitioners don’t connect with business executives. Norman suggests the reason is that risk managers talk technobabble. I suggest they too often talk gibberish. This may include technobabble if you take technobabble to mean nonsense and platitudes expressed through the misuse of technical language. CEOs aren’t mystified by heat maps; they’re embarrassed by them.

Norman seems to find risk-appetite frameworks similarly facile, so I think we agree. But concerning the “techno” in technobabble, I think boards want better and real technical info, not less technical info.

Since most of this post addresses where we differ, I’ll end by adding that Marks, along with Tim Leech and a few others, deserves praise for a tireless fight against a seemingly unstoppable but ineffectual model of enterprise risk management. Pomp, structure, and a compliance mindset do not constitute rigor; and boards and CEO’s have keen detectors for baloney.


Teaching Kids about Financial Risk

Last year a Chicago teacher’s response to a survey on financial literacy included a beautifully stated observation:

“This [financial literacy] is a topic that should be taught as early as possible in order to curtail the mindset of fast money earned on the streets and gambling being the only way to improve one’s financial circumstances in life.”

The statement touches on the relationship between risk and reward and on the notion of delayed gratification. It also suggests that kids can grasp the fundamentals of financial risk.

The need for financial literacy is clear. Many adults don’t grasp the concept of compound interest, let alone the ability to calculate it. We’re similarly weak on the basics of risk. Combine the two weaknesses and you get the kind of investors that made Charles Ponzi famous.

I browsed the web for material on financial risk education for kids and found very little, and often misguided. As I mentioned earlier, many advice-giving sites confuse risk taking with confidence building.

On one site I found this curious claim about the relationship between risk and reward:

What is risk versus return? In finance, risk versus return is the idea that the amount of potential return is proportional to the amount of risk taken in a financial investment.

In fact, history shows that reward correlates rather weakly with risk.  As the Chicago teacher quoted above notes, financial security – along with good health and other benefits – stem from knowing that some risks have much down side and little up. Or, more accurately, it means knowing that some risks have vastly higher expected costs than expected profits.

This holds where expected cost means the sum of the dollar value of each possible way to take the risk times its probability value, and where expected profit is the sum of each possible beneficial outcome times its probability. Here expected value would be the latter minus the former. This is simple economic risk analysis (expected value analysis). Many people get it intuitively. Others – some of whom are investment managers – pretend that some particular detail causes it not apply to the case at hand. Or they might deny the existence of certain high-cost outcomes.

I was once honored to give Susan Beacham a bit of input as she was developing her Money Savvy Kids® curriculum. Nearly twenty years later the program has helped over a million kids to develop money smarts. Analysts show the program to be effective in shaping financial attitudes and kids’ understanding of spending, saving and investing money.

Beacham’s results show that young kids, teens, and young adults can learn how money works, a topic that otherwise slips through gaps between subjects in standard schooling. Maybe we can do the same with financial risk and risk in general.

– – –


A web site aimed at teaching kids about investment risk proposes an educational game where kids win candies by betting on the outcome of the roll of a six-sided die. Purportedly this game shows how return is proportional to risk taken. Before rolling the die the player states the number of guesses he/she will make on its outcome. The outcome is concealed until all guesses are made or a correct guess is made. Since the cost of any number of guesses is the same, I assume the authors’ stated proportionality between reward and risk to mean that five guesses is less risky than two, for example, and therefore will have a lower yield. The authors provides the first two columns of the table below, showing the candies won vs. number of guesses. I added the Expected Value column, calculated as the expected profit minus the expected (actual) cost.

I think the authors missed an opportunity to point out that, as constructed, the game makes  the five-guess option a money pump. They also constructed the game so that reward is not proportional to risk (using my guess of their understanding of risk). They also missed an opportunity to explore the psychological consequences of the one- and two-guess options. Both have the same expected value, but have much different winnings amounts. I’ve discussed risk-neutrality with ten-year-olds who seem to get the nuances better than some risk managers. It may be one of those natural proficiencies that are unlearned in school.

Overall, the game is constructed to teach that, while not proportional, reward does increase with risk, and, except for the timid who buy the five-guess option, high “risk” has no downside. This seems exactly the opposite of what I want kids to learn about risk.

No. of Guesses Candies Won  Expected Value 
5 1  5/6*1 – 1 = -0.17
4 2  4/6*2 – 1 = 0.33
3 5  3/6*5 – 1 = 1.5
2 10  2/6*10 – 1 = 2.33
1 20  1/6*20 – 1 = 2.33


Teaching Kids about Risk

Most people understand risk to mean some combination of the likelihood and severity of an unwanted outcome resulting from an encounter with a potentially hazardous situation or condition. In kid terms this means for any given activity:

  1. what can go wrong?
  2. how bad is it?
  3. how likely is it?

It’s also helpful to discuss risk in terms of trading risk for reward or benefit.

For example, the rewards of riding a bike include fun (the activity is intrinsically rewarding) and convenient transportation.

In adult risk-analysis terms, the above three aspects of risk equate to:

  1. hazard description
  2. hazard severity
  3. hazard probability

Some hazards associated with riding a bicycle include collision with a moving car, falling down, getting lost, having a flat tire, and being hit by an asteroid. A key point falls out of the above wording. To a risk analyst, bicycle riding is not a risk. Many risks are associated with bike riding, because many distinct hazards are connected to bike riding. Each hazard has an associated risk, depending on the severity and probability of the hazard.

Each hazard associated with bicycle riding differs in likelihood and severity. Getting hit by an asteroid is extremely unlikely and extremely harmful; but it is so unlikely that we can ignore the risk altogether. Colliding with a car can be very severe. Its likelihood depends greatly on bike riding practices.

Talking to kids about how to decrease the likelihood of colliding with a moving car helps teach kids about the concept of risk. Talking with them about the relative likelihood of outcomes such as asteroid strikes can also help.

Even young kids can understand the difference between chance and risk. The flip of a coin involves chance but not risk, unless you’ve placed a bet on the outcome. The same applies for predicting the outcome of contests.

I see a lot of articles aimed at teaching kids to take risks. Most of these really address helping kids build confidence and dispel irrational fears. Irrational fear often means errors in perception of either the severity (how bad is it) or the probability (how likely is it) of a hazard. Explicit identification of specific hazards will help conquer irrational fears and will help kids become more comprehensive in identifying the hazards (what can go wrong) associated with an activity.

This is good for kids, since they’re quick to visualize the rewards and benefits of an activity and take action before asking what can go wrong at all, let alone what are all the things that can go wrong.

Teach your kids about risk as a concept, not just about specific risks and hazards. They’ll make better CEOs and board members. We’ll all benefit.


Covering-Law Models, Statistical Relevance, and the Challenger Disaster

Space shuttle Challenger disintegrated one minute after takeoff, killing its seven passengers. Many presentations on root cause analysis (RCA) cite the accident to show how an RCA framework would have pointed directly to the root cause. But no one agrees exactly on that cause, not even the members of the Rogers Commission. And showing that previously reached conclusions and evidence can be fed into an RCA framework is much different from showing that the framework would have produced that conclusion.

Challenger explosionA minor reference to the Covering-Law model of scientific explanation in my last post on fault trees and RCA drew some interest. I’ll look at the Challenger accident with regard to explanatory and causal models below.

A reader challenged my reference to covering-law models because of their association with a school of thought (“logical positivism”) abandoned fifty years ago. The positivists, like most people, saw a close connection between explanation and causality; but unlike most of us, they worried that chasing causality relied too heavily on induction and too often led to metaphysics and hollow semantics. That birds fly south because of instinct is no explanation at all. Nevertheless, we have a natural tendency to explain things by describing our perceptions of their causes.

WV Quine and Thomas Kuhn extinguished any remaining embers of logical positivism in the 1960’s. It’s true that covering-law models of explanation have major failures – i.e., counterexamples to sufficiency are common (see odds and ends at the end of this post). But so do every other model of explanation, including statistical-relevance models, counterfactual models, unification models, and causal models.

In my post I suggested that, while most engineers probably never explicitly think about the pitfalls of causal explanations, they lean toward explanations that combine observations (facts of the case) with general laws (physics) such that the condition being modeled is inevitable.  That is, they lean toward covering-law explanations, also called nomological-deductive models since given deterministic laws and the facts of the case, pure deduction leads to the inevitability (nomic expectancy) of the observed outcome.

By leaning toward covering-law models of explanation, engineers avoid some of the problems with causal models of explanation. These problems were exactly why the logical positivists (e.g., Popper, Hempel, Nagel) pursued covering-law models of explanation. They sought explanations that were free of causation’s reliance on induction.

In the 1700’s David Hume expressed similar concern with causation in his treatise. Some might argue that this is just another case of philosophers trying to see who can’t doubt the most obvious truth. But the logical positivists’ concerns went beyond Hume’s. Positivist Carl Hempel was concerned that we too easily infer causation from correlation in a very practical sense – one that applies to things like real-world drug trials, accident investigation, and root cause analysis.

Hempel worried about causal overdetermination, the possibility that two conditions, both sufficient to produce an effect (to confer nomic expectability on it) both might exist. In this case, assigning one of the conditions as “the” cause is messy; and the two conditions can’t reasonably be called “contributing causes” because each is individually sufficient. Counterfactual models offer no help in this case since removal of neither condition would change the outcome; thus neither could be logically deemed a “cause.” Counterfactual notions of explanation, common in root cause frameworks, maintain that a condition is causally relevant if its removal would have prevented the outcome being investigated.

Another commonly cited problem with causal models is failure of transitivity. One might expect causation to be transitive: if a causes b and b causes c, a should cause (and therefore explain) c. Famous examples show this not be the case (see example at end of this post).

We need not descend into philosophy to find examples where overdermination creates problems for root-cause analysis. The Challenger disaster suffices. But before delving into the Challenger example we should cover one other model of explanation – statistical relevance. The positivists added statistical models of explanation for cases where outcomes were non-deterministic. More recent scholars have refined statistical-relevance models of explanation.

Fault tree initiators (basic events such as microprocessor or valve failure, for example) are treated probabilistically. Fault trees model such failures as irreducibly indeterministic (like the decay of uranium atoms), regardless of the fact that a detailed analysis of the failed part might reveal a difference between it and one of the same age that did not fail. The design of complex system expects failures of such components at a prescribed statistical rate. In this sense fault trees use inductive logic (i.e., the future will resemble the past) despite their reliance on deductive logic for calculating cut sets. NASA did not use fault tree analysis in design of the shuttle.

Statistical information about historical – or sometimes estimated – failure rates often plays the role of evidence in scientific explanations. Statistical laws might, in conjunction with facts of a case, confer a high probability on the outcome being explained. Statistical explanations are common in medicine; the drug is considered successful if it relieves some fraction of patients, usually a large percentage.

Statistical explanations seem less satisfying for rare events, particularly when the impact (severity) is high. Most of us are more willing to accept that probability explains losing in a single bet on on a single slot in the roulette wheel. But what about winning that bet. Can the same explanation account for both winning and losing?

Sidestepping this psychological hurdle of statistical and probabilistic explanations, we’re left with defending such explanations, in general, for cases where a portion of an event’s explanation is assigned to statistical phenomena. In these cases, a factor is not deemed relevant if it makes the probability P of the outcome high, but if it increases its probability. That is, given a population A, a condition C will be statistically relevant to condition B just in case the probability of B conditional on A and C is greater than the probability of B conditional on A alone. In symbolic form, C is relevant if P(B|A.C) > P(B|A).

Now consider the Challenger disaster and the disagreement between the findings of the Rogers Commission as a group and Richard Feynman as a dissenting participant. The commission (Ch. V: The Cause of the Accident) concluded with this explanation (i.e. cause) of the disaster:

In view of the findings, the Commission concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor. The failure was due to a faulty design unacceptably sensitive to a number of factors. These factors were the effects of temperature, physical dimensions, the character of materials, the effects of reusability, processing, and the reaction of the joint to dynamic loading.”

The commission does not use the phrase, “root cause.” Feynman (Appendix F: Personal observations on the reliability of the Shuttle) doesn’t explicitly state a cause. He addresses the “enormous disparity” between the subjective probabilities assigned to certain failure modes by NASA management and by NASA and Thiokol engineers. Elsewhere Feynman writes that he thought the effects of cold weather on the o-rings in the SRM joint were sufficient to cause the explosion. Further, he found poor management, specifically in Flight Readiness Reviews, to be responsible for the decision, against significant opposition from engineering experts, to be the factor deserving attention (what most root-cause analyses seek) for preventing future disasters:

“Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.”

Feynman’s minority report further makes the case that the safety margins for the shuttle’s main engines (not the solid rockets that exploded) sufficiently compromised to make their independent catastrophic failure also probable.

The probability of o-ring failure played a central role in analysis of the disaster. NASA’s Larry Mulloy apparently thought that the lack of correlation between combustion gas blow-by (indicating o-ring leakage) in past flights meant that low temperature was not a cause of leakage resulting in explosion. Feynman thought that the observed rate of blow-by and the probability of its resulting in explosion was sufficient to count as statistical explanation of the disastrous outcome. He judged that probability, using the estimates of engineers, to be much greater the the 1E-5 value given by NASA management – somewhere in the range of one in 100 (1E-2).  With citizens’ lives at stake, 1E-2 is sufficiently close to 1.0 to deem the unwanted outcome likely. Despite some errors in Feynman’s conception of reliability engineering, his argument is plausible.

Feynman also judged the probability of main engine turbine blade failure to be in the range of 1E-2 per mission, contrasting NASA management’s (Jud Lovingood‘s) estimate of 1E-5.

From the perspective of engineers, scientists and philosophers, the facts of the shuttle design and environmental conditions combined with the laws of nature (physics) together confer some combination of nomic and statistical expectancy on the disastrous outcome. The expected outcome both explains and predicts (retrodicts) the explosion.

The presence of low temperatures significantly increased (at least in Feynman’s view) the probability of a catastrophic failure mode; therefore cold weather at launch was both statistically and explanatorily relevant. But even without cold weather, the flawed design of the solid rock field joint already made catastrophe probable. Further, main engine loading combined with turbine-blade crack propagation rates, in Feynman’s view, independently made catastrophic failure probable.

Thus, using some versions of statistical-explanation models, the Challenger disaster can be seen to be overdetermined, at least in a probabilistic sense. The disaster was likely by several combinations of design facts and natural laws. From the perspective of a counterfactual model of explanation, cold weather could be seen not to have been causally (or explanatorily) relevant since its absence would not have prevented the outcome. Neither Feynman nor others on the Rogers Commission expressed their disagreement in the language of philosophy, epistemology and logic;  but unstated, possibly unrealized, differences in interpretations of statistics, probability, and scientific explanation may account for their disagreement on cause of the accident.


Odds and ends

“A thing is safe if its attendant risks are judged to be acceptable.” – William Lowrance, Of Acceptable Risk: Science and the Determination of Safety|

Failure of a covering-law model of explanation (from Wesley Salmon, Statistical Explanation, 1971):

  • Facts: Mr. Jones takes birth control pills. Mr. Jones has not become pregnant.
  • Law: Birth control pills prevent pregnancy.
  • Explanation deduced from facts plus covering law: Jones’ use of birth control pills explains his non-pregnancy.

Causal transitivity problems:
Mr. Jones contracts a bacterial disease that is deadly but easily cured with a dose of the right medicine. Jones is admitted to the hospital and Dr. One administers the drug, which has the curious property of being an effective cure if given once, but itself being fatal if given twice. Jones remains in the hospital to be monitored. The next day Dr. Two responsibly checks Jones’ chart, seeing that Dr. One previously administered the treatment and takes no further action. Jones survives and is released. Did Jones’ having contracted the disease cause Dr. Two to not administer a treatment? Causal transitivity would suggest it does.

A dog bites Jones’ right hand. Jones has a bomb, which he intends to detonate. He is right-handed. Because of the dog bite, Jones uses his left hand to trigger the bomb. Did the dog bite cause the bomb to explode? Examples lifted from Joseph Y. Halpern of the Cornell University Computer Science Dept.

Overdermination and counterfactuals:
Two sharpshooters, both with live ammo, are charged with executing the condemned Mr. Jones. Each shooter’s accurately fired shot is sufficient to kill him. Thus the shot from each shooter fully explains the death of the condemned, and might therefore constitute a cause or explanation of Jones’ death. However, in a counterfactual model, neither shooter caused his death because, had each not fired, Jones would still be dead. The counterfactual model follows our sense that something that has no impact on the outcome cannot explain it or be its cause. Against this model, we might ask how a fired shot, which has sufficient causal power by itself, can be disempowered by being duplicated.

Statistical explanation troubles:
Linus Pauling, despite winning two Nobel prizes, seems to have been a quack on some issues. Grossly oversimplified, 90% of patients who take large doses of Vitamin C recover from influenza in seven days. I took Vitamin C and I recovered in seven days. Therefore Vitamin C cured my cold. But 90% of patients who do not take Vitamin C also recover from colds in seven days.

The Big Bang is the root cause of this post.

SpaceX, Fault Trees, and Root Cause

My previous six Fault Tree Friday posts were tutorials highlighting some of the stumbling blocks I’ve noticed in teaching classes on safety analysis. This one deals with fault trees in recent press.

I ran into a question on Quora asking how SpaceX so accurately determines the causes of failures in their rockets. The questioner said he heard Elon Musk mention a fault tree analysis and didn’t understand how it could be used to determine the cause of rocket failures.

In my response (essentially what follows below) I suggested that his confusion regarding the suitability of fault tree analysis to explaining incidents like the September 2016 SpaceX explosion (“anomaly” in SpaceX lingo) is well-founded. The term “fault tree analysis” (FTA) in aerospace has not historically been applied to work done during an accident/incident investigation (though it is in less technical fields, where it means something less rigorous).

FTA was first used during the design of the Minuteman system in the 1960s. FTA was initially used to validate a design against conceivable system-level (usually catastrophic) failures (often called hazards in the FTA context) by modeling all combinations of failures and faults that can jointly produce each modeled hazard. It was subsequently used earlier in the design process (i.e., earlier than the stage of design validation by reliability engineering) when we realized that FTA or (something very similar) is the only rational means of allocating redundancy in complex redundant systems. This allocation of redundancy ensures that systems effectively have no probabilistic strong or weak links – similar to the way stress analysis ensures that mechanical systems have no structural strong or weak links, yielding a “balanced” system.

During the design of a complex system, hazards are modeled by a so-called top-down (in FTA jargon) process. By “top-down” we mean that the process of building a fault tree (which is typically represented by a diagram looking somewhat like an org chart) uses functional decomposition of the hazardous state (the “top event” in a fault tree) by envisioning what equipment states could singly or in combination produce the top event. Each such state is an intermediate event in FTA parlance. Once such equipment states are identified, we use a similar analytical approach (i.e., similar thinking) to identify the equipment states necessary to jointly or singly produce the intermediate event in question. Thus the process continues from higher level events down to more basic events. This logical drill-down usually stops at a level of granularity (bottom events) sufficient to determine from observed frequencies in historical data (or from expert opinion about similar equipment) a probability of each bottom event. At this point a fault tree diagram looks like an org chart where each intermediate event is associated with a Boolean logic gate (e.g., AND or OR). This fault tree model can then be solved, by Boolean algebra (Boolean absorption and reduction), usually using dedicated software tools.

The solution to a fault tree consists of a collection of sets of bottom-level events, where each set is individually sufficient to produce the top event, and where all events in a set are individually necessary to that set. There may be thousands or millions of such sets (“cut sets”) in the solution to a fault tree for a complex system. Each set would include one or more bottom-level events. A set having only one bottom-level event would indicate that a single failure could produce the top event, i.e., cause a catastrophe. Civil aviation guidelines (e.g. CFR 25.1309) require that no single event and no combinations of events more probable than a specified threshold should produce a catastrophic condition.

If probability values are assigned to each bottom-level event, the solution to a fault tree will include calculated probability values for each intermediate event and for the top event. The top event probability is equal to the sum of the probabilities for each cut set in the solution’s collection of cut sets, each of which is an independent combination of bottom events jointly sufficient to produce the top event. A fault tree can still be useful without including probability calculations, however, since the cut set list, typically ordered by increasing event count, provides information about the importance of the associated equipment in avoiding the hazard (top event). The list also gives guidance for seeking common causes of events believed to be (modeled as being) independent.

A fault tree is only as good as the logic that went into its construction. I.e., FTA requires that the events within each cut set are absolutely independent and not causally correlated. This means that wherever an AND gate occurs in a tree, all events associated with that AND gate must be truly independent and have no common cause. So another value of a completed fault tree is to challenge beliefs about independence of events based on isolation, physical separation, vulnerability to environmental effects, duplicated maintenance errors, etc.

Now in the case of SpaceX, the press referred to conducting a fault tree analysis as part of the investigation. E.g., Bloomberg reported on Sep. 9 2016 that a “group will be conducting a thorough ‘fault tree analysis’ to try to determine what went wrong.” This usage is not consistent with the way the term is typically used in aerospace. As stated above, FTA relates to the design of a system and addresses all possible combinations of failures that can be catastrophic.

By contrast, accident investigation would be concerned, in the SpaceX case, with examining failed components and debris from the explosion. Such an investigation would likely include fractography, simulations, models, and hypotheses, which would draw on a group of fault trees that presumably would have existed since the design phase of the rocket and its planned operations.

It is unclear whether SpaceX meant that they were constructing fault trees as part of their accident investigation. They said they were “conducting tests” and that the “investigation team worked systematically through an extensive fault tree analysis.” It seems inconceivable that NASA, stakeholders, and customers would have allowed development of a rocket and use of public funds and facilities without the prior existence of fault trees. It’s possible, even if a SpaceX public relations representative said they conducting a fault tree analysis, that the PR person was miscommunicating information received from engineers. If no fault trees existed at the time of the explosion, then shame on SpaceX and NASA; but I doubt that is the case. NASA has greatly increased emphasis on FTA and related techniques since the shuttle disasters.

For reasons stated above the relationship between accident investigation and fault tree analysis in aerospace is limited. It would be unproductive to analyze all possible causes of a catastrophic system-state for the purpose of accident investigation when physical evidence supports certain hypotheses and excludes others. Note that each cut set in the solution to a fault tree is effectively a hypothesis in the sense that it is a plausible explanation for the catastrophe; but a fault tree does not provide justification for the nodes of a causation chain.

Many people ask about the relationship between aerospace accident investigation and root cause analysis. While aerospace engineers, the NTSB, and the FAA sometimes use the term “root cause” they usually do so with a meaning somewhat different than its usage in popular techniques of root cause analysis. The NTSB seeks explanations of accidents in order to know what to change to prevent their recurrence. They realize that all causal chains are infinitely long and typically very wide. They use the term “root cause” to describe the relevant aspects of an accident that should be changed – specifically, equipment design, vehicle operating procedures or maintenance practices. They are not seeking root causes such as realignment of incentives for corporate officers, restructuring the education system for risk culture, or identifying the ethical breaches that led to poor purchasing decisions.

Perhaps surprisingly, aerospace accident analyses are rather skeptical of “why” questions (central to many root-cause techniques) as a means of providing explanation. From the perspective of theory of scientific explanation (a topic in philosophy of science), fault tree analysis is also skeptical of causality in the sense that it favors the “covering law” model of explanation. In practice, this means that both FTA and accident investigation seek facts of the case (occurrences of errors and hardware/software failures) that confer nomic expectability on the accident (the thing being explained). That is, the facts of the case, when combined with the laws of nature (physics) and the laws of math (Boolean algebra), require the accident (or top event) to happen. In this sense accident investigation identifies a set of facts (conditions or equipment states) that were jointly sufficient to produce the accident, given the laws of nature. I could have said “cause the accident” rather than “produce the accident” in my previous sentence; but as phrased, it emphasizes logical relationships rather than causal relationships. It attempts to steer clear of biases and errors in inference common to efforts pursuing “why” questions. Thus, techniques like “the 5 whys” have no place in aerospace accident analyses.

Another problem with “why”-based analyses is that “why” questions are almost always veiled choices that entail false dichotomies or contain an implicit – but likely incorrect – scope. I.e., as an example of false dichotomy, “why did x happen” too often is understood to mean “why x and not y.” The classic example of the why-scoping problem is attributed to Willie Sutton. When asked why he robbed banks, Sutton is reported to have replied, “because that’s where the money is.” In this account Sutton understood “why” to mean “why banks instead of churches” rather than ”why steal instead of having a job.”

Some root-cause frameworks attempt to avoid the problems with “why” questions by focusing on “what” questions (i.e., what happened and what facts and conditions are relevant). This is a better approach, but in absence of a previously existing fault tree, there may be a nearly infinite number of potentially relevant facts and conditions. Narrowing down the set of what questions exposes the investigators to preferring certain hypotheses too early. It can also be extremely difficult to answer a “what happened” question without being biased by a hunch established early in the investigation. This is akin to the problem of theory-laden observations in science. The NTSB and aerospace accident investigators seem to do an excellent job of not being led by hunches, partly by resisting premature inferences about evidence they collect.

I’d be interested in hearing from anyone having more information about the use of fault trees at SpaceX.

[Edit: a bit more on the topic in a subsequent post]

  – – – – – – – –

Are you in the San Francisco Bay area?
If so, consider joining the Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.


Fault Tree Inhibitor and Combination Gates

So far in this Fault Tree Friday series ( 1 2 3 4 5 ) I’ve only dealt with AND and OR logic gates. Over the years engineers have occasionally seen a need to use other logical relationships, such as NOR, NAND (NOT AND), Exclusive OR, etc. While these are essential for designing circuits, I have never had occasion to use them in fault trees, and I sense that few others do. I recently spoke with a Systems Safety engineer at Boeing who suggested that if you need them, you’ve structured your fault tree wrong.

Two additional gate types that do appear fairly often are the Inhibit gate and the Combination gate. The inhibitor (Inhibit gate, not to be confused with initiator, a basic event), usually diagrammed as a point-down hexagon, is logically identical to an AND gate with two children. It is used to clearly signify that one of the children is not a fault, per se, but some sort of qualifying condition that must be satisfied for the inhibitor gate’s other child event to produce the condition named by the inhibitor gate. The non-fault event beneath an Inhibit gate is shown with a special symbol (often an oval) and is called a Conditioning Event.

Some regulations or standards include provisions that no single failure regardless of probability shall have catastrophic results (e.g., AC 25.1309-1). Use of Inhibit gates in fault trees under such constraints will usually be scrutinized, so the conditioning event must represent an uncommon state. My favorite example is a rejected take-off. It’s not a fault. It’s uncommon but not extremely rare. And to be of consequence it has to occur at high speed. I’ve seen rates in the range of 1E-4 or 1E-5 per takeoff used in the past. Whether the Inhibit gate is justified for low-temperature launch in modeling O-ring failure in the Challenger situation is debatable.

A standard graphical representation of the rejected takeoff scenario, in which an independent loss of braking or reverse thrust capability might be of interest, appears below.

inhibit gate

The Combination Gate, usually represented by a flat-bottom hexagon, is a modeling device that eliminates repetition of similar branches in situations where, for example, two of four redundant paths must be failed to produce the condition named by the Combination gate.

Note that true redundancy may not exist in all such situations. In the above aircraft braking example, all brakes are applied together, but some fraction of them – half, for example, may be sufficient to avoid a hazardous state if they are functional.  For a six-wheeled aircraft, the Combination gate lets us specify the number of failures within a group of logical branches that will result in the relevant condition. In the case of three-of-six, the Combination gate is shorthand for an OR gate with ten AND gates beneath it, one for each combination of three, e.g., {1,2,3; 1,2,4; 1,2,5; 1,2,6; 1,3,4; 1,3,5; 1,3,6; etc.}.

The two shorthand (stripped-down to show only the logic and event types) fault trees above and below the green line in the image below are logically equivalent. Each models the case where two of four initiator events must exist to produce the top event.

combination gate

The Combination gate is not merely handy but is in practice a necessity for cases where the number of combinations is large. While two-of-four has only six possible combinations, the ten-of-twenty case, which might occur in a fault tree for the Airbus A380, would require two million events and gates to diagram without the Combination gate.

If you’re curious about how to enumerate the possible results of a combination gate, the relevant area of mathematics is called combinatorics. The specific rules that apply here are that order is not important (thus, this is literally a combination and not a permutation) and repetition is not allowed. This means that {a,b} and {b,a} represent the same real-world situation, so we’re only concerned with one of them. Likewise, {a,a} is not allowed, since event “a” can only occur once. Reliability engineers often call this a “r of n” situation. Since “R” is often used to represent failure rate in the fault-tree world, users of fault trees sometimes call this a “m of n” scenario. The formula for the number of combinations is:

c = n! / ( m! * (n – m)! )

For the two-of-four case:

c = 4 x 3 x 2 / ( 2 * 2) )  = 6

For any value of n the maximum value of c occurs when m equals n/2, which, coincidentally, is often the case being modeled, as with the ten-of-twenty case example I described above:

c = 2,432,902,008,176,640,000 /  (670,442,572,800 * 670,442,572,800 ) = 184,756

In the above example there are 184.756 cut sets, each composed of ten independent initiator events. While and AND of ten events likely has a very low probability, the combination of all such AND gates will be about six orders of magnitude more probable.

This possibly unintuitive result is yet another reason that fault tree analysis belongs in the preliminary design of systems.

 –  –  –

Though a good deal is too strange to be believed, nothing is too strange to have happened. – Thomas Hardy