Category Archives: Aerospace

Covering-Law Models, Statistical Relevance, and the Challenger Disaster

Space shuttle Challenger disintegrated one minute after takeoff, killing its seven passengers. Many presentations on root cause analysis (RCA) cite the accident to show how an RCA framework would have pointed directly to the root cause. But no one agrees exactly on that cause, not even the members of the Rogers Commission. And showing that previously reached conclusions and evidence can be fed into an RCA framework is much different from showing that the framework would have produced that conclusion.

Challenger explosionA minor reference to the Covering-Law model of scientific explanation in my last post on fault trees and RCA drew some interest. I’ll look at the Challenger accident with regard to explanatory and causal models below.

A reader challenged my reference to covering-law models because of their association with a school of thought (“logical positivism”) abandoned fifty years ago. The positivists, like most people, saw a close connection between explanation and causality; but unlike most of us, they worried that chasing causality relied too heavily on induction and too often led to metaphysics and hollow semantics. That birds fly south because of instinct is no explanation at all. Nevertheless, we have a natural tendency to explain things by describing our perceptions of their causes.

WV Quine and Thomas Kuhn extinguished any remaining embers of logical positivism in the 1960’s. It’s true that covering-law models of explanation have major failures – i.e., counterexamples to sufficiency are common (see odds and ends at the end of this post). But so do every other model of explanation, including statistical-relevance models, counterfactual models, unification models, and causal models.

In my post I suggested that, while most engineers probably never explicitly think about the pitfalls of causal explanations, they lean toward explanations that combine observations (facts of the case) with general laws (physics) such that the condition being modeled is inevitable.  That is, they lean toward covering-law explanations, also called nomological-deductive models since given deterministic laws and the facts of the case, pure deduction leads to the inevitability (nomic expectancy) of the observed outcome.

By leaning toward covering-law models of explanation, engineers avoid some of the problems with causal models of explanation. These problems were exactly why the logical positivists (e.g., Popper, Hempel, Nagel) pursued covering-law models of explanation. They sought explanations that were free of causation’s reliance on induction.

In the 1700’s David Hume expressed similar concern with causation in his treatise. Some might argue that this is just another case of philosophers trying to see who can’t doubt the most obvious truth. But the logical positivists’ concerns went beyond Hume’s. Positivist Carl Hempel was concerned that we too easily infer causation from correlation in a very practical sense – one that applies to things like real-world drug trials, accident investigation, and root cause analysis.

Hempel worried about causal overdetermination, the possibility that two conditions, both sufficient to produce an effect (to confer nomic expectability on it) both might exist. In this case, assigning one of the conditions as “the” cause is messy; and the two conditions can’t reasonably be called “contributing causes” because each is individually sufficient. Counterfactual models offer no help in this case since removal of neither condition would change the outcome; thus neither could be logically deemed a “cause.” Counterfactual notions of explanation, common in root cause frameworks, maintain that a condition is causally relevant if its removal would have prevented the outcome being investigated.

Another commonly cited problem with causal models is failure of transitivity. One might expect causation to be transitive: if a causes b and b causes c, a should cause (and therefore explain) c. Famous examples show this not be the case (see example at end of this post).

We need not descend into philosophy to find examples where overdermination creates problems for root-cause analysis. The Challenger disaster suffices. But before delving into the Challenger example we should cover one other model of explanation – statistical relevance. The positivists added statistical models of explanation for cases where outcomes were non-deterministic. More recent scholars have refined statistical-relevance models of explanation.

Fault tree initiators (basic events such as microprocessor or valve failure, for example) are treated probabilistically. Fault trees model such failures as irreducibly indeterministic (like the decay of uranium atoms), regardless of the fact that a detailed analysis of the failed part might reveal a difference between it and one of the same age that did not fail. The design of complex system expects failures of such components at a prescribed statistical rate. In this sense fault trees use inductive logic (i.e., the future will resemble the past) despite their reliance on deductive logic for calculating cut sets. NASA did not use fault tree analysis in design of the shuttle.

Statistical information about historical – or sometimes estimated – failure rates often plays the role of evidence in scientific explanations. Statistical laws might, in conjunction with facts of a case, confer a high probability on the outcome being explained. Statistical explanations are common in medicine; the drug is considered successful if it relieves some fraction of patients, usually a large percentage.

Statistical explanations seem less satisfying for rare events, particularly when the impact (severity) is high. Most of us are more willing to accept that probability explains losing in a single bet on on a single slot in the roulette wheel. But what about winning that bet. Can the same explanation account for both winning and losing?

Sidestepping this psychological hurdle of statistical and probabilistic explanations, we’re left with defending such explanations, in general, for cases where a portion of an event’s explanation is assigned to statistical phenomena. In these cases, a factor is not deemed relevant if it makes the probability P of the outcome high, but if it increases its probability. That is, given a population A, a condition C will be statistically relevant to condition B just in case the probability of B conditional on A and C is greater than the probability of B conditional on A alone. In symbolic form, C is relevant if P(B|A.C) > P(B|A).

Now consider the Challenger disaster and the disagreement between the findings of the Rogers Commission as a group and Richard Feynman as a dissenting participant. The commission (Ch. V: The Cause of the Accident) concluded with this explanation (i.e. cause) of the disaster:

In view of the findings, the Commission concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor. The failure was due to a faulty design unacceptably sensitive to a number of factors. These factors were the effects of temperature, physical dimensions, the character of materials, the effects of reusability, processing, and the reaction of the joint to dynamic loading.”

The commission does not use the phrase, “root cause.” Feynman (Appendix F: Personal observations on the reliability of the Shuttle) doesn’t explicitly state a cause. He addresses the “enormous disparity” between the subjective probabilities assigned to certain failure modes by NASA management and by NASA and Thiokol engineers. Elsewhere Feynman writes that he thought the effects of cold weather on the o-rings in the SRM joint were sufficient to cause the explosion. Further, he found poor management, specifically in Flight Readiness Reviews, to be responsible for the decision, against significant opposition from engineering experts, to be the factor deserving attention (what most root-cause analyses seek) for preventing future disasters:

“Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.”

Feynman’s minority report further makes the case that the safety margins for the shuttle’s main engines (not the solid rockets that exploded) sufficiently compromised to make their independent catastrophic failure also probable.

The probability of o-ring failure played a central role in analysis of the disaster. NASA’s Larry Mulloy apparently thought that the lack of correlation between combustion gas blow-by (indicating o-ring leakage) in past flights meant that low temperature was not a cause of leakage resulting in explosion. Feynman thought that the observed rate of blow-by and the probability of its resulting in explosion was sufficient to count as statistical explanation of the disastrous outcome. He judged that probability, using the estimates of engineers, to be much greater the the 1E-5 value given by NASA management – somewhere in the range of one in 100 (1E-2).  With citizens’ lives at stake, 1E-2 is sufficiently close to 1.0 to deem the unwanted outcome likely. Despite some errors in Feynman’s conception of reliability engineering, his argument is plausible.

Feynman also judged the probability of main engine turbine blade failure to be in the range of 1E-2 per mission, contrasting NASA management’s (Jud Lovingood‘s) estimate of 1E-5.

From the perspective of engineers, scientists and philosophers, the facts of the shuttle design and environmental conditions combined with the laws of nature (physics) together confer some combination of nomic and statistical expectancy on the disastrous outcome. The expected outcome both explains and predicts (retrodicts) the explosion.

The presence of low temperatures significantly increased (at least in Feynman’s view) the probability of a catastrophic failure mode; therefore cold weather at launch was both statistically and explanatorily relevant. But even without cold weather, the flawed design of the solid rock field joint already made catastrophe probable. Further, main engine loading combined with turbine-blade crack propagation rates, in Feynman’s view, independently made catastrophic failure probable.

Thus, using some versions of statistical-explanation models, the Challenger disaster can be seen to be overdetermined, at least in a probabilistic sense. The disaster was likely by several combinations of design facts and natural laws. From the perspective of a counterfactual model of explanation, cold weather could be seen not to have been causally (or explanatorily) relevant since its absence would not have prevented the outcome. Neither Feynman nor others on the Rogers Commission expressed their disagreement in the language of philosophy, epistemology and logic;  but unstated, possibly unrealized, differences in interpretations of statistics, probability, and scientific explanation may account for their disagreement on cause of the accident.


Odds and ends

“A thing is safe if its attendant risks are judged to be acceptable.” – William Lowrance, Of Acceptable Risk: Science and the Determination of Safety|

Failure of a covering-law model of explanation (from Wesley Salmon, Statistical Explanation, 1971):

  • Facts: Mr. Jones takes birth control pills. Mr. Jones has not become pregnant.
  • Law: Birth control pills prevent pregnancy.
  • Explanation deduced from facts plus covering law: Jones’ use of birth control pills explains his non-pregnancy.

Causal transitivity problems:
Mr. Jones contracts a bacterial disease that is deadly but easily cured with a dose of the right medicine. Jones is admitted to the hospital and Dr. One administers the drug, which has the curious property of being an effective cure if given once, but itself being fatal if given twice. Jones remains in the hospital to be monitored. The next day Dr. Two responsibly checks Jones’ chart, seeing that Dr. One previously administered the treatment and takes no further action. Jones survives and is released. Did Jones’ having contracted the disease cause Dr. Two to not administer a treatment? Causal transitivity would suggest it does.

A dog bites Jones’ right hand. Jones has a bomb, which he intends to detonate. He is right-handed. Because of the dog bite, Jones uses his left hand to trigger the bomb. Did the dog bite cause the bomb to explode? Examples lifted from Joseph Y. Halpern of the Cornell University Computer Science Dept.

Overdermination and counterfactuals:
Two sharpshooters, both with live ammo, are charged with executing the condemned Mr. Jones. Each shooter’s accurately fired shot is sufficient to kill him. Thus the shot from each shooter fully explains the death of the condemned, and might therefore constitute a cause or explanation of Jones’ death. However, in a counterfactual model, neither shooter caused his death because, had each not fired, Jones would still be dead. The counterfactual model follows our sense that something that has no impact on the outcome cannot explain it or be its cause. Against this model, we might ask how a fired shot, which has sufficient causal power by itself, can be disempowered by being duplicated.

Statistical explanation troubles:
Linus Pauling, despite winning two Nobel prizes, seems to have been a quack on some issues. Grossly oversimplified, 90% of patients who take large doses of Vitamin C recover from influenza in seven days. I took Vitamin C and I recovered in seven days. Therefore Vitamin C cured my cold. But 90% of patients who do not take Vitamin C also recover from colds in seven days.

The Big Bang is the root cause of this post.

Functional Hazard Assessment Basics

Outside of aerospace and military projects, I see a lot of frustration about failures of failure mode effects analyses (FMEA) to predict system-level or enterprise-level problems. This often stems from the fact that FMEAs see a system or process as a bunch of parts or steps. The FMEA works by asking what happens if I break this part right here or that process step over there.

For this approach to predict high-level (system-sized) failures, many FMEAs would need to be performed to exhaustive detail. They would also need to predict propagation of failures (or errors) through a system, identifying consequences on the function of the system containing them. Since FMEAs focus on single-event system failure initiators, examining combinations of failures is unwieldy. In redundant equipment or processes, this need can be far beyond the limits of our cognitive capability. In an aircraft brake system, for example, there may be hundreds of thousands of combinations of failures that lead to hazardous loss of braking. Also, by focusing on single-event conditions, only external and environmental contributors to system problems that directly cause component failures get analyzed. Finally, FMEAs often fail to catch human error that doesn’t directly result in equipment failure.

Functional Hazard AnalysisConsequently, we often call on FMEAs to do a job better suited for a functional hazard assessments (FHA). FHAs identify system-level functional failures, hazards, or other unwanted system-level consequences of major malfunction or non-function. With that, we can build a plan to eliminate, mitigate or transfer the associated risks by changing the design, deciding against the project, adding controls and monitors, adding maintenance requirements, defining operational windows, buying insurance, narrowing contract language, or making regulatory appeals. While some people call the FMEA a bottom-up approach, the FHA might be called a top-down approach. More accurately, it is a top-only approach, in the sense that it identifies top-level hazards to a system, process, or enterprise, independent of the system’s specific implementation details. It is also top-only in the sense that FHAs produce the top events of fault trees.

Terminology: In some domains – many ERM frameworks for example – the term “hazard” is restricted to risks  stemming from environmental effects, war, and terrorism. This results in   differentiating “hazard risks” from market, credit, or human capital risks, and much unproductive taxonomic/ontological hairsplitting. In some fields, business impact analysis (BIA) serves much the same purpose as FHA. While often correctly differentiated from risk analysis (understood to inform risk analyses), implementation details of BIA varies greatly. Particularly in tech circles, BIA impact is defined for levels lower than actual impact on business. That is, its meaning drifts into something like a specialized FMEA. For these reasons, I’ll use only the term “FHA,” where “hazard” means any critical unwanted outcome.

To be most useful, functional hazards should be defined precisely, so they can contribute to quantification of risk. That is, in the aircraft example above, loss of braking itself is not a useful hazard definition. Brake failure at the gate isn’t dangerous. Useful hazard definition would be something like reduction in aircraft braking resulting in loss of aircraft or loss of life. That constraint would allow engineers to model the system failure condition as something like loss of braking resulting in departing the end of a runway at speeds in excess of 50 miles per hour. Here, 50 mph might be a conservative engineering judgement of a runway departure speed that would cause a fatality or irreparable fuselage damage.

Hazards for other fields can take a similar form, e.g.:

  • Reputation damage resulting in revenue loss exceeding $5B in a fiscal year
  • Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
  • Unexpected coupon redemption in a campaign, resulting in > 1M$ excess promotion cost
  • Loss of chemical batch (value $1M) by thermal runaway
  • Electronic health record data breach resulting in disclosure of > 100 patient identifiers with medical history
  • Uncontained oil spill of > 10,000 gallons within 10 miles of LA shore

Note that these hazards tend to be stated in terms of a top-level or system-level function, an unwanted event related to that function, and specific, quantified effects with some sort of cost, be it dollars, lives, or gallons. Often, the numerical details are somewhat arbitrary, reflecting the values of the entity affected by the hazard. In other cases, as with aviation, guidelines on hazard classification comes from a regulatory body. The FAA defines catastrophic, for example, as hazards “expected to result in multiple fatalities of the occupants, or incapacitation or fatal injury to a flight crewmember, normally with the loss of the airplane.”

Organizations vary considerably in the ways they view the FHA process; but their objectives are remarkably consistent. Diagrams for processes as envisioned by NAVSEA and the FAA (and EASA in Europe) appear below.

Functional Hazard Analysis

The same for enterprise risk might look like:

  • Identify main functions or components of business opportunity, system, or process
  • Examine each function for effects of interruption, non-availability or major change
  • Define hazards in terms of above effects
  • Determine criticality of hazards
  • Check for cross-function impact
  • Plan for avoidance, mitigation or transfer of risk

In all the above cases, the end purpose, as stated earlier, is to inform trade studies (help decide between futures) or to eliminate, mitigate or transfer risk. Typical FHA outputs might include:

  • A plan for follow-on action – analyses, tests, training
  • Identification of subsystem requirements
  • Input to strategic decision-making
  • Input to design trade studies
  • The top events of fault trees
  • Maintenance and inspection frequencies
  • System operating requirements and limits
  • Prioritization of hazards, prioritization of risks*

The asterisk on prioritization of risks means that, in many cases, this isn’t really possible at the time of an FHA, at least in its first round. A useful definition of risk involves a hazard, its severity, and its probability. The latter, in any nontrivial system or operation, cannot be quantified – often even roughly – at the time of FHA. Thus the FHA identifies the hazards needing probabilistic quantification. The FAA and NAVSEA (examples used above) do not quantify risk as an arithmetic product of probability and severity (contrary to the beliefs of many who cite these domains as exemplars); but the two-dimensional (vector) risk values they use still require quantification of probability.

I’ll drill into details of that issue and discuss real-world use of FHAs in future posts. If you’d like a copy of my FHA slides from a recent presentation at Stone Aerospace, email me or contact me via the About link on this site. I keep slide detail fairly low, so be sure to read the speaker’s notes.

–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

If you are in the bay area, please join us, and let us know your preferences for meeting times.