Category Archives: Safety

Covering-Law Models, Statistical Relevance, and the Challenger Disaster

Space shuttle Challenger disintegrated one minute after takeoff, killing its seven passengers. Many presentations on root cause analysis (RCA) cite the accident to show how an RCA framework would have pointed directly to the root cause. But no one agrees exactly on that cause, not even the members of the Rogers Commission. And showing that previously reached conclusions and evidence can be fed into an RCA framework is much different from showing that the framework would have produced that conclusion.

Challenger explosionA minor reference to the Covering-Law model of scientific explanation in my last post on fault trees and RCA drew some interest. I’ll look at the Challenger accident with regard to explanatory and causal models below.

A reader challenged my reference to covering-law models because of their association with a school of thought (“logical positivism”) abandoned fifty years ago. The positivists, like most people, saw a close connection between explanation and causality; but unlike most of us, they worried that chasing causality relied too heavily on induction and too often led to metaphysics and hollow semantics. That birds fly south because of instinct is no explanation at all. Nevertheless, we have a natural tendency to explain things by describing our perceptions of their causes.

WV Quine and Thomas Kuhn extinguished any remaining embers of logical positivism in the 1960’s. It’s true that covering-law models of explanation have major failures – i.e., counterexamples to sufficiency are common (see odds and ends at the end of this post). But so do every other model of explanation, including statistical-relevance models, counterfactual models, unification models, and causal models.

In my post I suggested that, while most engineers probably never explicitly think about the pitfalls of causal explanations, they lean toward explanations that combine observations (facts of the case) with general laws (physics) such that the condition being modeled is inevitable.  That is, they lean toward covering-law explanations, also called nomological-deductive models since given deterministic laws and the facts of the case, pure deduction leads to the inevitability (nomic expectancy) of the observed outcome.

By leaning toward covering-law models of explanation, engineers avoid some of the problems with causal models of explanation. These problems were exactly why the logical positivists (e.g., Popper, Hempel, Nagel) pursued covering-law models of explanation. They sought explanations that were free of causation’s reliance on induction.

In the 1700’s David Hume expressed similar concern with causation in his treatise. Some might argue that this is just another case of philosophers trying to see who can’t doubt the most obvious truth. But the logical positivists’ concerns went beyond Hume’s. Positivist Carl Hempel was concerned that we too easily infer causation from correlation in a very practical sense – one that applies to things like real-world drug trials, accident investigation, and root cause analysis.

Hempel worried about causal overdetermination, the possibility that two conditions, both sufficient to produce an effect (to confer nomic expectability on it) both might exist. In this case, assigning one of the conditions as “the” cause is messy; and the two conditions can’t reasonably be called “contributing causes” because each is individually sufficient. Counterfactual models offer no help in this case since removal of neither condition would change the outcome; thus neither could be logically deemed a “cause.” Counterfactual notions of explanation, common in root cause frameworks, maintain that a condition is causally relevant if its removal would have prevented the outcome being investigated.

Another commonly cited problem with causal models is failure of transitivity. One might expect causation to be transitive: if a causes b and b causes c, a should cause (and therefore explain) c. Famous examples show this not be the case (see example at end of this post).

We need not descend into philosophy to find examples where overdermination creates problems for root-cause analysis. The Challenger disaster suffices. But before delving into the Challenger example we should cover one other model of explanation – statistical relevance. The positivists added statistical models of explanation for cases where outcomes were non-deterministic. More recent scholars have refined statistical-relevance models of explanation.

Fault tree initiators (basic events such as microprocessor or valve failure, for example) are treated probabilistically. Fault trees model such failures as irreducibly indeterministic (like the decay of uranium atoms), regardless of the fact that a detailed analysis of the failed part might reveal a difference between it and one of the same age that did not fail. The design of complex system expects failures of such components at a prescribed statistical rate. In this sense fault trees use inductive logic (i.e., the future will resemble the past) despite their reliance on deductive logic for calculating cut sets. NASA did not use fault tree analysis in design of the shuttle.

Statistical information about historical – or sometimes estimated – failure rates often plays the role of evidence in scientific explanations. Statistical laws might, in conjunction with facts of a case, confer a high probability on the outcome being explained. Statistical explanations are common in medicine; the drug is considered successful if it relieves some fraction of patients, usually a large percentage.

Statistical explanations seem less satisfying for rare events, particularly when the impact (severity) is high. Most of us are more willing to accept that probability explains losing in a single bet on on a single slot in the roulette wheel. But what about winning that bet. Can the same explanation account for both winning and losing?

Sidestepping this psychological hurdle of statistical and probabilistic explanations, we’re left with defending such explanations, in general, for cases where a portion of an event’s explanation is assigned to statistical phenomena. In these cases, a factor is not deemed relevant if it makes the probability P of the outcome high, but if it increases its probability. That is, given a population A, a condition C will be statistically relevant to condition B just in case the probability of B conditional on A and C is greater than the probability of B conditional on A alone. In symbolic form, C is relevant if P(B|A.C) > P(B|A).

Now consider the Challenger disaster and the disagreement between the findings of the Rogers Commission as a group and Richard Feynman as a dissenting participant. The commission (Ch. V: The Cause of the Accident) concluded with this explanation (i.e. cause) of the disaster:

In view of the findings, the Commission concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor. The failure was due to a faulty design unacceptably sensitive to a number of factors. These factors were the effects of temperature, physical dimensions, the character of materials, the effects of reusability, processing, and the reaction of the joint to dynamic loading.”

The commission does not use the phrase, “root cause.” Feynman (Appendix F: Personal observations on the reliability of the Shuttle) doesn’t explicitly state a cause. He addresses the “enormous disparity” between the subjective probabilities assigned to certain failure modes by NASA management and by NASA and Thiokol engineers. Elsewhere Feynman writes that he thought the effects of cold weather on the o-rings in the SRM joint were sufficient to cause the explosion. Further, he found poor management, specifically in Flight Readiness Reviews, to be responsible for the decision, against significant opposition from engineering experts, to be the factor deserving attention (what most root-cause analyses seek) for preventing future disasters:

“Official management, on the other hand, claims to believe the probability of failure is a thousand times less. One reason for this may be an attempt to assure the government of NASA perfection and success in order to ensure the supply of funds. The other may be that they sincerely believed it to be true, demonstrating an almost incredible lack of communication between themselves and their working engineers.”

Feynman’s minority report further makes the case that the safety margins for the shuttle’s main engines (not the solid rockets that exploded) sufficiently compromised to make their independent catastrophic failure also probable.

The probability of o-ring failure played a central role in analysis of the disaster. NASA’s Larry Mulloy apparently thought that the lack of correlation between combustion gas blow-by (indicating o-ring leakage) in past flights meant that low temperature was not a cause of leakage resulting in explosion. Feynman thought that the observed rate of blow-by and the probability of its resulting in explosion was sufficient to count as statistical explanation of the disastrous outcome. He judged that probability, using the estimates of engineers, to be much greater the the 1E-5 value given by NASA management – somewhere in the range of one in 100 (1E-2).  With citizens’ lives at stake, 1E-2 is sufficiently close to 1.0 to deem the unwanted outcome likely. Despite some errors in Feynman’s conception of reliability engineering, his argument is plausible.

Feynman also judged the probability of main engine turbine blade failure to be in the range of 1E-2 per mission, contrasting NASA management’s (Jud Lovingood‘s) estimate of 1E-5.

From the perspective of engineers, scientists and philosophers, the facts of the shuttle design and environmental conditions combined with the laws of nature (physics) together confer some combination of nomic and statistical expectancy on the disastrous outcome. The expected outcome both explains and predicts (retrodicts) the explosion.

The presence of low temperatures significantly increased (at least in Feynman’s view) the probability of a catastrophic failure mode; therefore cold weather at launch was both statistically and explanatorily relevant. But even without cold weather, the flawed design of the solid rock field joint already made catastrophe probable. Further, main engine loading combined with turbine-blade crack propagation rates, in Feynman’s view, independently made catastrophic failure probable.

Thus, using some versions of statistical-explanation models, the Challenger disaster can be seen to be overdetermined, at least in a probabilistic sense. The disaster was likely by several combinations of design facts and natural laws. From the perspective of a counterfactual model of explanation, cold weather could be seen not to have been causally (or explanatorily) relevant since its absence would not have prevented the outcome. Neither Feynman nor others on the Rogers Commission expressed their disagreement in the language of philosophy, epistemology and logic;  but unstated, possibly unrealized, differences in interpretations of statistics, probability, and scientific explanation may account for their disagreement on cause of the accident.

END.

Odds and ends

“A thing is safe if its attendant risks are judged to be acceptable.” – William Lowrance, Of Acceptable Risk: Science and the Determination of Safety|

Failure of a covering-law model of explanation (from Wesley Salmon, Statistical Explanation, 1971):

  • Facts: Mr. Jones takes birth control pills. Mr. Jones has not become pregnant.
  • Law: Birth control pills prevent pregnancy.
  • Explanation deduced from facts plus covering law: Jones’ use of birth control pills explains his non-pregnancy.

Causal transitivity problems:
Mr. Jones contracts a bacterial disease that is deadly but easily cured with a dose of the right medicine. Jones is admitted to the hospital and Dr. One administers the drug, which has the curious property of being an effective cure if given once, but itself being fatal if given twice. Jones remains in the hospital to be monitored. The next day Dr. Two responsibly checks Jones’ chart, seeing that Dr. One previously administered the treatment and takes no further action. Jones survives and is released. Did Jones’ having contracted the disease cause Dr. Two to not administer a treatment? Causal transitivity would suggest it does.

A dog bites Jones’ right hand. Jones has a bomb, which he intends to detonate. He is right-handed. Because of the dog bite, Jones uses his left hand to trigger the bomb. Did the dog bite cause the bomb to explode? Examples lifted from Joseph Y. Halpern of the Cornell University Computer Science Dept.

Overdermination and counterfactuals:
Two sharpshooters, both with live ammo, are charged with executing the condemned Mr. Jones. Each shooter’s accurately fired shot is sufficient to kill him. Thus the shot from each shooter fully explains the death of the condemned, and might therefore constitute a cause or explanation of Jones’ death. However, in a counterfactual model, neither shooter caused his death because, had each not fired, Jones would still be dead. The counterfactual model follows our sense that something that has no impact on the outcome cannot explain it or be its cause. Against this model, we might ask how a fired shot, which has sufficient causal power by itself, can be disempowered by being duplicated.

Statistical explanation troubles:
Linus Pauling, despite winning two Nobel prizes, seems to have been a quack on some issues. Grossly oversimplified, 90% of patients who take large doses of Vitamin C recover from influenza in seven days. I took Vitamin C and I recovered in seven days. Therefore Vitamin C cured my cold. But 90% of patients who do not take Vitamin C also recover from colds in seven days.

The Big Bang is the root cause of this post.

The Onagawa Reactor Non-Meltdown

On March 11, 2011, the strongest earthquake in Japanese recorded history hit Tohuku, leaving about 15,000 dead. The closest nuclear reactor to the quake’s epicenter was the Onagawa Nuclear Power Station operated by Tohoku Electric Power Company. As a result of the earthquake and subsequent tsunami that destroyed the town of Onagawa, the Onagawa nuclear facility remained intact and shut itself down safely, without incident. The Onagawa nuclear facility was the vicinity’s only safe evacuation destination. Residents of Onagawa left homeless by the natural disasters sought refuge in the facility, where its workers provided food.

The more famous Fukushima nuclear facility was about twice as far from the earthquake’s epicenter. The tsunami at Fukushima was slightly less severe. Fukushimia experienced three core meltdowns, resulting in evacuation of 300,000 people. The findings of the Fukushima Nuclear Accident Independent Investigation Commission have been widely published. They conclude that Fukushima failed to meet the most basic safety requirements, had conducted no valid probabilistic risk assessment, had no provisions for containing damage, and that its regulators operated in a network of corruption, collusion, and nepotism. Kiyoshi Kurokawa, Chairman of the commission stated:

THE EARTHQUAKE AND TSUNAMI of March 11, 2011 were natural disasters of a magnitude that shocked the entire world. Although triggered by these cataclysmic events, the subsequent accident at the Fukushima Daiichi Nuclear Power Plant cannot be regarded as a natural disaster. It was a profoundly manmade disaster – that could and should have been foreseen and prevented.

Only by grasping [the mindset of Japanese bureaucracy] can one understand how Japan’s nuclear industry managed to avoid absorbing the critical lessons learned from Three Mile Island and Chernobyl. It was this mindset that led to the disaster at the Fukushima Daiichi Nuclear Plant.

The consequences of negligence at Fukushima stand out as catastrophic, but the mindset that supported it can be found across Japan.

Despite these findings, the world’s response Fukushima has been much more focused on opposition to nuclear power than on opposition to corrupt regulatory government bodies and the cultures that foster them.

Two scholars from USC, Airi Ryu and Najmedin Meshkati, recently published “Why You Haven’t Heard About Onagawa Nuclear Power Station after the Earthquake and Tsunami of March 11, 2011,their examination of the contrasting safety mindsets of TEPCO, the firm operating the Fukushima nuclear plant, and Tohoku Electric Power, the firm operating Onagawa.

Ryu and Meshkati reorted vast differences in personal accountability, leadership values, work environments, and approaches to decision-making. Interestingly, they found even Tohuko Electric to be weak in setting up an environment where concerns could be raised and where an attitude of questioning authority was encouraged. Nevertheless, TEPCO was far inferior to Tohoku Electric in all other safety culture traits.

Their report is worth a read for anyone interested in the value of creating a culture of risk management and the need for regulatory bodies to develop non-adversarial relationships with the industries they oversee, something I discussed in a recent post on risk management.