Category Archives: Risk assessment

Oroville Dam Risk Mismanagement

The Oroville Dam crisis exemplifies bad risk assessment, fouled by an unsound determination of hazard probability and severity.

It seems likely that Governor Brown and others in government believed that California’s drought was permanent, that they did so on irrational grounds, and that they concluded the risk of dam problems was low because the probability of flood was low. By doing this they greatly increased the cost of managing the risk, and increased the likelihood of loss of lives.

In other words, Governor Brown and others indulged in a belief that they found ideologically satisfying at the expense of sound risk analysis and management. Predictions of permanent drought in California were entertained by Governor Brown, the New York Times (California Braces for Unending Drought), Wired magazine (Drought Probably Forever) and other outlets last year when El Niño conditions failed to fill reservoirs. Peter Gleick of the Pacific Institute explained to KQED why the last drought would be unlike all others.

One would have to have immense confidence  in the improbability of a future floods to neglect ten-year old warnings of the need from dam repair by several agencies. Apparently, many had such confidence. It was doubly unwarranted, given that with or without anthropogenic warming, the standard bimodal precipitation regime of a desert should be expected. That is, on the theory that man-made climate change exists, we should expect big rain years; and on rejection of that theory we should expect big rain years. Eternal-drought predictions were perhaps politically handy for raising awareness or for vote pandering, but they didn’t serve risk management.

Letting ideology and beliefs interfere with measuring risk by assessing the likelihood of each severity range of each hazard isn’t new. A decade ago many believed certain defunct financial institutions to be too big to fail, long before that phrase was understood to mean too big for the government to allow to fail. No evidence supported this belief, and plenty of counter-evidence existed.

This isn’t even the first time our government has indulged in irrational beliefs about weather. In the late 1800’s, many Americans, apparently including President Lincoln, believed, without necessarily stating it explicitly, that populating the wild west would cause precipitation to increase. The government enticed settlers to move west with land grants. There was a shred of scientific basis: plowing raises dust into the air, increasing the seeding of clouds. Coincidentally, there was a dramatic greening of the west from 1850 to 1880; but it was due to weather, not the desired climate change.

When the rains suddenly stopped in 1880, homesteaders faced decades of normal drought. Looking back, one wonders how farmers, investors and politicians could so deeply indulge in an irrational belief that led to very poor risk analysis.


– – –

Tom Hight is my name, an old bachelor I am,
You’ll find me out West in the country of fame,
You’ll find me out West on an elegant plain,
And starving to death on my government claim.

Hurrah for Greer County!
The land of the free,
The land of the bed-bug,
Grass-hopper and flea;
I’ll sing of its praises
And tell of its fame,
While starving to death
On my government claim.

Opening lyrics to a folk song by Daniel Kelley in the late 1800’s chronicling the pain of settling on a government land grant after the end of a multi-decade wet spell.

Teaching Kids about Financial Risk

Last year a Chicago teacher’s response to a survey on financial literacy included a beautifully stated observation:

“This [financial literacy] is a topic that should be taught as early as possible in order to curtail the mindset of fast money earned on the streets and gambling being the only way to improve one’s financial circumstances in life.”

The statement touches on the relationship between risk and reward and on the notion of delayed gratification. It also suggests that kids can grasp the fundamentals of financial risk.

The need for financial literacy is clear. Many adults don’t grasp the concept of compound interest, let alone the ability to calculate it. We’re similarly weak on the basics of risk. Combine the two weaknesses and you get the kind of investors that made Charles Ponzi famous.

I browsed the web for material on financial risk education for kids and found very little, and often misguided. As I mentioned earlier, many advice-giving sites confuse risk taking with confidence building.

On one site I found this curious claim about the relationship between risk and reward:

What is risk versus return? In finance, risk versus return is the idea that the amount of potential return is proportional to the amount of risk taken in a financial investment.

In fact, history shows that reward correlates rather weakly with risk.  As the Chicago teacher quoted above notes, financial security – along with good health and other benefits – stem from knowing that some risks have much down side and little up. Or, more accurately, it means knowing that some risks have vastly higher expected costs than expected profits.

This holds where expected cost means the sum of the dollar value of each possible way to take the risk times its probability value, and where expected profit is the sum of each possible beneficial outcome times its probability. Here expected value would be the latter minus the former. This is simple economic risk analysis (expected value analysis). Many people get it intuitively. Others – some of whom are investment managers – pretend that some particular detail causes it not apply to the case at hand. Or they might deny the existence of certain high-cost outcomes.

I was once honored to give Susan Beacham a bit of input as she was developing her Money Savvy Kids® curriculum. Nearly twenty years later the program has helped over a million kids to develop money smarts. Analysts show the program to be effective in shaping financial attitudes and kids’ understanding of spending, saving and investing money.

Beacham’s results show that young kids, teens, and young adults can learn how money works, a topic that otherwise slips through gaps between subjects in standard schooling. Maybe we can do the same with financial risk and risk in general.

– – –


A web site aimed at teaching kids about investment risk proposes an educational game where kids win candies by betting on the outcome of the roll of a six-sided die. Purportedly this game shows how return is proportional to risk taken. Before rolling the die the player states the number of guesses he/she will make on its outcome. The outcome is concealed until all guesses are made or a correct guess is made. Since the cost of any number of guesses is the same, I assume the authors’ stated proportionality between reward and risk to mean that five guesses is less risky than two, for example, and therefore will have a lower yield. The authors provides the first two columns of the table below, showing the candies won vs. number of guesses. I added the Expected Value column, calculated as the expected profit minus the expected (actual) cost.

I think the authors missed an opportunity to point out that, as constructed, the game makes  the five-guess option a money pump. They also constructed the game so that reward is not proportional to risk (using my guess of their understanding of risk). They also missed an opportunity to explore the psychological consequences of the one- and two-guess options. Both have the same expected value, but have much different winnings amounts. I’ve discussed risk-neutrality with ten-year-olds who seem to get the nuances better than some risk managers. It may be one of those natural proficiencies that are unlearned in school.

Overall, the game is constructed to teach that, while not proportional, reward does increase with risk, and, except for the timid who buy the five-guess option, high “risk” has no downside. This seems exactly the opposite of what I want kids to learn about risk.

No. of Guesses Candies Won  Expected Value 
5 1  5/6*1 – 1 = -0.17
4 2  4/6*2 – 1 = 0.33
3 5  3/6*5 – 1 = 1.5
2 10  2/6*10 – 1 = 2.33
1 20  1/6*20 – 1 = 2.33


Teaching Kids about Risk

Most people understand risk to mean some combination of the likelihood and severity of an unwanted outcome resulting from an encounter with a potentially hazardous situation or condition. In kid terms this means for any given activity:

  1. what can go wrong?
  2. how bad is it?
  3. how likely is it?

It’s also helpful to discuss risk in terms of trading risk for reward or benefit.

For example, the rewards of riding a bike include fun (the activity is intrinsically rewarding) and convenient transportation.

In adult risk-analysis terms, the above three aspects of risk equate to:

  1. hazard description
  2. hazard severity
  3. hazard probability

Some hazards associated with riding a bicycle include collision with a moving car, falling down, getting lost, having a flat tire, and being hit by an asteroid. A key point falls out of the above wording. To a risk analyst, bicycle riding is not a risk. Many risks are associated with bike riding, because many distinct hazards are connected to bike riding. Each hazard has an associated risk, depending on the severity and probability of the hazard.

Each hazard associated with bicycle riding differs in likelihood and severity. Getting hit by an asteroid is extremely unlikely and extremely harmful; but it is so unlikely that we can ignore the risk altogether. Colliding with a car can be very severe. Its likelihood depends greatly on bike riding practices.

Talking to kids about how to decrease the likelihood of colliding with a moving car helps teach kids about the concept of risk. Talking with them about the relative likelihood of outcomes such as asteroid strikes can also help.

Even young kids can understand the difference between chance and risk. The flip of a coin involves chance but not risk, unless you’ve placed a bet on the outcome. The same applies for predicting the outcome of contests.

I see a lot of articles aimed at teaching kids to take risks. Most of these really address helping kids build confidence and dispel irrational fears. Irrational fear often means errors in perception of either the severity (how bad is it) or the probability (how likely is it) of a hazard. Explicit identification of specific hazards will help conquer irrational fears and will help kids become more comprehensive in identifying the hazards (what can go wrong) associated with an activity.

This is good for kids, since they’re quick to visualize the rewards and benefits of an activity and take action before asking what can go wrong at all, let alone what are all the things that can go wrong.

Teach your kids about risk as a concept, not just about specific risks and hazards. They’ll make better CEOs and board members. We’ll all benefit.


Why Pharmaceutical Risk Management Is in Deep Trouble

The ICH Q9 guidelines finalized in 2005 called for pharmaceutical firms to use a risk-based approach to the specification, design, and verification of manufacturing systems having the potential to affect product quality and patient safety. In 2008, ICH Q10 added that the design of the pharmaceutical quality system should incorporate risk management and risk-based audits.

Pharmaceutical firms had little background in the relevant areas of risk-management. Early troubles the industry faced in applying risk tools developed in other industries are well documented. Potential benefits of proactive risk management include reduction in regulatory oversight and associated costs, reduced cost from discrepant materials, reduced batch-failure rates, and a safer product. Because risk management, in theory, is present in the earliest stages of product and process design,  it can, in theory, raise profitability while improving patient safety.

Such theoretical benefits of good risk management are in fact realized by firms in other industries. In commercial aviation, probabilistic risk analysis is the means by which redundancy is allocated to systems to achieve a balanced  design – the minimum weight and cost consistent with performance requirements. In this sense, good risk analysis is a competitive edge.

From 2010 to 2015, Class 1 to 3 FDA recall events ranged from 8000 to 9500 per year, with an average of 17 injunctions per year. FDA warnings rose steadily from 673 in 2010 to 17,232 in 2015. FDA warning letters specifically identifying missing or faulty risk assessments have also steadily increased, with 53 in 2015, and 83 so far this year, based on my count from the FDA databases.

FDA warnings 2010-2015

It is not merely foreign CMOs that receive warnings identifying defective risk assessments. Abbott, Baxter, Johnson & Johnson, Merck, Sanofi and Teva are in the list.

The high rate of out-of-spec and contamination-recalls seen in the FDA data clearly points to low hanging fruit for risk assessments. These issues are cited in ICH Q9 and 10 as examples of areas where proactive risk management serves all stakeholders by preventing costly recalls. Given the occurence rate in 2015, it’s obvious that a decade of risk management in pharma can’t be declared a major success. In fact, we seem to be losing ground. So what’s going on here, and why hasn’t pharma seen results like those of commercial aviation?

One likely reason stems from evolution of the FDA itself. The FDA predates most of drug manufacture. For decades it has regulated manufacturing, marketing, distribution, safety and efficacy  of drugs and medical devices (among other things) down to the level of raw materials, including inspection of facilities. With obvious benefits to consumers, this role has had the detrimental side effect of maturation of an entire industry where risk management and safety are equated with regulatory compliance by drug makers. That is, there’s tendency to view risk management as something that is imposed by regulators, from the outside, rather than being an integral value-add.

The FAA, by contrast, was born (1958) into an already huge aircraft industry. At that time a precedent for delegating authority to private persons had already been established by the Civil Aviation Act. Knowing the FAA lacked the resources to regulate manufacturing to a level of detail like that in the FDA, it sought to foster a culture of risk in aircraft builders, and succeeded in doing so through a variety of means including  expansion of  the industry participation in certifying aircraft. This included a designated-engineering-rep program in which aircraft engineers are duty-bound delegates of the FAA.

Further, except for the most basic, high-level standards, engineering design and safety requirements are developed by manufacturers and professional organizations, not the FAA. The FAA’s mandate to builders for risk management was basically to come up with the requirements and show the FAA how they intended to meet them. Risk management is therefore integrated into design, not just QA and certification. The contrasting risk cultures of the aviation and pharmaceutical industries is the subject of my current research in history of science and technology at UC Berkeley. More on that topic in future posts.

Changing culture takes time and likely needs an enterprise-level effort. But a much more immediate opportunity for the benefits envisioned in ICH Q9 exists directly at the level of the actual practice of risk assessment.

My perspective is shaped by three decades of risk analysis in aviation, chemical refinement, nuclear power, mountaineering and firefighting equipment, ERM, and project risk. From this perspective, and evidence from direct experience in pharma combined with material found in the FDA databases, I find the quality of pharmaceutical risk assessment and training to be simply appalling.

While ICH Q9 mentions, just as examples, using PHA/FHA (functional hazard analysis), hazard operability analysis, HACCP, FMEA (failure mode effects analysis), probabilistic safety analysis and fault trees at appropriate levels and project phases, one rarely sees anything but FMEAs performed in a mechanistic manner with the mindset that a required document (“the FMEA form”) is being completed.

Setting aside, for now, the points that FMEA was not intended by its originators to be a risk analysis tool and is not used as such in aerospace (for reasons discussed here, including inability to capture human error and external contributors), I sense that the job of completing FMEAs is often relegated to project managers who are ill-equipped for it and lack access to subject matter experts. Further injury is done here by the dreadfully poor conception of FMEA seen in the Project Management Institute’s (PMI) training materials inflicted on pharma Project Managers. But other training available to pharma employees in risk assessment is similarly weak.

Some examples might be useful. In the last two months, I’ve attended a seminar and two webinars I found on LinkedIn, all explicitly targeting pharma. In them I learned, for example, that the disadvantage to using FMEAs is that they require complex mathematics. I have no clue what the speaker meant by this. Maybe it was a reference to RPN calculation, an approach strongly opposed by aviation, nuclear, INCOSE and NAVAIR – for reasons I’ll cover later – which requires multiplying three numbers together?

I learned that FMEAs are also known as fault trees (can anyone claiming this have any relevant experience in the field?), and that bow tie (Ishikawa) “analysis” is heavily used in aerospace. Ishikawa is a brainstorming method, not risk analysis, as noted by Vesely 40+ years ago, and it is never (ever) used as a risk tool in aerospace. I learned that another disadvantage of FMEAs is that you can waste a lot of time working on risks with low probabilities. The speaker seemed unaware that low-probability, high-cost hazards are really what risk analysis is about; you’re not wasting your time there! If the “risks” are high-probability events, like convenience-store theft, we call them expenses, not risks. I learned in this training that heat maps represent sound reasoning. These last two points were made by an instructor billed as a strategic management consultant and head of a pharmaceutical risk-management software firm.

None of these presentations mentioned functional hazard analysis, business impact analysis, or any related tool. FHA (my previous post) is a gaping hole in pharmaceutical risk assessment, missing in safety, market, reputation, and every other kind of risk a pharma firm faces.

Most annoying to me personally is the fact that the above seminars, like every one I’ve attended in the past, served up aerospace risk assessment as an exemplar. Pharma should learn mature risk analysis techniques and culture from aviation, not just show photos of aircraft on presentation slides. In no other industry but commercial aviation has something so inherently dangerous been made so safe, by any definition of safety. Aviation achieved this (1000-fold reduction in fatality rate) not through component quality, but by integrating risk into the core of the business. Aviation risk managers’ jaws hit the floor when I show them “risk assessments” (i,e., FMEAs) from pharma projects.

One thing obviously lacking here is simple analytic rigor. That is, if we’re going to do risk assessment, let’s try to do it right. The pharmaceutical industry obviously has some of the best scientific minds around, so one would expect it to understand the value of knowledge, diligence, and their correct application. So perhaps the root of its defective execution of risk management is in fact the underdeveloped risk culture mentioned above.

The opportunity here is immense. By cleaning up their risk act, pharmaceutical firms could reap the rewards intended by ICH Q9 a decade ago and cut our ballooning regulatory expenses. Leave a comment or reach me via the About link above to discuss this further.

 –  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

If you are in the bay area, please join us, and let us know your preferences for meeting times.

Cato Institute on Immigration Terrorism Risk

Alex Nowrasteh, the immigration policy analyst at the Cato Institute, recently authored a paper entitled “Terrorism and Immigration: A Risk Analysis.” It is not an analysis of risk in the traditional sense; it has little interest in causes and none in mitigation strategy. A risk-reward study, it argues from observed frequencies of terrorism incidents in the US that restricting immigration is a poor means of reducing terrorism and that the huge economic benefits of immigration outweigh the small costs of terrorism.

That may be true – even if we adjust for the gross logical errors and abuse of statistics in the paper.

Nowrasteh admits that in the developing world, heavy refugee flows are correlated with increased terrorism. He also observes that, since 2001, “only three years were marred by successful foreign-born attacks.” Given his focus on what he calls the chance of being murdered in a terrorist attack (based solely on historical frequencies since 1975), the fact that successful terrorism occurred in only three years seems oddly polemical. What follows?  By his lights, the probability of terrorist death stems only from historical frequencies. While honest people disagree about Bayesian probability theory, surely we owe the Bayesians more than blinding ourselves to all but brute averages over a 40-year interval. I.e., having noted that heavy refugee flows correlate with terrorism elsewhere, he doesn’t update his prior at all. Further, unsuccessful terrorist attempts  have no influence.

Nowrasteh writes, “government officials frequently remind the public that we live in a post-9/11 world where the risk of terrorism is so extraordinarily high that it justifies enormous security expenditures.” I don’t know his mindset, but writing this in a “risk analysis” seems poorly motivated at best. He seems to be saying that given the low rate of successful terrorism, security efforts are a big waste. The social justice warriors repeating this quote from his analysis clearly think so:

“The chance that an American would be killed in a terrorist attack committed by a refugee was 1 in 3.64 billion a year.” [emphasis in original]

Nowrasteh develops a line of argument around the cost of a disrupted economy resulting from terrorism events using the 1993 and 2001 WTC attacks and the Boston Marathon bombing, finding that cost to be relatively small. He doesn’t address the possibility that a cluster of related successful attacks might have disproportionately disruptive economic effects.

He makes much of the distinctions between various Visa categories (e.g., tourist, refugee, student) – way too much given that the rate of terrorism in each is tiny to start with, and they vary only by an order of magnitude or so.

These are trifles. Two aspects of the analysis are shocking. First, Nowrasteh repeatedly reports the calculated probabilities of being killed by foreigners of various Visa categories – emphasizing their extreme smallness – with no consideration to base rate. I.e. the probability of being murdered is already tiny. Many of us might be more interested in a conditional probability – what is the probability that if you were murdered, the murderer would be an immigrant terrorist. Or perhaps, if you were murdered by an immigrant terrorist, how likely is it that the immigrant terrorist arrived on a refugee Visa.

Finally, Nowrasteh makes this dazzling claim:

“The attacks (9/11) were a horrendous crime, but they were also a dramatic outlier.” 

Dramatic outlier? An outlier is a datum that lies far outside a known distribution. Does Nowrasteh know of a distribution of terrorist events that nature supplies? What could possibly motivate such an observation. We call a measurement an outlier when its accuracy is in doubt because of prior knowledge about its population. Outliers cannot exist in sparse data. Saying so is absurd. Utterly.

“I wouldn’t believe it even if Cato told me so.” That is how, we are told, an ancient Roman senator would express incredulity, since Marcus Porcius Cato was the archetype of truthfulness. Well, Cato has said it, and I’m bewildered.




On the Use and Abuse of FMEAs

– William Storage, VP LiveSky, Inc.; Visiting Scholar, UC Berkeley History of Science

Analyzing about 80 deaths associated with the drug heparin in 2009, the FDA found that over-sulphated chondroitin with toxic effects had been intentionally substituted for a legitimate ingredient for economic reasons. That is, an unscrupulous supplier sold a counterfeit drug material costing 1% as much as the real thing; and it killed people.

This wasn’t unprecedented. Something similar happened with gentamicin in the late 1980s, with cefaclor in 1996, and again with DEG sold as glycerin in 2006.

Adulteration and toxic excipients are obvious failure modes of supply chains and operations for drug manufacturers. Presumably, the firms buying the adulterated raw material had conducted failure mode effects analyses at several levels. An early-stage FMEA would have seen the failure mode and assessed its effects, thereby triggering the creation of controls to prevent the failure. So what went wrong?

The FDA’s reports on the heparin incident did not make public any analyses done by the drug makers. But based on the “best practices” specified by standards bodies, consulting firms, and many risk managers, we can make a good guess. Their risk assessments were likely misguided, poorly executed, and impotent.

Abuse of FMEA - On Risk Of. Photo by Bill StoragePromoters of FMEAs – and of risk analysis in general – as any conference attendee on the topic can attest – regularly cite aerospace as a source for the basis of their product or initiative, and how to do things in matters of risk. Commercial aviation – as opposed to aerospace in general – should be the exemplar of risk management. In no other endeavor has mankind made such an inherently dangerous activity so safe as commercial jet flight.

While promoters of risk management of all sorts extol aviation, they tend to stray far from its methods, mindset, and values. This is certainly the case with the FMEA, a tool poorly understood, misapplied, poorly executed, and then blamed for failing to prevent catastrophe.

In the case of heparin, a properly performed FMEA exercise would certainly have identified the failure mode. But FMEA wasn’t really the right tool for identifying that hazard in the first place. A functional hazard anlysis (FHA) or Business Impact Analysis (BIA) would have highlighted chemical contamination leading to death of patients, supply disruption, and reputation damage as a top hazard in minutes. I know this for fact, because I use drug manufacture as an example when teaching classes on FHA. Day-one students identify that hazard without being coached.

FHAs can be done very early in the conceptual phase of a project or system design. They need no implementation details. They’re typically short and sweet, yielding concerns to address with high priority as a plan is taking form. Early writers on the topic of FMEA explicitly identified it as being directly opposed to FHA, for former being “bottom-up, the latter “top down,” NASA’s response to the USGS on the suitability of FMEAs their needs, for example, stressed this point. FMEAs rely on at least preliminary implementation details to be useful. And they produce a lot of essential but lower-value content (essential because FMEAs help confirm which failure modes can be de-prioritized) at the time of design or process conception.

So a common failure mode of risk management is using FMEAs for purposes other than those for which they were designed. More generally, equating FMEA with risk analysis and risk management is a failure mode of management.

Assuming we stop misusing FMEAs, we then face the hurdle of doing them well. This is a challenge, as the quality of training, guidance, and facilitation of FMEAs has degraded markedly over the past twenty years. FMEAs, as promoted by the PMI, ISO 31000, and APM PRAM, to name a few, bear little resemblance to those in aviation. I know this, from three decades of risk work in diverse industries, half of it in aerospace. You can see the differences by studying sample FMEAs on the web. I’ll give some specifics.

The inventors of the FMEA themselves acknowledged that FMEAs would need to be tailored for different domains. This was spelled out in the first version of MIL-P-1629 in 1949. But math, psychology, behavioral economics, and philosophy can all point out major flaws in the approach to FMEAs as commonly taught in most fields today. That is, the excuse that nuances of a specific industry turn bad analysis into good will not fly. The same laws of physics and economics apply to all industries.

I’m not sure how  FMEAs went so far astray. Some blame the explosion of enterprise risk management suppliers in the 1990s. ERM, partly rooted in the sound discipline of actuarial science, unfortunately took on many aspects of management fads of the period. It was up-sold by consultancies to their existing corporate clients, who assumed those consultancies actually had background in risk science, which they did not.  Studies a decade later by Protiviti and the EIU failed to show any impact on profit or other benefit of ERM initiatives, except for positive self-assessments by executives of the firms.

But bad FMEAs predated the ERM era. Adopted by the automotive industry in the 1970s, FMEAs seem to have been used to justify optimistic warranty claims estimates for accounting purposes. Few suspect that automotive engineers conspired to misrepresent reliability; but their rosy FMEAs indirectly supported bullish board presentations in struggling auto firms wracked by double-digit claims rates. While Toyota was implementing statistical process control to precisely predict the warranty cost of adverse tolerance accumulation, Detroit was pretending that multiplying ordinal scales of probability, severity, and detectability was mathematically or scientifically valid.

Citing inability to quantify failure rates of basic components and assemblies (an odd claim given the abundance of warranty and repair data), auto firms began to assign scores or ranks to failure modes rather than giving probability values between zero and one. This first appears in automotive conference proceedings around 1971. Lacking hard failure rates – if in fact they did – reliability workers could have estimated numeric probability values based on subjective experience or derived them from reliability handbooks then available. Instead they began to assign ranks or scores on a 1 to 10 scale.

In principle there is no difference between guessing a probability of 0.001 (a numerical probability value) and guessing a value of “1” on a 10 scale (either an ordinal number or a probability value mapped to a limited-range score); but in practice there is a big difference. I see this while doing risk assessments for clients. One difference is that those assigning probability scores in facilitated FMEA sessions usually use grossly different mental mapping processes to get from labels such as “extremely likely” or “moderately unlikely” to numerical probabilities. A physicist sees “likely” for a failure mode to mean more than once per million; a drug trial manager interprets it to mean more than 5%. Neither is wrong; the terms have different meanings in different domains. But if those two specialists aren’t alert to the issue, on jointly calling a failure likely, there will be an illusion of communication and agreement where none exists.

Further, FMEA participants don’t agree – and often don’t know they don’t agree – on the mapping of numerical probability values into 1-10 scores. Unless, of course, if they use an explicit mapping table to translate probabilities into probability scores. But if you have such a table, why use scores at all?

There’s a reason, and it’s a poor one. Probability scores (or sometimes worse, ranks) between 1 and 10 are needed to generate the Risk Priority Numbers (RPN), alluded-to above), made popular by the American automotive industry. You won’t find RPN or anything like it in aviation FMEAs – no arithmetic product of any measures of probability, severity and/or detectability. Probability values, for given failure modes in specific operational modes are however calculated on the basis of observed failure frequency distributions and exposure rates. RPN attempts to move in this direction, but fails miserably.

RPNs are defined as the arithmetic product of a probability score, a severity score, and a detection (more precisely, the inverse of detectability) score. The explicit thinking here is that risks can be prioritized on the basis of the product of three numbers, each ranging from 1 to 10.

The implicit – but critical, though never addressed by users of RPN – thinking here is that all engineers, businesses, regulators and consumers are risk-neutral. Risk neutrality, as conceived in portfolio choice theory, would in this context mean that everyone would be indifferent to two risks of the same RPN, even comprising very different probability and severity values.That is, an RPN formed from the values {2,8,4} would dictate the same risk response as failure modes with RPN scores {8,4,2} and {4,4,4} since the RPN values (product of the scores) are equal. In the real world this is never true. Often it is very far from true. Most people and businesses are not risk-neutral, they’re risk-averse. That changes things. As a trivial example, banks might have valid reasons for caring more about a single $100M loss than one hundred $1M losses.

Beyond the implicit assumption of risk-neutrality, RPN has other problems. As mentioned above, there are cognitive and group-dynamics issues when FMEA teams attempt to model probabilities as ranks or scores. Similar difficulties arise with scoring the cost of a loss, i.e., the severity component of RPN. Again there is the question of why, if you know the cost of a failure (in dollars, lives lost, or patients not cured) why convert a valid measurement into a subjective score (granting, for sake of argument, that risk-neutrality is justified)? Again the answer is to enter that score into the RPN calculation.

Still more problematic is the detectability value used in RPNs. In a non-trivial system or process, detectability and probability are not independent variables. And there is vagueness around the meaning of detectability. Is it the means by which you know the failure mode has happened, after the fact? Or is there an indication that the failure is about to happen, such that something can be observed thereby preventing the failure? If the former, detection is irrelevant to risk of failure, if the latter the detection should be operationalized in the model of the system. That is, if a monitor (e.g, brake fluid low) is in a system, the monitor is a component with its own failure modes and exposure times, which impact its probability of failure. This is how aviation risk analysis models such things.

A simple summary of the problems with scoring, ranking and RPN is that adding ambiguity to a calculation does not eliminate uncertainty about its parameters; it merely adds errors and reduces precision.

Another wrong turn has been the notion that a primary function of FMEAs is to establish cause of failures. Aviation found FMEAs to be ineffective for this purpose long ago. Reasoning back from observation to cause (a-posteriori logic) is tricky business and is often beyond the reach of facilitated FMEA sessions. This was one reason why supplier-FMEAs came to be. In defense of the “Cause” column on FMEA templates used in the automotive world, in relatively simple systems and components, causes are often entailed in failure modes (leakage caused by corrosion as opposed to leakage caused by stress fracture). In such cases cause may not be out of reach. But in the general case, more so in complex cases such as manufacturing process or military operations, seeking causes in FMEAs encourages leaping to regrettable conclusions. I’ll dig deeper into the problem of causes in FMEAs in a future post.

I’ve identified  several major differences between the approach to FMEAs used in aviation and those who claim to use methods based on aerospace. In addition to the reasons given above why I side with aviation on FMEA method, I’ll also note that we know the approach to risk used in aviation has reduced risk – by a factor of roughly a thousand, based on fatal accident rates since aviation risk methods were developed. I don’t think any other industry or domain can show similar success.

A partial summary of failure modes of common FMEA processes includes the following, based on the above discussion:

  • Confusing FMEA with Hazard Analysis
  • Equating FMEA with risk assessment
  • Viewing the FMEA as a Quality (QC) function
  • Insufficient rigor in establishing probability and severity values
  • Unwarranted (and implicit) assumption of risk-neutrality
  • Unsound quantification of risk (RPN)
  • Confusion about the role of detection
  • Using the FMEA as a root-cause analysis

The corrective action for most of these should be obvious, including steering clear of RPN, operationalizing detection methods, using numeric (non-ordinal) probability and cost values (even if estimated), instead of masking ignorance and uncertainty with ranking and scoring.  I’ll add more in a future post.

 – – –

Text and photos © 2016 by William Storage. All rights reserved.

Common-Mode Failure Driven Home

In a recent post I mentioned that probabilistic failure models are highly vulnerable to wrong assumptions of independence of failures, especially in redundant system designs. Common-mode failures in multiple channels defeats the purpose of redundancy in fault-tolerant designs. Likewise, if probability of non-function is modeled (roughly) as historical rate of a specific component failure times the length of time we’re exposed to the failure, we need to establish that exposure time with great care. If only one channel is in control at a time, failure of the other channel can go undetected. Monitoring systems can detect such latent failures. But then failures of the monitoring system tend to be latent.

For example, your car’s dashboard has an engine oil warning light. That light ties to a monitor that detects oil leaks from worn gaskets or loose connections before the oil level drops enough to cause engine damage. Without that dashboard warning light, the exposure time to an undetected slow leak is months – the time between oil changes. The oil warning light alerts you to the condition, giving you time to deal with it before your engine seizes.

But what if the light is burned out? This failure mode is why the warning lights flash on for a short time when you start your car. In theory, you’d notice a burnt-out warning light during the startup monitor test. If you don’t notice it, the exposure time for an oil leak becomes the exposure time for failure of the warning light. Assuming you change your engine oil every 9 months, loss of the monitor potentially increases the exposure time from minutes to months, multiplying the probability of an engine problem by several orders of magnitude. Aircraft and nuclear reactors contain many such monitoring systems. They need periodic maintenance to ensure they’re able to detect failures. The monitoring systems rarely show problems in the check-ups; and this fact often lures operations managers, perceiving that inspections aren’t productive, into increasing maintenance intervals. Oops. Those maintenance intervals were actually part of the system design, derived from some quantified level of acceptable risk.

Common-mode failures get a lot press when they’re dramatic. They’re often used by risk managers as evidence that quantitative risk analysis of all types doesn’t work. Fukushima is the current poster child of bad quantitative risk analysis. Despite everyone’s agreement that any frequencies or probabilities used in Fukushima analyses prior to the tsunami were complete garbage, the result for many was to conclude that probability theory failed us. Opponents of risk analysis also regularly cite the Tacoma Narrows Bridge collapse, the Chicago DC-10 engine-loss disaster, and the Mount Osutaka 747 crash as examples. But none of the affected systems in these disasters had been justified by probabilistic risk modeling. Finally, common-mode failure is often cited in cases where it isn’t the whole story, as with the Sioux City DC-10 crash. More on Sioux City later.

On the lighter side, I’d like to relate two incidents – one personal experience, one from a neighbor – that exemplify common-mode failure and erroneous assumptions of exposure time in everyday life, to drive the point home with no mathematical rigor.

I often ride my bicycle through affluent Marin County. Last year I stopped at the Molly Stone grocery in Sausalito, a popular biker stop, to grab some junk food. I locked my bike to the bike rack, entered the store, grabbed a bag of chips and checked out through the fast lane with no waiting. Ninety seconds at most. I emerged to find no bike, no lock and no thief.

I suspect that, as a risk man, I unconsciously model all risk as the combination of some numerical rate (occurrence per hour) times some exposure time. In this mental model, the exposure time to bike theft was 90 seconds. I likely judged the rate to be more than zero but still pretty low, given broad daylight, the busy location with lots of witnesses, and the affluent community. Not that I built such a mental model explicitly of course, but I must have used some unconscious process of that sort. Thinking like a crook would have served me better.

If you were planning to steal an expensive bike, where would you go to do it? Probably a place with a lot of expensive bikes. You might go there and sit in your pickup truck with a friend waiting for a good opportunity. You’d bring a 3-foot long set of chain link cutters to make quick work of the 10 mm diameter stem of a bike lock. Your friend might follow the victim into the store to ensure you were done cutting the lock and throwing the bike into the bed of your pickup to speed away before the victim bought his snacks.

After the fact, I had much different thought thoughts about this specific failure rate. More important, what is the exposure time when the thief is already there waiting for me, or when I’m being stalked?

My neighbor just experienced a nerve-racking common mode failure. He lives in a San Francisco high-rise and drives a Range Rover. His wife drives a Mercedes. He takes the Range Rover to work, using the same valet parking-lot service every day. He’s known the attendant for years. He takes his house key from the ring of vehicle keys, leaving the rest on the visor for the attendant. He waves to the attendant as he leaves the lot on way to the office.

One day last year he erred in thinking the attendant had seen him. Someone else, now quite familiar with his arrival time and habits, got to his Range Rover while the attendant was moving another car. The thief drove out of the lot without the attendant noticing. Neither my neighbor nor the attendant had reason for concern. This gave the enterprising thief plenty of time. He explored the glove box, finding the registration, which includes my neighbor’s address. He also noticed the electronic keys for the Mercedes.

The thief enlisted a trusted colleague, and drove the stolen car to my neighbor’s home, where they used the electronic garage entry key tucked neatly into its slot in the visor to open the gate. They methodically spiraled through the garage, periodically clicking the button on the Mercedes key. Eventually they saw the car lights flash and they split up, each driving one vehicle out of the garage using the provided electronic key fobs. My neighbor lost two cars though common-mode failures. Fortunately, the whole thing was on tape and the law men were effective; no vehicle damage.

Should I hide my vehicle registration, or move to Michigan?


In theory, there’s no difference between theory and practice. In practice, there is.