Category Archives: Risk management

Oroville Dam Risk Mismanagement

The Oroville Dam crisis exemplifies bad risk assessment, fouled by an unsound determination of hazard probability and severity.

It seems likely that Governor Brown and others in government believed that California’s drought was permanent, that they did so on irrational grounds, and that they concluded the risk of dam problems was low because the probability of flood was low. By doing this they greatly increased the cost of managing the risk, and increased the likelihood of loss of lives.

In other words, Governor Brown and others indulged in a belief that they found ideologically satisfying at the expense of sound risk analysis and management. Predictions of permanent drought in California were entertained by Governor Brown, the New York Times (California Braces for Unending Drought), Wired magazine (Drought Probably Forever) and other outlets last year when El Niño conditions failed to fill reservoirs. Peter Gleick of the Pacific Institute explained to KQED why the last drought would be unlike all others.

One would have to have immense confidence  in the improbability of a future floods to neglect ten-year old warnings of the need from dam repair by several agencies. Apparently, many had such confidence. It was doubly unwarranted, given that with or without anthropogenic warming, the standard bimodal precipitation regime of a desert should be expected. That is, on the theory that man-made climate change exists, we should expect big rain years; and on rejection of that theory we should expect big rain years. Eternal-drought predictions were perhaps politically handy for raising awareness or for vote pandering, but they didn’t serve risk management.

Letting ideology and beliefs interfere with measuring risk by assessing the likelihood of each severity range of each hazard isn’t new. A decade ago many believed certain defunct financial institutions to be too big to fail, long before that phrase was understood to mean too big for the government to allow to fail. No evidence supported this belief, and plenty of counter-evidence existed.

This isn’t even the first time our government has indulged in irrational beliefs about weather. In the late 1800’s, many Americans, apparently including President Lincoln, believed, without necessarily stating it explicitly, that populating the wild west would cause precipitation to increase. The government enticed settlers to move west with land grants. There was a shred of scientific basis: plowing raises dust into the air, increasing the seeding of clouds. Coincidentally, there was a dramatic greening of the west from 1850 to 1880; but it was due to weather, not the desired climate change.

When the rains suddenly stopped in 1880, homesteaders faced decades of normal drought. Looking back, one wonders how farmers, investors and politicians could so deeply indulge in an irrational belief that led to very poor risk analysis.

 

– – –

Tom Hight is my name, an old bachelor I am,
You’ll find me out West in the country of fame,
You’ll find me out West on an elegant plain,
And starving to death on my government claim.

Hurrah for Greer County!
The land of the free,
The land of the bed-bug,
Grass-hopper and flea;
I’ll sing of its praises
And tell of its fame,
While starving to death
On my government claim.

Opening lyrics to a folk song by Daniel Kelley in the late 1800’s chronicling the pain of settling on a government land grant after the end of a multi-decade wet spell.

Teaching Kids about Financial Risk

Last year a Chicago teacher’s response to a survey on financial literacy included a beautifully stated observation:

“This [financial literacy] is a topic that should be taught as early as possible in order to curtail the mindset of fast money earned on the streets and gambling being the only way to improve one’s financial circumstances in life.”

The statement touches on the relationship between risk and reward and on the notion of delayed gratification. It also suggests that kids can grasp the fundamentals of financial risk.

The need for financial literacy is clear. Many adults don’t grasp the concept of compound interest, let alone the ability to calculate it. We’re similarly weak on the basics of risk. Combine the two weaknesses and you get the kind of investors that made Charles Ponzi famous.

I browsed the web for material on financial risk education for kids and found very little, and often misguided. As I mentioned earlier, many advice-giving sites confuse risk taking with confidence building.

On one site I found this curious claim about the relationship between risk and reward:

What is risk versus return? In finance, risk versus return is the idea that the amount of potential return is proportional to the amount of risk taken in a financial investment.

In fact, history shows that reward correlates rather weakly with risk.  As the Chicago teacher quoted above notes, financial security – along with good health and other benefits – stem from knowing that some risks have much down side and little up. Or, more accurately, it means knowing that some risks have vastly higher expected costs than expected profits.

This holds where expected cost means the sum of the dollar value of each possible way to take the risk times its probability value, and where expected profit is the sum of each possible beneficial outcome times its probability. Here expected value would be the latter minus the former. This is simple economic risk analysis (expected value analysis). Many people get it intuitively. Others – some of whom are investment managers – pretend that some particular detail causes it not apply to the case at hand. Or they might deny the existence of certain high-cost outcomes.

I was once honored to give Susan Beacham a bit of input as she was developing her Money Savvy Kids® curriculum. Nearly twenty years later the program has helped over a million kids to develop money smarts. Analysts show the program to be effective in shaping financial attitudes and kids’ understanding of spending, saving and investing money.

Beacham’s results show that young kids, teens, and young adults can learn how money works, a topic that otherwise slips through gaps between subjects in standard schooling. Maybe we can do the same with financial risk and risk in general.

– – –

Postscript

A web site aimed at teaching kids about investment risk proposes an educational game where kids win candies by betting on the outcome of the roll of a six-sided die. Purportedly this game shows how return is proportional to risk taken. Before rolling the die the player states the number of guesses he/she will make on its outcome. The outcome is concealed until all guesses are made or a correct guess is made. Since the cost of any number of guesses is the same, I assume the authors’ stated proportionality between reward and risk to mean that five guesses is less risky than two, for example, and therefore will have a lower yield. The authors provides the first two columns of the table below, showing the candies won vs. number of guesses. I added the Expected Value column, calculated as the expected profit minus the expected (actual) cost.

I think the authors missed an opportunity to point out that, as constructed, the game makes  the five-guess option a money pump. They also constructed the game so that reward is not proportional to risk (using my guess of their understanding of risk). They also missed an opportunity to explore the psychological consequences of the one- and two-guess options. Both have the same expected value, but have much different winnings amounts. I’ve discussed risk-neutrality with ten-year-olds who seem to get the nuances better than some risk managers. It may be one of those natural proficiencies that are unlearned in school.

Overall, the game is constructed to teach that, while not proportional, reward does increase with risk, and, except for the timid who buy the five-guess option, high “risk” has no downside. This seems exactly the opposite of what I want kids to learn about risk.

No. of Guesses Candies Won  Expected Value 
5 1  5/6*1 – 1 = -0.17
4 2  4/6*2 – 1 = 0.33
3 5  3/6*5 – 1 = 1.5
2 10  2/6*10 – 1 = 2.33
1 20  1/6*20 – 1 = 2.33

 

Correcting McKinsey’s Fogged Vision of Risk

McKinsey’s recent promotional piece, Risk: Seeing around the corners is a perfect example of why enterprise risk management is so ineffective (answering a question posed by Norman Marks). Citing a handful of well worn cases of supply chain and distribution channel failures, its advice for seeing around corners might be better expressed as driving while gazing into the rear-view mirror.

The article opens with the claim that risk-assessment processes expose only the most direct threats and neglect indirect ones. It finds “indirect” hazards (one step removed from harmful business impact) to be elusive.  The hazards they cite, however, would immediately flow from a proper engineering-style hazard assessment; they are far from indirect. For example, missing environment-caused damage to a facility with subsequent supply-chain interruption in a risk assessment is a greenhorn move at best.

On Risk Of

McKinsey has cultivated this strain of risk-management hype for a decade, periodically fertilized, as is the case here, with the implication that no means of real analysis exists. Presumably, their desired yield is customers’ conclusions that McKinsey’s risk practice can nevertheless lead us through the uncharted terrain of risk.The blurry advice of this article, while perhaps raising risk awareness, does the disservice of further mystifying risk management.

McKinsey cites environmental impact on a supply chain as an example of a particularly covert risk, as if vendor failure from environmental hazards were somehow unforeseeable:

“At first glance, for instance, a thunderstorm in a distant place wouldn’t seem like cause for alarm. Yet in 2000, when a lightning strike from such a storm set off a fire at a microchip plant in New Mexico, it damaged millions of chips slated for use in mobile phones from a number of manufacturers.”

In fact, the Business Continuity Institute‘s data shows tier-1 supplier problems due to weather and environment to be the second largest source of high-impact supply chain interruptions in 2015.

McKinsey includes a type of infographic it uses liberally. It has concentric circles and lots of arrows, and seems intent on fogging rather than clarifying (portion shown below for commentary and criticism purposes). More importantly, it reveals a fundamental problem with ERM’s conception of risk modeling – that enterprise risk should be modeled bottom-up – that is, from causes to effects. The text of the article implies the same, for example, the distant thunderstorm in the above quote.

On Risk Of

Trying to list – as a risk-analysis starting point – all the possible root causes propagating up to impact on a business’s cost structure, financing, productivity, and product performance is indeed very difficult. And it is a task for which McKinsey can have no privileged insight.

This is a bottom-up (cause first) approach. It is equivalent to examining the failure modes of every component of an aircraft and every conceivable pilot error to determine which can cause a catastrophic accident. There are billions of combinations of component failures and an infinite number of pilot errors to remember. This is not a productive route for modeling high-impact problems.

Deriving the relevant low-level causes of harmful business impacts through a systematic top-down process is more productive. This is the role of business impact analysis (BIA) in the form of Functional Hazard Assessment (FHA) and Fault Tree Analysis (FTA). None of these terms, according to Google, ever appear in McKinsey’s published materials. But they are how we, in my firm, do risk analyses – an approach validated by half a century of incontestable success in aviation and other high-risk areas.

An FHA view of the problem with which McKinsey fumbles would first identify the primary functions necessary for success of the business operation. Depending on specifics of the business these might include things like:

  • Manufacturing complex widgets
  • Distributing widgets
  • Marketing widgets
  • Selling product in the competitive widget space
  • Complying with environmental regulation
  • Issue stock in compliance with SEC

A functional hazard assessment would then look at each primary function and quantify some level of non-function the firm would consider catastrophic, and a level it would consider survivable but dangerous. It might name three or four such levels, knowing that the boundaries between them are somewhat arbitrary; the analysis accommodates this.

For example, an inability to manufacture product at 50% of the target production rate of one million pieces per month for a period exceeding two months might reasonably be judged to result in bankruptcy. Another level of production interruption might be deemed hazardous but survivable.

An FHA would include similar definitions of hazard classes (note I’m using the term “hazard” to mean any unwanted outcome, not just those involving unwanted energy transfers like explosions and lightning) for all primary functions of the business.

Once we have a list of top-level functional hazards – not the same thing as risk registers in popular risk frameworks – we can then determine, given implementation details of the business functions, what specific failures, errors, and external events could give rise to failure of each function.

For example, some things should quickly come to mind when asked what might cause manufacturing output to fall. They would include labor problems, supply chain disruption, regulatory action, loss of electrical power and floods. Some factors impacting production are simple (though not necessarily easy) to model. Floods, for example, have only a few possible sources. Others might need to be modeled systematically, involving many combinations of contributory events using tools like a qualitative or quantitative fault tree.

Looking specifically at the causes of loss of manufacturing capability due to supply chain interruption, we naturally ask ourselves what proximate causes exist there. Subject matter experts or a literature search would quickly list failures like:

  • IT/communications downtime
  • Cyber attack
  • Fire, earthquake, lightning, flood
  • Flu epidemic
  • Credit problem
  • Supplier labor dispute
  • Transportation labor dispute
  • Utility failure
  • Terrorism
  • Supplier ethics event
  • Regulatory change or citation

We would then assess the probability of these events as they contribute to the above top-level hazards, for which severity values have been assigned. At that point we have a risk assessment with some intellectual heft and actionable content.

Note that in that last step we are assigning probability values to the failures, either by using observed frequencies, in the case of floods, lighting and power outages, or with periodically updated estimates of subject matter experts, in the case of terrorism and labor disputes. In no case are we assigning ranks or scores to the probability of failures, as many risk frameworks dictate. Probability ranking of this sort (ranks of 1 through 5 or high, medium, low) has been the fatal flaw of many past risk-analysis failures. In reality, all important failure modes have low probability, especially when one-per-thousand and one-per-billion are both counted as low, as is often the case. I’ve discussed the problem of subjective probability assignment in earlier posts.

McKinsey’s article confuses uncertainty about event frequency with unforseeability, implying that McKinsey holds special arcane knowledge about the future.

Further, as with many ERM writings, it advances a vague hierarchy of risk triggers and types of risk, including “hazard risk,” insurable risk, performance risk, cyber risk, environmental risk, etc. These complex taxonomies of risk reveal ontological flaws in their conception of risk. Positing kinds of risk leads to bewilderment and stasis. The need to do this dissolves if you embrace causality in your risk models. Things happen for reasons, and when bad things might happen, we call it risk. Unlike risk frameworks, we model risk by tracing effects back to causes systematically. And this is profoundly different from trying to pull causes from thin air as a starting point, and viewing different causes as different kinds of risk.

The approach I’m advocating here isn’t rocket science, nor is it even jet science. It is nothing new, but seems unknown within ERM. It is exactly the approach we used in my 1977 college co-op job to evaluate economic viability, along with safety, environmental, and project risk for a potential expansion of Goodyear Atomic’s uranium enrichment facility. That was back when CAPM and Efficient-Market were thought to be good financial models, when McKinsey was an accounting firm, and before ERM was a thing.

McKinsey concludes by stating that unknown and unforeseeable risks will always be with us, but that “thinking about your risk cascades is a concrete approach” to gaining needed insights. We view this as sloppy thinking – not concrete, but vague. Technically speaking, risks do not cascade; events and causes do. A concrete approach uses functional hazard assessments and related systematic, analytic tools.

The purpose of risk management is not to contemplate and ponder. It is to model risk by anticipating future unwanted events, to assess their likelihood and severity, and to make good decisions about their avoidance, mitigation, transfer or retention.

Leave a comment or reach me via the About link above to discuss how this can work in your specific risk arena.

–  –  –


In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

If you are in the bay area, please join us, and let us know your preferences for meeting times.

https://www.meetup.com/San-Francisco-Risk-Managers/

Functional Hazard Assessment Basics

Outside of aerospace and military projects, I see a lot of frustration about failures of failure mode effects analyses (FMEA) to predict system-level or enterprise-level problems. This often stems from the fact that FMEAs see a system or process as a bunch of parts or steps. The FMEA works by asking what happens if I break this part right here or that process step over there.

For this approach to predict high-level (system-sized) failures, many FMEAs would need to be performed to exhaustive detail. They would also need to predict propagation of failures (or errors) through a system, identifying consequences on the function of the system containing them. Since FMEAs focus on single-event system failure initiators, examining combinations of failures is unwieldy. In redundant equipment or processes, this need can be far beyond the limits of our cognitive capability. In an aircraft brake system, for example, there may be hundreds of thousands of combinations of failures that lead to hazardous loss of braking. Also, by focusing on single-event conditions, only external and environmental contributors to system problems that directly cause component failures get analyzed. Finally, FMEAs often fail to catch human error that doesn’t directly result in equipment failure.

Functional Hazard AnalysisConsequently, we often call on FMEAs to do a job better suited for a functional hazard assessments (FHA). FHAs identify system-level functional failures, hazards, or other unwanted system-level consequences of major malfunction or non-function. With that, we can build a plan to eliminate, mitigate or transfer the associated risks by changing the design, deciding against the project, adding controls and monitors, adding maintenance requirements, defining operational windows, buying insurance, narrowing contract language, or making regulatory appeals. While some people call the FMEA a bottom-up approach, the FHA might be called a top-down approach. More accurately, it is a top-only approach, in the sense that it identifies top-level hazards to a system, process, or enterprise, independent of the system’s specific implementation details. It is also top-only in the sense that FHAs produce the top events of fault trees.

Terminology: In some domains – many ERM frameworks for example – the term “hazard” is restricted to risks  stemming from environmental effects, war, and terrorism. This results in   differentiating “hazard risks” from market, credit, or human capital risks, and much unproductive taxonomic/ontological hairsplitting. In some fields, business impact analysis (BIA) serves much the same purpose as FHA. While often correctly differentiated from risk analysis (understood to inform risk analyses), implementation details of BIA varies greatly. Particularly in tech circles, BIA impact is defined for levels lower than actual impact on business. That is, its meaning drifts into something like a specialized FMEA. For these reasons, I’ll use only the term “FHA,” where “hazard” means any critical unwanted outcome.

To be most useful, functional hazards should be defined precisely, so they can contribute to quantification of risk. That is, in the aircraft example above, loss of braking itself is not a useful hazard definition. Brake failure at the gate isn’t dangerous. Useful hazard definition would be something like reduction in aircraft braking resulting in loss of aircraft or loss of life. That constraint would allow engineers to model the system failure condition as something like loss of braking resulting in departing the end of a runway at speeds in excess of 50 miles per hour. Here, 50 mph might be a conservative engineering judgement of a runway departure speed that would cause a fatality or irreparable fuselage damage.

Hazards for other fields can take a similar form, e.g.:

  • Reputation damage resulting in revenue loss exceeding $5B in a fiscal year
  • Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
  • Unexpected coupon redemption in a campaign, resulting in > 1M$ excess promotion cost
  • Loss of chemical batch (value $1M) by thermal runaway
  • Electronic health record data breach resulting in disclosure of > 100 patient identifiers with medical history
  • Uncontained oil spill of > 10,000 gallons within 10 miles of LA shore

Note that these hazards tend to be stated in terms of a top-level or system-level function, an unwanted event related to that function, and specific, quantified effects with some sort of cost, be it dollars, lives, or gallons. Often, the numerical details are somewhat arbitrary, reflecting the values of the entity affected by the hazard. In other cases, as with aviation, guidelines on hazard classification comes from a regulatory body. The FAA defines catastrophic, for example, as hazards “expected to result in multiple fatalities of the occupants, or incapacitation or fatal injury to a flight crewmember, normally with the loss of the airplane.”

Organizations vary considerably in the ways they view the FHA process; but their objectives are remarkably consistent. Diagrams for processes as envisioned by NAVSEA and the FAA (and EASA in Europe) appear below.

Functional Hazard Analysis

The same for enterprise risk might look like:

  • Identify main functions or components of business opportunity, system, or process
  • Examine each function for effects of interruption, non-availability or major change
  • Define hazards in terms of above effects
  • Determine criticality of hazards
  • Check for cross-function impact
  • Plan for avoidance, mitigation or transfer of risk

In all the above cases, the end purpose, as stated earlier, is to inform trade studies (help decide between futures) or to eliminate, mitigate or transfer risk. Typical FHA outputs might include:

  • A plan for follow-on action – analyses, tests, training
  • Identification of subsystem requirements
  • Input to strategic decision-making
  • Input to design trade studies
  • The top events of fault trees
  • Maintenance and inspection frequencies
  • System operating requirements and limits
  • Prioritization of hazards, prioritization of risks*

The asterisk on prioritization of risks means that, in many cases, this isn’t really possible at the time of an FHA, at least in its first round. A useful definition of risk involves a hazard, its severity, and its probability. The latter, in any nontrivial system or operation, cannot be quantified – often even roughly – at the time of FHA. Thus the FHA identifies the hazards needing probabilistic quantification. The FAA and NAVSEA (examples used above) do not quantify risk as an arithmetic product of probability and severity (contrary to the beliefs of many who cite these domains as exemplars); but the two-dimensional (vector) risk values they use still require quantification of probability.

I’ll drill into details of that issue and discuss real-world use of FHAs in future posts. If you’d like a copy of my FHA slides from a recent presentation at Stone Aerospace, email me or contact me via the About link on this site. I keep slide detail fairly low, so be sure to read the speaker’s notes.

–  –  –


In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

If you are in the bay area, please join us, and let us know your preferences for meeting times.

https://www.meetup.com/San-Francisco-Risk-Managers/

Medical Device Risk – ISO 14971 Gets It Right

William Storage
VP, LiveSky, Inc.,  Visiting Scholar, UC Berkeley History of Science

The novel alliance between security research firm MedSec and Muddy Waters LLC’s short-seller Carson Block brought medical device risk into the news again this summer. The competing needs of healthcare cost-control for an aging population, a shift toward population-level outcomes, med-tech entrepreneurialism, changing risk-reward attitudes, and aggressive product liability lawsuits demand a rational approach to medical-device risk management. Forty-six Class-3 medical device recalls have been posted this year.

Medical device design and manufacture deserves our best efforts to analyze and manage risks. ISO 14971 (including EU variants) is a detailed standard providing guidance for applying risk management to medical devices. For several years I’ve been comparing different industries’ conceptions of risk and their approaches to risk management in my work with UC Berkeley’s Center for Science, Technology, Medicine and Society. In comparison to most sectors’ approach to risk, ISO 14971 is stellar.

My reasons for this opinion are many. To start with, its language and statement of purpose is ultra-clear. It’s free of jargon and ambiguous terms such as risk scores and risk factors – a potentially useful term that has incompatible meanings in different sectors. Miscommunication between different but interacting domains is wasteful, and could even increase risk. Precision in language a small thing, but it sets a tone of discipline that many specs and frameworks lack. For example, the standard includes the following definitions:

  • Risk– combination of the probability of occurrence of harm and the severity of that harm
  • Hazard– potential source of harm
  • Severity– measure of the possible consequences of a hazard

Obvious as those may seem, defining risk in terms of hazards is surprisingly uncommon; leaving severity out of its definition is far too common; and many who include it define risk as an arithmetic product of probability and severity, which often results in nonsense.

ISO 14971 calls for a risk-analysis approach that is top-down. I.e., its risk analysis emphasizes functional hazard analysis first (ISO 14971 doesn’t use the acronym “FHA”, but its discussion of hazard analysis is function-oriented). Hazard analyses attempt to identify all significant harms or unwanted situations – often independent of any specific implementation of the function a product serves – that can arise from its use. Risk analyses based on FHA start with the hypothetical harms and work their way down through the combinations of errors and failures that can lead to that harm.

Despite similarity of the information categories between FHA and Failure Mode Effects Analysis (FMEA), their usage is (should be) profoundly different. As several authors have pointed out recently, FMEA was not invented for risk analysis, and is not up to the task. FMEAs simply cannot determine criticality of failures of any but the simplest components.

Further, FHA can reasonably accommodate  harmful equipment states not resulting from failure modes, e.g. misuse, and mismatched operational phase and operating mode, and other errors. Also, FHAs force us to specify criticality of situations (harm to the device user) rather than trying to tie criticality to individual failure modes. Again, this is sensible for complex and redundant equipment, while doing no harm for simple devices. While the standard doesn’t mention fault trees outright, it’s clear that in many cases the only rational defense of a residual risk of high severity in a complex device would be fault trees to demonstrate sufficiently low probability of hazards.

ISO 14971 also deserves praise for having an engineering perspective, rather than that of insurers or lawyers. I mean no offense to lawyers, but successful products and patient safety should not start with avoidance of failure-to-warn lawsuits, nor should it start with risk-transfer mechanisms.

The standard is pragmatic, allowing for a risk/reward calculus in which patients choose to accept some level of risk for a desired benefit. In the real world, risk-free products and activities do not exist, contrary to the creative visions of litigators. Almost everyone in healthcare agrees that risk/reward considerations make sense; but it often fails to make its way into regulations and standards.

14971 identifies a proper hierarchy of risk-control options that provide guidance from conceptual design through release of medical devices. The options closely parallel those used in design of life-critical systems in aerospace and nuclear circles:

  1. inherent safety by design
  2. protective measures
  3. information for safety

As such, the standard effectively disallows claiming credit for warnings in device instructions as a risk-reduction measure without detailed analysis of such claims.

A very uncommon feature of risk programs is calling for regression-analysis on potential new risks introduced by control measures. Requiring such regression analysis forces hazard analysis reports to be living documents and the resulting risk evaluations to be dynamic. A rough diagram of the risk management process of ISO 14971, based on one that appears in the standard with minor clarifications (at least for my taste) appears below.

ISO 14971 risk management process

This standard also avoids the common pitfalls and fuzzy thinking around “detection”(though some professionals seem determined to introduce it in upcoming reviews). Presumably, its authors recognized that if monitors and operating instructions call for function-checks then detection is addressed in FHAs and FMEAs, and is not some vague factor to be stirred into risk calculus (as we see in RPN usage).

What’s not to like? Minor quibbles only. Disagreements between US and EU standards bodies address some valid, often subtle points. Terminology issues such as differentiating “as low as reasonably practicable” vs “as far as possible” bring to mind the learning curve that went with the FAA AC 25.1309 amendments in commercial aviation. This haggling is a good thing; it brings clarity to the standard.

Another nit – while the standard is otherwise free of risk-neutrality logic flaws, Annex D does give an example of a “risk chart” plotting severity against probability. However, to its credit, the standard says this is for visualization and does not imply that any conclusions be drawn from the relative positions of plotted risks.

Also while  severity values are quantified concretely (e.g., Significant = death, Moderate = reversible or minor injury, etc.), Annex D.3.4 needlessly uses arbitrary and qualitative probability ranges, e.g., “High” = “likely,” etc.

These are small or easy-to-fix concerns with a very comprehensive, systematic, and internally consistent standard. Its authors should be proud.

Comments on the COSO ERM Public Exposure 2016

In June, COSO, the Committee of Sponsoring Organizations of the Treadway Commission, requested comments on a new draft of its framework.  I discovered this two days before the due date for comments, and rushed to respond.  My comments are below. The document is available for public review here.

Most of my comments about this draft address Section 2, which deals with terminology. I’d like to stress that this concern stems not from a desire for semantic purity but from observations of miscommunication and misunderstandings between ERM professionals and those of various business units as well as a lack of conceptual clarity about risk within ERM.

Before diving into that topic in detail, I’ll offer two general comment based on observations from work in industry. I think we all agree that risk management must be a process, not a business unit. Despite this, many executives still equate risk management with regulatory compliance or risk transfer through insurance. That thinking was apparent in the Protiviti and EIU surveys of the 2000’s, and, despite the optimism of Deloitte’s 2013 survey, is readily apparent if one reads between its lines. As with information technology, risk management is far too often viewed as a department down the hall, rather than an integral process. Sadly, part of this problem seems to stem from ERM’s self-image; ERM is often called “an area of management” in ERM literature. Risk management can no more be limited to an area of management than can engineering or supply-chain management; i.e., they require management, not just Management.

My second general comment is that the framework expends considerable energy on risk management but comparatively little on risk assessment. It is hard to imagine how risks can be managed without first being assessed, i.e., managed risks must be first identified and measured.

Nearly all risk management presentations lean on imagery and examples from aerospace and other human endeavors where inherently dangerous activities have been made safe through disciplined risk analysis and management. Many ERM practitioners believe that their best practices and frameworks draw heavily on the body of knowledge developed in aviation over the last 70 years. This belief is not totally justified. ERM educators and practitioners often use aerospace metaphors (or nuclear, mountaineering, scuba, etc.) but are unaware that the discipline of aerospace risk assessment and management categorically rejects certain axioms of ERM – particularly those tied to the relationships between the fundamental concepts of risk, likelihood or probability, severity and uncertainty. I’d like to offer here that moving a bit closer to conceptual and terminological alignment with the aerospace mindset would better serve the objectives of COSO.

At first glance ERM differs greatly in objective from aircraft safety, and has a broader scope. This difference in scope might be cited as a valid basis for the difference in approaches and mindsets I observe between the two domains. I’ll suggest that the perception of material differences is mostly an illusion stemming from our fear of flying and from minimal technical interchange between the two domains. Even assuming, for sake of argument, that aerospace risk analysis is validation-focused rather than a component of business decision-making and strategy, certain fundamental concepts would still be shared. I.e., in both cases we systematically identify risks, measure their impact, modify designs and processes to mitigate them, and apply the analysis of those risks to strategic decisions where we seek gain. This common thread running through all risk management would seem to warrant commonality in perspective, ideology, method, and terminology. Yet fundamental conceptual difference exist, which, in my view, prevent ERM from reaching its potential.

Before discussing how ERM might benefit from closer adherence to mindsets fostered by aerospace risk practice (and I use aerospace here as a placeholder – nuclear power, weapons systems, mountaineering and scuba would also apply) I’d like to stress that both probabilistic and qualitative risk analysis of many forms profoundly impact strategic decisions of aircraft makers. At McDonnell Douglas (now Boeing) three decades ago I piloted an initiative to use probabilistic risk analysis in the conceptual-design phase of aircraft models considered for emerging markets (as opposed to merely in the realm of reliability assessment and regulatory compliance). Since risk analysis is the only rational means for allocating redundancy within complex systems, the tools of safety analysis entered the same calculus as those evaluating time-to-market, financing, credit, and competitive risk.

In the proposed framework, I have significant concerns about the definitions given in Section 2 (“Understanding the Terms”). While terminology can be expected to vary across disciplines, I submit that these definitions do not serve COSO’s needs, and that they hamper effective communication between organizations. I’ll offer suggested revisions below.

P22 begins:

“There is risk in not knowing how an entity’s strategy and business objectives may be affected by potential events. The risk of an event occurring (or not), creates uncertainty.”

It then defines risk, given the context of uncertainty specified above:

Risk: “The possibility that events will occur and affect the achievement of strategy and business objectives.”

The relationship between risk and uncertainty expressed here seems to be either backward or circular. That is, uncertainty always exists in setting an entity’s strategy and business objectives. That uncertainty exists independent of whether a party has a stake in the success of the entity. Uncertainty – the state of not being definitely known, being undecided, or having doubt – only entails risk, as “risk” is commonly used in society, most of business, science, and academics to those with a stake in the outcome.

I am aware that in many ERM frameworks, risk is explicitly defined as uncertainty about an outcome that can be either beneficial or undesirable. Such usage of term has two significant problems. First, it causes misunderstandings in communications between ERM insiders and those affected by their decisions. Second, even within ERM, practitioners drift between that ERM-specific meaning and the meaning used by the rest of the world. This is apparent in the frequent use of expressions such as “risk mitigation” and “risk avoidance” within ERM literature. Use of these phrases clearly indicates a scope of “risk” limited to unwanted events, not to desired outcomes. Logically, no one would seek to mitigate benefit.

While the above definition of risk doesn’t explicitly connect beneficial outcomes with risk, the implicit connection is obvious in the relationships between risk and the other defined terms. If risk is “the possibility that events will occur” and those events can be beneficial or undesirable, then, as defined, the term risk covers both beneficial and undesirable events. Risk then communicates nothing beyond uncertainty about those events. As such, risk becomes synonymous with uncertainty.

Equating risk with uncertainty is unproductive; and expressing uncertainty as a consequence of risk (as stated at the beginning of P22) puts the cart before the horse. The general concept in risk studies is that risk is a consequence of uncertainty, not the cause of uncertainty. Decisions would be easy – virtually automatic – if uncertainty were removed from the picture.

Uncertainty about potential outcomes, some of which are harmful, is a necessary but insufficient feature of risk. The insufficiency of uncertainty alone in expressing risk is apparent if one considers, again, that no risk exists without a potential loss. Uncertainty exists at the roulette wheel regardless of your participation. You have risk only if you wager. The risk rises as your wager rises. Further, for a given wager, your risk is higher in America than in Europe roulette because American roulette’s additional possible outcome – the double-zero (not present elsewhere) – reduces your probability – i.e., increases your uncertainty – of winning. Thus rational management of risk entails recognition of two independent components of risk – uncertainty and loss. Below I suggest a revision of the definition of risk to accommodate this idea.

Understanding risk to involve both uncertainty and potential loss provides consistency with usage of the term in the realms of nuclear, aerospace, medicine, manufacturing statistical-process control, and math and science in general.

When considering uncertainty’s role in risk (and that they have profoundly different meanings), we can consider several interpretations of uncertainty. In math, philosophy, and logic, uncertainty usually refers to quantities that can be expressed as a probability – a value between zero and one – whether we can state that probability with confidence or not. We measure our uncertainty about the outcome of rolling a die by, assuming a fair die, examining the sample space. Given six possible outcomes of presumed equal likelihood, we assign a probability of 1/6 to each possible outcome. That is a measurement of our uncertainty about the outcome. Rolling a die thousands of times gives experimental confirmation of our uncertainty measurement. We express uncertainty about Manhattan being destroyed this week by an asteroid through a much different process. We have no historical (frequency) data from which to draw. But by measuring the distribution, age, and size of asteroid craters on the moon we can estimate the rate of large asteroid strikes on the earth. This too gives a measure of our uncertainty about Manhattan’s fate. We’re uncertain, but we’re not in a state of complete ignorance.

But we are ignorant of truly unforeseeable events – what Rumsfeld famously called unknown unknowns. Not even knowing what a list of such events would contain could also be called uncertainty; but it is a grave error to mix that conception of uncertainty (perhaps better termed ignorance) with uncertainty about the likelihood of known possible events. Much of ERM literature suffers from failing to make this distinction.

An important component of risk-management is risk-analysis in which we diligently and systematically aim to enumerate all possible events, thereby minimizing our ignorance – moving possible outcomes from the realm of ignorance to the realm of uncertainty, which can be measured, though sometimes only by crude estimates. It’s crucial to differentiate ignorance and uncertainty in risk management, since the former demands thoroughness in identifying unwanted events (often called hazards, though ERM restricts that term to a subset of unwanted events), while the latter is a component of a specific, already-identified risk.

Beyond facilitating communications between ERM practitioners and those outside it, a more disciplined use of language – using these separate concepts of risk, uncertainty and ignorance –  will promote conceptual clarity in managing risk.

A more useful definition of risk should include both uncertainty and loss and might take the form:

Risk:  “The possibility that unwanted events will occur and negatively impact the achievement of strategy and business objectives.”

To address the possible objection that risk may have a “positive” (desirable) element, note that risk management exists to inform business decisions; i.e., making good decisions involves more than risk management alone; it is tied to values and data external to risks. Nothing is lost by restricting risk to the unwanted consequences of unwanted events. The choice to accept a risk for the purpose of achieving a desirable outcome (gain, reward) is informed by thorough assessment of the risk. Again, without uncertainty, we’d have no risk; without risk, decisions would be easy. The possibility that by accepting a managed risk we may experience unforeseen benefits (beyond those for which the decision to accept the risk was made) is not excluded by the above proposed definition of risk. Finally, my above proposed definition is consistent with the common conception of risk-reward calculus.

One final clarification: I am not proposing that risk should in any way be an arithmetic product of quantified uncertainty and quantified cost of the potential loss. While occasionally useful, that approach requires a judgment of risk-neutrality that can rarely be justified, and is at odds with most people’s sense of risk tolerance. For example, we have no basis for assuming that a bank would consider one loss of a million dollars to be an equivalent risk to 10,000 losses of $100 each, despite both having the same mathematical expectation (expected value or combined cost of the loss).

An example of the implicit notion of a positive component of risk (as opposed to a positive component of informed decision-making) P25 states:

“Organizations commonly focus on those risks that may result in a negative outcome, such as damage from a fire, losing a key customer, or a new competitor emerging. However, events can also have positive outcomes, and these must also be considered.“

A clearer expression of the relationship between risk (always negative) and reward would recognize that positive outcomes result from deciding to accept managed and understood risks (risks that have been analyzed). With this understanding of risk, common to other risk-focused disciplines, positive outcomes result from good decisions that manage risks, not from the risks themselves.

This is not a mere semantic distinction, but a conceptual one. If we could achieve the desired benefit (“positive outcome”) without accepting the risk, we would certainly do so. This point logically ties benefits to decisions (based on risk analysis), not to risks themselves. A rewording of P25 should, in my view, should explain that:

  • events (not risks) may result in beneficial or harmful outcomes
  • risk management involves assessment of the likelihood and cost of unwanted outcomes,
  • risks are undertaken or planned-for as part of management decisions
  • those informed decisions are made to seek gains or rewards

This distinction clarifies the needs of risk management and emphasizes its role in good corporate decision-making.

Returning to the concept of uncertainty, I suggest that the distinction between ignorance (not knowing what events might happen) and uncertainty (not knowing the likelihood of an identified event) is important for effective analysis and management of risk. Therefore, in the context of events, the matter of “how” should be replaced with shades of “whether.” The revised definition I propose below reflects this.

The term severity is common in expressing the cost of the loss component of risk.  The definition of severity accompanying P25 states:

Severity: A measurement of considerations such as the likelihood and impacts of events or the time it takes to recover from events.

Since the definition of risk (both the draft original and my proposed revision) entail likelihood (possibility or probability), likelihood should be excluded from a definition of severity; they are independent variables. Severity is a measure of how bad the consequences of the loss can be. I.e., it is simply the cost of the hypothetical loss, if the loss were to occur. Severity can be expressed in dollars or lost lives. Reputation damage, competitive disadvantage, missed market opportunities, and disaster recovery all ultimately can be expressed in dollars. While we may only be able to estimate the cost of a loss, the probability of that loss is independent of it severity.

Recommended definitions for Sections 2:

Event: An anticipated or unforeseen occurrence, situation, or phenomenon of any magnitude, having beneficial, harmful or unknown consequences

Uncertainty: The state of not knowing or being undecided about the likelihood of an event.

Severity: A measurement of the undesirability or cost of a loss

Risk:  The possibility that unwanted events will negatively impact the achievement of strategy and business objectives.

 

Historical perspective on the divergent concepts of risk, uncertainty, and probability

Despite having mastered geometry and the quadratic formula in ancient times, our study of probability and uncertainty only dates to the late 17th century when Blaise Pascal was paid by a client to develop mathematics to gain an advantage in gambling. This was the start of the frequentist interpretation of probability, based on the idea that, under deterministic mechanisms, we can predict the outcome of various trials given a large enough data set. Pierre-Simon Laplace then formalized the subjectivist (Bayesian) interpretation of probability in which probability refers to one’s degree of belief in a possible outcome. Both these interpretations of probability are expressed as a number between zero and one. That is, they are both quantifications of uncertainty about one or more explicitly identified potential outcomes.

The inability to identify a possible outcome, regardless of probability, stems from ignorance of the system in question. Such ignorance is in some cases inevitable. An action may have unforeseeable outcomes; flipping our light switch may cause a black hole to emerge and swallow the earth. But scientific knowledge combined with our understanding of the wiring of a house gives us license to eliminate that as a risk. Whether truly unforeseeable events exist depends on the situation; but we can say with confidence that many events called black swans, such as the Challenger explosion, Hurricane Katrina and the 2009 mortgage crisis were foreseeable and foreseen – though ignored. The distinction between uncertainty about likelihood of an event and ignorance of the extent of the list of events si extremely important.

Yet confusing the inability to anticipate all possible unwanted events and a failure to measure or estimate the probability of identified risks is common in some risk circles. A possible source of this confusion was Frank Knight’s 1921 Uncertainty and Profit. Knight’s contributions to economic and entrepreneurial theory are laudable, but his understanding of set theory and probability was poor. Despite this, Knight’s definitions linger in business writing. Specifically, Knight defined “risk” as “measurable uncertainty” and “uncertainty” as “unmeasurable uncertainty.” Semantic incoherence aside, Knight’s terminology was inconsistent with all prior use of the terms uncertainty, risk, and probability in mathematical economics and science. (See chapters 2 and 10 of Stigler’s The History of Statistics: The Measurement of Uncertainty before 1900 for details).

The understanding and rational management of risk requires that we develop and maintain clarity around the related but distinct concepts of uncertainty, probability, severity and risk, regardless of terminology. Clearly, we can navigate through some level of ambiguous language in risk management, but the current lack of conceptual clarity about risk in ERM has not well served its primary objective. Hopefully, renewed interest in making ERM integral to strategic decisions will allow a reformulation of the fundamental concepts of risk.

On the Use and Abuse of FMEAs

– William Storage, VP LiveSky, Inc.; Visiting Scholar, UC Berkeley History of Science

Analyzing about 80 deaths associated with the drug heparin in 2009, the FDA found that over-sulphated chondroitin with toxic effects had been intentionally substituted for a legitimate ingredient for economic reasons. That is, an unscrupulous supplier sold a counterfeit drug material costing 1% as much as the real thing; and it killed people.

This wasn’t unprecedented. Something similar happened with gentamicin in the late 1980s, with cefaclor in 1996, and again with DEG sold as glycerin in 2006.

Adulteration and toxic excipients are obvious failure modes of supply chains and operations for drug manufacturers. Presumably, the firms buying the adulterated raw material had conducted failure mode effects analyses at several levels. An early-stage FMEA would have seen the failure mode and assessed its effects, thereby triggering the creation of controls to prevent the failure. So what went wrong?

The FDA’s reports on the heparin incident did not make public any analyses done by the drug makers. But based on the “best practices” specified by standards bodies, consulting firms, and many risk managers, we can make a good guess. Their risk assessments were likely misguided, poorly executed, and impotent.

Abuse of FMEA - On Risk Of. Photo by Bill StoragePromoters of FMEAs – and of risk analysis in general – as any conference attendee on the topic can attest – regularly cite aerospace as a source for the basis of their product or initiative, and how to do things in matters of risk. Commercial aviation – as opposed to aerospace in general – should be the exemplar of risk management. In no other endeavor has mankind made such an inherently dangerous activity so safe as commercial jet flight.

While promoters of risk management of all sorts extol aviation, they tend to stray far from its methods, mindset, and values. This is certainly the case with the FMEA, a tool poorly understood, misapplied, poorly executed, and then blamed for failing to prevent catastrophe.

In the case of heparin, a properly performed FMEA exercise would certainly have identified the failure mode. But FMEA wasn’t really the right tool for identifying that hazard in the first place. A functional hazard anlysis (FHA) or Business Impact Analysis (BIA) would have highlighted chemical contamination leading to death of patients, supply disruption, and reputation damage as a top hazard in minutes. I know this for fact, because I use drug manufacture as an example when teaching classes on FHA. Day-one students identify that hazard without being coached.

FHAs can be done very early in the conceptual phase of a project or system design. They need no implementation details. They’re typically short and sweet, yielding concerns to address with high priority as a plan is taking form. Early writers on the topic of FMEA explicitly identified it as being directly opposed to FHA, for former being “bottom-up, the latter “top down,” NASA’s response to the USGS on the suitability of FMEAs their needs, for example, stressed this point. FMEAs rely on at least preliminary implementation details to be useful. And they produce a lot of essential but lower-value content (essential because FMEAs help confirm which failure modes can be de-prioritized) at the time of design or process conception.

So a common failure mode of risk management is using FMEAs for purposes other than those for which they were designed. More generally, equating FMEA with risk analysis and risk management is a failure mode of management.

Assuming we stop misusing FMEAs, we then face the hurdle of doing them well. This is a challenge, as the quality of training, guidance, and facilitation of FMEAs has degraded markedly over the past twenty years. FMEAs, as promoted by the PMI, ISO 31000, and APM PRAM, to name a few, bear little resemblance to those in aviation. I know this, from three decades of risk work in diverse industries, half of it in aerospace. You can see the differences by studying sample FMEAs on the web. I’ll give some specifics.

The inventors of the FMEA themselves acknowledged that FMEAs would need to be tailored for different domains. This was spelled out in the first version of MIL-P-1629 in 1949. But math, psychology, behavioral economics, and philosophy can all point out major flaws in the approach to FMEAs as commonly taught in most fields today. That is, the excuse that nuances of a specific industry turn bad analysis into good will not fly. The same laws of physics and economics apply to all industries.

I’m not sure how  FMEAs went so far astray. Some blame the explosion of enterprise risk management suppliers in the 1990s. ERM, partly rooted in the sound discipline of actuarial science, unfortunately took on many aspects of management fads of the period. It was up-sold by consultancies to their existing corporate clients, who assumed those consultancies actually had background in risk science, which they did not.  Studies a decade later by Protiviti and the EIU failed to show any impact on profit or other benefit of ERM initiatives, except for positive self-assessments by executives of the firms.

But bad FMEAs predated the ERM era. Adopted by the automotive industry in the 1970s, FMEAs seem to have been used to justify optimistic warranty claims estimates for accounting purposes. Few suspect that automotive engineers conspired to misrepresent reliability; but their rosy FMEAs indirectly supported bullish board presentations in struggling auto firms wracked by double-digit claims rates. While Toyota was implementing statistical process control to precisely predict the warranty cost of adverse tolerance accumulation, Detroit was pretending that multiplying ordinal scales of probability, severity, and detectability was mathematically or scientifically valid.

Citing inability to quantify failure rates of basic components and assemblies (an odd claim given the abundance of warranty and repair data), auto firms began to assign scores or ranks to failure modes rather than giving probability values between zero and one. This first appears in automotive conference proceedings around 1971. Lacking hard failure rates – if in fact they did – reliability workers could have estimated numeric probability values based on subjective experience or derived them from reliability handbooks then available. Instead they began to assign ranks or scores on a 1 to 10 scale.

In principle there is no difference between guessing a probability of 0.001 (a numerical probability value) and guessing a value of “1” on a 10 scale (either an ordinal number or a probability value mapped to a limited-range score); but in practice there is a big difference. I see this while doing risk assessments for clients. One difference is that those assigning probability scores in facilitated FMEA sessions usually use grossly different mental mapping processes to get from labels such as “extremely likely” or “moderately unlikely” to numerical probabilities. A physicist sees “likely” for a failure mode to mean more than once per million; a drug trial manager interprets it to mean more than 5%. Neither is wrong; the terms have different meanings in different domains. But if those two specialists aren’t alert to the issue, on jointly calling a failure likely, there will be an illusion of communication and agreement where none exists.

Further, FMEA participants don’t agree – and often don’t know they don’t agree – on the mapping of numerical probability values into 1-10 scores. Unless, of course, if they use an explicit mapping table to translate probabilities into probability scores. But if you have such a table, why use scores at all?

There’s a reason, and it’s a poor one. Probability scores (or sometimes worse, ranks) between 1 and 10 are needed to generate the Risk Priority Numbers (RPN), alluded-to above), made popular by the American automotive industry. You won’t find RPN or anything like it in aviation FMEAs – no arithmetic product of any measures of probability, severity and/or detectability. Probability values, for given failure modes in specific operational modes are however calculated on the basis of observed failure frequency distributions and exposure rates. RPN attempts to move in this direction, but fails miserably.

RPNs are defined as the arithmetic product of a probability score, a severity score, and a detection (more precisely, the inverse of detectability) score. The explicit thinking here is that risks can be prioritized on the basis of the product of three numbers, each ranging from 1 to 10.

The implicit – but critical, though never addressed by users of RPN – thinking here is that all engineers, businesses, regulators and consumers are risk-neutral. Risk neutrality, as conceived in portfolio choice theory, would in this context mean that everyone would be indifferent to two risks of the same RPN, even comprising very different probability and severity values.That is, an RPN formed from the values {2,8,4} would dictate the same risk response as failure modes with RPN scores {8,4,2} and {4,4,4} since the RPN values (product of the scores) are equal. In the real world this is never true. Often it is very far from true. Most people and businesses are not risk-neutral, they’re risk-averse. That changes things. As a trivial example, banks might have valid reasons for caring more about a single $100M loss than one hundred $1M losses.

Beyond the implicit assumption of risk-neutrality, RPN has other problems. As mentioned above, there are cognitive and group-dynamics issues when FMEA teams attempt to model probabilities as ranks or scores. Similar difficulties arise with scoring the cost of a loss, i.e., the severity component of RPN. Again there is the question of why, if you know the cost of a failure (in dollars, lives lost, or patients not cured) why convert a valid measurement into a subjective score (granting, for sake of argument, that risk-neutrality is justified)? Again the answer is to enter that score into the RPN calculation.

Still more problematic is the detectability value used in RPNs. In a non-trivial system or process, detectability and probability are not independent variables. And there is vagueness around the meaning of detectability. Is it the means by which you know the failure mode has happened, after the fact? Or is there an indication that the failure is about to happen, such that something can be observed thereby preventing the failure? If the former, detection is irrelevant to risk of failure, if the latter the detection should be operationalized in the model of the system. That is, if a monitor (e.g, brake fluid low) is in a system, the monitor is a component with its own failure modes and exposure times, which impact its probability of failure. This is how aviation risk analysis models such things.

A simple summary of the problems with scoring, ranking and RPN is that adding ambiguity to a calculation does not eliminate uncertainty about its parameters; it merely adds errors and reduces precision.

Another wrong turn has been the notion that a primary function of FMEAs is to establish cause of failures. Aviation found FMEAs to be ineffective for this purpose long ago. Reasoning back from observation to cause (a-posteriori logic) is tricky business and is often beyond the reach of facilitated FMEA sessions. This was one reason why supplier-FMEAs came to be. In defense of the “Cause” column on FMEA templates used in the automotive world, in relatively simple systems and components, causes are often entailed in failure modes (leakage caused by corrosion as opposed to leakage caused by stress fracture). In such cases cause may not be out of reach. But in the general case, more so in complex cases such as manufacturing process or military operations, seeking causes in FMEAs encourages leaping to regrettable conclusions. I’ll dig deeper into the problem of causes in FMEAs in a future post.

I’ve identified  several major differences between the approach to FMEAs used in aviation and those who claim to use methods based on aerospace. In addition to the reasons given above why I side with aviation on FMEA method, I’ll also note that we know the approach to risk used in aviation has reduced risk – by a factor of roughly a thousand, based on fatal accident rates since aviation risk methods were developed. I don’t think any other industry or domain can show similar success.

A partial summary of failure modes of common FMEA processes includes the following, based on the above discussion:

  • Confusing FMEA with Hazard Analysis
  • Equating FMEA with risk assessment
  • Viewing the FMEA as a Quality (QC) function
  • Insufficient rigor in establishing probability and severity values
  • Unwarranted (and implicit) assumption of risk-neutrality
  • Unsound quantification of risk (RPN)
  • Confusion about the role of detection
  • Using the FMEA as a root-cause analysis

The corrective action for most of these should be obvious, including steering clear of RPN, operationalizing detection methods, using numeric (non-ordinal) probability and cost values (even if estimated), instead of masking ignorance and uncertainty with ranking and scoring.  I’ll add more in a future post.

 – – –

Text and photos © 2016 by William Storage. All rights reserved.