Outside of aerospace and military projects, I see a lot of frustration about failures of failure mode effects analyses (FMEA) to predict system-level or enterprise-level problems. This often stems from the fact that FMEAs see a system or process as a bunch of parts or steps. The FMEA works by asking what happens if I break this part right here or that process step over there.
For this approach to predict high-level (system-sized) failures, many FMEAs would need to be performed to exhaustive detail. They would also need to predict propagation of failures (or errors) through a system, identifying consequences on the function of the system containing them. Since FMEAs focus on single-event system failure initiators, examining combinations of failures is unwieldy. In redundant equipment or processes, this need can be far beyond the limits of our cognitive capability. In an aircraft brake system, for example, there may be hundreds of thousands of combinations of failures that lead to hazardous loss of braking. Also, by focusing on single-event conditions, only external and environmental contributors to system problems that directly cause component failures get analyzed. Finally, FMEAs often fail to catch human error that doesn’t directly result in equipment failure.
Consequently, we often call on FMEAs to do a job better suited for a functional hazard assessments (FHA). FHAs identify system-level functional failures, hazards, or other unwanted system-level consequences of major malfunction or non-function. With that, we can build a plan to eliminate, mitigate or transfer the associated risks by changing the design, deciding against the project, adding controls and monitors, adding maintenance requirements, defining operational windows, buying insurance, narrowing contract language, or making regulatory appeals. While some people call the FMEA a bottom-up approach, the FHA might be called a top-down approach. More accurately, it is a top-only approach, in the sense that it identifies top-level hazards to a system, process, or enterprise, independent of the system’s specific implementation details. It is also top-only in the sense that FHAs produce the top events of fault trees.
Terminology: In some domains – many ERM frameworks for example – the term “hazard” is restricted to risks stemming from environmental effects, war, and terrorism. This results in differentiating “hazard risks” from market, credit, or human capital risks, and much unproductive taxonomic/ontological hairsplitting. In some fields, business impact analysis (BIA) serves much the same purpose as FHA. While often correctly differentiated from risk analysis (understood to inform risk analyses), implementation details of BIA varies greatly. Particularly in tech circles, BIA impact is defined for levels lower than actual impact on business. That is, its meaning drifts into something like a specialized FMEA. For these reasons, I’ll use only the term “FHA,” where “hazard” means any critical unwanted outcome.
To be most useful, functional hazards should be defined precisely, so they can contribute to quantification of risk. That is, in the aircraft example above, loss of braking itself is not a useful hazard definition. Brake failure at the gate isn’t dangerous. Useful hazard definition would be something like reduction in aircraft braking resulting in loss of aircraft or loss of life. That constraint would allow engineers to model the system failure condition as something like loss of braking resulting in departing the end of a runway at speeds in excess of 50 miles per hour. Here, 50 mph might be a conservative engineering judgement of a runway departure speed that would cause a fatality or irreparable fuselage damage.
Hazards for other fields can take a similar form, e.g.:
- Reputation damage resulting in revenue loss exceeding $5B in a fiscal year
- Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
- Unexpected coupon redemption in a campaign, resulting in > 1M$ excess promotion cost
- Loss of chemical batch (value $1M) by thermal runaway
- Electronic health record data breach resulting in disclosure of > 100 patient identifiers with medical history
- Uncontained oil spill of > 10,000 gallons within 10 miles of LA shore
Note that these hazards tend to be stated in terms of a top-level or system-level function, an unwanted event related to that function, and specific, quantified effects with some sort of cost, be it dollars, lives, or gallons. Often, the numerical details are somewhat arbitrary, reflecting the values of the entity affected by the hazard. In other cases, as with aviation, guidelines on hazard classification comes from a regulatory body. The FAA defines catastrophic, for example, as hazards “expected to result in multiple fatalities of the occupants, or incapacitation or fatal injury to a flight crewmember, normally with the loss of the airplane.”
Organizations vary considerably in the ways they view the FHA process; but their objectives are remarkably consistent. Diagrams for processes as envisioned by NAVSEA and the FAA (and EASA in Europe) appear below.
The same for enterprise risk might look like:
- Identify main functions or components of business opportunity, system, or process
- Examine each function for effects of interruption, non-availability or major change
- Define hazards in terms of above effects
- Determine criticality of hazards
- Check for cross-function impact
- Plan for avoidance, mitigation or transfer of risk
In all the above cases, the end purpose, as stated earlier, is to inform trade studies (help decide between futures) or to eliminate, mitigate or transfer risk. Typical FHA outputs might include:
- A plan for follow-on action – analyses, tests, training
- Identification of subsystem requirements
- Input to strategic decision-making
- Input to design trade studies
- The top events of fault trees
- Maintenance and inspection frequencies
- System operating requirements and limits
- Prioritization of hazards, prioritization of risks*
The asterisk on prioritization of risks means that, in many cases, this isn’t really possible at the time of an FHA, at least in its first round. A useful definition of risk involves a hazard, its severity, and its probability. The latter, in any nontrivial system or operation, cannot be quantified – often even roughly – at the time of FHA. Thus the FHA identifies the hazards needing probabilistic quantification. The FAA and NAVSEA (examples used above) do not quantify risk as an arithmetic product of probability and severity (contrary to the beliefs of many who cite these domains as exemplars); but the two-dimensional (vector) risk values they use still require quantification of probability.
I’ll drill into details of that issue and discuss real-world use of FHAs in future posts. If you’d like a copy of my FHA slides from a recent presentation at Stone Aerospace, email me or contact me via the About link on this site. I keep slide detail fairly low, so be sure to read the speaker’s notes.
– – –
In the San Francisco Bay area?
If so, consider joining us in a newly formed Risk Management meetup group.
Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.
This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.
If you are in the bay area, please join us, and let us know your preferences for meeting times.