# Fault Tree Construction Basics

Last week I introduced fault trees, giving a hint at what they’re good for, and showing some fault-tree diagram basics. Today I’ll focus on the mechanics of building one.

The image below, showing two different diagrams for the exact same logical fault tree, serves as a quick review. The top event, “A,” has two 2nd-level events, “B” and “C,” having two and three child events respectively. An OR gate is associated with the top event and event B, while an AND gate is associated with event C. Events 1 through 5 are basic events, which are initiators. They are literally “bottom events” though we don’t usually use that term.

The difference between the top and bottom renderings below is just a matter of formatting. The bottom rendering looks much nicer in diagrams when we replace “A,” “B,” “2,” etc. with descriptive text. Here I’ll use whichever convention fits best on the page. It’s important to realize that in the bottom style, the logic gate goes with (sticks to) the event above it.

Last time I mentioned that building a tree is a top-down process in the sense that you start with the top event of the tree. Since fault trees can have only one top event but a large number of bottom events (initiators), the analogy with living trees is weak. An organizational chart might be a closer analogy, but even that isn’t accurate, since fault trees can contain the same initiator events in multiple places. I’ll show how this can happen later.

We typically get the top event of a tree from a hazard assessment where an unwanted outcome has been deemed critical – something like an aircraft hitting the end of a runway at speeds over 50 mph due to a brake system fault.

From the top event, we identify its high-level contributors. In the case of the aircraft brakes example, the redundancy designed into the system may make descriptions of these conditions a bit wordy. For example, consider a dual-hydraulic-brake design with two systems, each feeding half the brakes, with eight brakes total. In that system, one equipment state causing the top event (there are several) would be complete loss of hydraulic system number 1 plus sufficient additional failures to render braking inadequate. Those additional failures could include, for example, the complete failure of hydraulic system number 2 OR mechanical failures of one or more of the system 2 brakes OR loss of the command signal to the system two brakes, and a few others.

That is an example of one of the possible causes (i.e., one of the 2nd level events) of faults that could cause the loss of braking specified in the top event. There may be five or ten others, any of which could produce the hazardous state. The word “any” in the previous sentence tells you the relationship between the top event and this collection of second-level contributors. It is an OR relationship, since any one of them would be sufficient to cause the hazard. An OR gate is therefore tied directly to the top event.

For the first 2nd-level intermediate event of the aircraft brake system example above, we would carefully come up with a name for this fault; we can refine it later. Something like “Loss of hydraulic system #1 plus additional failures” would be sufficient. Note that a good name for this event gives a clue about the logic gate associated with it. The word “plus” suggests that the gate for this event will be an AND gate.

A more accurate description of the event – perhaps too wordy – would be “Loss of hydraulic system #1 plus any of several different additional failures or combinations of failures.” This tells us that this event, as we’re modeling it, will have two children, which we can label:

1. Loss of hydraulic system 1 brake hydraulic power to brakes
2. Additional failures leading to loss of braking (meaning loss of braking sufficient to result in the top event when combined with loss of system 1 hydraulic power to brakes)

Without knowing anything else about the system and its operation at this point (details we would get from system schematics and operating manuals) we can’t really specify the gate associated with the first of the two events listed above. If hydraulic system #1 is itself redundant, it might be an AND gate, otherwise an OR. We can be infer that event #2 above has an OR gate beneath it, since several combinations of additional failures might be sufficient to render the whole system ineffective.

Here’s a diagram for what we’ve modeled so far on the brake system failure. This diagram, of the bottom style in the above image, also includes small tags below each event description (above its gate) containing event IDs used by fault tree software. Ignore these for now.

To help understand the style of thinking involved in modeling systems, whether physical, like a brake system, or or procedural, like complex surgery, compare what we’ve done so far, for aircraft brakes, with the brake system of your car. For modeling purposes, let’s ignore the parking brake for now.

Your car’s brake system also contains some redundancy, but it’s limited. If it’s a simple hydraulic brake system (most cars are more complex), it has two cylinders, one powering the front two brakes and one powering the rear. Both of these have to be in a failed state, for you to be without brakes for hydraulic reasons.

Notice I said “have to be in a failed state” and not “have to fail.” The likelihood of both failing independently during a trip is much lower than the probability that one failed some time ago without detection and one failed during the trip. Fault trees deal with both these cases, the latter involving a latent failure and a monitor with an indicator to report the otherwise latent failure. Of course, the monitor or the indicator might be in a failed state, again without your knowing it. Redundancy and monitors complicate things. We’ll get into that later.

Since both of your car’s brake cylinders must be failed for you to be without hydraulic power for braking, you might think that a fault tree for total loss of braking in your car would start with a top event having an AND gate. But your brake pedal probably isn’t redundant. If it falls off, you’re without brakes. Historical data on a large population of cars shows this event to have a very low probability. But we’ll include it for thoroughness. If we diagram the model we’ve developed for car brakes so far, we have this:

One thing immediately apparent in these two barely-begun fault trees, one for aircraft brakes, and one for car brakes, is that car brakes, as modeled, have a single-point failure leading to the top event, albeit an improbable one. The FAA guidance for design of aircraft systems specifies that no single failure regardless of probability should have catastrophic consequences. If we imagine a similar requirement for car design, we would have to add a brake subsystem with an independent actuator, like the separate hand or foot-controlled parking brake actuator in most cars. Have you tested yours lately?

Next time we’ll explore the bottom events of fault trees, the initiator events, and the topic of exposure times and monitors. Working with these  involves examination of failure probabilities and the time during which you’re exposed to failures, some of which – as with your parking brake – may be very long, dramatically increasing the probability that a latent failure has occurred, resulting in a loss of perceived redundancy.

– – –

In the San Francisco Bay area?

If you live near San Francisco, consider joining our new formed Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup seeks to build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

https://www.meetup.com/San-Francisco-Risk-Managers/

# Driverless Cars, Accountability and Insurance

Today at the Algorithms in Culture conference at the Berkeley Institute of Data Science, Helen Nissenbaum of NYU gave a fabulous keynote, Values in Algorithms: Then and Now, in which she examined bias and discrimination in, or resulting from, algorithms in credit scoring, IoT, predictive analytics, and targeted advertising. A central theme of her talk was accountability, bias, and governability of emerging technologies – along with other “newfangled societal quandaries.” She touched on insurance, aircraft autopilot systems and driverless cars, all topics dear to the risk analyst.

Citing massive data breaches that neither broke laws nor resulted in civil litigation, Helen suggested that a reduction in accountability has accompanied automation. She fears that future delegation of function will not necessitate delegation of accountability.

This made me wonder whether in the case of driverless cars, the reverse might actually be true. With human-controlled cars, when they crash into each other, we indeed have a strong sense of accountability in most US states. But this hinges on the decidability of fault and blame, which is not a trivial detail. Accountability isn’t simply a moral matter; it is a legal one involving highly nuanced legal code, judges, juries, and evidence collection. Deciding accountability in car crashes is both flawed and somewhat arbitrary.

I asked Helen if she had considered the possibility that driverless cars could in fact increase real accountability in the sense that when driverless cars crash, the blame will fall directly on a car maker. If driverless-car technology is feasible outside of the bay area sandbox, and if it is as reliable as auto-flight systems, determining fault even in the case of two-car collisions, will be much less arbitrary. In that sense, the algorithms might effectively reduce bias. She seemed to think the answer to that would depend on the reliability of the cars and how their accident rates will compare to that of human drivers. I’m less convinced that reliability need enter the equation, but I haven’t really explored the topic in any detail.

Regardless of the bias issue, I’d think driverless cars will force big changes on determining insurance-premiums and the insurance world in general. Perhaps more interesting, driverless-car algorithms will have to embody risk analysis and risk-reward calculus in a major way. The trolley problem, a favorite of philosophers who force us to decide which life to save, in all its incarnations, may have to be encoded; and software developers might need to minor in Mill, Bentham, and rule-utilitarianism.

# ECRI’s Top 10 Health Technology Hazards

William Storage – Nov 30, 2016
VP, LiveSky, Inc.
Visiting Scholar, UC Berkeley Center for Science, Technology, Medicine & Society

ECRI recently published their list of top ten health technology hazards for 2017. ECRI has released such a list each year since at least 2008.

ECRI’s top ten for 2017 (requires registration) as they label them:

1. Infusion Errors Can Be Deadly If Simple Safety Steps Are Overlooked
3. Missed Ventilator Alarms Can Lead to Patient Harm
4. Undetected Opioid-Induced Respiratory Depression
5. Infection Risks with Heater-Cooler Devices Used in Cardiothoracic Surgery
6. Software Management Gaps Put Patients, and Patient Data, at Risk
7. Occupational Radiation Hazards in Hybrid ORs
8. Automated Dispensing Cabinet Setup and Use Errors May Cause Medication Mishaps
9. Surgical Stapler Misuse and Malfunctions
10. Device Failures Caused by Cleaning Products and Practices

ECRI is no doubt aiming their publication at a broad audience. The wording of several of these, from the standpoint of hazard assessment, could be refined a bit to better inform mitigation plans. For example, the first item in the list (infusion errors) doesn’t really name an actual hazard (unwanted outcome). I take a crack at it below, along with a few comments on some of the other hazards.

ECRI lists their criteria for inclusion in the list. They include – in system-safety terminology – severity, frequency, scope (ECRI: “breadth,” “insidiousness”), detectability (“insidiousness”), profile,  preventability.

That seems a good set of criteria, though profile might better point to an opportunity for public education rather than be a good criterion for ranking risks. We’d hope that subject matter experts would heavily discount public concern for imaginary hazards.

Infusion failures/errors resulting in wrong dose, rate, duration or contamination

I’m guessing there’s a long list of possible failures and errors that could lead to or contribute to this hazard. Some that come to mind:

• Software bugs
• Human-computer interaction (HCI) errors (wrong value entered due to extra keystroke)
• Unit-of-measure confusion
• Unclear instructions and cues
• Unclear warnings
• Unheard warnings (speaker volume low)
• Monitor failures (false positive, failure to alert when alert condition is met)
• Undetected physical damage (material fatigue-cracks allowing water penetration
• Unannunciated battery failure
• Electrical power failure

Ventilator alarms

This issue includes two unrelated problems, one simple and infrequent, the other common  and often called “preventable human error.” Human error may be the immediate cause, but systems having a large number of critical, preventable errors are flawed systems. That means some combination of flawed hardware design and flawed operating procedures. The first problem, latent failure of ventilator alarm resulting in undetected breathing problem, caused several deaths in the last ten years. Failure of caregivers to respond to alarm reporting critical breathing condition is much more serious, and has been near the top of ECRI’s list for the past five years.

Undetected Opioid-Induced Respiratory Depression

In 2006 an Anesthesia Patient Safety Foundation conference set a vision that “no patient shall be harmed by opioid-induced respiratory depression” and considered various changes to patient monitoring. In 2011, lack of progress toward that goal led to another conference that looked at details of patient monitoring during anesthesia. Alert fatigue was again a major factor. Inclusion in ECRI’s 2016 list suggests HCI issues related to oximetry and ventilation-monitoring still warrant attention.

Occupational Radiation Hazards in Hybrid ORs

Software Management Gaps

Yes Judy, EHR vendors’ versioning practices from the 80s do impact patient care. So do sluggish IT departments. ECRI cites delayed implementation of software updates with safety ramifications  and data inaccessibility as consequences.

Mitigations

Isn’t there some pretty low-hanging fruit for mitigation on this hazard list? Radiation badges, exhaust-fan filters on heater-cooler systems to catch aerosolized contaminants, and formal procedures for equipment cleaning that specify exactly what cleaning agents to use would seem to knock three items from the list with acceptable cost.

Correcting issues with software deployment and version management may take years, given the inertia of vendors and IT organizations, and will require culture changes involving hospital C-suites.

Despite decades of psychology studies showing that frequent and repetitive alarms (and excessive communication channels) negatively impact our ability to recall “known” information, to cause us to forget which process step we’re performing, and cause us to randomly shed tasks from a mental list, computer and hardware interface design still struggles with information chaos. Fixing this requires the sort of multidisciplinary/interdisciplinary analysis for which current educational and organizational silos aren’t prepared. We have work to do.

ECRI deserves praise not only for researching and publishing this list, but for focusing primarily on hazards and secondarily on risk. From the perspective of system safety, risk management must start with hazard assessment. This point, obvious to those with a system safety background, is missed in many analyses and frameworks.

– – –

In the San Francisco Bay area?

If you are, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, and so on.

This meetup aims to build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

https://www.meetup.com/San-Francisco-Risk-Managers/

# It’s Fault Tree Friday

Fault trees are back in style. This is good!

I’ve taught classes on fault trees analysis (FTA) for just over 25 years now, usually limited to commercial and military aviation engineers, and niche, critical life-support areas like redundant, closed-circuit scuba devices.  In recent years, pharmaceutical manufacture, oil refinement, and the increased affordability of complex devices and processes have made fault trees much more generally relevant. Fault trees are a good fit for certain ERM scenarios, and have deservedly caught the attention of some of the top thinkers in that domain. Covering FTA for a general audience won’t fit in a single blog post, so I’ll try doing a post each Friday on the topic. Happy Friday.

A fault tree  is a way to model a single, specific, unwanted state of a system using Boolean logic. Boolean logic is a type of algebra in which all equations reduce to the values true or false. The term “fault” stems from aerospace engineering, where fault refers to an unwanted state or failure of a system at a system level. Failure and fault are typically differentiated in engineering; many failures result in faults, but not all faults result from failures. A human may be in the loop.

Wikipedia correctly states the practical uses for fault trees, including:

• understand how the top event (system fault) can occur
• showing compliance with system reliability requirements
• Identifying critical equipment or process steps
• Minimizing and optimizing resources
• Assisting in designing a system

The last two bullets above are certainly correct, but don’t really drive the point home. Fault tree analysis during preliminary design of complex systems is really the sole means to rationally allocate redundancy in a system and to identify the locations and characteristics of monitors essential to redundant systems.

Optimizing the weight of structural components in a system is a fairly straightforward process. For example, if all links in a chain don’t have the same tensile strength, the design wastes weight and cost. Similarly, but less obviously, if the relationships between the reliability of redundant system components is arbitrary, the design wastes weight and cost. Balancing the reliability of redundant-system components greatly exceeds the cognizing capacity of any human without using FTA or something similar. And the same goes for knowing where to put monitors (gauges, indicators, alarms, check-lists, batch process QC steps, etc.).

Essential Concepts

A fault tree has exactly one top event and a number of bottom events, usually called basic events or initiators. I’ll use the term initiator to avoid an ambiguity that will surface later. Initiator events typically come from failure mode effect analyses (FMEA). To calculate a probability value for the top event, all initiators must have an associated probability value. That probability value often comes from a known failure rate (a frequency) and a know exposure time, the time duration in which the failure could occur. Fault trees are useful even without probability calculations for reasons described below.

Fault trees are constructed – in terms of human logic – from top to bottom. This involves many intermediate fault states, often called intermediate events. The top event and all intermediate events have an associated logic gate, a symbolic representation of the logical relationship between the event and those leading upward into it. We often mix metaphors, calling the events leading up into an intermediate event and its associated logic gate child events, as would be the case in a family tree.

The top event is a specific system fault (unwanted state). It often comes directly from a Functional Hazard Assessment (FHA), and usually quantifies severity. Examples of faults that might be modeled with fault trees, taken from my recent FHA post) include:

• Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
• Loss of chemical batch (value > \$1M) by thermal runaway
• Reputation damage resulting in revenue loss exceeding \$5B in a fiscal year
• Aircraft departs end of a runway at speeds in excess of 50 miles per hour due to brake system fault

The solution to a fault tree is a set of combinations of individual failures, errors or events, each of which is logically sufficient to produce the top event. Each such combination is known as a cut set. Boolean logic reduces the tree into one or more cut sets, each having one or more initiator (bottom level) events. The collection of cut sets immediately identifies single-point critical failures and tells much about system vulnerabilities. If the initiator events have quantified probabilities, then the cut sets do to, and we can then know the probability of the top event.

For an example of cut sets, consider the following oversimplified fault tree. As diagrammed below, it has three cut sets, two having three initiator events, and one having only one initiator (“uncontained horizontal gene transfer). If the reason for this cut-set solution isn’t obvious, don’t worry, we’ll get to the relevant Boolean algebra later.

Not a Diagram

A common rendering of a fault tree is through a fault tree diagram looking something like the one above. In the old days fault trees and fault tree diagrams could be considered the same thing. Diagrams were useful for seeing the relationships between branches and the way they contributed to the top event. Fault tree diagrams for complex systems can exceed the space available even on poster-sized paper, having hundreds or thousands of events and millions of cut sets. Breaking fault tree diagrams into smaller chunks reduces their value in seeing important relationships. Fault trees must be logically coherent. We now rely on software to validate the logical structure of a fault tree, rather than by visual inspection of a fault tree diagram. Software also allows us to visually navigate around a tree more easily that flipping through printed pages, each showing a segment of a tree.

Fault tree diagrams represent logical relationships between child and parent events with symbols (logic gates) indicating the relevant Boolean function. These are usually AND or OR relationships, but can include other logical relationships (more on which later). Note that symbols for logic gates other than AND and OR vary across industries. Also note that we typically supply initiators with symbols also (diamonds and circles above), just as visual clues. They serve no Boolean function but show that the event has no children and is therefore an initiator, one of several varieties.

Relation to PRA

Fault tree analysis is a form of probabilistic risk analysis. If you understand “probabilistic” to require numerical probability values, then FTA is only a form of probabilistic risk analysis if all the fault tree initiators’ probabilities are known or are calculable (a quantitative rather than qualitative tree). To avoid confusion, note that in many circles, the term “probabilistic risk analysis” and the acronym PRA are used only to describe methods of inference from subjective probability judgments and usually Bayesian belief systems, as promoted by George Apostolakis of UCLA in the 80s and 90s. This is not fault tree analysis, but there is some overlap. For example, NASA’s guide Bayesian Inference for NASA Probabilistic Risk and Reliability Analysis shows how to use PRA (in the Bayesian-inference sense) to populate a fault tree’s initiator events with progressively updated subjective probability values.

Deductive Failure Analysis?

Wikipedia’s entry on fault tree analysis starts with the claim that it is a “deductive failure analysis.” Setting aside the above-mentioned difference between faults and failures, there’s the matter of what makes an analysis deductive. I’m pretty sure this claim (and the claim that FMEAs are inductive) originated with Bill Vesely at NASA in the 1980s. Bill’s a very sharp guy who greatly advanced the state of fault tree art; but fault trees are not deductive in any way that a mathematician or philosopher uses that term.

Deduction is reaching a conclusion about an instance based on an assumption, premise, or postulate about the population containing that instance. Example: if all men are mortal and Socrates is a man, then Socrates is mortal. This is deduction. It uses a statement in universal-conditional form (“if all A is B”) and a fact about the world (“observed X is an A”).

Induction, technically speaking, is the belief or assumption that unobserved cases will resemble observed cases. This is the basis for science, where we expect, all other things being equal, that the future will resemble the past.

Fault tree analysis relies equally on both these forms of reasoning, assuming, for sake or argument, that induction is reasoning, a matter famously contested by David Hume. The reduction of a fault tree into its cut sets includes deductions using the postulates of Boolean algebra (more on which soon). The rest of the analysis relies on the assumption that future failures (initiator events) will resemble past ones in their frequency (a probability distribution) and that these frequencies can be measured as facts about the world. Initiator probabilities derived from Bayesian inferences involve yet another form of reasoning – diachronic probabilistic coherence (more on which maybe someday). In any case, past students of mine have gotten hung up on whether fault trees and FMEAs are deductive and inductive. This pursuit, which either decays into semantics or falls down a philosophical pit, adds nothing to understanding fault trees or their solutions.

One final piece of philosophical baggage to discard stems from the matter of the explanatory power of fault trees. Fault trees’ concern with causes is limited. They don’t care much about “why” in the deep sense of the term. Despite reliance on logic, they are empiricist in nature.  Fault trees explain by showing “how” the top event can happen, not “why” it will or did happen.

–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

The fields of risk assessment, risk analysis, and risk management have each evolved nearly independently in a number of industries. This Meetup group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk management (ERM), project risk, safety, reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, cyber risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

https://www.meetup.com/San-Francisco-Risk-Managers/

# Correcting McKinsey’s Fogged Vision of Risk

McKinsey’s recent promotional piece, Risk: Seeing around the corners is a perfect example of why enterprise risk management is so ineffective (answering a question posed by Norman Marks). Citing a handful of well worn cases of supply chain and distribution channel failures, its advice for seeing around corners might be better expressed as driving while gazing into the rear-view mirror.

The article opens with the claim that risk-assessment processes expose only the most direct threats and neglect indirect ones. It finds “indirect” hazards (one step removed from harmful business impact) to be elusive.  The hazards they cite, however, would immediately flow from a proper engineering-style hazard assessment; they are far from indirect. For example, missing environment-caused damage to a facility with subsequent supply-chain interruption in a risk assessment is a greenhorn move at best.

McKinsey has cultivated this strain of risk-management hype for a decade, periodically fertilized, as is the case here, with the implication that no means of real analysis exists. Presumably, their desired yield is customers’ conclusions that McKinsey’s risk practice can nevertheless lead us through the uncharted terrain of risk.The blurry advice of this article, while perhaps raising risk awareness, does the disservice of further mystifying risk management.

McKinsey cites environmental impact on a supply chain as an example of a particularly covert risk, as if vendor failure from environmental hazards were somehow unforeseeable:

“At first glance, for instance, a thunderstorm in a distant place wouldn’t seem like cause for alarm. Yet in 2000, when a lightning strike from such a storm set off a fire at a microchip plant in New Mexico, it damaged millions of chips slated for use in mobile phones from a number of manufacturers.”

In fact, the Business Continuity Institute‘s data shows tier-1 supplier problems due to weather and environment to be the second largest source of high-impact supply chain interruptions in 2015.

McKinsey includes a type of infographic it uses liberally. It has concentric circles and lots of arrows, and seems intent on fogging rather than clarifying (portion shown below for commentary and criticism purposes). More importantly, it reveals a fundamental problem with ERM’s conception of risk modeling – that enterprise risk should be modeled bottom-up – that is, from causes to effects. The text of the article implies the same, for example, the distant thunderstorm in the above quote.

Trying to list – as a risk-analysis starting point – all the possible root causes propagating up to impact on a business’s cost structure, financing, productivity, and product performance is indeed very difficult. And it is a task for which McKinsey can have no privileged insight.

This is a bottom-up (cause first) approach. It is equivalent to examining the failure modes of every component of an aircraft and every conceivable pilot error to determine which can cause a catastrophic accident. There are billions of combinations of component failures and an infinite number of pilot errors to remember. This is not a productive route for modeling high-impact problems.

Deriving the relevant low-level causes of harmful business impacts through a systematic top-down process is more productive. This is the role of business impact analysis (BIA) in the form of Functional Hazard Assessment (FHA) and Fault Tree Analysis (FTA). None of these terms, according to Google, ever appear in McKinsey’s published materials. But they are how we, in my firm, do risk analyses – an approach validated by half a century of incontestable success in aviation and other high-risk areas.

An FHA view of the problem with which McKinsey fumbles would first identify the primary functions necessary for success of the business operation. Depending on specifics of the business these might include things like:

• Manufacturing complex widgets
• Distributing widgets
• Marketing widgets
• Selling product in the competitive widget space
• Complying with environmental regulation
• Issue stock in compliance with SEC

A functional hazard assessment would then look at each primary function and quantify some level of non-function the firm would consider catastrophic, and a level it would consider survivable but dangerous. It might name three or four such levels, knowing that the boundaries between them are somewhat arbitrary; the analysis accommodates this.

For example, an inability to manufacture product at 50% of the target production rate of one million pieces per month for a period exceeding two months might reasonably be judged to result in bankruptcy. Another level of production interruption might be deemed hazardous but survivable.

An FHA would include similar definitions of hazard classes (note I’m using the term “hazard” to mean any unwanted outcome, not just those involving unwanted energy transfers like explosions and lightning) for all primary functions of the business.

Once we have a list of top-level functional hazards – not the same thing as risk registers in popular risk frameworks – we can then determine, given implementation details of the business functions, what specific failures, errors, and external events could give rise to failure of each function.

For example, some things should quickly come to mind when asked what might cause manufacturing output to fall. They would include labor problems, supply chain disruption, regulatory action, loss of electrical power and floods. Some factors impacting production are simple (though not necessarily easy) to model. Floods, for example, have only a few possible sources. Others might need to be modeled systematically, involving many combinations of contributory events using tools like a qualitative or quantitative fault tree.

Looking specifically at the causes of loss of manufacturing capability due to supply chain interruption, we naturally ask ourselves what proximate causes exist there. Subject matter experts or a literature search would quickly list failures like:

• IT/communications downtime
• Cyber attack
• Fire, earthquake, lightning, flood
• Flu epidemic
• Credit problem
• Supplier labor dispute
• Transportation labor dispute
• Utility failure
• Terrorism
• Supplier ethics event
• Regulatory change or citation

We would then assess the probability of these events as they contribute to the above top-level hazards, for which severity values have been assigned. At that point we have a risk assessment with some intellectual heft and actionable content.

Note that in that last step we are assigning probability values to the failures, either by using observed frequencies, in the case of floods, lighting and power outages, or with periodically updated estimates of subject matter experts, in the case of terrorism and labor disputes. In no case are we assigning ranks or scores to the probability of failures, as many risk frameworks dictate. Probability ranking of this sort (ranks of 1 through 5 or high, medium, low) has been the fatal flaw of many past risk-analysis failures. In reality, all important failure modes have low probability, especially when one-per-thousand and one-per-billion are both counted as low, as is often the case. I’ve discussed the problem of subjective probability assignment in earlier posts.

McKinsey’s article confuses uncertainty about event frequency with unforseeability, implying that McKinsey holds special arcane knowledge about the future.

Further, as with many ERM writings, it advances a vague hierarchy of risk triggers and types of risk, including “hazard risk,” insurable risk, performance risk, cyber risk, environmental risk, etc. These complex taxonomies of risk reveal ontological flaws in their conception of risk. Positing kinds of risk leads to bewilderment and stasis. The need to do this dissolves if you embrace causality in your risk models. Things happen for reasons, and when bad things might happen, we call it risk. Unlike risk frameworks, we model risk by tracing effects back to causes systematically. And this is profoundly different from trying to pull causes from thin air as a starting point, and viewing different causes as different kinds of risk.

The approach I’m advocating here isn’t rocket science, nor is it even jet science. It is nothing new, but seems unknown within ERM. It is exactly the approach we used in my 1977 college co-op job to evaluate economic viability, along with safety, environmental, and project risk for a potential expansion of Goodyear Atomic’s uranium enrichment facility. That was back when CAPM and Efficient-Market were thought to be good financial models, when McKinsey was an accounting firm, and before ERM was a thing.

McKinsey concludes by stating that unknown and unforeseeable risks will always be with us, but that “thinking about your risk cascades is a concrete approach” to gaining needed insights. We view this as sloppy thinking – not concrete, but vague. Technically speaking, risks do not cascade; events and causes do. A concrete approach uses functional hazard assessments and related systematic, analytic tools.

The purpose of risk management is not to contemplate and ponder. It is to model risk by anticipating future unwanted events, to assess their likelihood and severity, and to make good decisions about their avoidance, mitigation, transfer or retention.

–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

https://www.meetup.com/San-Francisco-Risk-Managers/

# Why Pharmaceutical Risk Management Is in Deep Trouble

The ICH Q9 guidelines finalized in 2005 called for pharmaceutical firms to use a risk-based approach to the specification, design, and verification of manufacturing systems having the potential to affect product quality and patient safety. In 2008, ICH Q10 added that the design of the pharmaceutical quality system should incorporate risk management and risk-based audits.

Pharmaceutical firms had little background in the relevant areas of risk-management. Early troubles the industry faced in applying risk tools developed in other industries are well documented. Potential benefits of proactive risk management include reduction in regulatory oversight and associated costs, reduced cost from discrepant materials, reduced batch-failure rates, and a safer product. Because risk management, in theory, is present in the earliest stages of product and process design,  it can, in theory, raise profitability while improving patient safety.

Such theoretical benefits of good risk management are in fact realized by firms in other industries. In commercial aviation, probabilistic risk analysis is the means by which redundancy is allocated to systems to achieve a balanced  design – the minimum weight and cost consistent with performance requirements. In this sense, good risk analysis is a competitive edge.

From 2010 to 2015, Class 1 to 3 FDA recall events ranged from 8000 to 9500 per year, with an average of 17 injunctions per year. FDA warnings rose steadily from 673 in 2010 to 17,232 in 2015. FDA warning letters specifically identifying missing or faulty risk assessments have also steadily increased, with 53 in 2015, and 83 so far this year, based on my count from the FDA databases.

It is not merely foreign CMOs that receive warnings identifying defective risk assessments. Abbott, Baxter, Johnson & Johnson, Merck, Sanofi and Teva are in the list.

The high rate of out-of-spec and contamination-recalls seen in the FDA data clearly points to low hanging fruit for risk assessments. These issues are cited in ICH Q9 and 10 as examples of areas where proactive risk management serves all stakeholders by preventing costly recalls. Given the occurence rate in 2015, it’s obvious that a decade of risk management in pharma can’t be declared a major success. In fact, we seem to be losing ground. So what’s going on here, and why hasn’t pharma seen results like those of commercial aviation?

One likely reason stems from evolution of the FDA itself. The FDA predates most of drug manufacture. For decades it has regulated manufacturing, marketing, distribution, safety and efficacy  of drugs and medical devices (among other things) down to the level of raw materials, including inspection of facilities. With obvious benefits to consumers, this role has had the detrimental side effect of maturation of an entire industry where risk management and safety are equated with regulatory compliance by drug makers. That is, there’s tendency to view risk management as something that is imposed by regulators, from the outside, rather than being an integral value-add.

The FAA, by contrast, was born (1958) into an already huge aircraft industry. At that time a precedent for delegating authority to private persons had already been established by the Civil Aviation Act. Knowing the FAA lacked the resources to regulate manufacturing to a level of detail like that in the FDA, it sought to foster a culture of risk in aircraft builders, and succeeded in doing so through a variety of means including  expansion of  the industry participation in certifying aircraft. This included a designated-engineering-rep program in which aircraft engineers are duty-bound delegates of the FAA.

Further, except for the most basic, high-level standards, engineering design and safety requirements are developed by manufacturers and professional organizations, not the FAA. The FAA’s mandate to builders for risk management was basically to come up with the requirements and show the FAA how they intended to meet them. Risk management is therefore integrated into design, not just QA and certification. The contrasting risk cultures of the aviation and pharmaceutical industries is the subject of my current research in history of science and technology at UC Berkeley. More on that topic in future posts.

Changing culture takes time and likely needs an enterprise-level effort. But a much more immediate opportunity for the benefits envisioned in ICH Q9 exists directly at the level of the actual practice of risk assessment.

My perspective is shaped by three decades of risk analysis in aviation, chemical refinement, nuclear power, mountaineering and firefighting equipment, ERM, and project risk. From this perspective, and evidence from direct experience in pharma combined with material found in the FDA databases, I find the quality of pharmaceutical risk assessment and training to be simply appalling.

While ICH Q9 mentions, just as examples, using PHA/FHA (functional hazard analysis), hazard operability analysis, HACCP, FMEA (failure mode effects analysis), probabilistic safety analysis and fault trees at appropriate levels and project phases, one rarely sees anything but FMEAs performed in a mechanistic manner with the mindset that a required document (“the FMEA form”) is being completed.

Setting aside, for now, the points that FMEA was not intended by its originators to be a risk analysis tool and is not used as such in aerospace (for reasons discussed here, including inability to capture human error and external contributors), I sense that the job of completing FMEAs is often relegated to project managers who are ill-equipped for it and lack access to subject matter experts. Further injury is done here by the dreadfully poor conception of FMEA seen in the Project Management Institute’s (PMI) training materials inflicted on pharma Project Managers. But other training available to pharma employees in risk assessment is similarly weak.

Some examples might be useful. In the last two months, I’ve attended a seminar and two webinars I found on LinkedIn, all explicitly targeting pharma. In them I learned, for example, that the disadvantage to using FMEAs is that they require complex mathematics. I have no clue what the speaker meant by this. Maybe it was a reference to RPN calculation, an approach strongly opposed by aviation, nuclear, INCOSE and NAVAIR – for reasons I’ll cover later – which requires multiplying three numbers together?

I learned that FMEAs are also known as fault trees (can anyone claiming this have any relevant experience in the field?), and that bow tie (Ishikawa) “analysis” is heavily used in aerospace. Ishikawa is a brainstorming method, not risk analysis, as noted by Vesely 40+ years ago, and it is never (ever) used as a risk tool in aerospace. I learned that another disadvantage of FMEAs is that you can waste a lot of time working on risks with low probabilities. The speaker seemed unaware that low-probability, high-cost hazards are really what risk analysis is about; you’re not wasting your time there! If the “risks” are high-probability events, like convenience-store theft, we call them expenses, not risks. I learned in this training that heat maps represent sound reasoning. These last two points were made by an instructor billed as a strategic management consultant and head of a pharmaceutical risk-management software firm.

None of these presentations mentioned functional hazard analysis, business impact analysis, or any related tool. FHA (my previous post) is a gaping hole in pharmaceutical risk assessment, missing in safety, market, reputation, and every other kind of risk a pharma firm faces.

Most annoying to me personally is the fact that the above seminars, like every one I’ve attended in the past, served up aerospace risk assessment as an exemplar. Pharma should learn mature risk analysis techniques and culture from aviation, not just show photos of aircraft on presentation slides. In no other industry but commercial aviation has something so inherently dangerous been made so safe, by any definition of safety. Aviation achieved this (1000-fold reduction in fatality rate) not through component quality, but by integrating risk into the core of the business. Aviation risk managers’ jaws hit the floor when I show them “risk assessments” (i,e., FMEAs) from pharma projects.

One thing obviously lacking here is simple analytic rigor. That is, if we’re going to do risk assessment, let’s try to do it right. The pharmaceutical industry obviously has some of the best scientific minds around, so one would expect it to understand the value of knowledge, diligence, and their correct application. So perhaps the root of its defective execution of risk management is in fact the underdeveloped risk culture mentioned above.

The opportunity here is immense. By cleaning up their risk act, pharmaceutical firms could reap the rewards intended by ICH Q9 a decade ago and cut our ballooning regulatory expenses. Leave a comment or reach me via the About link above to discuss this further.

–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

https://www.meetup.com/San-Francisco-Risk-Managers/

# Functional Hazard Assessment Basics

Outside of aerospace and military projects, I see a lot of frustration about failures of failure mode effects analyses (FMEA) to predict system-level or enterprise-level problems. This often stems from the fact that FMEAs see a system or process as a bunch of parts or steps. The FMEA works by asking what happens if I break this part right here or that process step over there.

For this approach to predict high-level (system-sized) failures, many FMEAs would need to be performed to exhaustive detail. They would also need to predict propagation of failures (or errors) through a system, identifying consequences on the function of the system containing them. Since FMEAs focus on single-event system failure initiators, examining combinations of failures is unwieldy. In redundant equipment or processes, this need can be far beyond the limits of our cognitive capability. In an aircraft brake system, for example, there may be hundreds of thousands of combinations of failures that lead to hazardous loss of braking. Also, by focusing on single-event conditions, only external and environmental contributors to system problems that directly cause component failures get analyzed. Finally, FMEAs often fail to catch human error that doesn’t directly result in equipment failure.

Consequently, we often call on FMEAs to do a job better suited for a functional hazard assessments (FHA). FHAs identify system-level functional failures, hazards, or other unwanted system-level consequences of major malfunction or non-function. With that, we can build a plan to eliminate, mitigate or transfer the associated risks by changing the design, deciding against the project, adding controls and monitors, adding maintenance requirements, defining operational windows, buying insurance, narrowing contract language, or making regulatory appeals. While some people call the FMEA a bottom-up approach, the FHA might be called a top-down approach. More accurately, it is a top-only approach, in the sense that it identifies top-level hazards to a system, process, or enterprise, independent of the system’s specific implementation details. It is also top-only in the sense that FHAs produce the top events of fault trees.

Terminology: In some domains – many ERM frameworks for example – the term “hazard” is restricted to risks  stemming from environmental effects, war, and terrorism. This results in   differentiating “hazard risks” from market, credit, or human capital risks, and much unproductive taxonomic/ontological hairsplitting. In some fields, business impact analysis (BIA) serves much the same purpose as FHA. While often correctly differentiated from risk analysis (understood to inform risk analyses), implementation details of BIA varies greatly. Particularly in tech circles, BIA impact is defined for levels lower than actual impact on business. That is, its meaning drifts into something like a specialized FMEA. For these reasons, I’ll use only the term “FHA,” where “hazard” means any critical unwanted outcome.

To be most useful, functional hazards should be defined precisely, so they can contribute to quantification of risk. That is, in the aircraft example above, loss of braking itself is not a useful hazard definition. Brake failure at the gate isn’t dangerous. Useful hazard definition would be something like reduction in aircraft braking resulting in loss of aircraft or loss of life. That constraint would allow engineers to model the system failure condition as something like loss of braking resulting in departing the end of a runway at speeds in excess of 50 miles per hour. Here, 50 mph might be a conservative engineering judgement of a runway departure speed that would cause a fatality or irreparable fuselage damage.

Hazards for other fields can take a similar form, e.g.:

• Reputation damage resulting in revenue loss exceeding \$5B in a fiscal year
• Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
• Unexpected coupon redemption in a campaign, resulting in > 1M\$ excess promotion cost
• Loss of chemical batch (value \$1M) by thermal runaway
• Electronic health record data breach resulting in disclosure of > 100 patient identifiers with medical history
• Uncontained oil spill of > 10,000 gallons within 10 miles of LA shore

Note that these hazards tend to be stated in terms of a top-level or system-level function, an unwanted event related to that function, and specific, quantified effects with some sort of cost, be it dollars, lives, or gallons. Often, the numerical details are somewhat arbitrary, reflecting the values of the entity affected by the hazard. In other cases, as with aviation, guidelines on hazard classification comes from a regulatory body. The FAA defines catastrophic, for example, as hazards “expected to result in multiple fatalities of the occupants, or incapacitation or fatal injury to a flight crewmember, normally with the loss of the airplane.”

Organizations vary considerably in the ways they view the FHA process; but their objectives are remarkably consistent. Diagrams for processes as envisioned by NAVSEA and the FAA (and EASA in Europe) appear below.

The same for enterprise risk might look like:

• Identify main functions or components of business opportunity, system, or process
• Examine each function for effects of interruption, non-availability or major change
• Define hazards in terms of above effects
• Determine criticality of hazards
• Check for cross-function impact
• Plan for avoidance, mitigation or transfer of risk

In all the above cases, the end purpose, as stated earlier, is to inform trade studies (help decide between futures) or to eliminate, mitigate or transfer risk. Typical FHA outputs might include:

• A plan for follow-on action – analyses, tests, training
• Identification of subsystem requirements
• Input to strategic decision-making
• Input to design trade studies
• The top events of fault trees
• Maintenance and inspection frequencies
• System operating requirements and limits
• Prioritization of hazards, prioritization of risks*

The asterisk on prioritization of risks means that, in many cases, this isn’t really possible at the time of an FHA, at least in its first round. A useful definition of risk involves a hazard, its severity, and its probability. The latter, in any nontrivial system or operation, cannot be quantified – often even roughly – at the time of FHA. Thus the FHA identifies the hazards needing probabilistic quantification. The FAA and NAVSEA (examples used above) do not quantify risk as an arithmetic product of probability and severity (contrary to the beliefs of many who cite these domains as exemplars); but the two-dimensional (vector) risk values they use still require quantification of probability.

I’ll drill into details of that issue and discuss real-world use of FHAs in future posts. If you’d like a copy of my FHA slides from a recent presentation at Stone Aerospace, email me or contact me via the About link on this site. I keep slide detail fairly low, so be sure to read the speaker’s notes.

–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

Risk assessment, risk analysis, and risk management have evolved nearly independently in a number of industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.