Fault trees are back in style. This is good!
I’ve taught classes on fault trees analysis (FTA) for just over 25 years now, usually limited to commercial and military aviation engineers, and niche, critical life-support areas like redundant, closed-circuit scuba devices. In recent years, pharmaceutical manufacture, oil refinement, and the increased affordability of complex devices and processes have made fault trees much more generally relevant. Fault trees are a good fit for certain ERM scenarios, and have deservedly caught the attention of some of the top thinkers in that domain. Covering FTA for a general audience won’t fit in a single blog post, so I’ll try doing a post each Friday on the topic. Happy Friday.
A fault tree is a way to model a single, specific, unwanted state of a system using Boolean logic. Boolean logic is a type of algebra in which all equations reduce to the values true or false. The term “fault” stems from aerospace engineering, where fault refers to an unwanted state or failure of a system at a system level. Failure and fault are typically differentiated in engineering; many failures result in faults, but not all faults result from failures. A human may be in the loop.
Wikipedia correctly states the practical uses for fault trees, including:
- understand how the top event (system fault) can occur
- showing compliance with system reliability requirements
- Identifying critical equipment or process steps
- Minimizing and optimizing resources
- Assisting in designing a system
The last two bullets above are certainly correct, but don’t really drive the point home. Fault tree analysis during preliminary design of complex systems is really the sole means to rationally allocate redundancy in a system and to identify the locations and characteristics of monitors essential to redundant systems.
Optimizing the weight of structural components in a system is a fairly straightforward process. For example, if all links in a chain don’t have the same tensile strength, the design wastes weight and cost. Similarly, but less obviously, if the relationships between the reliability of redundant system components is arbitrary, the design wastes weight and cost. Balancing the reliability of redundant-system components greatly exceeds the cognizing capacity of any human without using FTA or something similar. And the same goes for knowing where to put monitors (gauges, indicators, alarms, check-lists, batch process QC steps, etc.).
A fault tree has exactly one top event and a number of bottom events, usually called basic events or initiators. I’ll use the term initiator to avoid an ambiguity that will surface later. Initiator events typically come from failure mode effect analyses (FMEA). To calculate a probability value for the top event, all initiators must have an associated probability value. That probability value often comes from a known failure rate (a frequency) and a know exposure time, the time duration in which the failure could occur. Fault trees are useful even without probability calculations for reasons described below.
Fault trees are constructed – in terms of human logic – from top to bottom. This involves many intermediate fault states, often called intermediate events. The top event and all intermediate events have an associated logic gate, a symbolic representation of the logical relationship between the event and those leading upward into it. We often mix metaphors, calling the events leading up into an intermediate event and its associated logic gate child events, as would be the case in a family tree.
The top event is a specific system fault (unwanted state). It often comes directly from a Functional Hazard Assessment (FHA), and usually quantifies severity. Examples of faults that might be modeled with fault trees, taken from my recent FHA post) include:
- Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
- Loss of chemical batch (value > $1M) by thermal runaway
- Reputation damage resulting in revenue loss exceeding $5B in a fiscal year
- Aircraft departs end of a runway at speeds in excess of 50 miles per hour due to brake system fault
The solution to a fault tree is a set of combinations of individual failures, errors or events, each of which is logically sufficient to produce the top event. Each such combination is known as a cut set. Boolean logic reduces the tree into one or more cut sets, each having one or more initiator (bottom level) events. The collection of cut sets immediately identifies single-point critical failures and tells much about system vulnerabilities. If the initiator events have quantified probabilities, then the cut sets do to, and we can then know the probability of the top event.
For an example of cut sets, consider the following oversimplified fault tree. As diagrammed below, it has three cut sets, two having three initiator events, and one having only one initiator (“uncontained horizontal gene transfer). If the reason for this cut-set solution isn’t obvious, don’t worry, we’ll get to the relevant Boolean algebra later.
Not a Diagram
A common rendering of a fault tree is through a fault tree diagram looking something like the one above. In the old days fault trees and fault tree diagrams could be considered the same thing. Diagrams were useful for seeing the relationships between branches and the way they contributed to the top event. Fault tree diagrams for complex systems can exceed the space available even on poster-sized paper, having hundreds or thousands of events and millions of cut sets. Breaking fault tree diagrams into smaller chunks reduces their value in seeing important relationships. Fault trees must be logically coherent. We now rely on software to validate the logical structure of a fault tree, rather than by visual inspection of a fault tree diagram. Software also allows us to visually navigate around a tree more easily that flipping through printed pages, each showing a segment of a tree.
Fault tree diagrams represent logical relationships between child and parent events with symbols (logic gates) indicating the relevant Boolean function. These are usually AND or OR relationships, but can include other logical relationships (more on which later). Note that symbols for logic gates other than AND and OR vary across industries. Also note that we typically supply initiators with symbols also (diamonds and circles above), just as visual clues. They serve no Boolean function but show that the event has no children and is therefore an initiator, one of several varieties.
Relation to PRA
Fault tree analysis is a form of probabilistic risk analysis. If you understand “probabilistic” to require numerical probability values, then FTA is only a form of probabilistic risk analysis if all the fault tree initiators’ probabilities are known or are calculable (a quantitative rather than qualitative tree). To avoid confusion, note that in many circles, the term “probabilistic risk analysis” and the acronym PRA are used only to describe methods of inference from subjective probability judgments and usually Bayesian belief systems, as promoted by George Apostolakis of UCLA in the 80s and 90s. This is not fault tree analysis, but there is some overlap. For example, NASA’s guide Bayesian Inference for NASA Probabilistic Risk and Reliability Analysis shows how to use PRA (in the Bayesian-inference sense) to populate a fault tree’s initiator events with progressively updated subjective probability values.
Deductive Failure Analysis?
Wikipedia’s entry on fault tree analysis starts with the claim that it is a “deductive failure analysis.” Setting aside the above-mentioned difference between faults and failures, there’s the matter of what makes an analysis deductive. I’m pretty sure this claim (and the claim that FMEAs are inductive) originated with Bill Vesely at NASA in the 1980s. Bill’s a very sharp guy who greatly advanced the state of fault tree art; but fault trees are not deductive in any way that a mathematician or philosopher uses that term.
Deduction is reaching a conclusion about an instance based on an assumption, premise, or postulate about the population containing that instance. Example: if all men are mortal and Socrates is a man, then Socrates is mortal. This is deduction. It uses a statement in universal-conditional form (“if all A is B”) and a fact about the world (“observed X is an A”).
Induction, technically speaking, is the belief or assumption that unobserved cases will resemble observed cases. This is the basis for science, where we expect, all other things being equal, that the future will resemble the past.
Fault tree analysis relies equally on both these forms of reasoning, assuming, for sake or argument, that induction is reasoning, a matter famously contested by David Hume. The reduction of a fault tree into its cut sets includes deductions using the postulates of Boolean algebra (more on which soon). The rest of the analysis relies on the assumption that future failures (initiator events) will resemble past ones in their frequency (a probability distribution) and that these frequencies can be measured as facts about the world. Initiator probabilities derived from Bayesian inferences involve yet another form of reasoning – diachronic probabilistic coherence (more on which maybe someday). In any case, past students of mine have gotten hung up on whether fault trees and FMEAs are deductive and inductive. This pursuit, which either decays into semantics or falls down a philosophical pit, adds nothing to understanding fault trees or their solutions.
One final piece of philosophical baggage to discard stems from the matter of the explanatory power of fault trees. Fault trees’ concern with causes is limited. They don’t care much about “why” in the deep sense of the term. Despite reliance on logic, they are empiricist in nature. Fault trees explain by showing “how” the top event can happen, not “why” it will or did happen.
– – –
In the San Francisco Bay area?
If so, consider joining us in a newly formed Risk Management meetup group.
The fields of risk assessment, risk analysis, and risk management have each evolved nearly independently in a number of industries. This Meetup group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk management (ERM), project risk, safety, reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, cyber risk, etc.
This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.
If you are in the bay area, please join us, and let us know your preferences for meeting times.