Category Archives: Fault Trees

SpaceX, Fault Trees, and Root Cause

My previous six Fault Tree Friday posts were tutorials highlighting some of the stumbling blocks I’ve noticed in teaching classes on safety analysis. This one deals with fault trees in recent press.

I ran into a question on Quora asking how SpaceX so accurately determines the causes of failures in their rockets. The questioner said he heard Elon Musk mention a fault tree analysis and didn’t understand how it could be used to determine the cause of rocket failures.

In my response (essentially what follows below) I suggested that his confusion regarding the suitability of fault tree analysis to explaining incidents like the September 2016 SpaceX explosion (“anomaly” in SpaceX lingo) is well-founded. The term “fault tree analysis” (FTA) in aerospace has not historically been applied to work done during an accident/incident investigation (though it is in less technical fields, where it means something less rigorous).

FTA was first used during the design of the Minuteman system in the 1960s. FTA was initially used to validate a design against conceivable system-level (usually catastrophic) failures (often called hazards in the FTA context) by modeling all combinations of failures and faults that can jointly produce each modeled hazard. It was subsequently used earlier in the design process (i.e., earlier than the stage of design validation by reliability engineering) when we realized that FTA or (something very similar) is the only rational means of allocating redundancy in complex redundant systems. This allocation of redundancy ensures that systems effectively have no probabilistic strong or weak links – similar to the way stress analysis ensures that mechanical systems have no structural strong or weak links, yielding a “balanced” system.

During the design of a complex system, hazards are modeled by a so-called top-down (in FTA jargon) process. By “top-down” we mean that the process of building a fault tree (which is typically represented by a diagram looking somewhat like an org chart) uses functional decomposition of the hazardous state (the “top event” in a fault tree) by envisioning what equipment states could singly or in combination produce the top event. Each such state is an intermediate event in FTA parlance. Once such equipment states are identified, we use a similar analytical approach (i.e., similar thinking) to identify the equipment states necessary to jointly or singly produce the intermediate event in question. Thus the process continues from higher level events down to more basic events. This logical drill-down usually stops at a level of granularity (bottom events) sufficient to determine from observed frequencies in historical data (or from expert opinion about similar equipment) a probability of each bottom event. At this point a fault tree diagram looks like an org chart where each intermediate event is associated with a Boolean logic gate (e.g., AND or OR). This fault tree model can then be solved, by Boolean algebra (Boolean absorption and reduction), usually using dedicated software tools.

The solution to a fault tree consists of a collection of sets of bottom-level events, where each set is individually sufficient to produce the top event, and where all events in a set are individually necessary to that set. There may be thousands or millions of such sets (“cut sets”) in the solution to a fault tree for a complex system. Each set would include one or more bottom-level events. A set having only one bottom-level event would indicate that a single failure could produce the top event, i.e., cause a catastrophe. Civil aviation guidelines (e.g. CFR 25.1309) require that no single event and no combinations of events more probable than a specified threshold should produce a catastrophic condition.

If probability values are assigned to each bottom-level event, the solution to a fault tree will include calculated probability values for each intermediate event and for the top event. The top event probability is equal to the sum of the probabilities for each cut set in the solution’s collection of cut sets, each of which is an independent combination of bottom events jointly sufficient to produce the top event. A fault tree can still be useful without including probability calculations, however, since the cut set list, typically ordered by increasing event count, provides information about the importance of the associated equipment in avoiding the hazard (top event). The list also gives guidance for seeking common causes of events believed to be (modeled as being) independent.

A fault tree is only as good as the logic that went into its construction. I.e., FTA requires that the events within each cut set are absolutely independent and not causally correlated. This means that wherever an AND gate occurs in a tree, all events associated with that AND gate must be truly independent and have no common cause. So another value of a completed fault tree is to challenge beliefs about independence of events based on isolation, physical separation, vulnerability to environmental effects, duplicated maintenance errors, etc.

Now in the case of SpaceX, the press referred to conducting a fault tree analysis as part of the investigation. E.g., Bloomberg reported on Sep. 9 2016 that a “group will be conducting a thorough ‘fault tree analysis’ to try to determine what went wrong.” This usage is not consistent with the way the term is typically used in aerospace. As stated above, FTA relates to the design of a system and addresses all possible combinations of failures that can be catastrophic.

By contrast, accident investigation would be concerned, in the SpaceX case, with examining failed components and debris from the explosion. Such an investigation would likely include fractography, simulations, models, and hypotheses, which would draw on a group of fault trees that presumably would have existed since the design phase of the rocket and its planned operations.

It is unclear whether SpaceX meant that they were constructing fault trees as part of their accident investigation. They said they were “conducting tests” and that the “investigation team worked systematically through an extensive fault tree analysis.” It seems inconceivable that NASA, stakeholders, and customers would have allowed development of a rocket and use of public funds and facilities without the prior existence of fault trees. It’s possible, even if a SpaceX public relations representative said they conducting a fault tree analysis, that the PR person was miscommunicating information received from engineers. If no fault trees existed at the time of the explosion, then shame on SpaceX and NASA; but I doubt that is the case. NASA has greatly increased emphasis on FTA and related techniques since the shuttle disasters.

For reasons stated above the relationship between accident investigation and fault tree analysis in aerospace is limited. It would be unproductive to analyze all possible causes of a catastrophic system-state for the purpose of accident investigation when physical evidence supports certain hypotheses and excludes others. Note that each cut set in the solution to a fault tree is effectively a hypothesis in the sense that it is a plausible explanation for the catastrophe; but a fault tree does not provide justification for the nodes of a causation chain.

Many people ask about the relationship between aerospace accident investigation and root cause analysis. While aerospace engineers, the NTSB, and the FAA sometimes use the term “root cause” they usually do so with a meaning somewhat different than its usage in popular techniques of root cause analysis. The NTSB seeks explanations of accidents in order to know what to change to prevent their recurrence. They realize that all causal chains are infinitely long and typically very wide. They use the term “root cause” to describe the relevant aspects of an accident that should be changed – specifically, equipment design, vehicle operating procedures or maintenance practices. They are not seeking root causes such as realignment of incentives for corporate officers, restructuring the education system for risk culture, or identifying the ethical breaches that led to poor purchasing decisions.

Perhaps surprisingly, aerospace accident analyses are rather skeptical of “why” questions (central to many root-cause techniques) as a means of providing explanation. From the perspective of theory of scientific explanation (a topic in philosophy of science), fault tree analysis is also skeptical of causality in the sense that it favors the “covering law” model of explanation. In practice, this means that both FTA and accident investigation seek facts of the case (occurrences of errors and hardware/software failures) that confer nomic expectability on the accident (the thing being explained). That is, the facts of the case, when combined with the laws of nature (physics) and the laws of math (Boolean algebra), require the accident (or top event) to happen. In this sense accident investigation identifies a set of facts (conditions or equipment states) that were jointly sufficient to produce the accident, given the laws of nature. I could have said “cause the accident” rather than “produce the accident” in my previous sentence; but as phrased, it emphasizes logical relationships rather than causal relationships. It attempts to steer clear of biases and errors in inference common to efforts pursuing “why” questions. Thus, techniques like “the 5 whys” have no place in aerospace accident analyses.

Another problem with “why”-based analyses is that “why” questions are almost always veiled choices that entail false dichotomies or contain an implicit – but likely incorrect – scope. I.e., as an example of false dichotomy, “why did x happen” too often is understood to mean “why x and not y.” The classic example of the why-scoping problem is attributed to Willie Sutton. When asked why he robbed banks, Sutton is reported to have replied, “because that’s where the money is.” In this account Sutton understood “why” to mean “why banks instead of churches” rather than ”why steal instead of having a job.”

Some root-cause frameworks attempt to avoid the problems with “why” questions by focusing on “what” questions (i.e., what happened and what facts and conditions are relevant). This is a better approach, but in absence of a previously existing fault tree, there may be a nearly infinite number of potentially relevant facts and conditions. Narrowing down the set of what questions exposes the investigators to preferring certain hypotheses too early. It can also be extremely difficult to answer a “what happened” question without being biased by a hunch established early in the investigation. This is akin to the problem of theory-laden observations in science. The NTSB and aerospace accident investigators seem to do an excellent job of not being led by hunches, partly by resisting premature inferences about evidence they collect.

I’d be interested in hearing from anyone having more information about the use of fault trees at SpaceX.

[Edit: a bit more on the topic in a subsequent post]

  – – – – – – – –

Are you in the San Francisco Bay area?
If so, consider joining the Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

Fault Tree Inhibitor and Combination Gates

So far in this Fault Tree Friday series ( 1 2 3 4 5 ) I’ve only dealt with AND and OR logic gates. Over the years engineers have occasionally seen a need to use other logical relationships, such as NOR, NAND (NOT AND), Exclusive OR, etc. While these are essential for designing circuits, I have never had occasion to use them in fault trees, and I sense that few others do. I recently spoke with a Systems Safety engineer at Boeing who suggested that if you need them, you’ve structured your fault tree wrong.

Two additional gate types that do appear fairly often are the Inhibit gate and the Combination gate. The inhibitor (Inhibit gate, not to be confused with initiator, a basic event), usually diagrammed as a point-down hexagon, is logically identical to an AND gate with two children. It is used to clearly signify that one of the children is not a fault, per se, but some sort of qualifying condition that must be satisfied for the inhibitor gate’s other child event to produce the condition named by the inhibitor gate. The non-fault event beneath an Inhibit gate is shown with a special symbol (often an oval) and is called a Conditioning Event.

Some regulations or standards include provisions that no single failure regardless of probability shall have catastrophic results (e.g., AC 25.1309-1). Use of Inhibit gates in fault trees under such constraints will usually be scrutinized, so the conditioning event must represent an uncommon state. My favorite example is a rejected take-off. It’s not a fault. It’s uncommon but not extremely rare. And to be of consequence it has to occur at high speed. I’ve seen rates in the range of 1E-4 or 1E-5 per takeoff used in the past. Whether the Inhibit gate is justified for low-temperature launch in modeling O-ring failure in the Challenger situation is debatable.

A standard graphical representation of the rejected takeoff scenario, in which an independent loss of braking or reverse thrust capability might be of interest, appears below.

inhibit gate

The Combination Gate, usually represented by a flat-bottom hexagon, is a modeling device that eliminates repetition of similar branches in situations where, for example, two of four redundant paths must be failed to produce the condition named by the Combination gate.

Note that true redundancy may not exist in all such situations. In the above aircraft braking example, all brakes are applied together, but some fraction of them – half, for example, may be sufficient to avoid a hazardous state if they are functional.  For a six-wheeled aircraft, the Combination gate lets us specify the number of failures within a group of logical branches that will result in the relevant condition. In the case of three-of-six, the Combination gate is shorthand for an OR gate with ten AND gates beneath it, one for each combination of three, e.g., {1,2,3; 1,2,4; 1,2,5; 1,2,6; 1,3,4; 1,3,5; 1,3,6; etc.}.

The two shorthand (stripped-down to show only the logic and event types) fault trees above and below the green line in the image below are logically equivalent. Each models the case where two of four initiator events must exist to produce the top event.

combination gate

The Combination gate is not merely handy but is in practice a necessity for cases where the number of combinations is large. While two-of-four has only six possible combinations, the ten-of-twenty case, which might occur in a fault tree for the Airbus A380, would require two million events and gates to diagram without the Combination gate.

If you’re curious about how to enumerate the possible results of a combination gate, the relevant area of mathematics is called combinatorics. The specific rules that apply here are that order is not important (thus, this is literally a combination and not a permutation) and repetition is not allowed. This means that {a,b} and {b,a} represent the same real-world situation, so we’re only concerned with one of them. Likewise, {a,a} is not allowed, since event “a” can only occur once. Reliability engineers often call this a “r of n” situation. Since “R” is often used to represent failure rate in the fault-tree world, users of fault trees sometimes call this a “m of n” scenario. The formula for the number of combinations is:

c = n! / ( m! * (n – m)! )

For the two-of-four case:

c = 4 x 3 x 2 / ( 2 * 2) )  = 6

For any value of n the maximum value of c occurs when m equals n/2, which, coincidentally, is often the case being modeled, as with the ten-of-twenty case example I described above:

c = 2,432,902,008,176,640,000 /  (670,442,572,800 * 670,442,572,800 ) = 184,756

In the above example there are 184.756 cut sets, each composed of ten independent initiator events. While and AND of ten events likely has a very low probability, the combination of all such AND gates will be about six orders of magnitude more probable.

This possibly unintuitive result is yet another reason that fault tree analysis belongs in the preliminary design of systems.

 –  –  –

Though a good deal is too strange to be believed, nothing is too strange to have happened. – Thomas Hardy

Boolean Logic and Cut Sets

Fault Tree Friday  post 5    ( 1 2 3 4 )

In the first post on fault trees I used the term cut sets to refer to any combination of fault tree initiators that can produce the fault tree’s top event. There may be many – sometimes very many – cut sets in the complete collection of cut sets for a tree. The probability of the top event is, roughly speaking in  most cases, the sum of the probabilities of each cut set. The probability of each cut set is, in most cases, the product of the probabilities of each initiator in that set. The previous two sentences describe how things are most of the time. The exceptional cases – when they exist, are important.

Before dealing with the exceptional cases though, let’s look at the cut sets of some simple fault trees. In the second post of this series I showed two logically equivalent trees (repeated below) noting that, in real fault tree analysis, we use the lower rendering. The top one is useful for educational purposes, since it emphasizes gate logic. In this example, there are three cut sets:

  • Set 1: Event 1
  • Set 2: Event 2
  • Set 3: Events 3, 4, and 5, together

Fault tree

If the initiator probability of all events were 0.5 (an unlikely value for any real world initiator event, chosen here just to make a point), the probability values of each cut set would be the product of the probability values in each set (two of which have only one event).

  • Set 1: P = 0.5
  • Set 2: P = 0.5
  • Set 3: P = 0.125

Earlier I said that the top event probability in most, but not all, fault trees roughly equals the sum of the probabilities of all the cut sets. In this case that sum would be 1.125. We know that can’t be right since a probability cannot exceed 1.0.

The problem in the example above stems from using a shortcut form of the calculation for the probability of the union of sets – in this case the union of all cut sets of a fault tree. The accurate form of the solution for the probability of a union of the three cut sets (1, 2, and 3) above would be:

P(1,2,3) = P(1) + P(2) + P(3)  –  P(1) * P(2)  –  P(1) * P(3)   –   P(2 * P(3)   +    P(1) * P(2) * P(3).

Or in set notation:

The generalized form of the equation for the top event probability requires more math than we need to cover here. The Wikipedia entry on inclusion-exclusion principle covers is a good reference for those needing details. A rough summary is that we subtract the probabilities of each combination of even numbers of cut sets and add the probabilities of each combination of odd numbers of cut sets. The resulting equation for the top event probability of a tree modeling faults in a complex system can have billions of terms. No problem for modern computers.

Inclusion exclusion Venn diagram

In one sense the “solution” to a fault tree is simply a set of combinations of initiator events. This means any fault tree can be reorganized into a tree of only three levels. In such a reorganized tree, all top events are associated with an OR gate, and all children on that OR gate are associated with AND gates. That is, you can say that any fault tree can be reduced to an OR of many ANDs. At least that would be the case if no single failure were allowed, e.g., by design, to produce the top event. If single initiator events can lead to the top event, we could say such fault trees could be reduced to an OR of either ANDs (i.e. multiple-event cut sets) or single-event cut sets (allowing that a set can have one element). That is the case for the tree in the above example, which can be rearranged to look like this:

fault tree logic

Perhaps a better example of rearrangement of a fault tree into and OR of many ANDs is the tree below. Note that the AND gate in the rendering below the green line has six child events. The single black vertical line leading from the bottom of the gate joins six branches. Using this drawing technique prevents the clutter that would exist if we attempted to draw six separate parallel vertical lines into the bottom of the AND gate.

equivalent fault trees

The trees above and below the green line are logically equivalent. Presumably, the design of the system being modeled would lead an analyst to draw a tree of the top form in a structured top-down analysis. The bottom tree shows how we get cut sets, of which there are six for this tree:

  • 1, 3, 4
  • 1, 3, 5
  • 1, 3, 6
  • 2, 3, 4
  • 2, 3, 5
  • 2, 3, 6

We can now calculate the top event probability as follows:

P = P(1,3,4) + P(1,3,5) + P(1,3,6) … – P(1,3,4) * P(1,3,5) – P(1,3,4) * P(1,3,6) …
+ P(1,3,4) * P(1,3,5) * P(1,3,6) + P(1,3,4) * P(1,3,5) * P(2,3,5) …, etc.

where P(1,2,3) equals P(1) * P(2) * P(3), because initiator events 1, 2, and 3 are required to be truly independent of each other for the fault tree to be valid. Fault tree software handles the tedious and error-prone job of calculating the top event probability and the probabilities of all intermediate events.

While the rules of fault tree construction require all initiator events to be truly independent of each other, nothing says the same initiator event cannot appear in multiple places in a tree. In fact, for redundant systems, this happens often.

We need to give this situation special attention. Its ramifications for system design are important. Consider this simple fault tree on the left below, noting that the same basic event ( event A) appears in two branches.

On first glance, one might conclude that the top event probability for this tree would be 0.1 * 0.1 = 0.01, since the top event is an AND gate, and AND nominally indicates multiplication. But the tree is modeling a real-world phenomenon in which if event A happens, it happens regardless of our symbolic representation. Or, from an analytic standpoint, we can refer to the rules of Boolean algebra. A AND A is simply A (from a rule known as idempotence of conjunction).

So the collection of cut sets for this simple tree contains only one cut set; and that cut set consists of a single event, A. Since the probability of A is 0.1, the probability of T, the top event, is also 0.1. Both common sense and Boolean algebra reach the same conclusion. Complex fault trees can vastly exceed the grasp of our common sense, and “A AND A” cases can be concealed in complex trees. Software that applies the rules of Boolean algebra saves the day.

Idempotency of disjunction (image on right, above) similarly leads us to conclude that replacing the above tree’s AND gate with an OR gate (fault tree on right, above) yields the same cut set collection – a single set having a single event, A.

In addition to idempotency effects, non-trivial real-world fault trees will also likely have a cut set collection where one cut set consists of a subset of the events of another cut set. This has the sometimes unintuitive consequence of eliminating the cut set with the larger number of terms. Consider this tree:

fault tree disjunctive absorption

One view of its solution is that two cut sets exist:

1.)  A
2.)  A, B

But A OR (A AND B) equals simply A. That is, disjunctive absorption removes cut set 2 from the cut set collection. In a real-world fault tree, complexity may make cases of disjunctive absorption much less obvious; and they often point to areas of ineffective application of design redundancy.

fault tree conjunctive absorption

Conjunctive absorption has the same effect in the above tree. A preliminary account of its cut sets would be:

1.)  A, A
2.)  A, B

Idempotency reduces cut set 1 to A alone, and disjunctive absorption eliminates cut set 2. Thus, conjunctive absorption can be derived from disjunctive absorption plus idempotency. In short form, A AND (A OR B) equals simply A.

The simple fault trees above would never occur in the real world. But logically equivalent conditions do appear in real-world trees. They may not be obvious from inspecting the fault tree diagram; but they become apparent on viewing the cut set collection. A cut set collection from which all supersets have been eliminated (i.e., absorption has been applied) is often called a minimal cut set collection.

Back in the dark ages when computers struggled with fault tree calculations, analysts would use a so-called gate-by-gate (i.e. bottom-up) method to calculate a top event probability by hand. The danger of doing this, if a tree conceals cases where idempotency and absorption are relevant, is immense. Given that real-world initiator probabilities are usually small numbers, a grossly wrong result can stem from effectively squaring – without being justified in doing so – an initiator probability. I mention this only because some textbooks published this century (e.g. one by the Center for Chemical Process Safety) still describe this manual approach – a risky way of dealing with risk.

A more important aspect of the concepts covered here is the problem stemming from a fault tree that does not model reality well. Consider, for example, a fault tree with 10,000 cut sets. Imagine the 1000th most probable cut set, based on minimal cut set analysis, has a probability of one per trillion (1E-12) and contains events A, B, and C.

Imagine further that events A and C each have probabilities of 1E-5. If A and C turn out to be actually the same real-world event – or they result from the same failure, or are in some other way causally correlated – that cut set then reduces to events A and B, having a cut set probability of 1E-7. This probability may be greater than all others in the collection, moving it to the top of an ordered list of cut sets, and possibly into the range of an unacceptably likely top event.

In other words, fault trees that miss common-mode failures are dangerous. Classic cases of this include:

  • Redundant check-valves all installed backwards after maintenance
  • Uncontained engine failure drains all three aircraft hydraulic systems
  • Building fire destroys RAID array and on-premise backup drives
  • Earthquake knocks out electrical power, tsunami destroys backup generators

Environmental factors like flood, fire, lightning, temperature, epidemic, terrorism, sabotage, war, electromagnetic pulse, collision, corrosion, and collusion must enter almost any risk analysis. They can do so through functional hazard analysis (FHA), failure mode effects analysis (FMEA), zonal analysis (ZSA) and other routes. Inspection of fault trees to challenge independence of initiator events requires subject matter expertise. This is yet another reason that fault trees belong to system and process design as much as they belong to post-design reliability analysis.

 –  –  –  –

Are you in the San Francisco Bay area?

If so, consider joining the Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

The Magic of Redundancy

Fault Tree Friday – Week 4           ( 1  2  3 )

In post 1 of this series I stressed that fault trees were useful in the preliminary design of systems, when alternative designs are being compared (design trade studies). I argued that fault trees are the only rational means of allocating redundancy in complex systems. In that post I used the example of a crude brake system for a car. It consists of a brake pedal, a brake valve with two cylinders each supplying pressure to two brakes, and the hydraulic lines and brake hardware (drums, calipers, etc.). I’ll use that simplified design (it has no reservoir or other essentials yet) again here.

car brakes

We’ll assume that pressure sent to either the front wheels alone or the rear wheels alone delivers enough braking force that the car stops normally. Note that this also means the driver would, using brakes of this design, not know that only two of four brakes were operational.

For sake of simplicity we’ll model the brake system as having only two fault states: front brakes unable to brake, and rear brakes similarly incapacitated. While we would normally not model the collection of all failure modes resulting in loss of braking capability to the front brakes as a basic event (initiator), we’ll do so here to emphasize some aspects of redundant system design.

If we imagine a driving time of one hour and, assign a failure rate of one per thousand hours (1E-3/hr) each to the composite front and rear brake system failures, our fault tree might look like this:

fault tree

If we modeled the total loss of braking using the above fault tree, we’d be dead wrong. Since the system, as designed here, is fully redundant (in terms of braking power), many failures of either front or rear brakes could go unnoticed almost indefinitely. To make a point, let’s say the maintenance events that would eventually detect this condition occur every two years.

To correct the fault tree (leaving the system design alone for now), we’d start with something like what appears below. Note that with the specified failure rates, the probability of either the front or rear brakes being in a failed state at the beginning of a drive is roughly equal to one, using the standard calculus of probability given an exposure time and a failure rate based on an exponential distribution. This means that, as modeled, the redundancy is essentially useless. We call the failure that goes unnoticed for a long time period a latent failure.

fault tree

That fault tree isn’t quite correct either. It models the rear brakes as having failed first – a latent failure that silently awaits failure of the front brakes. But in an equally likely scenario, the front brakes could fail first. That means either of two similar scenarios could occur. In one (as shown above) the loss of braking in the rear brakes goes unnoticed for a long time period, combining with a suddenly apparent failure of the front brakes. In the other scenario, the front brake failure is latent. A corrected version appears below. Note that the effect of this correction is to double the top event probability, making our design look even worse and making redundancy seem not so magic.

The fault tree tells us that our brake system design isn’t very good. We need to reduce the probability of the latent failures, thereby getting some value from the redundancy. We can do this by adding a pressure sensor to detect loss of pressure to front or rear brakes, along with an indicator on the dashboard to report that this failure was detected. To keep this example simple, assume we can use the same sensor for both systems.

The astute designer will likely see where this is heading. Failure of the pressure sensor to detect low pressure is now a latent failure. And so is failure of the indicator in the dashboard. Again for simplicity of example, we’ll model these as a single unit; but failure of that unit to be able to tell the driver anything that either front or rear brakes are incapacitated is a latent failure that could go undetected for years. The resulting fault tree would look like this:

fault tree

Despite being a latent failure, monitoring subsystems are usually more reliable than the the things they monitor. I’ve shown that in this example by using a failure rate of one per 10,000 hrs for the monitor equipment. With an exposure time of two years, the probability that the monitor is in a failed state during a drive is 0.16. Consequently, adding all the monitoring equipment only changed our top event probability from 1E-3 to 3.2E-4. It’s better, but not impressively so. And we added cost to the car and more things that can fail.

It’s important to realize that, at the bottom of the above fault tree segment, the exposure time to loss of braking in both rear brakes cannot be shorter than the exposure time to failure of its monitor. That is almost always the case for monitored components in any redundant design.

We could reduce the top event’s likelihood by shortening the system’s maintenance intervals. Checking the monitoring system every 2 months, instead of two years, would get us another factor of twelve.

But such an inspection need not verify that each element of the monitoring subsystem is separately functional. Logic allows us to see that we need only verify that it is  or is not capable of monitoring a failure as a group. For now, if it’s failed we aren’t really concerned with what went wrong in particular. We can test the monitor by inducing  it to test for pressure when none exists. If the monitor reports the condition as a pressure failure, it is good.

We don’t need an auto mechanic for this. The driver could run the test on startup. We’ll design the monitor to test for no pressure as the car starts, when no brakes are applied. We call on the driver to verify that the warning indicator illuminates.

Two failure modes of the monitored brake system – which now includes an operator – should now be apparent. An unlikely failure mode is that the indicator somehow illuminates (“fails high”) during the test sequence, indicating no pressure at startup, but fails to illuminate when no pressure is applied to the brakes.

A far more likely fault state is operator error. This condition exists when the operator fails to notice that the startup test sequence does not display the illumination. This fault does not involve exposure time at all. The operator either remembers or forgets. For an untrained operator, the error probability in such a situation may be 100%. A trained operator will make this error somewhere between one and ten percent of the time, especially when distracted. Two operators who monitor each other will do better. Two operators with a check list will do better still.

With the startup-check procedures in place, our system no longer has any latent failures. It has high-probability error states, but they have the advantage of only contributing to the top event in conjunction with another failure. The resulting fault tree makes us feel better about driving:


While the system we’ve modeled here is not representative of modern brake systems (and my example has other shortcuts that would be deemed foul in a real analysis), this example shows how fault trees can be used in preliminary system design to make better system-design choices. It really only begins to make that point. In a quad-redundant flight control system, such analysis has far more impact. This sort of modeling can also reveal weak spots in driverless cars, chemical batch processes, uranium refinement, complex surgery procedures, cyber-security, synthetic biology, and operations where human checks and balances (a form of redundancy) are important.

In the above fault trees, two similar branches appear, one each for front and rear brakes. The real-world components and their failures represented in each branch are physically distinct. In the final version, immediately above, indicator failure and operator error occur in both branches. Unlike the other initiator events, these events logically belong to both branches, but each represent the same real-word event. The operator’s failure to verify the function of the pressure warning indicator, for example, if it occurred, would occur in both branches simultaneously. The event IDs (e.g., “OE1” above) remind us that this is the case. Appearance of the same event in multiple branches of a tree can profoundly impact the top event probability in a way that isn’t obvious  unless you’re familiar with Boolean algebra. We’ll go there next time.


Fault Trees – View from the Bottom

Fault Tree Friday #3. See also #1 #2

Last week I showed how to begin building a fault tree from the top down. explained a tree’s structure, and looked at intermediate events and their associated logic gates. Now we can briefly cover initiators, the events at the bottom of the tree.

In most fault trees, most initiators get their probability values from the failure rate of a component related to the specific failure mode in question, and from the time window during which the mission or operation is exposed to that failure. For example, resistors fail by open more often than they fail by short , and the consequences of those two failure modes is often very different.

Some missions are exposed to certain failures for a fixed number of times regardless of the length of the mission. Aircraft are exposed to aborted takeoffs commanded by the control tower at some historical rate. This has nothing to do with the duration of a flight. Initial Public Offerings occur once in a firm’s lifetime, and occasionally fail miserably at a low historical rate, often with no chance of recovery. If modeling these events, you might infer their probabilities from the known cases in the total population.

Most hardware failures and many human errors are modeled as occurring at a fixed rate over the duration of a mission or lie of a project. This assumes, in the case of mechanical or electrical equipment, that infant-mortality cases have been removed from the population by a burn-in process, as is often applied to 100% of semiconductors in critical applications. Likewise, wear-out failures are prevented in critical applications by maintenance, non-destructive testing, and replacement of finite-life structural components.

Between the extremes of infant mortality and wear-out – during the “useful-life” period – most equipment fails at a roughly constant rate. During that normal-life period, the probability of failure during a mission (e.g., a flight, reactor-time in a chemical batch process, or the time between scheduled maintenance in power-generation) is a simple function of the exposure time and a historical failure rate.

The model for this is commonly called an exponential failure distribution. The probability P of a failure in time interval T for a component having a failure rate R (where R equals the reciprocal of man time between failure, MTBF) is given by:

On Risk Of - Fault Tree Analysis

where “e” is Euler’s Number, the number having a natural log of 1.

As a historical note, when the product R * T is small, you can approximate P as P = RT and leave your slide rule in the desk. Otherwise, fault tree software will let you supply values for R and T and will calculate P for each initiator event where you didn’t assign P directly. As another historical note, events for which P is supplied directly are sometimes called “undeveloped events” and those given R and T values are often called “basic events.” “Undeveloped” partly stems from an old practice of braking trees into chunks to ease computation (the top of one tree supplying the probability to an “undeveloped” (developed elsewhere) event. Try to avoid this; it risk grave computational errors, for reasons we’ll cover later. I’ll call any event that doesn’t have kids an initiator.

fault tree

You might be wondering where failure rates come from. Good question. Sources include GIDEP, IEEE 500, Backblaze hard drive data, USAF Rome Laboratory, MIL-HDBK-217F, and RIAC. And those who write procurement specs should require vendors to supply detailed data of this sort.

Above I used the example of different failure rates for the different failure modes of a resistor. Since the effort of building fault trees is usually only justified for catastrophic fault states (hazards), you’re unlikely to see a resistor failure appear as an initiator. The top-down development of a tree need only descend to a point dictated by logic and availability of historical failure-rate data. So initiators might specify a failure rate for electronics at the “box” (component) level or perhaps the circuit-board level when a box contains redundant boards, as would be case for an auto-land controller in my aircraft braking example.

The Human Factor

For fault-tree purposes, human errors are modeled as faults. Error is generally modeled as the probability of a mistake or omission per relevant action. This typically enters fault trees as events involving maintenance errors and primary operator errors – like pilots for aircraft and chemists for batch processes. Fault trees point out the needs for operator redundancy and for monitors.

Monitors might take the form of more humans, i.e., inspectors or copilots. Or monitors might take the form of machines that watch the output of human operators or machines that check the output of other machines.

One thing history has taught us – in general, humans make poor monitors, particularly when called upon to ensure that a machine is working properly. Picture Homer Simpson, eyes glazed over, while the needle is in the red.

Bored humans do poor work (reminder: blog post on human capital risk), and scared humans are even worse. This is particularly relevant for critical operations in degraded systems where skill is required – think fighter pilots and deep divers.

While diodes, motors, pumps, valves and surprisingly many other things fail at a rate of about one per million hours (1E-6/hr) – and RAM is better still – humans mess up surprisingly often. For example:

  • Omission of step in batch process by skilled operator:    0.005 to .003
  • Arithmetic error in single simple calculation:   0.03
  • Human monitor (inspector) doesn’t catch error:  0.03
  • Omission of step during stressful emergency procedure:   0.1

Commercial aircraft operation uses redundant pilots and redundancy in critical procedures to deal with errors of omission. Knowledge of error criticality does little to prevent critical errors. For example, failure to deploy flaps for takeoff – about as critical an error as can be imagined – resulted in several crashes, including Delta 1141 in 1988, despite redundancy and checklists. Non-human monitors of humans’ configuration of control surfaces is a better approach.

Fault trees are great for helping us allocating redundancy in system design. To get this right, we need to take a close look at the failure rates and exposure times supplied to initiator events in redundant designs. I intended to cover that today, but this post is already pushing the limits. I’ll get to it next time.

In the meantime, consider two other aspects of redundancy, failure rates, exposure time, and probability. First, imagine two components in parallel, each having P = .01. This arrangement may cost much less than one component having P = .0001. Then again, two components in parallel will weigh more than twice the weight of one, and will take up at least twice the space.

Second, consider the fault-tree ramifications of choosing the two-in-parallel design in the above example. Are there any possible common-mode (or common-cause) failures of both the components? What happens if one explodes and takes out the other? What happens if the repair crew installs both of them backwards after scheduled maintenance?


Fault Tree Construction Basics

Last week I introduced fault trees, giving a hint at what they’re good for, and showing some fault-tree diagram basics. Today I’ll focus on the mechanics of building one.

The image below, showing two different diagrams for the exact same logical fault tree, serves as a quick review. The top event, “A,” has two 2nd-level events, “B” and “C,” having two and three child events respectively. An OR gate is associated with the top event and event B, while an AND gate is associated with event C. Events 1 through 5 are basic events, which are initiators. They are literally “bottom events” though we don’t usually use that term.

The difference between the top and bottom renderings below is just a matter of formatting. The bottom rendering looks much nicer in diagrams when we replace “A,” “B,” “2,” etc. with descriptive text. Here I’ll use whichever convention fits best on the page. It’s important to realize that in the bottom style, the logic gate goes with (sticks to) the event above it.

Fault tree analysis

Last time I mentioned that building a tree is a top-down process in the sense that you start with the top event of the tree. Since fault trees can have only one top event but a large number of bottom events (initiators), the analogy with living trees is weak. An organizational chart might be a closer analogy, but even that isn’t accurate, since fault trees can contain the same initiator events in multiple places. I’ll show how this can happen later.

We typically get the top event of a tree from a hazard assessment where an unwanted outcome has been deemed critical – something like an aircraft hitting the end of a runway at speeds over 50 mph due to a brake system fault.

From the top event, we identify its high-level contributors. In the case of the aircraft brakes example, the redundancy designed into the system may make descriptions of these conditions a bit wordy. For example, consider a dual-hydraulic-brake design with two systems, each feeding half the brakes, with eight brakes total. In that system, one equipment state causing the top event (there are several) would be complete loss of hydraulic system number 1 plus sufficient additional failures to render braking inadequate. Those additional failures could include, for example, the complete failure of hydraulic system number 2 OR mechanical failures of one or more of the system 2 brakes OR loss of the command signal to the system two brakes, and a few others.

That is an example of one of the possible causes (i.e., one of the 2nd level events) of faults that could cause the loss of braking specified in the top event. There may be five or ten others, any of which could produce the hazardous state. The word “any” in the previous sentence tells you the relationship between the top event and this collection of second-level contributors. It is an OR relationship, since any one of them would be sufficient to cause the hazard. An OR gate is therefore tied directly to the top event.

For the first 2nd-level intermediate event of the aircraft brake system example above, we would carefully come up with a name for this fault; we can refine it later. Something like “Loss of hydraulic system #1 plus additional failures” would be sufficient. Note that a good name for this event gives a clue about the logic gate associated with it. The word “plus” suggests that the gate for this event will be an AND gate.

A more accurate description of the event – perhaps too wordy – would be “Loss of hydraulic system #1 plus any of several different additional failures or combinations of failures.” This tells us that this event, as we’re modeling it, will have two children, which we can label:

  1. Loss of hydraulic system 1 brake hydraulic power to brakes
  2. Additional failures leading to loss of braking (meaning loss of braking sufficient to result in the top event when combined with loss of system 1 hydraulic power to brakes)

Without knowing anything else about the system and its operation at this point (details we would get from system schematics and operating manuals) we can’t really specify the gate associated with the first of the two events listed above. If hydraulic system #1 is itself redundant, it might be an AND gate, otherwise an OR. We can be infer that event #2 above has an OR gate beneath it, since several combinations of additional failures might be sufficient to render the whole system ineffective.

Here’s a diagram for what we’ve modeled so far on the brake system failure. This diagram, of the bottom style in the above image, also includes small tags below each event description (above its gate) containing event IDs used by fault tree software. Ignore these for now.

fault tree analysis

To help understand the style of thinking involved in modeling systems, whether physical, like a brake system, or or procedural, like complex surgery, compare what we’ve done so far, for aircraft brakes, with the brake system of your car. For modeling purposes, let’s ignore the parking brake for now.

Your car’s brake system also contains some redundancy, but it’s limited. If it’s a simple hydraulic brake system (most cars are more complex), it has two cylinders, one powering the front two brakes and one powering the rear. Both of these have to be in a failed state, for you to be without brakes for hydraulic reasons.

Notice I said “have to be in a failed state” and not “have to fail.” The likelihood of both failing independently during a trip is much lower than the probability that one failed some time ago without detection and one failed during the trip. Fault trees deal with both these cases, the latter involving a latent failure and a monitor with an indicator to report the otherwise latent failure. Of course, the monitor or the indicator might be in a failed state, again without your knowing it. Redundancy and monitors complicate things. We’ll get into that later.

Since both of your car’s brake cylinders must be failed for you to be without hydraulic power for braking, you might think that a fault tree for total loss of braking in your car would start with a top event having an AND gate. But your brake pedal probably isn’t redundant. If it falls off, you’re without brakes. Historical data on a large population of cars shows this event to have a very low probability. But we’ll include it for thoroughness. If we diagram the model we’ve developed for car brakes so far, we have this:

fault tree analysis

One thing immediately apparent in these two barely-begun fault trees, one for aircraft brakes, and one for car brakes, is that car brakes, as modeled, have a single-point failure leading to the top event, albeit an improbable one. The FAA guidance for design of aircraft systems specifies that no single failure regardless of probability should have catastrophic consequences. If we imagine a similar requirement for car design, we would have to add a brake subsystem with an independent actuator, like the separate hand or foot-controlled parking brake actuator in most cars. Have you tested yours lately?

Next time we’ll explore the bottom events of fault trees, the initiator events, and the topic of exposure times and monitors. Working with these  involves examination of failure probabilities and the time during which you’re exposed to failures, some of which – as with your parking brake – may be very long, dramatically increasing the probability that a latent failure has occurred, resulting in a loss of perceived redundancy.

  – – –

In the San Francisco Bay area?

If you live near San Francisco, consider joining our new formed Risk Management meetup group.

Risk management has evolved separately in  various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup seeks to build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

It’s Fault Tree Friday

Fault trees are back in style. This is good!

I’ve taught classes on fault trees analysis (FTA) for just over 25 years now, usually limited to commercial and military aviation engineers, and niche, critical life-support areas like redundant, closed-circuit scuba devices.  In recent years, pharmaceutical manufacture, oil refinement, and the increased affordability of complex devices and processes have made fault trees much more generally relevant. Fault trees are a good fit for certain ERM scenarios, and have deservedly caught the attention of some of the top thinkers in that domain. Covering FTA for a general audience won’t fit in a single blog post, so I’ll try doing a post each Friday on the topic. Happy Friday.

A fault tree  is a way to model a single, specific, unwanted state of a system using Boolean logic. Boolean logic is a type of algebra in which all equations reduce to the values true or false. The term “fault” stems from aerospace engineering, where fault refers to an unwanted state or failure of a system at a system level. Failure and fault are typically differentiated in engineering; many failures result in faults, but not all faults result from failures. A human may be in the loop.

Wikipedia correctly states the practical uses for fault trees, including:

  • understand how the top event (system fault) can occur
  • showing compliance with system reliability requirements
  • Identifying critical equipment or process steps
  • Minimizing and optimizing resources
  • Assisting in designing a system

The last two bullets above are certainly correct, but don’t really drive the point home. Fault tree analysis during preliminary design of complex systems is really the sole means to rationally allocate redundancy in a system and to identify the locations and characteristics of monitors essential to redundant systems.

Optimizing the weight of structural components in a system is a fairly straightforward process. For example, if all links in a chain don’t have the same tensile strength, the design wastes weight and cost. Similarly, but less obviously, if the relationships between the reliability of redundant system components is arbitrary, the design wastes weight and cost. Balancing the reliability of redundant-system components greatly exceeds the cognizing capacity of any human without using FTA or something similar. And the same goes for knowing where to put monitors (gauges, indicators, alarms, check-lists, batch process QC steps, etc.).

Essential Concepts

A fault tree has exactly one top event and a number of bottom events, usually called basic events or initiators. I’ll use the term initiator to avoid an ambiguity that will surface later. Initiator events typically come from failure mode effect analyses (FMEA). To calculate a probability value for the top event, all initiators must have an associated probability value. That probability value often comes from a known failure rate (a frequency) and a know exposure time, the time duration in which the failure could occur. Fault trees are useful even without probability calculations for reasons described below.

Fault trees are constructed – in terms of human logic – from top to bottom. This involves many intermediate fault states, often called intermediate events. The top event and all intermediate events have an associated logic gate, a symbolic representation of the logical relationship between the event and those leading upward into it. We often mix metaphors, calling the events leading up into an intermediate event and its associated logic gate child events, as would be the case in a family tree.

The top event is a specific system fault (unwanted state). It often comes directly from a Functional Hazard Assessment (FHA), and usually quantifies severity. Examples of faults that might be modeled with fault trees, taken from my recent FHA post) include:

  • Seizure of diver in closed-circuit scuba operations due to oxygen toxicity from excessive oxygen partial pressure
  • Loss of chemical batch (value > $1M) by thermal runaway
  • Reputation damage resulting in revenue loss exceeding $5B in a fiscal year
  • Aircraft departs end of a runway at speeds in excess of 50 miles per hour due to brake system fault

The solution to a fault tree is a set of combinations of individual failures, errors or events, each of which is logically sufficient to produce the top event. Each such combination is known as a cut set. Boolean logic reduces the tree into one or more cut sets, each having one or more initiator (bottom level) events. The collection of cut sets immediately identifies single-point critical failures and tells much about system vulnerabilities. If the initiator events have quantified probabilities, then the cut sets do to, and we can then know the probability of the top event.

For an example of cut sets, consider the following oversimplified fault tree. As diagrammed below, it has three cut sets, two having three initiator events, and one having only one initiator (“uncontained horizontal gene transfer). If the reason for this cut-set solution isn’t obvious, don’t worry, we’ll get to the relevant Boolean algebra later.

fault tree analysis

Not a Diagram

A common rendering of a fault tree is through a fault tree diagram looking something like the one above. In the old days fault trees and fault tree diagrams could be considered the same thing. Diagrams were useful for seeing the relationships between branches and the way they contributed to the top event. Fault tree diagrams for complex systems can exceed the space available even on poster-sized paper, having hundreds or thousands of events and millions of cut sets. Breaking fault tree diagrams into smaller chunks reduces their value in seeing important relationships. Fault trees must be logically coherent. We now rely on software to validate the logical structure of a fault tree, rather than by visual inspection of a fault tree diagram. Software also allows us to visually navigate around a tree more easily that flipping through printed pages, each showing a segment of a tree.

Fault tree diagrams represent logical relationships between child and parent events with symbols (logic gates) indicating the relevant Boolean function. These are usually AND or OR relationships, but can include other logical relationships (more on which later). Note that symbols for logic gates other than AND and OR vary across industries. Also note that we typically supply initiators with symbols also (diamonds and circles above), just as visual clues. They serve no Boolean function but show that the event has no children and is therefore an initiator, one of several varieties.

Relation to PRA

Fault tree analysis is a form of probabilistic risk analysis. If you understand “probabilistic” to require numerical probability values, then FTA is only a form of probabilistic risk analysis if all the fault tree initiators’ probabilities are known or are calculable (a quantitative rather than qualitative tree). To avoid confusion, note that in many circles, the term “probabilistic risk analysis” and the acronym PRA are used only to describe methods of inference from subjective probability judgments and usually Bayesian belief systems, as promoted by George Apostolakis of UCLA in the 80s and 90s. This is not fault tree analysis, but there is some overlap. For example, NASA’s guide Bayesian Inference for NASA Probabilistic Risk and Reliability Analysis shows how to use PRA (in the Bayesian-inference sense) to populate a fault tree’s initiator events with progressively updated subjective probability values.

Deductive Failure Analysis?

Wikipedia’s entry on fault tree analysis starts with the claim that it is a “deductive failure analysis.” Setting aside the above-mentioned difference between faults and failures, there’s the matter of what makes an analysis deductive. I’m pretty sure this claim (and the claim that FMEAs are inductive) originated with Bill Vesely at NASA in the 1980s. Bill’s a very sharp guy who greatly advanced the state of fault tree art; but fault trees are not deductive in any way that a mathematician or philosopher uses that term.

Deduction is reaching a conclusion about an instance based on an assumption, premise, or postulate about the population containing that instance. Example: if all men are mortal and Socrates is a man, then Socrates is mortal. This is deduction. It uses a statement in universal-conditional form (“if all A is B”) and a fact about the world (“observed X is an A”).

Induction, technically speaking, is the belief or assumption that unobserved cases will resemble observed cases. This is the basis for science, where we expect, all other things being equal, that the future will resemble the past.

Fault tree analysis relies equally on both these forms of reasoning, assuming, for sake or argument, that induction is reasoning, a matter famously contested by David Hume. The reduction of a fault tree into its cut sets includes deductions using the postulates of Boolean algebra (more on which soon). The rest of the analysis relies on the assumption that future failures (initiator events) will resemble past ones in their frequency (a probability distribution) and that these frequencies can be measured as facts about the world. Initiator probabilities derived from Bayesian inferences involve yet another form of reasoning – diachronic probabilistic coherence (more on which maybe someday). In any case, past students of mine have gotten hung up on whether fault trees and FMEAs are deductive and inductive. This pursuit, which either decays into semantics or falls down a philosophical pit, adds nothing to understanding fault trees or their solutions.

One final piece of philosophical baggage to discard stems from the matter of the explanatory power of fault trees. Fault trees’ concern with causes is limited. They don’t care much about “why” in the deep sense of the term. Despite reliance on logic, they are empiricist in nature.  Fault trees explain by showing “how” the top event can happen, not “why” it will or did happen.


–  –  –

In the San Francisco Bay area?

If so, consider joining us in a newly formed Risk Management meetup group.

The fields of risk assessment, risk analysis, and risk management have each evolved nearly independently in a number of industries. This Meetup group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk management (ERM), project risk, safety, reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk, cyber risk, etc.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase leading-edge trends, case studies, and best practices in our profession, with a focus on practical application and advancing the state of the art.

If you are in the bay area, please join us, and let us know your preferences for meeting times.