*Fault Tree Friday post 5* *( 1 2 3 4 )*

In the first post on fault trees I used the term cut sets to refer to any combination of fault tree initiators that can produce the fault tree’s top event. There may be many – sometimes very many – cut sets in the complete collection of cut sets for a tree. The probability of the top event is, roughly speaking in most cases, the sum of the probabilities of each cut set. The probability of each cut set is, in most cases, the product of the probabilities of each initiator in that set. The previous two sentences describe how things are most of the time. The exceptional cases – when they exist, are important.

Before dealing with the exceptional cases though, let’s look at the cut sets of some simple fault trees. In the second post of this series I showed two logically equivalent trees (repeated below) noting that, in real fault tree analysis, we use the lower rendering. The top one is useful for educational purposes, since it emphasizes gate logic. In this example, there are three cut sets:

- Set 1: Event 1
- Set 2: Event 2
- Set 3: Events 3, 4, and 5, together

If the initiator probability of all events were 0.5 (an unlikely value for any real world initiator event, chosen here just to make a point), the probability values of each cut set would be the product of the probability values in each set (two of which have only one event).

- Set 1: P = 0.5
- Set 2: P = 0.5
- Set 3: P = 0.125

Earlier I said that the top event probability in most, but not all, fault trees roughly equals the sum of the probabilities of all the cut sets. In this case that sum would be 1.125. We know that can’t be right since a probability cannot exceed 1.0.

The problem in the example above stems from using a shortcut form of the calculation for the probability of the union of sets – in this case the union of all cut sets of a fault tree. The accurate form of the solution for the probability of a union of the three cut sets (1, 2, and 3) above would be:

P(1,2,3) = P(1) + P(2) + P(3) – P(1) * P(2) – P(1) * P(3) – P(2 * P(3) + P(1) * P(2) * P(3).

Or in set notation:

The generalized form of the equation for the top event probability requires more math than we need to cover here. The Wikipedia entry on inclusion-exclusion principle covers is a good reference for those needing details. A rough summary is that we subtract the probabilities of each combination of even numbers of cut sets and add the probabilities of each combination of odd numbers of cut sets. The resulting equation for the top event probability of a tree modeling faults in a complex system can have billions of terms. No problem for modern computers.

In one sense the “solution” to a fault tree is simply a set of combinations of initiator events. This means any fault tree can be reorganized into a tree of only three levels. In such a reorganized tree, all top events are associated with an OR gate, and all children on that OR gate are associated with AND gates. That is, you can say that any fault tree can be reduced to an OR of many ANDs. At least that would be the case if no single failure were allowed, e.g., by design, to produce the top event. If single initiator events can lead to the top event, we could say such fault trees could be reduced to an OR of either ANDs (i.e. multiple-event cut sets) or single-event cut sets (allowing that a set can have one element). That is the case for the tree in the above example, which can be rearranged to look like this:

Perhaps a better example of rearrangement of a fault tree into and OR of many ANDs is the tree below. Note that the AND gate in the rendering below the green line has six child events. The single black vertical line leading from the bottom of the gate joins six branches. Using this drawing technique prevents the clutter that would exist if we attempted to draw six separate parallel vertical lines into the bottom of the AND gate.

The trees above and below the green line are logically equivalent. Presumably, the design of the system being modeled would lead an analyst to draw a tree of the top form in a structured top-down analysis. The bottom tree shows how we get cut sets, of which there are six for this tree:

- 1, 3, 4
- 1, 3, 5
- 1, 3, 6
- 2, 3, 4
- 2, 3, 5
- 2, 3, 6

We can now calculate the top event probability as follows:

P = P(1,3,4) + P(1,3,5) + P(1,3,6) … – P(1,3,4) * P(1,3,5) – P(1,3,4) * P(1,3,6) …

+ P(1,3,4) * P(1,3,5) * P(1,3,6) + P(1,3,4) * P(1,3,5) * P(2,3,5) …, etc.

where P(1,2,3) equals P(1) * P(2) * P(3), because initiator events 1, 2, and 3 are required to be truly independent of each other for the fault tree to be valid. Fault tree software handles the tedious and error-prone job of calculating the top event probability and the probabilities of all intermediate events.

While the rules of fault tree construction require all initiator events to be truly independent of each other, nothing says the same initiator event cannot appear in multiple places in a tree. In fact, for redundant systems, this happens often.

We need to give this situation special attention. Its ramifications for system design are important. Consider this simple fault tree on the left below, noting that the same basic event ( event A) appears in two branches.

On first glance, one might conclude that the top event probability for this tree would be 0.1 * 0.1 = 0.01, since the top event is an AND gate, and AND nominally indicates multiplication. But the tree is modeling a real-world phenomenon in which if event A happens, it happens regardless of our symbolic representation. Or, from an analytic standpoint, we can refer to the rules of Boolean algebra. A AND A is simply A (from a rule known as *idempotence of conjunction*).

So the collection of cut sets for this simple tree contains only one cut set; and that cut set consists of a single event, A. Since the probability of A is 0.1, the probability of T, the top event, is also 0.1. Both common sense and Boolean algebra reach the same conclusion. Complex fault trees can vastly exceed the grasp of our common sense, and “A AND A” cases can be concealed in complex trees. Software that applies the rules of Boolean algebra saves the day.

Idempotency of disjunction (image on right, above) similarly leads us to conclude that replacing the above tree’s AND gate with an OR gate (fault tree on right, above) yields the same cut set collection – a single set having a single event, A.

In addition to idempotency effects, non-trivial real-world fault trees will also likely have a cut set collection where one cut set consists of a subset of the events of another cut set. This has the sometimes unintuitive consequence of eliminating the cut set with the larger number of terms. Consider this tree:

One view of its solution is that two cut sets exist:

1.) A

2.) A, B

But A OR (A AND B) equals simply A. That is, disjunctive absorption removes cut set 2 from the cut set collection. In a real-world fault tree, complexity may make cases of disjunctive absorption much less obvious; and they often point to areas of ineffective application of design redundancy.

Conjunctive absorption has the same effect in the above tree. A preliminary account of its cut sets would be:

1.) A, A

2.) A, B

Idempotency reduces cut set 1 to A alone, and disjunctive absorption eliminates cut set 2. Thus, conjunctive absorption can be derived from disjunctive absorption plus idempotency. In short form, A AND (A OR B) equals simply A.

The simple fault trees above would never occur in the real world. But logically equivalent conditions do appear in real-world trees. They may not be obvious from inspecting the fault tree diagram; but they become apparent on viewing the cut set collection. A cut set collection from which all supersets have been eliminated (i.e., absorption has been applied) is often called a minimal cut set collection.

Back in the dark ages when computers struggled with fault tree calculations, analysts would use a so-called gate-by-gate (i.e. bottom-up) method to calculate a top event probability by hand. The danger of doing this, if a tree conceals cases where idempotency and absorption are relevant, is immense. Given that real-world initiator probabilities are usually small numbers, a grossly wrong result can stem from effectively squaring – without being justified in doing so – an initiator probability. I mention this only because some textbooks published this century (e.g. one by the Center for Chemical Process Safety) still describe this manual approach – a risky way of dealing with risk.

A more important aspect of the concepts covered here is the problem stemming from a fault tree that does not model reality well. Consider, for example, a fault tree with 10,000 cut sets. Imagine the 1000th most probable cut set, based on minimal cut set analysis, has a probability of one per trillion (1E-12) and contains events A, B, and C.

Imagine further that events A and C each have probabilities of 1E-5. If A and C turn out to be actually the same real-world event – or they result from the same failure, or are in some other way causally correlated – that cut set then reduces to events A and B, having a cut set probability of 1E-7. This probability may be greater than all others in the collection, moving it to the top of an ordered list of cut sets, and possibly into the range of an unacceptably likely top event.

In other words, fault trees that miss common-mode failures are dangerous. Classic cases of this include:

- Redundant check-valves all installed backwards after maintenance
- Uncontained engine failure drains all three aircraft hydraulic systems
- Building fire destroys RAID array and on-premise backup drives
- Earthquake knocks out electrical power, tsunami destroys backup generators

Environmental factors like flood, fire, lightning, temperature, epidemic, terrorism, sabotage, war, electromagnetic pulse, collision, corrosion, and collusion must enter almost any risk analysis. They can do so through functional hazard analysis (FHA), failure mode effects analysis (FMEA), zonal analysis (ZSA) and other routes. Inspection of fault trees to challenge independence of initiator events requires subject matter expertise. This is yet another reason that fault trees belong to system and process design as much as they belong to post-design reliability analysis.

– – – –

Are you in the San Francisco Bay area?

If so, consider joining the *Risk Management meetup group*.

Risk management has evolved separately in various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.

FYI … if you are interested in other work, please contact.

Woody Epstein, formerly of PLG

LikeLike