# The Magic of Redundancy

Fault Tree Friday – Week 4           ( 1  2  3 )

In post 1 of this series I stressed that fault trees were useful in the preliminary design of systems, when alternative designs are being compared (design trade studies). I argued that fault trees are the only rational means of allocating redundancy in complex systems. In that post I used the example of a crude brake system for a car. It consists of a brake pedal, a brake valve with two cylinders each supplying pressure to two brakes, and the hydraulic lines and brake hardware (drums, calipers, etc.). I’ll use that simplified design (it has no reservoir or other essentials yet) again here.

We’ll assume that pressure sent to either the front wheels alone or the rear wheels alone delivers enough braking force that the car stops normally. Note that this also means the driver would, using brakes of this design, not know that only two of four brakes were operational.

For sake of simplicity we’ll model the brake system as having only two fault states: front brakes unable to brake, and rear brakes similarly incapacitated. While we would normally not model the collection of all failure modes resulting in loss of braking capability to the front brakes as a basic event (initiator), we’ll do so here to emphasize some aspects of redundant system design.

If we imagine a driving time of one hour and, assign a failure rate of one per thousand hours (1E-3/hr) each to the composite front and rear brake system failures, our fault tree might look like this:

If we modeled the total loss of braking using the above fault tree, we’d be dead wrong. Since the system, as designed here, is fully redundant (in terms of braking power), many failures of either front or rear brakes could go unnoticed almost indefinitely. To make a point, let’s say the maintenance events that would eventually detect this condition occur every two years.

To correct the fault tree (leaving the system design alone for now), we’d start with something like what appears below. Note that with the specified failure rates, the probability of either the front or rear brakes being in a failed state at the beginning of a drive is roughly equal to one, using the standard calculus of probability given an exposure time and a failure rate based on an exponential distribution. This means that, as modeled, the redundancy is essentially useless. We call the failure that goes unnoticed for a long time period a latent failure.

That fault tree isn’t quite correct either. It models the rear brakes as having failed first – a latent failure that silently awaits failure of the front brakes. But in an equally likely scenario, the front brakes could fail first. That means either of two similar scenarios could occur. In one (as shown above) the loss of braking in the rear brakes goes unnoticed for a long time period, combining with a suddenly apparent failure of the front brakes. In the other scenario, the front brake failure is latent. A corrected version appears below. Note that the effect of this correction is to double the top event probability, making our design look even worse and making redundancy seem not so magic.

The fault tree tells us that our brake system design isn’t very good. We need to reduce the probability of the latent failures, thereby getting some value from the redundancy. We can do this by adding a pressure sensor to detect loss of pressure to front or rear brakes, along with an indicator on the dashboard to report that this failure was detected. To keep this example simple, assume we can use the same sensor for both systems.

The astute designer will likely see where this is heading. Failure of the pressure sensor to detect low pressure is now a latent failure. And so is failure of the indicator in the dashboard. Again for simplicity of example, we’ll model these as a single unit; but failure of that unit to be able to tell the driver anything that either front or rear brakes are incapacitated is a latent failure that could go undetected for years. The resulting fault tree would look like this:

Despite being a latent failure, monitoring subsystems are usually more reliable than the the things they monitor. I’ve shown that in this example by using a failure rate of one per 10,000 hrs for the monitor equipment. With an exposure time of two years, the probability that the monitor is in a failed state during a drive is 0.16. Consequently, adding all the monitoring equipment only changed our top event probability from 1E-3 to 3.2E-4. It’s better, but not impressively so. And we added cost to the car and more things that can fail.

It’s important to realize that, at the bottom of the above fault tree segment, the exposure time to loss of braking in both rear brakes cannot be shorter than the exposure time to failure of its monitor. That is almost always the case for monitored components in any redundant design.

We could reduce the top event’s likelihood by shortening the system’s maintenance intervals. Checking the monitoring system every 2 months, instead of two years, would get us another factor of twelve.

But such an inspection need not verify that each element of the monitoring subsystem is separately functional. Logic allows us to see that we need only verify that it is  or is not capable of monitoring a failure as a group. For now, if it’s failed we aren’t really concerned with what went wrong in particular. We can test the monitor by inducing  it to test for pressure when none exists. If the monitor reports the condition as a pressure failure, it is good.

We don’t need an auto mechanic for this. The driver could run the test on startup. We’ll design the monitor to test for no pressure as the car starts, when no brakes are applied. We call on the driver to verify that the warning indicator illuminates.

Two failure modes of the monitored brake system – which now includes an operator – should now be apparent. An unlikely failure mode is that the indicator somehow illuminates (“fails high”) during the test sequence, indicating no pressure at startup, but fails to illuminate when no pressure is applied to the brakes.

A far more likely fault state is operator error. This condition exists when the operator fails to notice that the startup test sequence does not display the illumination. This fault does not involve exposure time at all. The operator either remembers or forgets. For an untrained operator, the error probability in such a situation may be 100%. A trained operator will make this error somewhere between one and ten percent of the time, especially when distracted. Two operators who monitor each other will do better. Two operators with a check list will do better still.

With the startup-check procedures in place, our system no longer has any latent failures. It has high-probability error states, but they have the advantage of only contributing to the top event in conjunction with another failure. The resulting fault tree makes us feel better about driving:

While the system we’ve modeled here is not representative of modern brake systems (and my example has other shortcuts that would be deemed foul in a real analysis), this example shows how fault trees can be used in preliminary system design to make better system-design choices. It really only begins to make that point. In a quad-redundant flight control system, such analysis has far more impact. This sort of modeling can also reveal weak spots in driverless cars, chemical batch processes, uranium refinement, complex surgery procedures, cyber-security, synthetic biology, and operations where human checks and balances (a form of redundancy) are important.

In the above fault trees, two similar branches appear, one each for front and rear brakes. The real-world components and their failures represented in each branch are physically distinct. In the final version, immediately above, indicator failure and operator error occur in both branches. Unlike the other initiator events, these events logically belong to both branches, but each represent the same real-word event. The operator’s failure to verify the function of the pressure warning indicator, for example, if it occurred, would occur in both branches simultaneously. The event IDs (e.g., “OE1” above) remind us that this is the case. Appearance of the same event in multiple branches of a tree can profoundly impact the top event probability in a way that isn’t obvious  unless you’re familiar with Boolean algebra. We’ll go there next time.