Last week I showed how to begin building a fault tree from the top down. explained a tree’s structure, and looked at intermediate events and their associated logic gates. Now we can briefly cover initiators, the events at the bottom of the tree.
In most fault trees, most initiators get their probability values from the failure rate of a component related to the specific failure mode in question, and from the time window during which the mission or operation is exposed to that failure. For example, resistors fail by open more often than they fail by short , and the consequences of those two failure modes is often very different.
Some missions are exposed to certain failures for a fixed number of times regardless of the length of the mission. Aircraft are exposed to aborted takeoffs commanded by the control tower at some historical rate. This has nothing to do with the duration of a flight. Initial Public Offerings occur once in a firm’s lifetime, and occasionally fail miserably at a low historical rate, often with no chance of recovery. If modeling these events, you might infer their probabilities from the known cases in the total population.
Most hardware failures and many human errors are modeled as occurring at a fixed rate over the duration of a mission or lie of a project. This assumes, in the case of mechanical or electrical equipment, that infant-mortality cases have been removed from the population by a burn-in process, as is often applied to 100% of semiconductors in critical applications. Likewise, wear-out failures are prevented in critical applications by maintenance, non-destructive testing, and replacement of finite-life structural components.
Between the extremes of infant mortality and wear-out – during the “useful-life” period – most equipment fails at a roughly constant rate. During that normal-life period, the probability of failure during a mission (e.g., a flight, reactor-time in a chemical batch process, or the time between scheduled maintenance in power-generation) is a simple function of the exposure time and a historical failure rate.
The model for this is commonly called an exponential failure distribution. The probability P of a failure in time interval T for a component having a failure rate R (where R equals the reciprocal of man time between failure, MTBF) is given by:
where “e” is Euler’s Number, the number having a natural log of 1.
As a historical note, when the product R * T is small, you can approximate P as P = RT and leave your slide rule in the desk. Otherwise, fault tree software will let you supply values for R and T and will calculate P for each initiator event where you didn’t assign P directly. As another historical note, events for which P is supplied directly are sometimes called “undeveloped events” and those given R and T values are often called “basic events.” “Undeveloped” partly stems from an old practice of braking trees into chunks to ease computation (the top of one tree supplying the probability to an “undeveloped” (developed elsewhere) event. Try to avoid this; it risk grave computational errors, for reasons we’ll cover later. I’ll call any event that doesn’t have kids an initiator.
You might be wondering where failure rates come from. Good question. Sources include GIDEP, IEEE 500, Backblaze hard drive data, USAF Rome Laboratory, MIL-HDBK-217F, and RIAC. And those who write procurement specs should require vendors to supply detailed data of this sort.
Above I used the example of different failure rates for the different failure modes of a resistor. Since the effort of building fault trees is usually only justified for catastrophic fault states (hazards), you’re unlikely to see a resistor failure appear as an initiator. The top-down development of a tree need only descend to a point dictated by logic and availability of historical failure-rate data. So initiators might specify a failure rate for electronics at the “box” (component) level or perhaps the circuit-board level when a box contains redundant boards, as would be case for an auto-land controller in my aircraft braking example.
The Human Factor
For fault-tree purposes, human errors are modeled as faults. Error is generally modeled as the probability of a mistake or omission per relevant action. This typically enters fault trees as events involving maintenance errors and primary operator errors – like pilots for aircraft and chemists for batch processes. Fault trees point out the needs for operator redundancy and for monitors.
Monitors might take the form of more humans, i.e., inspectors or copilots. Or monitors might take the form of machines that watch the output of human operators or machines that check the output of other machines.
One thing history has taught us – in general, humans make poor monitors, particularly when called upon to ensure that a machine is working properly. Picture Homer Simpson, eyes glazed over, while the needle is in the red.
Bored humans do poor work (reminder: blog post on human capital risk), and scared humans are even worse. This is particularly relevant for critical operations in degraded systems where skill is required – think fighter pilots and deep divers.
While diodes, motors, pumps, valves and surprisingly many other things fail at a rate of about one per million hours (1E-6/hr) – and RAM is better still – humans mess up surprisingly often. For example:
- Omission of step in batch process by skilled operator: 0.005 to .003
- Arithmetic error in single simple calculation: 0.03
- Human monitor (inspector) doesn’t catch error: 0.03
- Omission of step during stressful emergency procedure: 0.1
Commercial aircraft operation uses redundant pilots and redundancy in critical procedures to deal with errors of omission. Knowledge of error criticality does little to prevent critical errors. For example, failure to deploy flaps for takeoff – about as critical an error as can be imagined – resulted in several crashes, including Delta 1141 in 1988, despite redundancy and checklists. Non-human monitors of humans’ configuration of control surfaces is a better approach.
Fault trees are great for helping us allocating redundancy in system design. To get this right, we need to take a close look at the failure rates and exposure times supplied to initiator events in redundant designs. I intended to cover that today, but this post is already pushing the limits. I’ll get to it next time.
In the meantime, consider two other aspects of redundancy, failure rates, exposure time, and probability. First, imagine two components in parallel, each having P = .01. This arrangement may cost much less than one component having P = .0001. Then again, two components in parallel will weigh more than twice the weight of one, and will take up at least twice the space.
Second, consider the fault-tree ramifications of choosing the two-in-parallel design in the above example. Are there any possible common-mode (or common-cause) failures of both the components? What happens if one explodes and takes out the other? What happens if the repair crew installs both of them backwards after scheduled maintenance?