*William Storage – Jan 3, 2017*

*VP, LiveSky, Inc.
Visiting Scholar, UC Berkeley Center for Science, Technology, Medicine & Society*

Our intuitions about risk and probability are usually poor, despite our resolute belief that we judge risk well. In their new book, *How to Measure Anything in Cybersecurity Risk*, Hubbard and Seiersen challenge their industry’s dogged reliance on ineffective methods to assess and manage cybersecurity risk – despite a lack of evidence that these methods have any value at all. Indeed, as Tony Cox and others have noted, we have evidence that in some cases they are worse than useless; they do harm. In their view, cybersecurity has strayed from established, sound engineering risk methods and adopted the worst risk practices of project management and ERM. Poorly designed risk programs combined with increased exposure to cyber threats sets the stage for huge potential exposures, which might make the JP Morgan and Target attacks look tiny.

Ten years ago I bought Hubbard’s *How to Measure Anything: Finding the Value of Intangibles in Business* and found his pragmatic approach refreshing. The new book continues along that path, highly accessible and free of business jargon. An empiricist, Hubbard subscribes to Rear Admiral Meyer’s policy of *build a little; test a little; learn a lot*. Hubbard’s “Rule of Five” is a great example of an underappreciated practical tool. Sample five of anything out of any population. There is a 93.75% chance [ 1 – 2x(1/2)^5 ] that the median of the entire population is between the maximum and minimum values of your sample. This assumes a continuous probability distribution, but does not require any other knowledge about the distribution. Cool stuff.

The authors fight nobly against ordinal scoring of risks, heat maps, RPN, and bad justification of soft analyses. Noting that ISO 31000 says the risk map is “strongly applicable for risk identification,” the authors argue that ISO gives zero evidence for strong applicability and that much evidence to the contrary is available. They give clear examples of Tony Cox’s point that risk matrices are ambiguity amplifiers.

They also touch on some of the psychological issues of scoring. They cite Cox’s work showing that arbitrarily reversing impact and probability scales (i.e, making “1” stand for high rather than low probability) in study groups changed the outcome of risk matrix exercises. That is, workers repeatably reach different conclusions about risk as a consequence of the scoring scheme used. I think I can even top that. I’ve seen cases in project management work where RPN was calculated using 10 = high for probability, 10 = high for severity, and 10 = high for detectability. See the problem here? If RPN is used to generate a ranked list of risks, severity (undesirable) and detectability (presumably desirable) both increase the value of RPN. Yes, I’m serious; and I have presenter’s slides from a respected risk consultancy to prove it. Users of that RPN-centric framework are so removed from the fundamentals of risk analysis that they failed to notice it was producing nonsense results.

*Cybersecurity Risk* says it is the first in a series. It serves as a great intro for those unfamiliar with risk quantification. I strongly recommend it to anyone needing to ramp up cyber risk analysis and those who need to unlearn the rubbish spread by PMI and some of the standards bodies.

I have a few quibbles. One deals with epistemology and semantics, several with methodology and approach. On the former, consider this statement from Chapter 2: *“The definition of measurement itself is widely misunderstood. If one understands what ‘measurement’ actually means, a lot more things become measurable.” *

How can a definition, in the lexicographical sense, be misunderstood? I don’t think it can. The authors can claim to have a more useful definition, and can argue that its common use would better serve our needs. But words mean what their users intend them to mean; and it’s tough to argue that most people use a term wrong.

What Hubbard and Seiersen mean by measurement is different from what most people mean by it; and appealing to a “correct” definition will not persuade opponents that quantified estimates by humans have no important differences from measurements of the physical world performed with instruments.

I’m in full agreement that quantification of degrees of belief is useful and desirable, but we can’t sweep the distinction between quantified opinion and measurement by instruments under the carpet. We could even argue that it is the authors, not most people, who misunderstand the definition of measurement – or at least that they misuse the term – if they believe that measurement of human opinions about a thing is equal to the measurement of physical attributes of the thing.

On this topic, they write, *“This conception of measurement might be new to many readers, but there are strong mathematical foundations – as well as practical reasons – for looking at measurement this way. A measurement is, ultimately, just information.”* Indeed there are practical reasons for quantifying opinions about facts of the world. But there are *no* mathematical foundations for looking at measurement this way. There are mathematical means of ensuring that statements about human beliefs are collectively coherent and that quantified beliefs are rationally updated according to new evidence. That is, we can avoid *Dutch Book* arguments using the axioms of probability (see Frank Ramsey on “Truth and Probability,” 1926), and we can ensure that updates of beliefs with new evidence are coherent across time via Bayesian inference rules. But mathematics is completely silent on the authors’ interpretation of measurement. Strictly speaking, there are no *mathematical reasons* for doing anything outside of mathematics, and coherence should be no endorsement of an interpretation of measurement or a belief system.

On their claim that measurements are ultimately information, I think most information experts would disagree. Measurement is not information but *data*. It becomes information when it is structured and used in a context where it can guide action. This is not a trivial distinction. Crackpot rigor and data fetish plague information technology right now; and spurious correlation is the root of much evil.

Hubbard and Seiersen write that *“the method described so far requires the subjective evaluation of quantitative probabilities.”* This seems misleading. More accurately, the method requires rational evaluation of quantified subjective probabilities. Is this nit-picking? Perhaps, but I think accuracy about where subjectivity lies is important; and it is the probabilities that are subjective. I agree that quantified expert opinions are valuable and sometimes the only data on which to base risk assessments. But why pretend that people have no grounds for arguing otherwise.

A similar issue involves statistical significance. The authors argue that most people are wrong about significance, and that all data is significant. They set up a straw man by not allowing that common usage admits two meanings of data “significance.” One indicates noteworthiness, and another is the common but arguable technical meaning of having a p-value less than 0.05. Most competent professionals understand the difference, and are not arguing that a p-value of 0.06 is insignificant in all ways.

The authors are also a bit sloppy with the concept of proof. For example, they use the phrase “scientifically proven.” They refer to empirical observations being *proven* true (p. 150), and discuss drugs *proven* not to be a placebo. If we’re to argue against soft methods claimed to be “proven” (as the authors do), we need to be disciplined about the concept of proof. Proof is in the realm of math; nothing is scientifically proven. I believe in the methods promoted by this book, and I have supporting evidence; but no, they aren’t scientifically proven to be better.

They also call Monte Carlo a “proven method.” Monte Carlo is proven only in a trivial sense, not in a sense that means it gives good answers. Opponents would say it is proven to give most users enough rope to hang themselves. Monte Carlo simulations are just tools – like explosives, chainsaws and log splitters, perhaps. The book gives the impression that Monte Carlo is self-justifying. All you have to do is plug in the right distribution and the right mean value and you’re safe.

By “quantitative methods” the authors seem to mean primarily quantification of expert opinion with possible Bayesian updates and Monte Carlo methods. I too believe in Bayesian inference, but as with their use of “measurement,” this use of the phrase “quantitative methods” seems aimed at blurring an important distinction. It does the causes of Bayesianism no good to dodge objections to subjectivist interpretations of probability by grouping it together with all other quantitative methods. Finite-element stress modeling is also a quantitative method, but it is far less controversial than subjective probability. Quantified measurements by instruments and quantified degrees of belief are fundamentally different.

In criticizing soft methods with imprecise scales, the authors use, as an example, how ridiculous it would be for an engineer to say the mass of a component on an airplane is “medium.” But the engineer has access to a means of calculating or measuring the mass of that component that is fundamentally different from measuring the engineer’s estimate of its mass. Yes, error exists in measurement by instruments, and such measurements rely on many assumptions; but measurement of mass is still different in kind from quantification of estimates about mass. I’d be just as worried about an engineer’s quantified estimate of what the weight of a wing should be (“about 250 tons”) as I would about her guessing it to be “medium.”

There are well-known failures of Bayesian belief networks; and there are some non-trivial objections to Bayesianism. For example, Bayesianism puts absolutely no constraints on initial hypotheses probabilities; it is indifferent to the subjectivity of both the hypothesis and its initial prior. Likewise, there’s the problem of old evidence having zero confirming power in a Bayesian model, the equating of increased probability with hypothesis confirmation, the catch-all hypothesis problem, and, particularly, the potentially large number of iterations needed to wash out priors. These make real mischief in some applications of Bayesianism.

One of my concerns with this book’s methodology stem from its failure to differentiate risks from hazards. In proposing an alternative to the risk matrices, the authors propose a 5-step process that starts with “define a list of risks” (p. 37). This is jumping to potentially dangerous conclusions. The starting point should be a list of hazards, not risks. Those hazards serve as a basis for hypothetical risks, after assigning arbitrary or particularly useful (to the business exposed to them) impact levels for which a probability determination is sought. For example, for the hazard of sabotage, cyber risk analysts might want to separately examine sabotage incidents having economic impacts of $100K and $10M, or some other values, which, presumably, would not be equally likely. Confusing hazards with risks is something these authors should want to carefully avoid, especially since it is a fundamental flaw of frameworks that rely on risk registers and risk matrices.

A more serious flaw in the book’s approach to cyber-risk analysis is, in my view, its limited and poorly applied use of the concept of decomposition. While the book sometimes refers to parametric variations analysis as decomposition, its use of decomposition is mainly focused on the type of judgmental decomposition investigated by McGregor and Armstrong in the 80’s and 90’s. That is, given a target value that is difficult to estimate, one breaks the problem down into bits that are easier to estimate. McGregor and Armstrong noted from the start of their work that translating this idea into practice is difficult, and is most valuable when the target quantity is “extreme,” i.e., very difficult to estimate. Further, estimation errors for the components must not have strong positive correlations between one another. This is a difficult requirement to meet in cybersecurity.

Also, the authors fail to mention that McGregor and Armstrong ultimately concluded, after twenty years of research, that judgmental decomposition had a more limited value than they thought it would early in their research. Hubbard and Seiersen report that McGregor and Armstrong found that simple (low variable count) decompositions reduced error by *“a factor of as much as 10 or even 100.”* I don’t think that is an accurate statement of McGregor and Armstrong’s finding.

While they seem to overstate the value of judgmental decomposition, the authors miss the value of *functional* decomposition. Functional decomposition of many of the system states being modeled will certainly lead to better estimates. One way it does this is by making explicit all the Boolean logic (AND and OR logic) in the interaction of components of the hazardous system-state being modeled. The book never mentions fault trees or similar decomposition methods. It may be that the systems exposed to cyber risk cannot be meaningfully decomposed in a functional sense, but I doubt it. If this were true, it would be hard to see how the problem space could accommodate McGregor and Armstrong’s criterion that decomposed elements must be easier to estimate than global ones. For more on estimating low-level probabilities in multiple-level systems, see Section 4 of NASA/SP-2009-569 – *Bayesian Inference for NASA Probabilistic Risk and Reliability Analysis*.

Like Hubbard’s first book, this one includes an excellent set of training materials for calibration of subject matter experts in the style developed by Lichtenstein, Fischhoff and Phillips in their work with the Office of Naval Research in the 70’s. On the question of whether calibrated estimators are successful in real-world predictions, Hubbard attributes degrees of success with how closely practitioners follow his recommendations for calibration training (p. 153). This is naive and overconfident.

Using exactly the same calibration workshop procedures, I have seen dramatically different results with different situations in different industries. I suspect two additional factors govern the success of predictions by calibrated experts. One deals with relative degrees of ignorance. The overconfidence of experts can be tamed somewhat through calibration exercises using trivia quizzes. The range of possible values for the population of Sweden or the birth year of Isaac Newton are pretty well bounded for most people. The range of possible values for the probability of rare events – about which we are very ignorant – is not similarly bounded. One-per-million and one-per-trillion differ by a factor of a million, but are both exceedingly rare in the judgment of many experts in the subject matter of real-world predictions. Calibration seems to have less impact in estimates where values range over many orders of magnitude and include extremely large or extremely small numbers.

A more important limitation of expert calibration exists in situations where real-world predictions cannot be practically conducted in the absence of social influence. That peer influence greatly damages the wisdom of crowds is well-documented. The most senior or most charismatic member of a team can influence all assignments of subjective probability. Then the wisdom of the crowd transforms into to a consensus-building exercise, right where consensus is most destructive to good predictions. For more on this aspect of collective predictions, see Seidenfeld, Kadane and Schervish’s “*On the Shared Preferences of Two Bayesian Decision Makers.*” They note that an outstanding challenge for Bayesian decision theory is to extend its norms of rationality from individuals to groups. This challenge is huge. Organizational dynamics in large corporations combined with the empirically valid stereotype of shy engineers and scientists can render expert calibration ineffective. Still, peer influence with calibration is better than peer influence alone.

A general complaint I have with Hubbard’s work is that when he deals with matters of engineering, he seems not to clearly understand engineering risk assessment. When he talks about Fukushima he seems to have no familiarity with FHA, FMEA and fault tree/PSA – and misses opportunities for functional decomposition as mentioned above. By framing risk analysis as a matter for the quants, he misses the point, shown well by history, that risk must be tightly integrated with those who “own” the systems. I discovered in aerospace risk workshops I ran 25 years ago that it is far easier to teach risk analysis to aviation engineers than it is to teach aerospace engineering to data scientists; systems knowledge always trumps analytical methods.

Despite these complaints, I strongly recommend this book. Hubbard and Seiersen deserve praise for a valiant effort to dislodge some increasingly dangerous but established doctrine. Hopefully, it’s not an exercise in trying to sell brains to dinosaurs.

The last chapter of *Cybersecurity Risk* is particularly noteworthy. In it the authors dress down standards bodies and the consultancies who peddle facile risk training. They suggest that standards bodies should themselves be subject to metrics and empirical testing. This point is strengthened by the realization that auditors ensure compliance with standards, not independent determinations of efficacy. In that sense, standards based on established methods, but methods for which there is no evidence of added value, are doubly harmful. Through auditing, such standards prevent adoption of more effective methods while the auditing process inadvertently endorses ineffective methods embodied in standards. The authors describe scenarios in cybersecurity that parallel what I’ve seen in pharma – managers managing to an audit rather than managing to actual risks.

*Cybersecurity Risk’*s last chapter also calls for sanity in rolling out cybersecurity risk management. They seek to position quantitative cybersecurity risk management (CSRM) as a C-level strategic practice, noting that risk management must be a *program* rather than a bag of tactical quantitative tools. They’re not trying to usurp the corporate investment decision process (as some ERM initiatives seem to be) but are arguing that the CSRM function should be the first gate for executive or board-level technology investment consideration, thereby eliminating the weak, qualitative risk-register stuff commonly heaped on decision makers. Their proposed structure also removes cybersecurity risk from the domain of CTOs and technology implementors, a status quo they liken to foxes guarding the hen house. A CSRM program should protect the firm from bad technology investments and optimize technology investments in relation to probable future losses. Operationally, a big part of this function is answering the question, *“Are our investments working well together in addressing key risks?”*

– – – – – – – –

.

Are you in the San Francisco Bay area?

If so, consider joining the *Risk Management meetup group*.

Risk management has evolved separately in various industries. This group aims to cross-pollinate, compare and contrast the methods and concepts of diverse areas of risk including enterprise risk (ERM), project risk, safety, product reliability, aerospace and nuclear, financial and credit risk, market, data and reputation risk.

This meetup will build community among risk professionals – internal auditors and practitioners, external consultants, job seekers, and students – by providing forums and events that showcase current trends, case studies, and best practices in our profession with a focus on practical application and advancing the state of the art.