Category Archives: Decision-making

Multiple-Criteria Decision Analysis in the Engineering and Procurement of Systems

The use of weighted-sum value matrices is a core component of many system-procurement and organizational decisions including risk assessments. In recent years the USAF has eliminated weighted-sum evaluations from most procurement decisions. They’ve done this on the basis that system requirements should set accurate performance levels that, once met, reduce procurement decisions to simple competition on price. This probably oversimplifies things. For example, the acquisition cost for an aircraft system might be easy to establish. But life cycle cost of systems that includes wear-out or limited-fatigue-life components requires forecasting and engineering judgments. In other areas of systems engineering, such as trade studies, maintenance planning, spares allocation, and especially risk analysis, multi-attribute or multi-criterion decisions are common.

Weighted-sum criterion matrices (and their relatives, e.g., weighted-product, AHP, etc.) are often criticized in engineering decision analysis for some valid reasons. These include non-independence of criteria, difficulties in normalizing and converting measurements and expert opinions into scores, and logical/philosophical concerns about decomposing subjective decisions into constituents.

Years ago, a team of systems engineers and I, while working through the issues of using weighted-sum matrices to select subcontractors for aircraft systems, experimented with comparing the problems we encountered in vendor selection to the unrelated multi-attribute decision process of mate selection. We met the same issues in attempting to create criteria, weight those criteria, and establish criteria scores in both decision processes, despite the fact that one process seems highly technical, the other one completely non-technical. This exercise emphasized the degree to which aircraft system vendor selection involves subjective decisions. It also revealed that despite the weaknesses of using weighted sums to make decisions, the process of identifying, weighting, and scoring the criteria for a decision greatly enhanced the engineers’ ability to give an expert opinion. But this final expert opinion was often at odds with that derived from weighted-sum scoring, even after attempts to adjust the weightings of the criteria.

Weighted-sum and related numerical approaches to decision-making interest me because I encounter them in my work with clients. They are central to most risk-analysis methodologies, and, therefore, central to risk management. The topic is inherently multidisciplinary, since it entails engineering, psychology, economics, and, in cases where weighted sums derive from multiple participants, social psychology.

This post is an introduction-after-the-fact, to my previous post, How to Pick a Spouse. I’m writing this brief prequel to address the fact that blog excerpting tools tend to use only the first few lines of a post, and on that basis, my post appeared to be on mate selection rather than decision analysis, it’s main point.

If you’re interested in multi-attribute decision-making in the engineering of systems, please continue now to How to Pick a Spouse.

.
F-16

.

————-

Katz’s Law: Humans will act rationally when all other possibilities have been exhausted.

How to Pick a Spouse

Bekhap’s Law asserts that brains times beauty equals a constant. Can this be true? Are intellect and beauty quantifiable? Is beauty a property of the subject of investigation, or a quality of the mind of the beholder? Are any other relevant variables (attributes) intimately tied to brains or beauty? Assuming brains and beauty both are desirable, Backhap’s Law implies an optimization exercise – picking a point on the reciprocal function representing the best compromise between brains and beauty. Presumably, this point differs for all evaluators. It raises questions about the marginal utility of brains and beauty. Is it possible that too much brain or too much beauty could be a liability? (Engineers would call this an edge-case check of Beckhap’s validity.) Is Beckhap’s Law of any use without a cost axis? Other axes? In practice, if taken seriously, Backhap’s Law might be merely one constraint in a multi-attribute decision process for selecting a spouse. It also sheds light on the problems of Air Force procurement of the components of a weapons system and a lot of other decisions. I’ll explain why.

C-17 aircraft photo

I’ll start with an overview of how the Air Force oversees contract awards for aircraft subsystems – at least how it worked through most of USAF history, before recent changes in procurement methods.  Historically, after awarding a contract to an aircraft maker, the aircraft maker’s engineers wrote specs for its systems. Vendors bid on the systems by creating designs described in proposals submitted for competition. The engineers who wrote the specs also created a list of a few dozen criteria, with weightings for each, on which they graded the vendors’ proposals. The USAF approved this criteria list and their weightings before vendors submitted their proposals to ensure the fairness deserved by taxpayers. Pricing and life-cycle cost were similarly scored by the aircraft maker. The bidder with the best total score got the contract.

A while back I headed a team of four engineers, all single men, designing and spec’ing out systems for a military jet. It took most of a year to write these specs. Six months later we received proposals hundreds of pages long. We graded the proposals according to our pre-determined list of criteria. After computing the weighted sums (sums of score times weight for each criteria) I asked the engineers if the results agreed with their subjective judgments. That is, did the scores agree with the subjective judgment of best bidder made by these engineers independent of the scoring process. Only about half of them were. I asked the team why they thought the score results differed from their subjective judgments.

They proposed several theories. A systems engineer, viewing the system from the perspective of its interactions and interfaces with the entire aircraft may not be familiar with all the internal details of the system while writings specs. You learn a lot of these details by reading the vendors’ proposals. So you’re better suited to create the criteria list after reading proposals. But the criteria and their weightings are fixed at that point because of the fairness concern. Anonymized proposals might preserve fairness and allow better criteria lists, one engineer offered.

But there was more to the disconnect between their subjective judgments of “best candidate” and the computed results. Someone immediately cited the problem of normalization. Converting weight in pounds, for example, to a dimensionless score (e.g., a grade of 0 to 100) was problematic. If minimum product weight is the goal, how you do you convert three vendors’ product weights into grades on the 100 scale. Giving the lowest weight 100 points and subtracting the percentage weight delta of the others feels arbitrary – because it is. Doing so compresses the scores excessively – making you want to assign a higher weighting to product-weight to compensate for the clustering of the product-weight scores. Since you’re not allowed to do that, you invent some other ad hoc means of increasing the difference between scores. In other words, you work around the weighted-sum concept to try to comply with the spirit of the rules without actually breaking the rules. But you still end up with a method in which you’re not terribly confident.

A bright young engineer named Hui then hit on a major problem of the weighted-sum scoring approach. He offered that the criteria in our lists were not truly independent; they interacted with each other. Further, he noted, it would be impossible to create a list of criteria that were truly independent. Nature, physics and engineering design just don’t work like that. On that thought, another engineer said that even if the criteria represented truly independent attributes of the vendors’ proposed systems, they might not be independent in a mental model of quality judgment. For example, there may be a logical quality composed of a nonlinear relationship between reliability, spares cost, support equipment, and maintainability. Engineering meets philosophy.

We spent lunch critiquing and philosophizing about multi-attribute decision-making. Where else is this relevant, I asked. Hui said, “Hmmm, everywhere?” “Dating!” said Eric. “Dating, or marriage?”, I asked. They agreed that while their immediate dating interests might suggest otherwise, all four were in fact interested in finding a spouse at some point. I suggested we test multi-attribute decision matrices on this particular decision. They accepted the challenge. Each agreed to make a list of past and potential future candidates to wed, without regard for the likelihood of any mutual interest the candidate might have. Each also would independently prepare a list of criteria on which they would rate the candidates. To clarify, each engineer would develop their own criteria, weightings, and scores for their own candidates only. No multi-party (participatory) decisions were involved; these involve other complex issues beyond our scope here (e.g., differing degrees of over/under-confidence in participants, doctrinal paradox, etc.). Sharing the list would be optional.

Nevertheless, on completing their criteria lists, everyone was happy to share criteria and weightings. There were quite a few non-independent attributes related to appearance, grooming and dress, even within a single engineer’s list. Likewise with intelligence. Then there was sense of humor, quirkiness, religious compatibility, moral virtues, education, type A/B personality, all the characteristics of Myers-Briggs, Eysenck, MMPI, and assorted personality tests. Each engineer rated a handful of candidates and calculated the weighted sum for each.

I asked everyone if their winning candidate matched their subjective judgment of who the winner should have been. A resounding no, across the board.

Some adherents of rigid multi-attribute decision processes address such disconnects between intuition and weighted-sum decision scores by suggesting that in this case we merely adjust the weightings. For example, MindTools suggests:

“If your intuition tells you that the top scoring option isn’t the best one, then reflect on the scores and weightings that you’ve applied. This may be a sign that certain factors are more important to you than you initially thought.”

To some, this sounds like an admission that subjective judgment is more reliable than the results of the numerical exercise. Regardless, no amount of adjusting scores and weights left the engineers confident that the method worked. No adjustment to the weight coefficients seemed to properly express tradeoffs between some of the attributes. I.e., no tweaking of the system ordered the candidates (from high to low) in a way that made sense to each evaluator. This meant the redesigned formula still wasn’t trustworthy. Again, the matter of complex interactions of non-independent criteria came up. The relative importance of attributes seems to change as one contemplates different aspects of a thing. A philosopher’s perspective would be that normative statements cannot be made descriptive by decomposition. Analytic methods don’t answer normative questions.

Interestingly, all the engineers felt that listing criteria and scoring them helped them make better judgments about the ideal spouse, but not the judgments resulting directly from the weighted-sum analysis.

Fact is, picking which supplier should get the contract and picking the best spouse candidate are normative, subjective decisions. No amount of dividing a subjective decision into components makes it objective. Nor does any amount of ranking or scoring. A quantified opinion is still an opinion. This doesn’t mean we shouldn’t use decision matrices or quantify our sentiments, but it does mean we should not hide behind such quantifications.

From the perspective of psychology, decomposing the decision into parts seems to make sense. Expert opinion is known to be sometimes marvelous, sometimes terribly flawed. Daniel Kahneman writes extensively on associative coherence, finding that our natural, untrained tendency is to reach conclusions first, and justify them second. Kahneman and Gary Klein looked in detail at expert opinions in “Conditions for Intuitive Expertise: a Failure to Disagree(American Psychologist, 2009). They found that short-answer expert opinion can be very poor. But they found that the subjective judgments of experts forced to examine details and contemplate alternatives – particularly when they have sufficient experience to close the intuition feedback loop ­– are greatly improved.

Their findings seem to support the aircraft engineers’ views of the weight-sum analysis process. Despite the risk of confusing reasons with causes, enumerating the evaluation criteria and formally assessing them aids the subjective decision process. Doing so left them more confident about their decisions, for spouse and for aircraft system, though those decision differed from the ones produced by weighted sums. In the case of the aircraft systems, the engineers had to live with the results of the weighted-sum scoring.

I was one of the engineers who disagreed with the results of the aircraft system decisions.  The weighted-sum process awarded a very large contract to the firm whose design I judged inferior. Ten years later, service problems were severe enough that the Air Force agreed to switch to the vendor I had subjectively judged best. As for the engineer-spouse decisions, those of my old engineering team are all successful so far. It may not be a coincidence that the divorce rates of engineers are among the lowest of all professions.

——————-

Hedy Lamarr was granted a patent for spread-spectrum communication technology, paving the way for modern wireless networking.

Hedy Lamarr

Collective Decisions and Social Influence

VictrolaPeople have practiced collective decision-making here and there since antiquity. Many see modern social connectedness as offering great new possibilities for the concept. I agree, with a few giant caveats. I’m fond of the topic because I do some work in the field and because it is multidisciplinary, standing at the intersection of technology and society. I’ve written a couple of recent posts on related topics. A lawyer friend emailed me to say she was interested in my recent post on Yelp and crowd wisdom. She said the color-coded scatter plots were pretty; but she wondered if I had a version with less whereas and more therefore. I’ll do that here and give some high points from some excellent studies I’ve read on the topic.

First, in my post on the Yelp data, I accepted that many studies have shown that crowds can be wise. When large random crowds respond individually to certain quantitative questions, the median or geometric mean (though not the mean value) is often more accurate than answers by panels of experts. This requires that crowd members know at least a little something about the matter they’re voting on.

Then my experiments with Yelp data confirmed what others found in more detailed studies of similar data:

  1. Yelp raters tend to give extreme ratings.
  2. Ratings are skewed toward the high end.
  3. Even a rater who rates high on average still rates many businesses very low.
  4. Many businesses in certain categories have bimodal distributions – few average ratings, many high and low ratings.
  5. Young businesses are more like to show bimodal distributions; established ones right-skewed.

I noted that these characteristics would reduce statisticians’ confidence in conclusions drawn from the data. I then speculated that social influence contributed to these characteristics of the data, also seen in detailed studies published on Amazon, Imdb and other high-volume sites. Some of those studies actually quantified social influence.

Two of my favorite studies show how mild social influence can damage crowd wisdom; and how a bit more can destroy it altogether. Both studies are beautiful examples of design of experiments and analysis of data.

In one (Lorenz, et. al., full citation below), the experimenters asked six questions to twelve groups of twelve students. In half the groups, people answered questions with no knowledge of the other members’ responses. In the other groups the experimenters reported information on the group’s responses to all twelve people in that group. Each member in such groups could then give new answers. They repeated the process five times allowing each member to revise and re-revise his response with knowledge about his group’s answers, and did statistical analyses on the results. The results showed that while the groups were initially wise, knowledge about the answers of others narrowed the range of answers. But this reduced range did not reduce collective error. This convergence is often called the social influence effect.

A related aspect of the change in a group’s answers might be termed the range reduction effect. It describes that fact that the correct answer moves progressively toward the periphery of the ordered group of answers as members revise their answers. A key consequence of this effect is that representatives of the crowd become less valuable in giving advice to external observers.

The most fascinating aspect of this study was the confidence effect. Communication of responses by other members of a group increased individual members’ confidence about their responses during convergence of their estimates – despite no increase in accuracy. One needn’t reach far to find examples in the form of unfounded guru status, overconfident but misled elitists, and Teflon financial advisors.

Another favorite of the many studies quantifying social influence (Salganik, et. al.) built a music site where visitors could listen to previously-unreleased songs and download them. Visitors were randomly placed in one of eight isolated groups. All groups listened to songs, rated them, and were allowed to download a copy. In some of the groups visitors could see a download count of each song, though this information was not emphasized. The download count, where present, was a weak indicator of the preferences of other visitors. Ratings from groups with no download count information yielded a measurement of song quality as judged by a large population (14,000 participants total). Behavior of the groups with visible download counts allowed the experimenters to quantify the effect of mild social influence.

The results of the music experiment were profound. It showed that mild social influence contributes greatly to inequality of outcomes in the music market. More importantly, it showed, by comparison of the isolated populations that could see download count, that social influence introduces instability and unpredictability in the results. That is, wildly different “hits” emerged in the identical groups when social influence was possible. In an identical parallel universe, Rihanna did just OK and Donnie Darko packed theaters for months.

Engineers and mathematicians might correctly see this instability situation as something like a third order dynamic system, highly sensitive to initial conditions. The first vote cast in each group was the flapping of the butterfly’s wings in Brazil that set off a tornado in Texas.

This study’s authors point out the ramifications of their work on our thoughts about popular success. Hit songs, top movies and superstars are orders of magnitude more successful than their peers. This leads to the sentiment that superstars are fundamentally different from the rest. Yet the study’s results show that success was weakly related to quality. The best songs were rarely unpopular; and the worst rarely were hits. Beyond that, anything could and did happen.

This probably explains why profit-motivated experts do so poorly at predicting which products will succeed, even minutes before a superstar emerges.

When information about a group is available, its members do not make decisions independently, but are influenced subtly or strongly by their peers. When more group information is present (stronger social influence), collective results become increasingly skewed and increasingly unpredictable.

The wisdom of crowds comes from aggregation of independent input. It is a matter of statistics, not of social psychology. This crucial fact seems to be missed by many of the most distinguished champions of crowdsourcing, collective wisdom, crowd-based-design and the like. Collective wisdom can be put to great use in crowdsourcing and collective decision making. The wisdom of crowds is real, and so is social influence; both can be immensely useful. Mixing the two makes a lot of sense in the many business cases where you seek bias and non-individualistic preferences, such as promoting consumer sales.

But extracting “truth” from a crowd is another matter – still entirely possible, in some situations, under controlled conditions. But in other situations, we’re left with the dilemma of encouraging information exchange while maintaining diversity, independence, and individuality. Too much social influence (which could be quite a small amount) in certain collective decisions about governance and the path forward might result in our arriving at a shocking place and having no idea how we got there. History provides some uncomfortable examples.

_______________

Sources cited:

Jan Lorenza, Heiko Rauhutb, Frank Schweitzera, and Dirk Helbing.  “How social influence can undermine the wisdom of crowd effect” Proceedings of the National Acadamy of Science, May 31 2011.

Matthew J. Salganik, Peter Sheridan Dodds et. al. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market,” Science Feb 10 2006.