Kahneman Et Al. - 1982 - Judgment Under Uncertainty Heuristics and Biases

April 4, 2017 | Author: dave | Category: N/A
Share Embed Donate


Short Description

Download Kahneman Et Al. - 1982 - Judgment Under Uncertainty Heuristics and Biases...

Description

Judgment under uncertainty Heuristics and biases

PUBLISHED BY THE PRESS SYNDICATE DF Tl-IE UNIVERSITY DF CAMBRIIJ-GE

The Pitt Building, Trumpington Street. Cambridge. United Kingdom CAMBRIDGE UHIYERSITY PRESS

"I‘he Edinburgh Building, Cambridge CH2 ZRU, UK 40 West Zflth Street, New York. NY IDEIII-4211, USA 4?? Williamstown Road, Port Melbourne, VIC 32117. Australia Ruiz de Alarcon 13, ZSDI4 Madrid. Spain Dock House. The Waterfront, Cape Town SD01, South Africa httpz.-"iwww.camhridge.org Q Cambridge University Press 1982 This book is in copyright. Subject to statutory errception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 1982 Reprinted 1932. I983 {twice}, 1934. 1935 {twice}, 1986, 1937, I988, 1990, 1991. 1993. 1994. I998. 1999. 2001 Printed in the United States of America Typeset in Times A caiaiog recardfar this book is avaiiabie from the British Library Library of Congress Caiaioging in Pui:-iiraiion Dara is avaiiabie ISBN U $21 23414 '7' paperback

1.

Iudgrnent under uncertainty:

Heuristics and biases Amos Tversky and Daniel Kahneman

Many decisions are based on beliefs concerning the likelihood of uncertain events such as the outcome of an election, the guilt of a defendant. or

the future value of the dollar. These beliefs are usually expressed in statements such as “I think that. . . ,” "chances are. . . ,” “it is unlikely that . . . .” and so forth. Occasionally, beliefs concerning uncertain events

are expressed in numerical form as odds or subjective probabilities. What determines such beliefs? How do people assess the probability of an uncertain event or the value of an uncertain quantity? This article shows that people rely on a limited number of heuristic principles which reduce

the complex tasks of assessing probabilities and predicting values to simpler judgmental operations. In general, these heuristics are quite useful, but sometimes they lead to severe and systematic errors. The subjective assessment of probability resembles the subjective assessment of physical quantities such as distance or size. These judgments are all based on data of limited validity, which are processed according to heuristic rules. For example, the apparent distance of an object is determined in part by its clarity. The more sharply the object is seen, the closer it appears to be. This rule has some validity, because in any given scene the

more distant objects are seen less sharply than nearer objects. However, the reliance on this rule leads to systematic errors in the estimation of distance. Specifically, distances are often overestimated when visibility is

poor because the contours of objects are blurred. On the other hand,

distances are often underestimated when visibility is good because the objects are seen sharply. Thus, the reliance on clarity as an indication of distance leads to common biases. Such biases are also found in the

intuitive judgment of probability. This article describes three heuristics This chapter originally appeared in Science, l9?4_. I85, 1124-I131- Copyright D l'§l'.74 by the

American Association for the Advanoement of Science. Reprinted by permission.

4

INTRODUCTION

that are employed to assess probabilities and to predict values. Biases to which these heuristics lead are enumerated, and the applied and theoreti-

cal implications of these observations are discussed. Representativeness Many of the probabilistic questions with which people are concerned belong to one of the following types: What is the probability that object A belongs to class B? What is the probability that event A originates from process B? What is the probability that process B will generate event A? In answering such questions, people typically rely on the representativeness

heuristic, in which probabilities are evaluated by the degree to which A is representative of B, that is, by the degree to which A resembles B. For example, when A is highly representative of B, the probability that A originates from B is judged to be high. On the other hand, if A is not similar to B, the probability that A originates from B is judged to be low. For an illustration of judgment by representativeness. consider an individual who has been described by a former neighbor as follows: “Steve is very shy and withdrawn, invariably helpful, but with little interest in people. or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail.” How do people

assess the probability that Steve is engaged in a particular occupation from a list of possibilities (for example, farmer, salesman, airline pilot, librarian,

or physician)? How do people order these occupations from most to least likely? In the representativeness heuristic, the probability that Steve is a

librarian, for example, is assessed by the degree to which he is representa-

tive of, or similar to, the stereotype of a librarian. Indeed, research with problems of this type has shown that people order the occupations by probability and by similarity in exactly the same way (Kahnetnan dc

Tversky. 1973. 4). This approach to the judgment of probability leads to serious errors, because similarity, or representativeness, is not influenced

by several factors that should affect judgments of probability. insensitivity to prior probability of ontconres One of the factors that have no effect on representativeness but should

have a major effect on probability is the prior probability, or base-rate frequency, of the outcomes. In the case of Steve, for example, the fact that

there are many more farmers than librarians in the population should enter into any reasonable estimate of the probability that Steve is a librarian rather than a farmer. Considerations of base-rate frequency, however, do not affect the similarity of Steve to the stereotypes of

librarians and farmers. If people evaluate probability by representativeness. therefore, prior probabilities will be neglected. This hypothesis was tested in an experiment where prior probabilities were manipulated

Judgment under uncertainty

5

(Kahneman &: Tversl-cy, 1973. 4). Subjects were shown brief personality descriptions of several individuals, allegedly sampled at random from a

group of 100 professionals - engineers and lawyers. The subjects were asked to assess. for each description, the probability that it belonged to an engineer rather than to a lawyer. In one experimental condition, subjects were told that the group from which the descriptions had been drawn

consisted of 70 engineers and 30 lawyers. In another condition, subjects were told that the group consisted of 30 engineers and T0 lawyers. The odds that any particular description belongs to an engineer rather than to a lawyer should be higher in the first condition, where there is a majority

of engineers. than in the second condition, where there is a majority of lawyers. Specifically, it can be shown by applying Bayes’ rule that the ratio of these odds should be (.7i.3)1, or 5.44, for each description. In a sharp violation of Bayes’ rule. the subjects in the two conditions produced

essentially the same probability judgments. Apparently, subjects evaluated the likelihood that a particular description belonged to an engineer rather

than to a lawyer by the degree to which this description was representative of the two stereotypes, with little or no regard for the prior probabilities of the categories. The subjects used prior probabilities correctly when they had no other

information. In the absence of a personality sketch, they judged the probability that an unknown individual is an engineer to be .7 and .3,

respectively, in the two base-rate conditions. However, prior probabilities were effectively ignored when a description was introduced, even when this description was totally uninformative. The responses to the following

description illustrate this phenomenon: Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well

liked by his colleagues.

This description was intended to convey no information relevant to the question of whether Dick is an engineer or a lawyer. Consequently, the

probability that Dick is an engineer should equal the proportion of engineers in the group, as if no description had been given. The subjects, however, judged the probability of Dick being an engineer to be‘ .5 regardless of whether the stated proportion of engineers in the group was .7 or .3. Evidently, people respond differently when given no evidence

and when given worthless evidence. When no specific evidence is given, prior probabilities are properly utilized; when worthless evidence is given, prior probabilities are ignored (I-(ahneman ti: Tversky, 19173, 4).

insensitivity to sample size To evaluate the probability of obtaining a particular result in a sample

drawn from a specified population, people typically apply the representa-

6

INTRODUCTION

tiveness heuristic. That is, they assess the likelihood of a sample result, for example, that the average height in a random sample of ten men will be 6

feet (180 centimeters), by the similarity of this result to the corresponding parameter (that is, to the average height in the population of men). The similarity of a sample statistic to a population parameter does not depend on the size of the sample. Consequently, if probabilities are assessed by representativeness. then the judged probability of a sample statistic will be

essentially independent of sample size. Indeed, when subjects assessed the distributions of average height for samples of various sizes, they produced identical distributions. For example, the probability of obtaining an aver-

age height greater than ti feet was assigned the same value for samples of 1000. 100, and Ill men (Kahneman it Tversky, 19?2b, 3). Moreover, subjects failed to appreciate the role of sample size even when it was emphasized

in the formulation of the problem. Consider the following question: A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day. and in the smaller hospital about I5 babies are born each day. As you know, about 5ll percent of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50 percent, sometimes

lower. For a period of I year, each hospital recorded the days on which more than 60

percent of the babies born were boys. Which hospital do you think recorded more such days? The larger hospital (21) The smaller hospital (21) About the same (that is, within 5 percent of each other) (53)

The values in parentheses are the number of undergraduate students who chose each answer.

Most subjects judged the probability of obtaining more than 60 percent boys to be the same in the small and in the large hospital, presumably because these events are described by the same statistic and are therefore equally representative of the general population. In contrast, sampling theory entails that the expected number of days on which more than 60 percent of the babies are boys is much greater in the small hospital than in the large one, because a large sample is less likely to stray from 50 percent. This fundamental notion of statistics is evidently not part of people’s repertoire of intuitions. A similar insensitivity to sample size has been reported in judgments of

posterior probability, that is, of the probability that a sample has been drawn from one population rather than from another. Consider the following example: Imagine an urn filled with balls, of which 11;. are of one color and 1!; of another. One individual has drawn 5 balls from the urn, and found that 4 were red and I was white. Another individual has drawn 20 balls and found that 12 were red and 8 were white. Which of the two individuals should feel more confident that the urn

Judgment under uncertainty

F

contains 1')‘: red balls and 1!; white balls, rather than the opposite? What odds should each individual give? In this problem, the correct posterior odds are B to 1 for the 4:1 sample

and 16 to 1 for the 12:3 sample, assuming equal prior probabilities. However, most people feel that the first sample provides much stronger evidence for the hypothesis that the urn is predominantly red, because the

proportion of red balls is larger in the first than in the second sample. Here again, intuitive judgments are dominated by the sample proportion and are essentially unaffected by the size of the sample, which plays a

crucial role in the determination of the actual posterior odds (Kahneman 8: Tversky, 19'72b). In addition, intuitive estimates of posterior odds are far

less extreme than the correct values. The underestimation of the impact of evidence has been observed repeatedly in problems of this type (W. Edwards, 1968, 25; Slovic Sr Lichtenstein, l9?'1). It has been labeled

"conservatism."

Miscorrceptiorts of chance

People expect that a sequence of events generated by a random process will represent the essential characteristics of that process even when the sequence is short. In considering tosses of a coin for heads or tails, for

example, people regard the sequence H-T-H-T-T-I-I to be more likely than the sequence I-I-I-I-H-T-T-T, which does not appear random, and also more

likely than the sequence I-I-I-I-H-H-T-I-I, which does not represent the fairness of the coin (Kahneman 8: Tversky, 1972b, 3). Thus, people expect

that the essential characteristics of the process will be represented, not only globally in the entire sequence, but also locally in each of its parts. A

locally representative sequence, however, deviates systematically from chance expectation: it contains too many alternations and too few runs. Another consequence of the belief in local representativeness is the well-known gambler’s fallacy. After observing a long run of red on the

roulette wheel, for example, most people erroneously believe that black is now due, presumably because the occurence of black will result in a more representative sequence than the occurrence of an additional red. Chance is commonly viewed as a self-correcting process in which a deviation in

one direction induces a deviation in the opposite direction to restore the

equilibrium. In fact, deviations are not “corrected” as a chance process unfolds, they are merely diluted. Misconceptions of chance are not limited to naive subjects. A study of

the statistical intuitions of experienced research psychologists (Tverslcy 8: Kahneman, 1971, 2) revealed a lingering belief in what may be called the “law of small numbers,” according to which even small samples are highly

representative of the populations from which they are drawn. The responses of these investigators reflected the expectation that a valid

3

INTRODUCTION

hypothesis about a population will be represented by a statistically significant result in a sample - with little regard for its size. As a consequence, the researchers put too much faith in the results of small samples and grossly overestimated the replicability of such results. In the actual conduct of research, this bias leads to the selection of samples of inade-

quate size and to overinterpretation of findings. Iaserrsitivity to predictability People are sometimes called upon to make such numerical predictions as the future value of a stock, the demand for a commodity, or the outcome of a football game. Such predictions are often made by representativeness. For example, suppose one is given a description of a company and is asked to predict its future profit. If the description of the company is very favorable, a very high profit will appear most representative of that description; if the description is mediocre, a mediocre performance will

appear most representative. The degree to which the description is favorable is unaffected by the reliability of that description or by the degree to

which it permits accurate prediction. Hence, if people predict solely in terms of the favorableness of the description, their predictions will be insensitive to the reliability of the evidence and to the expected accuracy of the prediction. This mode of judgment violates the normative statistical theory in

which the extremeness and the range of predictions are controlled by considerations of predictability. When predictability is nil, the same prediction should be made in all cases. For example, if the descriptions of

companies provide no information relevant to profit, then the same value (such as average profit) should be predicted for all companies. If predict-

ability is perfect, of course. the values predicted will match the actual values and the range of predictions will equal the range of outcomes. In general, the higher the predictability, the wider the range of predicted values. Several studies of numerical prediction have demonstrated that intui-

tive predictions violate this rule, and that subjects show little or no regard for considerations of predictability (Kahneman 8: Tversky, 1973, 4). In one of these studies, subjects were presented with several paragraphs, each describing the performance of a student teacher during a particular practice lesson. Some subjects were asked to evaluate the quality of the lesson described in the paragraph in percentile scores, relative to a specified population. Other subjects were asked to predict, also in percentile scores, the standing of each student teacher 5 years after the practice lesson. The judgments made under the two conditions were identical. That is, the prediction of a remote criterion (success of a teacher after 5 years) was identical to the evaluation of the information on which the prediction was based (the quality of the practice lesson). The students who made

Iudgrnent under uncertainty

9

these predictions were undoubtedly aware of the limited predictability of teaching competence on the basis of a single trial lesson 5 years earlier; nevertheless, their predictions were as extreme as their evaluations.

The illusion of validity As we have seen, people often predict by selecting the outcome (for example, an occupation) that is most representative of the input (for example, the description of a person). The confidence they have in their prediction depends primarily on the degree of representativeness {that is, on the quality of the match between the selected outcome and the input)

with little or no regard for the factors that limit predictive accuracy. Thus, people express great confidence in the prediction that a person is a librarian when given a description of his personality which matches the

stereotype of librarians, even if the description is scanty, unreliable, or outdated. The unwarranted confidence which is produced by a good fit

between the predicted outcome and the input information may be called the illusion of validity. This illusion persists even when the judge is aware

of the factors that limit the accuracy of his predictions. It is a common observation that psychologists who conduct selection interviews often experience considerable confidence in their predictions, even when they

know of the vast literature that shows selection interviews to be highly fallible. The continued reliance on the clinical interview for selection, despite repeated demonstrations of its inadequacy, amply attests to the strength of this effect.

The internal consistency of a pattern of inputs is a major determinant of one's confidence in predictions based on these inputs. For example, people express more confidence in predicting the final grade-point average of a

student whose first-year record consists entirely of B's than in predicting the grade-point average of a student whose first-year record includes

many A's and C's. Highly consistent patterns are most often observed when the input variables are highly redundant or correlated. Hence, people tend to have great confidence in predictions based on redundant

input variables. However, an elementary result in the statistics of correlation asserts that, given input variables of stated validity, a prediction based

on several such inputs can achieve higher accuracy when they are independent of each other than when they are redundant or correlated. Thus, redundancy among inputs decreases accuracy even as it increases

confidence, and people are often confident in predictions that are quite likely to be off the mark (Kahneman 6: Tversky, 1973, 4). Misconceptions of regression

Suppose a large group of children has been examined on two equivalent versions of an aptitude test. If one selects ten children from among those

ll]

INTRDDUCTIUN

who did best on one of the two versions, he will usually find their performance on the second version to be somewhat disappointing. Conversely, if one selects ten children from among those who did worst

on one version, they will be found, on the average, to do somewhat better on the other version. More generally, consider two variables X and Y which have the same distribution. If one selects individuals whose average X score deviates from the mean of X by it units, then the average of

their Y scores will usually deviate from the mean of Y by less than it units. These observations illustrate a general phenomenon known as regression

toward the mean, which was first documented by Galton more than 100 years ago. In the normal course of life, one encounters many instances of regression toward the mean, in the comparison of the height of fathers and sons, of the intelligence of husbands and wives, or of the performance of individuals on consecutive examinations. Nevertheless, people do not develop correct intuitions about this phenomenon. First, they do not

expect regression in many contexts where it is bound to occur. Second, when they recognize the occurrence of regression, they often invent spurious causal explanations for it (Kahneman 8: Tversky, 1‘El'?3, 4). We

suggest that the phenomenon of regression remains elusive because it is incompatible with the belief that the predicted outcome should be maximally' representative of the input, and, hence, that the value of the

outcome variable should be as extreme as the value of the input variable. The failure to recognize the import of regression can have pernicious

consequences, as illustrated by the following observation (Kahneman a Tversky, 1973, 4). In a discussion of flight training. experienced instructors noted that praise for an exceptionally smooth landing is typically followed by a poorer landing on the next try, while harsh criticism after a rough landing is usually followed by an improvement on the next try. The instructors concluded that verbal rewards are detrimental to learning,

while verbal punishments are beneficial. contrary to accepted psychological doctrine. This conclusion is unwarranted because of the presence of regression toward the mean. As in other cases of repeated examination, an improvement will usually follow a poor performance and a deterioration will usually follow an outstanding performance, even if the instructor does not respond to the trainee’s achievement on the first attempt. Because

the instructors had praised their trainees after good landings and admonished them after poor ones, they reached the erroneous and potentially harmful conclusion that punishment is more effective than reward. Thus, the failure to understand the effect of regression leads one to

overestimate efiectiveness rewards are punishments

the effectiveness of punishment and to underestimate the of reward. In social interaction, as well as in training, typically administered when performance is good, and are typically administered when performance is poor. By

Judgment under uncertainty

ll

regression alone, therefore, behavior is most likely to improve after punishment and most likely to deteriorate after reward. Consequently, the

human condition is such that, by chance alone, one is most often rewarded for punishing others and most often punished for rewarding them. People are generally not aware of this contingency. In fact, the elusive role of regression in determining the apparent consequences of reward and punishment seems to have escaped the notice of students of this area.

Availability There are situations in which people assess the frequency of a class or the

probability of an event by the ease with which instances or occurrences can be brought to mind. For example, one may assess the risk of heart attack among middle-aged people by recalling such occurrences among

one's acquaintances. Similarly, one may evaluate the probability that a given business venture will fail by imagining various difficulties it could encounter. This judgmental heuristic is called availability. Availability is a

useful clue for assessing frequency or probability, because instances of large classes are usually reached better and faster than instances of less frequent classes. However, availability is affected by factors other than frequency and probability. Consequently, the reliance on availability leads to predictable biases, some of which are illustrated below. Biases due to the retrieoatrility of instances When the size of a class is judged by the availability of its instances, a class

whose instances are easily retrieved will appear more numerous than a class of equal frequency whose instances are less retrievable. In an

elementary demonstration of this effect, subjects heard a list of wellknown personalities of both sexes and were subsequently asked to judge whether the list contained more names of men than of women. Different

lists were presented to different groups of subjects. In some of the lists the men were relatively more famous than the women, and in others the

women were relatively more famous than the men. In each of the lists, the subjects erroneously judged that the class (sex) that had the more famous personalities was the more numerous (Tversky 8: Kahneman, 19?3, 11).

In addition to familiarity, there are other factors, such as salience, which affect the retrievability of instances. For example, the impact of seeing a house burning on the subjective probability of such accidents is probably greater than the impact of reading about a fire in the local paper. Furthermore, recent occurrences are likely to be relatively more available

than earlier occurrences. It is a common experience that the subjective probability of traffic accidents rises temporarily when one sees a car

overturned by the side of the road.

11

INTRDDUCTION

Biases due to the effectiveness of a search set Suppose one samples a word (of three letters or more) at random from an English text. Is it more likely that the word starts with r or that r is the third letter? People approach this problem by recalling words that begin with r (road) and words that have r in the third position (car) and assess

the relative frequency by the ease with which words of the two types come to mind. Because it is much easier to search for words by their first letter than by their third letter, most people judge words that begin with a given consonant to be more numerous than words in which the same consonant

appears in the third position. They do so even for consonants, such as r or k, that are more frequent in the third position than in the first (Tversky & Kahneman, 1973, 11).

Different tasks elicit different search sets. For example, suppose you are asked to rate the frequency with which abstract words (thought, love) and

concrete words (door, water) appear in written English. A natural way to answer this question is to search for contexts in which the word could

appear. It seems easier to think of contexts in which an abstract concept is mentioned (lave in love stories) than to think of contexts in which a concrete word (such as door) is mentioned. If the frequency of words is

judged by the availability of the contexts in which they appear, abstract words will be judged as relatively more numerous than concrete words. This bias has been observed in a recent study (Galbraith 8: Underwood,

1!-W3) which showed that the judged frequency of occurrence of abstract words was much higher than that of concrete words, equated in objective frequency. Abstract words were also judged to appear in a much greater

variety of contexts than concrete words.

Biases of inraginability Sometimes one has to assess the frequency of a class whose instances are not stored in memory but can be generated according to a given rule. In

such situations, one typically generates several instances and evaluates frequency or probability by the ease with which the relevant instances can be constructed. However, the ease of constructing instances does not always reflect their actual frequency, and this mode of evaluation is prone

to biases. To illustrate, consider a group of 10 people who form committees of it members, 2 5 it 5 B. How many different committees of it members can be formed? The correct answer to this problem is given by the binomial

coefficient ( 1“) which reaches a maximum of 252 for it - 5. Clearly, the

number of committees of it members equals the number of committees of (10 — it) members, because any committee of it members defines a unique group of (10 — k) nonmembers.

One way to answer this question without computation is to mentally construct committees of it members and to evaluate their number by the

Judgment under uncertainty

13

ease with which they come to mind. Committees of few members, say 2, are more available than committees of many members, say 8. The simplest scheme for the construction of committees is a partition of the group into

disjoint sets. One readily sees that it is easy to construct five disjoint committees of 2 members, while it is impossible to generate even two

disjoint committees of B members. Consequently, if frequency is assessed by imaginability. or by availability for construction. the small committees

will appear more numerous than larger committees. in contrast to the correct bell-shaped function. Indeed, when naive subjects were asked to

estimate the number of distinct committees of various sizes, their estimates were a decreasing monotonic function of committee size (Tversky 8: Kahneman, 1973, 11). For example, the median estimate of the number of committees of 2 members was 7'0, while the estimate for committees of 8 members was 20 (the correct answer is 45 in both cases). lmaginability plays an important role in the evaluation of probabilities in real-life situations. The risk involved in an adventurous expedition, for example, is evaluated by imagining contingencies with which the expedition is not equipped to cope. If many such difficulties are vividly

portrayed, the expedition can be made to appear exceedingly dangerous, although the ease with which disasters are imagined need not reflect their

actual likelihood. Conversely, the risk involved in an undertaking may be grossly underestimated if some possible dangers are either difficult to conceive of, or simply do not come to mind.

illusory correlation Chapman and Chapman (1969) have described an interesting bias in the judgment of the frequency with which two events co-occur. They

presented naive judges with information concerning several hypothetical mental patients. The data for each patient consisted of a clinical diagnosis

and a drawing of a person made by the patient. Later the judges estimated the frequency with which each diagnosis (such as paranoia or suspicious-

ness) had been accompanied by various features of the drawing (such as peculiar eyes). The subjects markedly overestimated the frequency of co-occurrence of natural associates. such as suspiciousness and peculiar

eyes. This effect was labeled illusory correlation. In their erroneous judgments of the data to which they had been exposed, naive subjects

"rediscovered" much of the common, but unfounded, clinical lore

concerning the interpretation of the draw-a-person test. The illusory correlation effect was extremely resistant to contradictory data. It persisted even when the correlation between symptom and diagnosis was actually

negative, and it prevented the judges from detecting relationships that were in fact present. Availability provides a natural account for the illusory-correlation effect. The judgment of how frequently two events co-occur could be

I4

INTRUDUCTIUN

based on the strength of the associative bond between them. When the association is strong, one is likely to conclude that the events have been frequently paired. Consequently, strong associates will be judged to have

occurred together frequently. According to this view, the illusory correlation between suspiciousness and peculiar drawing of the eyes, for example, is due to the fact that suspiciousness is more readily associated with the eyes than with any other part of the body. Lifelong experience has taught us that, in general, instances of large classes are recalled better and faster than instances of less frequent classes;

that likely occurrences are easier to imagine than unlikely ones; and that

the associative connections between events are strengthened when the events frequently co-occur. As a result, man has at his disposal a procedure (the availability heuristic) for estimating the numerosity of a class, the

likelihood of an event, or the frequency of co-occurrences, by the ease with which the relevant mental operations of retrieval, construction. or association can be performed. However, as the preceding examples have demonstrated, this valuable estimation procedure results in systematic

errors. Adjustment and anchoring In many situations, people make estimates by starting from an initial value

that is adjusted to yield the final answer. The initial value, or starting point, may be suggested by the formulation of the problem, or it may be

the result of a partial computation. In either case, adjustments are typically insufficient (Slovic 8: Lichtenstein, 19?1). That is, different starting points yield different estimates, which are biased toward the initial values. We call this phenomenon anchoring. insufficierrt adjustment

In a demonstration of the anchoring effect, subjects were asked to estimate various quantities, stated in percentages (for example, the percentage of African countries in the United Nations). For each quantity, a number between U and 100 was determined by spinning a wheel of fortune in the subjects’ presence. The subjects were instructed to indicate first whether

that number was higher or lower than the value of the quantity, and then to estimate the value of the quantity by moving upward or downward from the given number. Different groups were given different numbers for each quantity, and these arbitrary numbers had a marked effect on

estimates. For example, the median estimates of the percentage of African countries in the United Nations were 25 and 45 for groups that received 10 and 65, respectively, as starting points. Payoffs for accuracy did not reduce

the anchoring effect. Anchoring occurs not only when the starting point is given to the

Judgment under uncertainty

15

subject, but also when the subject bases his estimate on the result of some incomplete computation. A study of intuitive numerical estimation illus-

trates this effect. Two groups of high school students estimated, within 5 seconds, a numerical expression that was written on the blackboard. One group estimated the product

Bx7x6x5x4x3x2xl while another group estimated the product 1x2x3x4x5x6x7xB To rapidly answer such questions, people may perform a few steps of computation and estimate the product by extrapolation or adjustment.

Because adjustments are typically insufficient, this procedure should lead to underestimation. Furthermore, because the result of the first few steps

of multiplication (performed from left to right) is higher in the descending sequence than in the ascending sequence, the former expression

should be judged larger than the latter. Both predictions were confirmed. The median estimate for the ascending sequence was 512, while the median estimate for the descending sequence was 2,250. The correct answer is 40,320. Biases in the evaluation of conjunctive and disjunctive events

In a recent study by Bar-Hillel (19173) subjects were given the opportunity to bet on one of two events. Three types of events were used: (i) simple events, such as drawing a red marble from a bag containing 50 percent red

marbles and 50 percent white marbles; (ii) conjunctive events, such as drawing a red marble seven times in succession, with replacement, from a bag containing 90 percent red marbles and 10 percent white marbles; and (iii) disjunctive events, such as drawing a red marble at least once in seven successive tries, with replacement, from a bag containing 10 percent red marbles and 90 percent white marbles. In this problem, a significant

majority of subjects preferred to bet on the conjunctive event (the probability of which is .48) rather than on the simple event (the probability of which is .50). Subjects also preferred to bet on the simple event rather than

on the disjunctive event, which has a probability of .52. Thus, most subjects bet on the less likely event in both comparisons. This pattern of choices illustrates a general finding. Studies of choice among gambles and

of judgments of probability indicate that people tend to overestimate the probability of conjunctive events (Cohen, Chesnick, 8: Haran, 19'?2, 24) and to underestimate the probability of disjunctive events. These biases are readily explained as effects of anchoring. The stated probability of the elementary event (success at any one stage) provides a natural starting

point for the estimation of the probabilities of both conjunctive and disjunctive events. Since adjustment from the starting point is typically

16

INTRDDUCTIUN

insufficient, the final estimates remain too close to the probabilities of the elementary events in both cases. Note that the overall probability of a conjunctive event is lower than the probability of each elementary event,

whereas the overall probability of a disjunctive event is higher than the probability of each elementary event. As a consequence of anchoring, the overall probability will be overestimated in conjunctive problems and underestimated in disjunctive problems. Biases in the evaluation of compound events are particularly significant in the context of planning. The successful completion of an undertaking, such as the development of a new product, typically has a conjunctive character: for the undertaking to succeed, each of a series of events must

occur. Even when each of these events is very likely, the overall probability of success can be quite low if the number of events is large. The general tendency to overestimate the probability of conjunctive events leads to unwarranted optimism in the evaluation of the likelihood that a plan will succeed or that a project will be completed on time. Conversely, disjunctive structures are typically encountered in the evaluation of risks. A complex system, such as a nuclear reactor or a human body, will malfunc-

tion if any of its essential components fails. Even when the likelihood of failure in each component is slight, the probability of an overall failure can be high if many components are involved. Because of anchoring, people will tend to underestimate the probabilities of failure in complex

systems. Thus, the direction of the anchoring bias can sometimes be inferred from the structure of the event. The chain-like structure of

conjunctions leads to overestimation, the funnel-like structure of disjunctions leads to underestimation.

Anchoring in the assessment of subjective probability distributions In decision analysis, experts are often required to express their beliefs about a quantity, such as the value of the Dow-Jones average on a particular day, in the form of a probability distribution. Such a distribution is usually constructed by asking the person to select values of the quantity that correspond to specified percentiles of his subjective probability distribution. For example, the judge may be asked to select a

number, X9“, such that his subjective probability that this number will be higher than the value of the Dow-Jones average is .90. That is, he should

select the value X90 so that he is just willing to accept 9 to 1 odds that the Dow-Jones average will not exceed it. A subjective probability distribution for the value of the Dow-Jones average can be constructed from several such judgments corresponding to different percentiles. By collecting subjective probability distributions for many different

quantities, it is possible to test the judge for proper calibration. A judge is

properly (or externally) calibrated in a set of problems if exactly Il percent of the true values of the assessed quantities falls below his stated values of

Iudgrnent under uncertainty

1?

X". For example, the true values should fall below Km for 1 percent of the

quantities and above X99 for 1 percent of the quantities. Thus, the true values should fall in the confidence interval between Km and X9; on 98 percent of the problems. Several investigators (Alpert -5: Raiffa, 1969, 21; Stael von Holstein,

1971b; Winkler, 196?) have obtained probability disruptions for many quantities from a large number of judges. These distributions indicated

large and systematic departures from proper calibration. In most studies. the actual values of the assessed quantities are either smaller than Km or

greater than Ks, for about 30 percent of the problems. That is, the subjects state overly narrow confidence intervals which reflect more certainty than is justified by their knowledge about the assessed quantities. This bias is common to naive and to sophisticated subjects, and it is not eliminated by

introducing proper scoring rules, which provide incentives for external

calibration. This effect is attributable, in part at least, to anchoring. To select X90 for the value of the Dow-]ones average, for example, it is

natural to begin by thinking about one’s best estimate of the Dow—]ones and to adjust this value upward. If this adjustment - like most others — is insufficient, then X“, will not be sufficiently extreme. A similar anchoring

effect will occur in the selection of Km, which is presumably obtained by adjusting one's best estimate downward. Consequently, the confidence interval between X“, and Kw will be too narrow, and the assessed probabil-

ity distribution will be too tight. In support of this interpretation it can be shown that subjective probabilities are systematically altered by a procedure in which one's best estimate does not serve as an anchor.

Subjective probability distributions for a given quantity (the Dow-Jones average) can be obtained in two different ways: (i) by asking the subject to

select values of the Dow-Iones that correspond to specified percentiles of his probability distribution and {ii} by asking the subject to assess the

probabilities that the true value of the Dow~]ones will exceed some

specified values. The two procedures are formally equivalent and should yield identical distributions. However, they suggest different modes of adjustment from different anchors. In procedure (i). the natural starting point is one's best estimate of the quality. In procedure (ii), on the other hand, the subject may be anchored on the value stated in the question. Alternatively, he may be anchored on even odds, or 51]-5U chances, which

is a natural starting point in the estimation of likelihood. In either case, procedure (ii) should yield less extreme odds than procedure (i).

To contrast the two procedures, a set of 24 quantities (such as the air distance from New Delhi to Peking} was presented to a group of subjects who assessed either X“, or X“, for each problem. Another group of subjects

received the median judgment of the first group for each of the 24

quantities. They were asked to assess the odds that each of the given values exceeded the true value of the relevant quantity. In the absence of any bias, the second group should retrieve the odds specified to the first group,

IB

INTRODUCTION

that is, 9:1. However, if even odds or the stated value serve as anchors, the odds of the second group should be less extreme, that is, closer to 1:1.

Indeed, the median odds stated by this group, across all problems, were 3:1. When the judgments of the two groups were tested for external calibration, it was found that subjects in the first group were too extreme.

in accord with earlier studies. The events that they defined as having a probability of .10 actually obtained in 24 percent of the cases. In contrast, subjects in the second group were too conservative. Events to which they

assigned an average probability of .34 actually obtained in 26 percent of the cases. These results illustrate the manner in which the degree of calibration depends on the procedure of elicitation. Discussion This article has been concerned with cognitive biases that stem from the reliance on judgmental heuristics. These biases are not attributable to

motivational effects such as wishful thinking or the distortion of judgments by payoffs and penalties. Indeed, several of the severe errors of judgment reported earlier occurred despite the fact that subjects were encouraged to be accurate and were rewarded for the correct answers (Kahneman Gt Tversky, 1972b, 3; Tversky ll’: Kahneman, 19?3, 11). The reliance on heuristics and the prevalence of biases are not restricted

to laymen. Experienced researchers are also prone to the same biases when they think intuitively. For example, the tendency to predict the

outcome that best represents the data, with insufficient regard for prior probability, has been observed in the intuitive judgments of individuals who have had extensive training in statistics (Kahneman 8: Tversky, 1973, 4; Tversky 8.: Kahneman, 1971, 2). Although the statistically sophisticated

avoid elementary errors, such as the gambler’s fallacy, their intuitive judgments are liable to similar fallacies in more intricate and less transparent problems. It is not surprising that useful heuristics such as representativeness and availability are retained, even though they occasionally lead to errors in

prediction or estimation. What is perhaps surprising is the failure of people to infer from lifelong experience such fundamental statistical rules as regression toward the mean. or the effect of sample size on sampling variability. Although everyone is exposed, in the normal course of life, to numerous examples from which these rules could have been induced, very few people discover the principles of sampling and regression on their

own. Statistical principles are not learned from everyday experience because the relevant instances are not coded appropriately. For example, people do not discover that successive lines in a text differ more in average word length than do successive pages, because they simply do not attend to the average word length of individual lines or pages. Thus, people do

judgment under uncertainty

19

not learn the relation between sample size and sampling variability,

although the data for such learning are abundant. The lack of an appropriate code also explains why people usually do not detect the biases in their judgments of probability. A person could conceivably learn whether his judgments are externally calibrated by keeping a tally of the proportion of events that actually occur among those to which he assigns the same probability. However, it is not natural to

group events by their judged probability. In the absence of such grouping it is impossible for an individual to discover, for example, that only 50

percent of the predictions to which he has assigned a probability of .9 or higher actually come true.

The empirical analysis of cognitive biases has implications for the theoretical and applied role of judged probabilities. Modern decision theory (de Finetti, 1963; Savage, 1954) regards subjective probability as the

quantified opinion of an idealized person. Specifically, the subjective probability of a given event is defined by the set of bets about this event

that such a person is willing to accept. An internally consistent, or coherent, subjective probability measure can be derived for an individual

if his choices among bets satisfy certain principles, that is, the axioms of the theory. The derived probability is subjective in the sense that different

individuals are allowed to have different probabilities for the same event. The major contribution of this approach is that it provides a rigorous subjective interpretation of probability that is applicable to unique events and is embedded in a general theory of rational decision. It should perhaps be noted that, while subjective probabilities can sometimes be inferred from preferences among bets, they are normally not

formed in this fashion. A person bets on team A rather than on team B because he believes that team A is more likely to win," he does not infer this belief from his betting preferences. Thus, in reality, subjective proba-

bilities determine preferences among bets and are not derived from them,

as in the axiomatic theory of rational decision (Savage, 1954). The inherently subjective nature of probability has led many students to the belief that coherence, or internal consistency, is the only valid criterion by which judged probabilities should be evaluated. From the standpoint of the formal theory of subjective probability, any set of

internally consistent probability judgments is as good as any other. This criterion is not entirely satisfactory, because an internally consistent set of subjective probabilities can be incompatible with other beliefs held by the

individual. Consider a person whose subjective probabilities for all possible outcomes of a coin-tossing game reflect the gambler's fallacy. That is, his estimate of the probability of tails on a particular toss increases with the number of consecutive heads that preceded that toss. The judgments of

such a person could be internally consistent and therefore acceptable as adequate subjective probabilities according to the criterion of the formal

20

INTRUDUCTIUN

theory. These probabilities. however, are incompatible with the generally held belief that a coin has no memory and is therefore incapable of generating sequential dependencies. For judged probabilities to be consid-

ered adequate, or rational, internal consistency is not enough. The judgments must be compatible with the entire web of beliefs held by the individual. Unfortunately, there can be no simple formal procedure for assessing the compatibility of a set of probability judgments with the judge's total system of beliefs. The rational judge will nevertheless strive

for compatibility, even though internal consistency is more easily achieved and assessed. In particular, he will attempt to make his probability judgments compatible with his knowledge about the subject matter, the

laws of probability, and his own judgmental heuristics and biases. Summary This article described three heuristics that are employed in making

judgments under uncertainty: (i) representativeness, which is usually employed when people are asked to judge the probability that an object or

event A belongs to class or process B; (ii) availability of instances or scenarios, which is often employed when people are asked to assess the frequency of a class or the plausibility of a particular development; and (iii) adjustment from an anchor, which is usually employed in numerical

prediction when a relevant value is available. These heuristics are highly economical and usually effective, but they lead to systematic and predictable errors. A better understanding of these heuristics and of the biases to which they lead could improve judgments and decisions in situations of uncertainty.

Part II

Representativeness

2.

Belief in the law of small numbers

Amos Tversky and Daniel Kahnemtm

"Suppose you have run an experiment on 20 subjects, and have obtained a significant result which confirms your theory (z == 2.23, p -=-.: .05, two-

tailed). You now have cause to run an additional group of 10 subjects. What do you think the probability is that the results will be significant, by a one-tailed test, separately for this group?”

If you feel that the probability is somewhere around .85, you may be pleased to know that you belong to a majority group. Indeed, that was the

median answer of two small groups who were kind enough to respond to a questionnaire distributed at meetings of the Mathematical Psychology

Group and of the American Psychological Association. On the other hand, if you feel that the probability is around .48, you belong to a minority. Only 9 of our 84 respondents gave answers between

.40 and .60. However, .48 happens to be a much more reasonable estimate than .85.‘ Apparently, most psychologists have an exaggerated belief in the likeli-

hood of successfully replicating an obtained finding. The sources of such ' The required estimate can be interpreted in several ways. One possible approach is to follow common research practice, where a value obtained in one study is taken to define a plausible alternative to the null hypothesis. The probability requested in the question can

then be interpreted as the power of the second test {i.e., the probability of obtaining a significant result in the second sample} against the alternative hypothesis defined by the result of the first sample. In the special case of a test of a mean with known variance, one would compute the power of the test against the hypothesis that the population mean equals the mean of the first sample. Since the size of the second sample is half that of the first, the computed probability of obtaining z 3 1.645 is only .4?3. A theoretically more justifiable approach is to interpret the requested probability within a Bayesian framework

and compute it relative to some appropriately selected prior distribution. Assuming a uniform prior, the desired posterior probability is .4?B. Clearly, if the prior distribution favors the null hypothesis. as is often the case. the posterior probability will be even smaller. This chapter originally appeared in Psychological Bulletin, 19?1, 2, 105-10. Copyright Q 19171 by the American Psychological Association. Reprinted by permission.

24

REPRESENTATIVENESS

beliefs, and their consequences for the conduct of scientific inquiry, are what this paper is about. Our thesis is that people have strong intuitions about random sampling; that these intuitions are wrong in fundamental

respects; that these intuitions are shared by naive subjects and by trained scientists; and that they are applied with unfortunate consequences in the course of scientific inquiry. We submit that people view a sample randomly drawn from a popula-

tion as highly representative, that is, similar to the population in all essential characteristics. Consequently, they expect any two samples drawn from a particular population to be more similar to one another and

to the population than sampling theory predicts, at least for small samples. The tendency to regard a sample as a representation is manifest in a wide variety of situations. When subjects are instructed to generate a random sequence of hypothetical tosses of a fair coin, for example, they

produce sequences where the proportion of heads in any short segment stays far closer to .50 than the laws of chance would predict (Tune, 1964}.

Thus, each segment of the response sequence is highly representative of the “fairness” of the coin. Similar effects are observed when subjects successively predict events in a randomly generated series, as in probability learning experiments (Estes, 1964) or in other sequential games of chance. Subjects act as if every segment of the random sequence rnust

reflect the true proportion: if the sequence has strayed from the population proportion, a corrective bias in the other direction is expected. This has been called the gambler’s fallacy. The heart of the gambler's fallacy is a misconception of the fairness of the laws of chance. The gambler feels that the fairness of the coin entitles

him to expect that any deviation in one direction will soon be cancelled by a corresponding deviation in the other. Even the fairest of coins, however, given the limitations of its memory and moral sense, cannot be as fair as the gambler expects it to be. This fallacy is not unique to gamblers.

Consider the following example: The mean IQ of the population of eighth graders in a city is known to be 100. You have selected a random sample of 50 children for a study of educational achievements. The first child tested has an IQ of 150. What do you expect the mean IQ to be for the whole sample?

The correct answer is 101. A surprisingly large number of people believe that the expected IQ for the sample is still 10-0. This expectation can be

justified only by the belief that a random process is self-correcting. Idioms such as “errors cancel each other out” reflect the image of an active self-correcting process. Some familiar processes in nature obey such laws:

a deviation from a stable equilibrium produces a force that restores the equilibrium. The laws of chance, in contrast, do not work that way: deviations are not canceled as sampling proceeds, they are merely diluted.

Belief in the law of small numbers

25

Thus far, we have attempted to describe two related intuitions about chance. We proposed a representation hypothesis according to which people believe samples to be very similar to one another and to the population from which they are drawn. We also suggested that people believe sampling to be a self-correcting process. The two beliefs lead to the same consequences. Both generate expectations about characteristics of samples, and the variability of these expectations is less than the true

variability, at least for small samples. The law of large numbers guarantees that very large samples will indeed be highly representative of the population from which they are drawn. If, in addition, a self-corrective tendency is at work, then small samples should also be highly representative and similar to one another. People's intuitions about random sampling appear to satisfy the law of

small numbers, which asserts that the law of large numbers applies to small numbers as well. Consider a hypothetical scientist who lives by the law of small numbers.

I-low would his belief affect his scientific work? Assume our scientist studies phenomena whose magnitude is small relative to uncontrolled variability, that is, the signal-to-noise ratio in the messages he receives from nature is low. Our scientist could be a meteorologist, a pharmacolo-

gist, or perhaps a psychologist. If he believes in the law of small numbers, the scientist will have

exaggerated confidence in the validity of conclusions based on small samples. To illustrate, suppose he is engaged in studying which of two toys infants will prefer to play with. Of the first five infants studied, four have shown a preference for the same toy. Many a psychologist will feel

some confidence at this point, that the null hypothesis of no preference is false. Fortunately, such a conviction is not a sufficient condition for journal publication, although it may do for a book. By a quick computa-

tion, our psychologist will discover that the probability of a result as extreme as the one obtained is as high as 3,15 under the null hypothesis.

To be sure, the application of statistical hypothesis testing to scientific inference is beset with serious difficulties. Nevertheless, the computation

of significance levels (or likelihood ratios, as a Bayesian might prefer) forces the scientist to evaluate the obtained effect in terms of a valid

estimate of sampling variance rather than in terms of his subjective biased estimate. Statistical tests, therefore, protect the scientific community against overly hasty rejections of the null hypothesis (i.e., Type I error} by

policing its many members who would rather live by the law of small numbers. On the other hand, there are no comparable safeguards against

the risk of failing to confirm a valid research hypothesis (i.e., Type II error).

Imagine a psychologist who studies the correlation between need for achievement and grades. When deciding on sample size, he may reason as

follows: “What correlation do I expect? r - .35. What N do I need to make the result significant? (Looks at table.) N - 33. Fine, that's my sample.”

26

REPRESENTATIVENE55

The only flaw in this reasoning is that our psychologist has forgotten about sampling variation, possibly because he believes that any sample must be highly representative of its population. I-lowever, if his guess

about the correlation in the population is correct, the correlation in the sample is about as likely to lie below or above .35. Hence, the likelihood of

obtaining a significant result (i.e., the power of the test) for N - 33 is about .50. In a detailed investigation of statistical power, I. Cohen (1962, 1969) has provided plausible definitions of large, medium, and small effects and an

extensive set of computational aids to the estimation of power for a variety of statistical tests. In the normal test for a difference between two means, for example, a difference of .2511 is small, a difference of .500 is medium,

and a difference of lo is large, according to the proposed definitions. The mean IQ difference between clerical and semiskilled workers is a medium effect. In an ingenious study of research practice, I. Cohen (1962) reviewed all the statistical analyses published in one volume of the [normal of

Abnormal and Social Psychology, and computed the likelihood of detecting each of the three sizes of effect. The average power was .13 for the

detection of small effects, .48 for medium effects, and .83 for large effects. If psychologists typically expect medium effects and select sample size as in the above example, the power of their studies should indeed be about .50.

Cohen's analysis shows that the statistical power of many psychological studies is ridiculously low. This is a self-defeating practice: it makes for frustrated scientists and inefficient research. The investigator who tests a

valid hypothesis but fails to obtain significant results cannot help but regard nature as untrustworthy or even hostile. Furthermore, as Overall (1969) has shown, the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of

invalid rejections of the null hypothesis among published results. Because considerations of statistical power are of particular importance in the design of replication studies, we probed attitudes concerning replication in our questionnaire. Suppose one of your doctoral students has completed a difficult and time-

consuming experiment on 40 animals. He has scored and analyzed a large number of variables. His results are generally inconclusive, but one before-after comparison yields a highly significant t - 2.'?[l, which is surprising and could be of major theoretical significance. Considering the importance of the result, its surprisal value, and the number of

analyses that your student has performed, would you recommend that he replicate the study before publishing? If you recommend replication, how many animals would you urge him to run?

Among the psychologists to whom we put these questions there was

overwhelming sentiment favoring replication: it was recommended by 66

Belief in the law of small numbers

2?

out of 75 respondents, probably because they suspected that the single significant result was due to chance. The median recommendation was for the doctoral student to run 20 subjects in a replication study. It is instructive to consider the likely consequences of this advice. If the mean and the variance in the second sample are actually identical to those in the first sample, then the resulting value of t will be 1.88. Following the reasoning of--Footnote 1, the student's chance of obtaining a significant

result in the replication is only slightly above one-half (for p - .05, one-tail test). Since we had anticipated that a replication sample of 20 would appear reasonable to our respondents, we added the following question: Assume that your unhappy student has in fact repeated the initial study with 20 additional animals, and has obtained an insignificant result in the same direction, t - 1.24. What would you recommend now? Check one: [the numbers in

parentheses refer to the number of respondents who checked each answer] (a) He should pool the results and publish his conclusion as fact. (ll) {I1} He should report the results as a tentative finding. (26) {c} He should run another group of [median 20] animals. (21) (d) He should try to find an explanation for the difference between the two groups. (30)

Note that regardless of one’s confidence in the original finding, its credibility is surely enhanced by the replication. Not only is the experimental effect in the same direction in the two samples but the magnitude

of the effect in the replication is fully two-thirds of that in the original study. In view of the sample size (20), which our respondents recom-

mended, the replication was about as successful as one is entitled to expect. The distribution of responses, however, reflects continued skepticism

concerning the student's finding following the recommended replication. This unhappy state of affairs is a typical consequence of insufficient statistical power.

In contrast to Responses tr and c, which can be justified on some grounds, the most popular response, Response a‘, is indefensible. We doubt that the same answer would have been obtained if the respondents had realized that the difference between the two studies does not even approach significance. (If the variances of the two samples are equal, I for

the difference is .53.) In the absence of a statistical test, our respondents followed the representation hypothesis: as the difference between the two samples was larger than they expected, they viewed it as worthy of

explanation. However, the attempt to "find an explanation for the difference between the two groups" is in all probability an exercise in explaining noise. Altogether our respondents evaluated the replication rather harshly.

This follows from the representation hypothesis: if we expect all samples to be very similar to one another, then almost all replications of a valid

23

REPRESENTATIVENES5

hypothesis should be statistically significant. The harshness of the criterion for successful replication is manifest in the responses to the following question: An investigator has reported a result that you consider implausible. He ran 15 subjects, and reported a significant value, t = 2.46. Another investigator has attempted to duplicate his procedure, and he obtained a nonsignificant value of t with the same number of subjects. The direction was the same in both sets of data.

You are reviewing the literature. What is the highest value oft in the second set of data that you would describe as a failure to replicate?

The majority of our respondents regarded t - 1.20 as a failure to replicate. If the data of two such studies (t = 2.46 and t = 1.70) are pooled, the value of t for the combined data is about 3.00 (assuming equal variances). Thus, we are faced with a paradoxical state of affairs, in which the same data that would increase our confidence in the finding when viewed as part of the

original study, shake our confidence when viewed as an independent study. This double standard is particularly disturbing since, for many

reasons, replications are usually considered as independent studies, and hypotheses are often evaluated by listing confirming and disconfirming reports. Contrary to a widespread belief, a case can be made that a replication

sample should often be larger than the original. The decision to replicate a once obtained finding often expresses a great fondness for that finding and a desire to see it accepted by a skeptical community. Since that community unreasonably demands that the replication be independently significant, or at least that it approach significance, one must run a large sample. To illustrate, if the unfortunate doctoral student whose thesis was

discussed earlier assumes the validity of his initial result (t - 2.70, N‘ - 40), and if he is willing to accept a risk of only .10 of obtaining a t lower than 1.20, he should run approximately 50 animals in his replication study.

With a somewhat weaker initial result (t - 2.20, N - 40], the size of the replication sample required for the same power rises to about 75. That the effects discussed thus far are not limited to hypotheses about means and variances is demonstrated by the responses to the following

question: You have run a correlational study, scoring 20 variables on 100 subjects. Twentyseven of the I90 correlation coefficients are significant at the .05 level: and 9 of these are significant beyond the .01 level. The mean absolute level of the significant correlations is .31, and the pattern of results is very reasonable on

theoretical grounds. How many of the 2? significant correlations would you expect to be significant again, in an exact replication of the study, with N - 40? With N = 40. a correlation of about .31 is required for significance at the

.05 level. This is the mean of the significant correlations in the original study. Thus, only about half of the originally significant correlations (i.e., I3 or 14) would remain significant with N - 40. In addition, of course, the

Belief in the law of small numbers

29

correlations in the replication are bound to differ from those in the original study. Hence, by regression effects, the initially significant coefficients are most likely to be reduced. Thus, 8 to 10 repeated significant correlations from the original 2? is probably a generous estimate of what one is entitled to expect. The median estimate of our respondents is 18. This is more than the number of repeated significant correlations that will be found if the correlations are recomputed for 40 subjects randomly selected from the original 100! Apparently, people expect more than a mere duplication of the original statistics in the replication sample: they expect a duplication of the significance of results, with little regard for sample size. This expectation requires a ludicrous extension of the repre-

sentation hypothesis: even the law of small numbers is incapable of generating such a result.

The expectation that patterns of results are replicable almost in their entirety provides the rationale for a common, though much deplored practice. The investigator who computes all correlations between three indexes of anxiety and three indexes of dependency will often report and

interpret with great confidence the single significant correlation obtained. His confidence in the shaky finding stems from his belief that the obtained correlation matrix is highly representative and readily replicable.

In review, we have seen that the believer in the law of small numbers practices science as follows:

1. He gambles his research hypotheses on small samples without realizing that the odds against him are unreasonably high. He overestimates power. 2. He has undue confidence in early trends (e.g., the data of the first few subjects) and in the stability of observed patterns (e.g., the number and identity of significant results). He overestimates significance.

3. In evaluating replications, his or others’, he has unreasonably high expectations about the replicability of significant results. He underestimates the breadth of confidence intervals. 4. He rarely attributes a deviation of results from expectations to sampling variability, because he finds a causal “explanation” for any discrepancy. Thus, he has little opportunity to recognize sampling variation in action. His belief in the law of small numbers, therefore, will forever remain intact.

Our questionnaire elicited considerable evidence for the prevalence of the belief in the law of small numbers? Our typical respondent is a

believer, regardless of the group to which he belongs. There were practically no differences between the median responses of audiences at a 2 W. Edwards (1963, 25) has argued that people fail to extract sufficient information or certainty from probabilistic data; he called this failure conservatism. Our respondents can hardly be described as conservative. Rather, in accord with the representation hypothesis, they tend to extract more certainty from the data than the data, in fact, contain.

30

REPRESENTATIVENES5

mathematical psychology meeting and at a general session of the Ameri-

can Psychological Association convention, although we make no claims for the representativeness of either sample. Apparently, acquaintance with formal logic and with probability theory does not extinguish erroneous intuitions. What, then, can be done? Can the belief in the law of

small numbers be abolished or at least controlled? Research experience is unlikely to help much, because sampling variation is all too easily “explained.” Corrective experiences are those that provide neither motive nor opportunity for spurious explanation. Thus, a student in a statistics course may draw repeated samples of given size from a population, and learn the effect of sample size on sampling variability from personal observation. We are far from certain, however, that expectations can be corrected in this manner, since related biases, such as the

gambler’s fallacy, survive considerable contradictory evidence. Even if the bias cannot be unlearned, students can learn to recognize its existence and take the necessary precautions. Since the teaching of statistics is not short on admonitions, a warning about biased statistical intuitions may not be out of place. The obvious precaution is computation. The

believer in the law of small numbers has incorrect intuitions about significance level, power, and confidence intervals. Significance levelsare usually computed and reported, but power and confidence limits are not. Perhaps they should be.

Explicit computation of power, relative to some reasonable hypothesis, for instance, ]. Cohen’s (1962, 1969) small, large, and medium effects,

should surely be carried out before any study is done. Such computations will often lead to the realization that there is simply no point in running the study unless, for example, sample size is multiplied by four. We refuse

to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis. In addition, computations of power are essential to the interpretation of negative results, that is, failures to reject the null hypothesis. Because readers’ intuitive estimates of power are likely to be wrong, the publication of computed values does not appear to be a waste of either readers’ time or journal space.

In the early psychological literature, the convention prevailed of reporting, for example, a sample mean as M 1 PE, where PE is the probable error {i.e., the 50% confidence interval around the mean). This convention was later abandoned in favor of the hypothesis-testing formulation. A confidence interval, however, provides a useful index of sampling variability, and it is precisely this variability that we tend to underestimate. The

emphasis on significance levels tends to obscure a fundamental distinction between the size of an effect and its statistical significance. Regardless of sample size, the size of an effect in one study is a reasonable estimate of

the size of the effect in replication. In contrast, the estimated significance level in a replication depends critically on sample size. Unrealistic expectations concerning the replicability of significance levels may be corrected

Belief in the law of small numbers

31

if the distinction between size and significance is clarified, and if the computed size of observed effects is routinely reported. From this point of

view, at least, the acceptance of the hypothesis-testing model has not been an unmixed blessing for psychology. The true believer in the law of small numbers commits his multitude of sins against the logic of statistical inference in good faith. The representation hypothesis describes a cognitive or perceptual bias, which operates regardless of motivational factors. Thus, while the hasty rejection of the

null hypothesis is gratifying, the rejection of a cherished hypothesis is aggravating, yet the true believer is subject to both. His intuitive expecta-

tions are governed by a consistent misperception of the world rather than by opportunistic wishful thinking. Given some editorial prodding, he may

be willing to regard his statistical intuitions with proper suspicion and replace impression formation by computation whenever possible.

3. Subjective probability: A judgment of representativeness Daniel Kahneman and Amos Tversky

Subjective probabilities play an important role in our lives. The decisions we make, the conclusions we reach, and the explanations we offer are usually based on our judgments of the likelihood of uncertain events such as success in a new job, the outcome of an election, or the state of the

market. Indeed an extensive experimental literature has been devoted to the question of how people perceive, process, and evaluate the probabilities of uncertain events in the contexts of probability learning, intuitive

statistics, and decision making under risk. Although no systematic theory about the psychology of uncertainty has emerged from this literature, several empirical generalizations have been established. Perhaps the most general conclusion, obtained from numerous investigations, is that people do not follow the principles of probability theory in judging the likeli-

hood of uncertain events. This conclusion is hardly surprising because many of the laws of chance are neither intuitively apparent, nor easy to apply. Less obvious, however, is the fact that the deviations of subjective

from objective probability‘ seem reliable, systematic, and difficult to eliminate. Apparently, people replace the laws of chance by heuristics, which sometimes yield reasonable estimates and quite often do not.

' We use the term “subjective probability” to denote any estimate of the probability of an event, which is given by a subject. or inferred from his behavior. These estimates are not assumed to satisfy any axioms or consistency requirements. We use the term “objective

probability" to denote values calculated, on the basis of stated assumptions, according to the laws of the probability calculus. It should be evident that this terminology is noncommittal with respect to any philosophical view of probability. This chapter is an abbreviated version of a paper that appeared in Cognitive Psychology. I922, 3, 430-454. Copyright p 1922 by Academic Press, Inc. Reprinted by permission.

Subjective probability

33

In the present paper, we investigate in detail one such heuristic called representativeness. A person who follows this heuristic evaluates the

probability of an uncertain event, or a sample, by the degree to which it is: (1') similar in essential properties to its parent population ; and (ii) reflects the salient features of the process by which it is generated. Our thesis is that, in many situations, an event A is judged more probable than an event

B whenever A appears more representative than B. In other words, the ordering of events by their subjective probabilities coincides with their

ordering by representativeness. Representativeness, like perceptual similarity, is easier to assess than to characterize. In both cases, no general definition is available, yet there are many situations where people agree which of two stimuli is more similar

to a standard, or which of two events is more representative of a given process. In this paper we do not scale representativeness, although this is a feasible approach. Instead, we consider cases where the ordering of events according to representativeness appears obvious, and show that people

consistently judge the more representative event to be the more likely, whether it is or not. Although representativeness may play an important

role in many varieties of probability judgments, e.g., political forecasting and clinical judgment, the present treatment is restricted to essentially repetitive situations where objective probabilities are readily computable.

Most data reported in this paper were collected in questionnaire form from a total of approximately 1500 respondents in Israel. The respondents

were students in grades 10, 11, and 12 of college-preparatory high schools (ages 15-I8). Special efforts were made to maintain the attention and the motivation of the subjects (Ss). The questionnaires were administered in

quiz-like fashion in a natural classroom situation, and the respondents’ names were recorded on the answer sheets. Each respondent answered a

small number (typically 2-4) of questions each of which required, at most, 2 min. The questions were introduced as a study of people's intuitions about chance. They were preceded by standard oral instructions which

explained the appropriate question in detail. The experimental design was counterbalanced to prevent confounding with school or age. Most ques-

tions were pretested on university undergraduates (ages 20-25) and the results of the two populations were indistinguishable.

Determinants of representativeness In this section we discuss the characteristics of samples, or events, that make them representative, and demonstrate their effects on subjective probability. First, we describe some of the features that determine the

similarity of a sample to its parent population. Then, we turn to the analysis of the determinants of apparent randomness.

34

REPRESENT!-KTIVENESS

Similarity of sample to population The notion of representativeness is best explicated by specific examples. Consider the following question: All families of six children in a city were surveyed. In 7'2 families the exact order of births of boys and girls was G B G B B G. What is your estimate of the number of families surveyed in which the exact order of births was B G B B B B?

The two birth sequences are about equally likely, but most people will surely agree that they are not equally representative. The sequence with five boys and one girl fails to reflect the proportion of boys and girls in the population. Indeed, 75 of 92 Ss judged this sequence to be less likely than the standard sequence (p —?——-< but no one would call a carpenter an incompetent judge of distances simply because he too sees the illusion. Clinicians must be made aware of illusory correlations if they are to compensate for them. Ideally, the clinician should experience such illusions firsthand. It may be sound training policy to require each graduate student in clinical psychology to serve as an observer in tasks like the ones

248

COVARIATION AND CONTROL

we have described. He could then examine closely the size and source of the illusory correlations he experiences and thereby, one hopes, learn to guard against such errors in his clinical practice. The experience would also remind him that his senses are fallible, that his clinical judgments must be checked continually against objective measures, and that his professional task is one of the most difficult and complex in all of psychology.

18. Probabilistic reasoning in clinical medicine: Problems and opportunities

David M. Eddy

To a great extent, the quality and cost of health care are determined by the decisions made by physicians whose ultimate objective is to design and

administer a treatment program to improve a patient's condition. Most of the decisions involve many factors, great uncertainty, and difficult value

questions. This chapter examines one aspect of how these decisions are made, studying the use of probabilistic reasoning to analyze a particular problem: whether to perform a biopsy on a woman who has a breast mass that might be malignant. Specifically, we shall study how physicians process information about the results of a mammogram, an X-ray test used to diagnose breast cancer. The evidence presented shows that physicians do not manage uncertainty very well, that many physicians make major

errors in probabilistic reasoning, and that these errors threaten the quality of medical care.

The problem A breast biopsy is not a trivial procedure. The most common type (around 80%) is the excisional biopsy, in which the suspicious mass is removed surgically for microscopic examination and histological diagnosis by a

pathologist. Usually the patient is admitted to a hospital and given a full set of preoperative diagnostic tests. The biopsy is almost always done under general anesthesia (with a probability of approximately 2 out of 10,000 of an anesthetic death). A small (1- to 2-in.) incision is made, and tissue the size of a pecan to a plum is removed. In many cases (perhaps 1 in The preparation of this paper was supported by a grant from The Henry I. Raiser Family Foundation.

250

CDVARIATIUN AND CONTROL

2) the loss of tissue is barely noticeable; in others there is an indentation remaining. In an occasional case (perhaps 1 in 200} there is an infection or drainage that can persist for several weeks. The charge is approximately $700. This procedure can be done on an outpatient basis and under local anesthesia. As an alternative to the excisional biopsy, some surgeons prefer in some cases to obtain tissue by using a needle. This can be done on an outpatient basis. leaves no scar or other residual effects. and is far less expensive. However. it is thought by many physicians to be less reliable in that an existing malignant lesion may be missed. An important factor that affects the need for biopsy is the possibility that the breast mass is a cancer. To estimate this possibility. a physician can

list the possible diseases. assess the frequencies with which various signs and symptoms occur with each disease. compare this information with the

findings in the patient. estimate the chance that she has each of the diseases on the list. and perform a biopsy if the probability of cancer or another treatable lesion is high enough. To help the physician, many

textbooks describe how non-malignant diseases can be differentiated from cancer. For example. the following passage describes one such benign

disease - chronic cystic disease. Chronic cystic disease is often confused with carcinoma of the breast. It usually occurs in parous women with small breasts. It is present rnost commonly in the

upper outer quadrant but may occur in other parts and eventually involve the entire breast. It is often painful, particularly in the premenstrual period. and

accompanying menstrual disturbances are common. Nipple discharge, usually serous. occurs in approximately 15% of the cases. but there are no changes in the nipple itself. The lesion is diffuse without sharp demarcation and without fixation to the overlying skin. Multiple cysts are firm, round. and fluctuant and may transilluminate if they contain clear fluid. A large cyst in an area of chronic cystic disease feels

like a tumor. but it is usually smoother and well delimited. The axillary lymph nodes are usually not enlarged. Chronic cystic disease infrequently shows large bluish cysts. More often. the cysts are multiple and small.‘ (del Regato, I970. pp.

B60-B61 )

Similar descriptions are available for fibroadenomas. fat necrosis. trauma. and a half dozen other breast conditions, as well as for cancer. This type of probabilistic information can be used to help a physician

analyze the possible causes of a patient's breast mass. With assessments of the values of the possible outcomes (e.g., properly diagnosing a cancer. doing an unnecessary biopsy of a non-malignant lesion, not biopsying and missing a malignant lesion. and properly deciding not to biopsy a benign lesion). the physician can assess the chance that the patient, with her particular signs and symptoms, has cancer. and the physician can select an action. ' In this and all subsequent quotations. the italics are added.

Probabilistic reasoning in clinical medicine

251

The case of mammography Other diagnostic tests are available to help the physician estimate the chance that a particular woman's breast lesion is malignant. Perhaps the

most important and commonly used is mammography. The value of this test rests on the fact that the components of malignant cells absorb X rays

differently from the components of non-malignant cells. By studying the mammograms. a radiologist may be able to see certain signs that occur with different frequencies in different lesions. and from this information a

judgment can be made about the nature of the lesion in question. Typically. mammograms are classified as positive or negative for cancer. Occasionally an expanded classification scheme is used. such as one containing the three classes: malignant, suspicious. and benign. The test is not perfect, in that some malignant lesions are incorrectly

classified as benign and some benign lesions are called malignant. Thus. one factor that is very important to the clinician is the accuracy of the test.

Probabilistic reasoning Let us develop this notion more precisely. The purpose of a diagnostic test is to provide information to a clinician about the condition of a patient. The physician uses this information to revise the estimate of the patient’s condition and to select an action based on that new estimate. The action

may be an order for further diagnostic tests. or if the physician is sufficiently confident of the patient's condition. a therapeutic action may be taken. The essential point is that the physician can have degrees of

certainty about the patient's condition. The physician will gather evidence to refine this certainty that the patient does or does not have cancer. and

when that certainty becomes sufficiently strong (in the context of the severity of the disease and the change in prognosis with treatment). action will be taken. We can associate a probability. the physician’s subjective probability that the patient has cancer. with this degree of certainty. The impact on patient care of a diagnostic test such as mammography. therefore. lies in its power to change the physician's certainty or subjective probability that the

patient has cancer. The notion of a subjective probability or degree of certainty appears in many different forms in the medical vernacular. For example. one author

writes that “because the older age group has the greatest proportion of malignant lesions. there is heightened index of suspicion of cancer in the

mind of a clinician who faces an older patient” (Gold. 1969. p. 162). Another author states that the mammogram can reduce the number of breast biopsies "in many instances when the examining physician's rather

firm opinion of benign disease is supported by a firin niarninographic diagnosis

252

CDVARIATIDN AND‘ CUNTRDL

of benignancy” (Wolfe. 1964. p. 253). A third describes it this way: “If the subjective impression of the clinician gives enough reason for suspicion of carcinoma. the clinician will be compelled to biopsy despite a negative

mammogram” (Clark. et al.. 1965. p. 133). Other expressions that reflect this notion include. "confidence level” (Byrne. 1974. p. 37). “impression of malignancy” (Wolfe. 196?. p. 135). “a more positive diagnosis” (Egan. l9?2. p.

392). and so forth. These statements are not precise because few physicians are formally acquainted with the concepts of subjective probability and decision analysis. Nonetheless. there is ample evidence that the notions of degrees of certainty are natural to physicians and are used by them to help select a course of action. Interpreting the accuracy of mammography

Now consider a patient with a breast mass that the physician thinks is probably benign. Let this probability be 99 out of 100. You can interpret the phrase “that the physician thinks is probably [99 out of 100] benign” as follows. Suppose the physician has had experience with a number of

women who. in all important aspects such as age. symptoms. family history. and physical findings are similar to this particular patient. And suppose the physician knows from this experience that the frequency of cancer in this group is. say. 1 out of IUD. Lacking any other information.

the physician will therefore assign (perhaps subconsciously) a subjective probability of 1% to the event that this patient has cancer.

Now let the physician order a mammogram and receive a report that in the radiologist's opinion the lesion is malignant. This is new information and the actions taken will obviously depend on the physician's new estimate of the probability that the patient has cancer. A physician who turns to ‘the literature can find innumerable helpful statements. such as

the following: “The accuracy of mammography is approximately 90 percent” (Wolfe. 1966. p. 214); “In [a patient with a breast mass] a positive [mammogram] report of carcinoma is highly accurate” (Rosato. Thomas. 8:

Rosato. 1973. p. 491); and “The accuracy of mammography in correctly diagnosing malignant lesions of the breast averages SD to B5 percent“ (Cohn. 1972. p. 93). If more detail is desired. the physician can find many

statements like “The results showed 79.2 per cent of 475 malignant lesions were correctly diagnosed and 90.4 per cent of 1.105 benign lesions were correctly diagnosed. for an overall accuracy of 87 per cent” (Snyder. 1966. p. 21?’). At this point you can increase your appreciation of the physician's problem by estimating for yourself the new probability that this patient has cancer: The physician thinks the lump is probably (99%) benign. but the radiologist has produced a positive X-ray report with the accuracy just

given.

Probabilistic reasoning in clinical medicine

253

Table l. Accuracy of mammography in diagnosing benign and malignant lesions Results of

Malignant lesion

Benign lesion

X ray

(cancer)

(no cancer)

Positive Negative

F92 .208

.096 .9-04

Source: The numbers are from Snyder (1966).

Bayes’ formula can be applied to assess the probability. This formula tells us that P(ca I pos) -

P(posI ca) P(ca)

P(pos I ca) P(ca) + P(pos I benign) Plbenign)

where P(ca I pos) is the probability that the patient has cancer. given that she has a positive X-ray report (the posterior probability) P(poslca) is the probability that. if the patient has cancer. the radiologist will correctly diagnose it (the true-positive rate. or sensitivity) P(ca) is the probability that the patient has cancer (prior probabili-

) Pfliihnign) is the prior probability that the patient has benign

disease [P(benign) == 1 — P(ca)] P(posIbenign) is the probability that. if the patient has a benign lesion. the radiologist will incorrectly diagnose it as cancer (the false-positive rate} Table 1 summarizes the numbers given by Snyder. The entries in the cells are the appropriate probabilities (e.g.. P(posIca) = J92). Using 1% as the physician’s estimate of the prior probability that the

mass is malignant and taking into account the new information provided by the test. we obtain (0.?92) (0.01)

P(ca I P”) ' (o.rs2) (0.01) + (sass) (ass) ' “W? Thus. the physician should estimate that there is approximately an 8% chance that the patient has cancer.

lncorrect probabilistic reasoning Unfortunately. most physicians (approximately 95 out of 100 in an informal sample taken by the author) misinterpret the statements about the accuracy of the test and estimate F(ca|pos) to be about 75%. Other

254

CDVARIATIUN AND CONTROL

investigators have observed similar results (Casscells. Schoenberger. 8: Grayboys. 1978). When asked about this. the erring physicians usually report that they assumed that the probability of cancer given that the patient has a positive X ray [P(ca| pos)] was approximately equal to the probability of a positive X ray in a patient with cancer [P(posIca)]. The latter probability is the one measured in clinical research programs and is very familiar. but it is the former probability that is needed for clinical decision making. It seems that many if not most physicians confuse the two.

There are really two types of accuracy for any test designed to determine whether or not a specific disease is present. The retrospective accuracy concerns P(pos I ca) and P(negI no ca). (The abbreviation "no ca” refers to the event the patient does not have cancer. This can occur because she either has a benign disease or she has no disease at all.) This accuracy. the one usually referred to in the literature on mammography. is determined

by looking back at the X-ray diagnosis after the true (histological) diagnosis is known. Let us use the term predictive accuracy to describe P(ca I pos) and P(benignIneg). the accuracy important to the clinician who has an

X-ray report of an as yet undiagnosed patient and who wants to predict that patient's disease state. Confusing retrospective accuracy versus predictive accuracy. A review of the

medical literature on mammography reveals a strong tendency to equate the predictive accuracy of a positive report with the retrospective accuracy of an X-ray report; that is. to equate P(ca I pos) - P(pos I ca). There are many reasons to suspect that this error is being made. First. the wordings of many of the statements in the literature strongly suggest that the authors believe that the predictive accuracy [P(ca I pos)] equals the retrospective

accuracy [P(posIca)] that they report in their studies. For example. a 1964 article in Radiology stated. “the total correctness of the X-ray diagnosis was

674 out of 7'59. or 89 percent” (vol. B4. p. 254). A contributor to Clinical Obstetrics and Gynecology in 1966 said. ”Asch found a 90 percent correlation of mammography with the pathologic findings in SUD patients” (vol. 9. p. 21?). “The agreement in radiologic and pathologic diagnosis was 91.6

percent” (Egan. 19.72. p. 379). All of these statements imply that if the patient has a positive test the test will be correct and the patient will have cancer 90% of the time. This is not true.

Second. some authors make the error explicitly. The following appeared in a 1972 issue of Surgery. Gynecology and Obstetrics in an article entitled “Mammography in its Proper Perspective” and was intended to rectify some confusion that existed in the literature: “In women with proved carcinoma of the breast. in whom mammograms are performed. there is no

X-ray evidence of malignant disease in approximately one out of five patients examined. If then on the basis of a negative mammogram. we are to defer biopsy of a solid lesion of the breast. then there is a one in five

Probabilistic reasoning in clinical medicine

255

chance that we are deferring biopsy of a malignant lesion” (vol. 134, p. 98). The author has incorrectly stated that Plneg | ca) - .2 implies P(ca I neg) = .2. His error becomes very serious when he concludes that “to defer biopsy of a clinically benign solid lesion of the breast that has been called benign on mammography is to take a step backward in the eradication of carcinoma of the breast in our female population.” The chance that such a patient has cancer depends on the prior probability, but is less than 1 in 100. His analysis is in error by more than a factor of 20.

Surgery, Gynecology and Obstetrics published in 1970 (vol. 131, pp. 93-98) the findings of another research group, who computed the “correlation of radiographic diagnosis with pathologic diagnosis” as follows. They took

all the patients with histologically proven diagnoses and separated them into three groups on the basis of the X-ray diagnosis - “benign,” “carcinoma,” and “suspected carcinoma.” In the “X-ray benign” (”negative” in our terminology) group, the tally showed that 34% in fact had benign

lesions. It was also noted that 87.5% of the "X-ray carcinoma” (or ”positive”) group had biopsy-proven malignant lesions. Thus, P(ca I pos) B?.5‘i't= and P(benign I neg) - 84%. But the authors mistook this predictive

accuracy for the retrospective accuracy. They stated that "A correct mammographic diagnosis was made in B4 percent of those with benign lesions

and in 8715 percent of those with carcinoma.” In fact, the true-positive rate [P(pos I ca)] in this study was actually 66% and the true-negative rate [F-‘(neg I benign)] was 54%. In a letter to the editor in the September 11, 1976, issue of the National

Observer, a physician presented five “observations and facts” to support his opinion that “routine [i.e., screening] mammography is not in the best

interest of the population at large at any age." Here is the first set of observations. (1) The accuracy of the examination of mammography is reported to be between SD percent and 90 percent, depending on such factors as the age of the patient, whether or not she has fibrocystic disease, the type of radiographic equipment, the experience of the radiologist, and what our definition of "accurate" is. . . . Even if

we conclude that accuracy is 85 percent generally (and I am sure that not every radiologist in the nation can approach that figure in his own practice). then that means that I5 percent of the women X-rayed will wind‘ ap with incorrect interpretations of the findings, or more likely, their mammograms will simply fail to demonstrate the disease.

This means that I5 percent of the women will be gioen a false sense of security if they are told their X-rays are normal, if indeed they already have cancer. It is difficult to assess the harm done to this group, for they would obviously be better off with no information rather than with incorrect information. Told that her mammogram is

normal and she need not come back for one more year, a woman with breast cancer may well ignore a lump in her breast which might otherwise send her to the doctor

immediately.

There are several errors in this author's reasoning. First, the “accuracy”

of mammography cannot be expressed as a single number. Assume the

256

CDYARIATION AND CONTROL

author means that the true-positive and true-negative rates both equal 85%. Second, these rates (of 85%) are observed when mammography is used to make a differential diagnosis of known signs and symptoms. Such lesions are generally more advanced than the lesions being sought in a screening examination, which is the situation the author is addressing. More reasonable estimates for the true-positive and true-negative rates in screening programs are 60% and 98%, respectively.

Third, even using 85%, we find several inaccuracies in the reasoning.

Consider the second sentence. There are two ways an incorrect interpretation can occur: (a) the patient can have cancer and a negative examination, P(ca,neg); or (b) she can have a positive examination but not have cancer,

P(no ca,pos).2 From elementary probability theory we know that P(ca,neg) == Ptneg I ca) P(ca) P(neg |ca) is the complement of P(pos I ca) and therefore equals .15 in this case. We do not know P(ca) precisely, but for a screening population we are reasonably certain that it is less than .005. That is, fewer than 5 out of 1,000 women harbor an asymptomatic but mammogram-detectable cancer of the breast. Thus,

P(ca,neg) 5 (.15) (.005) - .000?5 Also, P(no ca.pos) - P(pos | no ca) P(no ca) I3. (.15) (.995) - 14925 The total probability of an incorrect interpretation [i.e., P(ca,neg) + P(no ca,pos)] is the sum of these two numbers, which is 15%, as the author states. However, this does not mean that "more likely, their mammograms will simply fail to demonstrate the disease." P(ca,neg) - .0001‘/'5 is not more likely than P(no ca,pos) - 14925. It is about 200 times less likely. Another problem is that B5% "accurate" does not mean that "15 percent

of the women will be given a false sense of security if they are told their X-rays are normal." The author appears to be trying to estimate P(ca I neg). Now by Bayes’ formula, __ Pgrteg I ca) P(ca) P(ca I neg) = Ptneg I ca) P(ca) + P(neg I no ca) P(no ca)

(.1s)(.0os) ' (.15)(.oos) + (_s5)(.s95) ' "D089 That is, if 10,000 asymptomatic women are screened, and if we use the author's misestimate of the accuracy, 8,458 of them will leave with a I F(r'l,B) is the joint probability that both events A and B occur.

Probabilistic reasoning in clinical medicine

25?’

Table 2. Presence of cancer and results of X rays in 1000 women who have abnormal

physical examinations Women with cancer

Women with no cancer

positive K rays Women with negative X rays

7'4

110

184

ti

B10

B16

Total

80

920

1,000

Total

Women with

Note: A true-positive rate of .92 (P(posIca) - 0.92) implies that of S0 women who have cancer, 7'4 will have positive X rays and 6 will have negative X rays. Of all the women with positive X rays, ?4!134 have cancer, or P{caIpos) - 741104 - 11-0%. Soarce: The numbers are from Wolfe (1964).

negative examination. The author thinks that about 1,269 of them will have a false sense of security. In fact, only about 9 will. This number has been overestimated by a factor of about 150. Finally, adding the phrase, "if indeed they already have cancer" further confuses the meaning of the sentence. The phrases "a false sense of security," "if [given] they are told their X-rays are normal," and "if they already have cancer" translate symbolically into P(ca I neg,ca). This proba-

bility is 1, not .15. The importance of P(ca). In addition to confusing the two accuracies, many

authors do not seem to understand that, for a test of constant retrospective accuracy, the meaning to the physician of the test results (the predictive

accuracy) depends on the initial risk of cancer in the patient being mammogrammed. Even if it is assumed that the true-positive and truenegative rates are constant for all studies, the proper interpretation of the test results — the chance that a patient with a positive (or negative) mammogram has cancer - will depend on the prevalence of cancer in the population from which the patient was selected, on the pretest probability that a patient has cancer. This can be extremely important when one

compares the use of the test in a diagnostic clinic (where women have signs and symptoms of breast disease) with its use in a screening clinic for asymptomatic women.

The importance of this is shown by an example. Suppose a clinician’s practice is to mammogram women who have an abnormal physical examination. The frequency of cancer ih such women has been found in one study to be approximately 8% (Wolfe, 1964]. In one series of mammograms in this population, a true-positive rate of 92% and a true-negative rate of 88% was obtained (Wolfe, 1964). Let the physician now face a

253

COVARIATIDN AND CONTROL

patient who he feels is representative of this sample population (i.e., let P(ca) = 8%). Suppose he orders a mammogram and receives a positive result from the radiologist. His decision to order a biopsy should be based

on the new probability that the patient has cancer. That probability can be calculated to be 40% (see Table 2). Would a negative report have ruled out

cancer? The probability that this woman, given a negative report, still has cancer is slightly less than 1%. The logic for this estimate is shown in Table 2. Now, suppose the clinician orders the test to screen for cancer in a woman who has no symptoms and a negative physical examination. The prevalence of mammography-detectable cancer in such women is about

.1095 (e.g., Shapiro, Strax, 8: Venet, 1967). For the purposes of this example, let the retrospective accuracy of the radiologist be unchanged - that is, in

this population of patients let him again have a true-positive rate of 92%

and a true-negative rate (for the diagnosis of benign lesions) of 83%.’ The literature provides data only on the retrospective accuracy of the test in women who have cancer and benign diseases. In one study about 60% of these women had no disease at all (Wolfe, 1965). Thus, in this case. P(ca|pos) == [P(posIca)P(ca)] + [P(pos I ca)P(ca) + P(pos I benign)P(benign) + P(pos I no disease)P(no disease)] P(benign), P(no disease), and P(pos I no disease) are not discussed explic-

itly in the literature. This is instructive and it leads us to suspect that their importance in the analysis of these problems is not understood. For this

example, we shall use the data presented by Wolfe (1965) and assume that P(no disease) is about 60% and P(benign) is about 40%. We shall also make

an assumption favorable to mammography and let P(pos|no disease) be0%. To continue with this example, say the radiologist reports that the mammogram in this asymptomatic woman is positive. Given the positive mammography report, the probability that the patient has cancer (P(ca|pos)) is about 1 out of 49, or about 2.0% (Table 3}. In the previous example that involved women with symptoms, P(ca I pos) was 40%. Thus,

depending on who is being examined, there can be about a twentyfolrt difference in the chance that a woman with a positive mammogram has

cancer.

This raises a major question about medical reasoning - when trying to evaluate a patient's signs and symptoms, how should a physician use information about the basic frequency of the possible diseases in the 3 This is not a good assumption, since the "accuracy" changes as the population being examined changes. For example, the true-positive rate is lower when one is using the test in an asymptomatic population because the cancers tend to be much smaller and harder to

detect. The assumption is made only to demonstrate the importance of P(ca}.

Probabilistic reasoning in clinical medicine

259

Table 3. Presence of cancer and results of X ray in 1,000 women who have no symptoms Women with

Women with

Women with

cancer

benign lesions

no cancer

Total

Women with positive X rays

I

48

0

49

Women with negative X rays

0

352

599

951

Total

1

400

599

1,000

Note: A true-positive rate of 0.92 implies that the X ray will detect cancer in the one woman who has the disease. A true-negative rate of 0.88 for benign disease implies that of 400 women with benign disease, 352 will have negative X rays,

whereas in 48 the X ray will be ositive. Thus, 49 women will have positive X rays, but only one has cancer, or P(caI’pos} - 1149 - 2%.

population at large? The profession appears to be confused about this issue. On the one hand, physicians make statements that the relative

commonness of a disease should not affect the estimate of the probability

that a particular patient has the disease. This notion appears in several maxims, such as, "The patient is a case of one" and, "Statistics are for dead men." In discussions of specific problems, the idea is sometimes expressed subtly as in the statement, "The younger women obviously have a fewer number of the malignancies which, however, should exert very little influence on the individual case" (Wolfe, 196?, p. 138). It can also be stated explicitly and presented as a rule to be obeyed. For example, the following appeared in a textbook on clinical diagnosis: "When a patient consults his

physician with an undiagnosed disease, neither he nor the doctor knows whether it is rare until the diagnosis is finally made. Statistical methods can only be applied to a population of thousands. The individual either has a rare disease or doesn't have it; the relative incidence of two diseases is completely irrelevant to the problem of making his diagnosis" (DeGo-

win 8: DeGowin, 1969, p. 6). On the other hand, these statements are often inconsistent with the behavior of physicians who try, however imperfectly, to use this diagnostic information. Witness the following maxims that are passed on in medical schools: "When you hear hoofbeats, think of horses not of zebras," "Common things occur most commonly," "Follow Sutton's law:

go where the money is," and so forth. It appears that many physicians sense the value of information on the prior probability of a disease but that the formal lessons of probability theory are not at all well understood. Without a formal theory, physicians tend to make the same kinds of errors

in probabilistic reasoning that have been observed in other contexts (Kahneman 8: Tversky, 1973, 4; Lyon 8: Slovic, 19376}.

260

CDVARIATIDN AND CONTROL

implications: Mammograms and biopsies These problems can have important practical implications. For instance, in the examples just cited two authors based their conclusions on incorrect probabilistic reasoning. Cine incorrectly argued that a woman with a breast mass that appears benign on physical examination and benign on the X ray still has a 20% chance of having cancer and recommended that

she be biopsied. Another author based a recommendation against screening on a gross misestimate of the frequency with which women would

have a false sense of security (i.e., have a cancer missed by the mammogram). Both authors may have come to the same conclusion with correct reasoning, but they may not have. The oalae of diagnostic information. The value of mammography in women

who have symptoms and signs of breast disease lies in its ability to provide diagnostic information that will affect the clinician's decision to biopsy. More precisely, the outcome of the test should change a clinician’s estimate of the probability that the patient has cancer. As one author puts it: Mammography can assist the clinician in differentiating between benign and

malignant lesions. . . . Some lesions. especially the small ones, may lack the characteristics that give the clinician an index of suspicion high enough to justify

biopsy. It is here that the .. . mammogram may provide additional objective evidence. Thus, in the case of an indeterminate lesion of the breast, mammography can aid the physician in deciding whether to perform a biopsy study (Clark dz

Robbins, 1965, p. 125). For any diagnostic test to be useful it must provide information that can

potentially change a decision about how the patient should be managed to call for a biopsy in some patients who would otherwise not be biopsied, and, we should hope, to obviate biopsies in some women who would otherwise receive them. This notion is developed formally in statistical decision theory and has been used to analyze some medical problems in a

research setting (e.g., Lusted et al., 1977). Many physicians recognize that the X-ray report carries useful information that should help in patient management, but precisely how the information should be used is ordinarily not stated. The explanations

given by most authors contain few specific directions. "Mammography is not designed to dictate treatment procedures but may provide, in certain cases, just that bit of more precise information, so that undesirable sequelae are avoided" (Egan, 1972, p. 392). "Mammography is a valuable

adjunctive to the surgeon in the diagnosis and treatment of breast lesions" (Lyons, l9?5, p. 231). "Mammography may assist in clarifying confusing palpable findings" (Egan, 1969, p. 146). It "plays a supportive or auxiliary role . . ." (Block 8: Reynolds, 195-’4, p. 589). The precise nature and degree of the support is usually left to the clinician's judgment.

Probabilistic reasoning in clinical medicine

261

Mammograms and biopsies: The practice. It seems that the role of mammography in such cases is only partially understood. To understand this, let us examine the impact that clinical investigators predict mammography will have on the need to biopsy diseased breasts. While the statements quoted

above imply that the use of X rays should help select patients for biopsy, an equal number of statements suggest that mammography cannot, indeed should not, perform this function. "Any palpable lesion requires verification by excision and biopsy regardless of the X-ray findings" (Lesnick, 1966, p. 2007). "While mammography is usually definitive it is not a

substitute for biopsy” (Egan, 1969, p. 148). "In no way would this procedure detract from the importance of biopsy. As a matter of fact, the

use of routine mammography will reaffirm the importance of biopsy, since X-ray evidence of a malignant lesion requires biopsy for confirmation. . . .

It in no way detracts from the importance of the biopsy. . . . [B]iopsy is as much a necessity for the confirmation of X-ray findings as it is for the confirmation of physical signs" (Gershon-Cohen, Er Borden, 1964, pp.

2753, 2254). "lt is apparent that mammography is not a substitute for surgery" (DeLuca, 19.74, p. 318). "Let us state emphatically that mammo-

graphy is not a substitute for biopsy” (McClow ll: Williams, 1923, p. 618). One of the most precise policy statements on how mammography

should be used to help select patients for biopsy appeared in Archives of Surgery in 1966 (vol. 93, pp. 853-856). A careful examination of the directions reveals that only half of the test's potential is used. The scheme for using mammography "to determine the treatment or disposition of each patient" involves three categories of patients: Category A: "The patients with a ‘lump’ or ‘dominant lesion’ in the breast are primarily surgical problems and there should be no delay in obtaining a biopsy. Mammography, in this instance, is strictly complementary. . . . It

may disclose occult tumors” (p. 854). Category B: "The patients have symptoms referable to the breast but no

discrete mass or ‘dominant lesion’. . . . In this category, the surgeon and clinician will find the greatest yield from mammography because here the modality is confirmatory." Here the mammogram will give confirmation and encouragement, "if the clinical impression is benign. It should not, however, dissuade him from a prior opinion to biopsy" (p. 855).

Category C: These patients have no signs or symptoms, there are no clinical indications for biopsy, and a mammogram can only increase the

number of biopsies. Thus, the author has outlined a plan that nullifies the value of mammographic information in selecting patients in whom a biopsy can be avoided. Only the added bit of information that implies biopsy is used. The information that might eliminate a biopsy is ignored. Mammograms and biopsies: The potential. To appreciate how problems in

probabilistic reasoning can affect the actual delivery of medical care, let us

262

CDVARIATIDN AND CONTROL

now examine the role that mammography might play in differential

diagnosis and in the selection of patients for biopsy. As described above, the purpose of the test is to change the decision maker’s subjective estimate of the chance that a patient has cancer. If that probability is high enough (as determined by the physician and patient), biopsy is recommended. Call this probability the biopsy threshold.‘ Now consider the impact of the test on the management of two groups of patients.

The first group consists of those patients who, on the basis of a history and physical examination, are thought by the clinician to have clinically

obvious cancer. Using data published by Friedman et al. (1966), let the prior probability (the frequency) of cancer in this group be 90%. If a mammogram were performed on such a patient, a positive result would increase the probability of cancer [P(ca | pos)] to perhaps!95%. A negative mammogram would still leave the patient with a ?1% chance of having cancer. This high probability is the motivation of such statements as: "If the subjective impression of the clinician gives enough reason for suspicion of cancer, the clinician will be compelled to biopsy” (Clark et al. 1965,

p. 133). A ?1% chance of malignancy is still high enough that almost anyone would want to be biopsied. Now consider a second group of patients who have a dominant mass that is not obviously carcinoma. In one study the probability that such a mass is malignant was 14% (Friedman et al., 1966). In the absence of further information, the clinical policy in such cases is to biopsy the

lesion: "If a dominant lump develops, it should be removed and examined

microscopically” (del Regato, 1970, p. B61). Using this as a guideline, let us suppose that the patient's biopsy threshold is 10%. That is, if, to the best of the physician's knowledge, the probability that his patient has cancer is

above 10% then the patient and physician agree that a biopsy should be done.‘ Using a biopsy threshold of 10%, we can determine the ' Anyone needing to be convinced of the existence of a biopsy threshold can reason as follows. Can we agree that no one is willing to be biopsied if the chance of cancer is 1 in 30

trillion? And can we agree that virtually everyone wants to confirm the diagnosis and be treated if that chance is 93 in 100? If so, then somewhere between 1 in 311' trillion and 93 in 100 everyone has a biopsy threshold. Of course if a woman refuses biopsy and treatment even when the chance of cancer is certain, then she has no threshold. 5 The biopsy threshold is a very fascinating and important number. Shifting it exerts great leverage on the number of women biopsied, the frequency of unproductive biopsies, the

cost of managing a patient, as well as a patient’s prognosis. Because of risk aversion and the fact that they are making the decision for someone else, physicians generally set the biopsy threshold quite low. The statement “if there is any chance that the lesion is malignant, a biopsy should be done” is typical. “If the physician is not completely satisfied that the

lesion is benign, it should be biopsied without delay" (Allen, 1965, p. 640]. There is evidence that women themselves generally set the threshold higher than do physicians —

although there is wide variation. For example, we can examine data from a large clinical trial in which mammography and a breast physical examination were used to screen asymptomatic women for breast cancer (Shapiro, Strax, it Venet, 19'?1]|. Depending on how the breast lesion was detected {i.e., by

which test or combination of tests], the probability that a woman's breast disease was cancer

Probabilistic reasoning in clinical medicine

263

impact of a mammogram on the management of 1,000 such patients. Without the test, all patients would have to be biopsied, 860 of them unproductively. The approximate fate of the original 1,000 patients with a

dominant lesion when mammography is used is presented in Figure 1.'i Patients with positive mammograms have a 53% chance of having

cancer and, since we have assumed they have a biopsy threshold of 10% they should be biopsied. Because the probability is 34% that a patient with an uncertain mammogram has cancer, these patients should also be

biopsied. Patients with a negative mammogram have a 4% chance of having cancer, and, since this is below their assumed biopsy threshold (10%), they would not want to be biopsied but would prefer to be followed closely. The total number of immediate biopsies has been reduced from 1,000 to 240. At least 30 more biopsies will have to be done eventually

because 30 of the 760 remaining patients have cancer. In this way, the expected benefits from having a mammogram (such as a reduction of the chance of an unnecessary biopsy from approximately 86% to a little over 13%) can be compared with the costs (e.g., a radiation hazard and about $375), and the slight decrease in expected survival (there is a 3%

E

varied from 15% to 54%. Cln the basis of a positive physical examination. physicians recommended that 545 women who had negative mammograms be biopsied. Despite the fact that the frequency of cancer in this group was 15%, 31% of the women declined the recommended biopsy. The frequency of cancer in women who had a positive mammogram and a negative breast physical examination was 20%, but 29% of the women in this group declined a recommended biopsy. In women who had positive results on both tests, the frequency of cancer was 54% and only 5% of these women preferred not to be biopsied at the recommended time. Thus, from this crude information it appears that about 31% of women had a biopsy threshold greater than 15%. 29% of women had a biopsy threshold greater than 20%, and in 5% of women the threshold exceeded 54%. To sketch the impact of mammography on these patients (and the patients with other signs and symptoms} much information is needed that is not directly available in the literature. It is fortunate that in one study {Friedman et al., 1966) the data on the frequency of cancer and the retrospective accuracy of mammography are presented separately for three groups of patients - those with obvious carcinoma, those with a dominant mass, and patients with other signs andlor symptoms of breast disease. The published data are incomplete, however, and the data on the frequency of an uncertain I-ray diagnosis in benign and malignant lesions are not included. The data available in the Friedman study were used, and for this example the following assumptions were made: [I] Lesions not biopsied were in fact benign, (2) lesions not biopsied were coded negative, {3} half of the benign lesions that were not coded negative were coded positive (the other half being coded uncertain), and (4) half of the malignant lesions that were not coded positive were coded negative. The

first two assumptions are the most optimistic interpretation of mammography's accuracy. The third and fourth assumptions are very important and as the false-positive (or false-negative} rate tends toward zero, the power of a positive {negative} I-ray report to rule cancer in (out) increases. Likewise, as the false-positive or false-negative rates increase, the test loses its predictive power. Interpretation of Friedman's data is made even more difficult by its presentation in terms of breasts rather than patients. Nonetheless, there is much information in this report and it is reasonable to use it in this example provided the reader understands that this is an illustration, not a formal analysis. A formal analysis of these questions would require better data. The figures for the accuracy used in the text for the evaluation of the patients in group 2 are as follows: P s| ca} - .52, Ptuncertain | ca] .24, Ffneg Ica} - _24, Ptposlbenignl - .025, Ptuncertainlflbcl-nign} - .0'?5, and Pfneg |benign) - .35-

264

COVARIATION AND CONTROL

positive

I40

I 0.14

cancer 0.53 no cancer 0.4?

1,000 patients

uncertain 0.10

100

0.25

7'60

66

cancer 0.34

34

no cancer 0.6-ti

I56

cancer __ negative

74

0.04

no cancer 0.96

30

7'30

Figure l. Probability of cancer in women with dominant lesions.

chance that diagnosis of a malignant lesion will be postponed a month or so). If the notion of a biopsy threshold and some simple probability theory were used, many patients in this group who had negative mammograms

would be spared a biopsy. In the absence of this type of analysis “the surgical consensus here is that all patients [in this group] should have a biopsy, regardless of mammographic findings” (Friedman et al., 1966, p. 839). The importance of the biopsy threshold in this example should be stressed. If the physician and his patient had set the threshold at 1% — that is, if the patient felt that a 1 in 100 chance of having cancer was sufficient to warrant a biopsy - then a negative mammogram report would not have eliminated the need for the biopsy (a 4% chance of cancer would exceed this threshold). The mammogram may have given the clinician some

information but this information would not have contributed to the decision to biopsy. Use of mammography in this case would have to be justified on other grounds. The practice revisited. This type of analysis helps make clear the potential

usefulness of mammography in the differential diagnosis of various lesions. It also helps us evaluate the following policy statements: 1. “Mammography adds little to the management of the clinically [i.e., physically] palpable breast nodule that, on the basis of its own characteristics, requires biopsy” (from Archives of Surgery, 1974, vol. 108, p. 589). In the study of the patients with a dominant mass, biopsy was required on

Probabilistic reasoning in clinical medicine

265

clinical grounds alone. The use of mammography split the group into

subgroups with frequencies of cancer ranging from 53% to 4%. Biopsy might be avoided in the latter group and the number of biopsies might be reduced 73% (from 1,000 per 1,000 to 270 per 1,000). 2. “For clinical purposes mammography must provide accuracy at approximately the 100 percent level before it alone can direct manage-

ment” (from Archives of Surgery, 1924, vol. 108, p. 589). In a population like the second group discussed above, it might be quite rational to let mammography select patients for biopsy. Recall that the true-positive rate used in that example was 52% and that a more accurate test would be even

more valuable. 3. “Mammography is not a substitute for biopsy” (from Oncology, 1969, vol. 23, p. 148). The purpose of both mammography and biopsy is to provide information about the state of the patient. Some patients, in the absence of mammography, require biopsy. In some of these patients a negative mammogram would obviate the biopsy, and in these cases the

mammogram would replace the biopsy. 4. “Every decision to biopsy should be preceded by a mammogram”

(from Oncology, 1969, vol. 23, p. 146). Consider clinically obvious carcinoma. The probability of cancer will be above almost anyone's biopsy threshold no matter what the outcome of the mammogram. The primary justification for this policy in such a case must lie in the chance that the clinically obvious is benign (otherwise the patient would have to have a mastectomy [breast removal] anyway) and that there is a hidden, nonpalpable, malignant lesion. The probability of this compound event is the product of the probabilities of the two events, which is extremely small

(on the order of 1 out of 5,000). 5. “To defer biopsy of a clinically benign lesion of the breast which has been called benign on mammography is to take a step backward in the eradication of carcinoma of the breast” (from Surgery, Gynecology and Obstetrics, 19?2, vol. 134, p. 93). Let “clinically benign” be represented by a P(ca) of 5%. After a negative mammogram, the probability that such a

patient has cancer is approximately 1%. Out of 100 biopsies, 99 would be unproductive. ls the deferral of biopsy here a step backward or forward? The other point is that if the policy were followed, all lesions from

"clinically benign” through clinically obvious carcinoma would require a biopsy no matter what the outcome of the test was. This seems to contradict the author"s statement that “when used in its proper perspective, mammography is an excellent adjunct to the physician in the management of carcinoma of the breast” (from Surgery, Gynecology and Obstetrics, 1972, vol. 134, p. 93). 6. “Mammography must never be used instead of biopsy when dealing

with a ‘dominant lesion’ of the breast and should never change the basic surgical approach in breast diseases, i.e., a ‘lump is a lump’ and must be

266

COVARIATION AND CONTROL

biopsied either by incision or aspiration” (from Archives of Surgery, 1966,

vol. 93, p. B54). Patients with dominant lesions and biopsy thresholds over 5% would disagree with this statement. 7'. “The fallacy comes in relying on [mammography] in doubtful cases. It is essential after examining and palpating the breast to decide whether you would or would not do a biopsy if X-ray were not available. If you would do a biopsy, then do it. If you are sure there is no indication for surgery or physical examination, then order a mammogram. As soon as

one says to himself, and particularly if he says to a patient, ‘I am not quite sure about this — let's get an X-ray,’ one unconsciously has committed

himself to reliance on the negativity of the mammogram, when one should only rely on positivity. This is a psychological trap into which we all tend to fall and is much more serious than a certain number of false-positive diagnoses reached with mammography” (Rhoads, 1969, p. 1182). Not a single biopsy will be avoided by this policy. This is a shame because, as the author of the above statement himself puts it, “there are few areas in which so much surgery is necessitated which could be avoided by better methods of diagnosis than the breast." We are now in a position to appreciate the following story that appeared in the San Francisco Chronicle (Kushner, 1976). A woman reporter had just discovered a mass in her breast and described a consultation with her

physician. ”l'd like you to get a xeromammogram. It's a new way to make mammograms -

pictures of the breasts.” “Is it accurate?” He shrugged, “Probably about as accurate as any picture can be. You know,” he warned, “even if the reading is negative - which means the lump isn't malignant -

the only way to be certain is to cut the thing out and look at it under a microscope.” The woman then discussed the problem with her husband. “What did the doctor say?” “He wants to do a xeromammo-gram. Then, whatever the result is the lump will have to come out.”

”5o why get the X-ray taken in the first place?“ “It's something to go on, I guess. And our doctor says it's right about 85 percent

of the time. . . . 5o, first I've scheduled an appointment to have a thermogram. If that’s either positive or negative, and if it agrees with the Xerox pictures from the mammogram, the statistics say the diagnosis would be 95 percent reliable."

In summary, it would seem reasonable to ask that if the purpose of mammography is to help physicians distinguish benign from malignant breast disease, thereby sparing some patients a more extensive and

traumatic procedure such as a biopsy, then we ought to let the test perform that function. If on the other hand the physician should always adhere to a prior biopsy decision and be unmoved by the mammogram outcome, then

Probabilistic reasoning in clinical medicine

26?’

we ought not to claim that the purpose of the test is to help distinguish benign from malignant disease, since that distinction will be made definitively from a biopsy. Finally, if the purpose of the test is to search for hidden and clinically unsuspected cancer in a different area of the breast (away from a palpable mass that needs biopsy anyway), we ought to

recognize explicily that the chances of such an event are extremely small and that the use of the test amounts to screening. My purpose is not to argue for a specific mammography or biopsy

policy - to do so would require better data and a better assessment of patient values. It is to suggest that we have not developed a formal way of reasoning probabilistically about this type of problem, that clinical judgment may be faulty, and that current clinical policies may be inconsistent or incorrect. Discussion These examples have been presented to illustrate the complexity of medical decision-making and to demonstrate how some physicians

manage one aspect of this complexity - the manipulation of probabilities. The case we have studied is a relatively simple one, the use of a single

diagnostic test to sort lesions into two groups, benign and malignant. The data base for this problem is relatively good. The accuracy and diagnostic

value of the test has been studied and analyzed in many institutions for many years. As one investigator put it, “I know of no medical procedure

that has been more tested and retested than mammography” (Egan, 1971, p. 1555). The probabilistic tools discussed in this chapter have been available for centuries. In the last two decades they have been applied increasingly to medical problems (e.g., Lusted, 1968), and the use of systematic methods for managing uncertainty has been growing in medical school curricula, journal articles, and postgraduate education programs. At present, howev-

er, the application of these techniques has been sporadic and has not yet filtered down to affect the thinking of most practitioners. As illustrated in

this case study, medical problems are complex, and the power of formal probabilistic reasoning provides great opportunities for improving the

quality and effectiveness of medical care.

19. Learning from experience and suboptimal rules in decision making Hillel ]. Einhorn

Current work in decision-making research has clearly shifted from representing choice processes via normative models (and modifications thereof) to an emphasis on heuristic processes developed within the general

framework of cognitive psychology and theories of information processing (Payne, 1980,: Russo, 1977; Simon, I978; Slovic, Fischhoff, Er Lichten-

stein, 1977: Tversky 6: Kahneman 1924, 1, 1980). The shift in emphasis from questions about how well people perform to how they perform is

certainly important (e.g_, Hogarth, 1925). However, the usefulness of studying both questions together is nowhere more evident than in the study of heuristic rules and strategies. The reason for this is that the comparison of heuristic and normative rules allows one to examine discrepancies between actual and optimal behavior, which then raises questions regarding why such discrepancies exist. In this chapter, I focus on how one learns both types of rules from experience. The concern with learning from experience raises a number of issues that have not been

adequately addressed; for example, Under what conditions are heuristics learned? How are they tested and maintained in the face of experience? Under what conditions do we fail to learn about the biases and mistakes that can result from their use?

The importance of learning for understanding heuristics and choice behavior can be seen by considering the following: 1. The ability to predict when a particular rule will be employed is currently inadequate (Wallsten, 1980). However, concern for how and

under what conditions a rule is learned should increase one's ability to This is an abbreviated version of a paper that appeared in T. S. Wallsten (Ed.), Cognitive Processes in Choice and Decision Behavior. Hillsdale, N.].: Lawrence Erlbaum Ass-oc., Inc., 1930. Reprinted by permission. This research was supported by a grant from the Illinois Department of Mental Health and Developmental Disabilities, Research and Development No. F40-02. I would like to thank

Robin Hogarth for his comments on an earlier version of this paper.

Learning from experience and suboptimal rules

269

predict when it is likely to be used. For example, if a rule is learned in

situations where there is little time to make a choice, prediction of the use of such a rule is enhanced by knowing the time pressure involved in the task. 2. A concomitant of (1) is that it should be possible to influence how people judge and decide by designing situations in which tasks incorpo-

rate or mimic initial learning conditions. The implications of this for both helping and manipulating people are enormous (Fischhoff, Slovic, 8:

Lichtenstein, 1978; 1930). 3. Consideration of learning focuses attention on environmental variables and task structure. Therefore, variables such as amount of reinforce-

ment, schedules of reinforcement, number of trials (- amount of experience), etc., should be considered in understanding judgment and decision behavior (cf. Estes, 19.76). Although the importance of the task for understanding behavior has been continually stressed (Brunswik, 1943,: Castellan, 197?; Cronbach, 1925,: Dawes, 1975b; W. Edwards, 1971,: Einhorn dz Hogarth, 19?B; Simon Gr Newell, 1971), psychologists seem as prone to what Ross (1977) calls the fundamental attribution error (underweighting

environmental factors in attributing causes) as anyone else. 4. A major variable in understanding heuristics is outcome feedback.

Since outcome feedback is the main source of information for evaluating the quality of our decision {judgment rules, knowledge of how task

variables both affect outcomes and influence the way outcomes are coded and stored in memory becomes critical in explaining how heuristics are learned and used.

5. The area of learning is the focal point for considering the relative merits of psychological versus economic explanations of choice behavior. Some economists have argued that although one does not act “rationally” all the time, one will learn the optimal rule through interaction with the environment. Vague assertions about equilibrium, efficiency, and evolutionary concepts are advanced to bolster this argument. Therefore, study of how (and how well) people learn from experience is important in casting light on the relative merits of psychological and economic theories

of choice.

Learning from experience: How? It is obvious that decision making is action oriented; one has to choose

what action to take in order to satisfy basic needs and wants. Therefore, it is important for any organism to learn the degree to which actions will lead to desirable or undesirable outcomes. This means that a great deal of learning from experience must involve the learning of action-outcome linkages. Furthermore, since actions and outcomes are contiguous, people

are prone to see the links between them as representing cause-and-effect relationships (Michotte, 1963). Therefore, the strong tendency to see

270

CUVARIATIUN AND CONTROL

causal relations can be seen as an outgrowth of the need to take action to satisfy basic needs. Moreover, as pointed out by Kahneman and Tversky

(19?9b). the learning of causal relationships and the organizing of events into causal “schemata” allow people to achieve a coherent interpretation of their experience. Finally, the learning of action-outcome links is important for understanding how people learn their own tastes or utilities. For example, consider a child who chooses a particular vegetable to

eat, experiences an unpleasant taste, and thereby learns to associate a negative utility with that food. Note that it is typically by choosing that consequences can be experienced and utility learned. Therefore, the

learning of action-outcome links and the learning of utility are closely tied together.

Although we learn from experience by taking action, how does one initially learn which alternative to choose? Undoubtedly, much initial

learning occurs by trial and error; that is, people randomly choose an option and observe the outcome (cf. Campbell, 1960). The process by which trial-and-error learning gives way to the development of strategies or rules is not well known (cf. Siegler, 19?9). However, one can speculate that both reinforcement from trial-and-error learning and generalization

(both stimulus and response) play an important role [Staddon 8: Simmelhag, 1971). In any event, the rules we develop seem directly tied to learning what outcomes will follow from particular actions. As described

above, learning from experience is basically inductive in nature, that is, one experiences specific instances or cases and heuristics are developed to

provide some general way to deal with them. The inductive nature of learning from experience has several implications regarding heuristics: 1. Specificity of rules. If learning occurs inductively via specific cases,

then heuristic rules should be extremely context dependent. Much evidence now suggests that this is indeed the case (Crether 15: Plott, 1979; Lichtenstein 8: Slovic, 19?1; Simon 5|: Hayes, 1976,: Tversky Er Kahneman, 1930). The way in which a problem is worded or displayed or a particular response is asked for all seem to make an important difference in the way information is processed and responses generated. A dramatic example of this specificity can be seen in the work of Simon and Hayes (1976) on “problem isomorphs.” They have shown that different surface wordings

of structurally identical problems (i.e., problems that can be solved using

identical principles) greatly change how people represent the problem in memory and consequently solve it. An important implication of this result is that in order to make heuristic models more predictive, one must

contend with the task as represented and not necessarily with the task structure as seen by an experimenter. A particularly timely example of the importance of this phenomenon in predicting behavior is provided by

observing that behavior depends on whether a tax cut is represented as a

gain or a smaller loss (Kahneman 8: Tversky, 1979b). 2. Generality of rules. If heuristics are rules learned through induction, it

Learning from experience and suboptimal rules

2?1

is necessary to group tasks by similarity or else there would be as many

rules as situations. Since this latter possibility is unacceptable. heuristics must have some generality over tasks. However, this conclusion contradicts what was said above about context dependence and specificity of rules. This paradox can be resolved if one considers the range of tasks to which a rule can be applied. For example, consider the rule “Never order fish in a meat restaurant.” While such a rule is general with respect to a certain type of restaurant, it is certainly more specific than the rule “Judge the probability with which event B comes from process A by their degree of similarity” (Tversky 6: Kahneman, l9'F4, 1). The latter heuristic is clearly

at a much higher level of generality. In fact, it may be that heuristics like representativeness, availability, anchoring, and adjusting are "metaheuristics," that is, they are rules on how to generate rules. Therefore, when confronted by problems that one has not encountered before (like judging

probabilities of events), or problems whose specificity makes them seem novel, metaheuristics direct the way in which specific rules can be formed to solve the problem. The idea of a metaheuristic allows one to retain the generality that any rule necessarily implies, yet at the same time allows for the important effects of context, wording, response mode, and so on. In order to illustrate, consider the study by Slovic, Fischhoff, and Lichten-

stein (1976; see also Chapter 33) in which people were asked to judge the relative probabilities of death from unusual causes. For example, which has a higher probability: being killed by lightening or dying from

emphysema? When confronted with such a question, there are many ways to attempt an answer. One rule that could be used would be: “Think of all the people I know that have died from the two causes and pick the event which caused more deaths.” In my own case, l would choose emphysema (which does have a higher probability, although most people pick being killed by lightning). However, I could have just as easily developed a rule that would lead to the opposite answer: for example, “Think of all of the cases of being killed by lightning and of death from emphysema that I

have ever heard about (newspapers, television, etc.).” If this were my rule, I would choose being killed by lightning as being more probable. Note that in both cases I have used an availability heuristic. Clearly, the way in which a question is phrased could induce specific rules that lead to different results, yet these specific rules could be classified under a single,

more general strategy. or metaheuristic (also see, Einhorn, Kleinmuntz, 6: Kleinmuntz, 1979).

3. Strength of heuristics. If heuristics are learned inductively, then learning occurs over many trials with many reinforcements. As will be discussed, because of the way feedback occurs and because of the methods

that we use to test rules via experience, positive reinforcement can occur

even for incorrect rules (Wason, 1961]). Moreover, in addition to the large number of reinforcements that we experience, the size or intensity of reinforcement can be large. For example, gaining a sizable amount of

2'?2

CDVARIATION AND CUNTRDL

money following the use of some rule for picking stocks should have a considerable reinforcement effect. Therefore, unlike laboratory studies of human learning, where ethical considerations prevent large positive and negative reinforcements, our own experience poses no such constraints. Learning from experience: How well? The question of how well we learn from experience focuses attention on

comparing heuristic rules with optimal rules. Therefore, it must be asked how the latter are learned and what the implications are for applying

them in our own experience? Optimal rules, such as Bayes’ theorem, optimization, etc., are learned deductively. In fact, much of what can be called formal learning is of a deductive character, that is, we are taught scientific laws, logical principles, mathematical and statistical rules, etc. Such rules are by their very nature abstract and context independent. Furthermore, when context can influence the form of a rule, one is

frequently told that the rule holds, “other things being equal.“ Of course, in our own experience other things are rarely equal, which makes the learning of optimal rules via induction so difficult. (The original discoverers or inventors of optimal rules overcame these difficulties; however, this

distinguishes them from the rest of us.) The abstract nature of deductive rules has important implications

regarding the difficulty people have of applying optimal methods in specific situations. This difficulty centers around the ability to discern the structure of tasks that are embedded in a rich variety of detail. Therefore,

when one is faced with 'a specific problem that is rich in detail and in which details may be irrelevant or redundant, one’s attention to specifics is likely to divert attention from the general structure of the problem. In fact, the very abstractness of deductively learned optimal rules may prevent them from being retrieved from memory (cf. Nisbett et al. 1976, chap ref. 7). Abstract rules, therefore, may not be very “available” in specific cases. However, this begs the question since it is important to know why these rules are not available.

Consider the way action-outcome combinations are likely to be organized and stored in memory. In particular, consider whether such information is more likely to be organized and stored by content or task

structure. It would seem easier and more “natural” to organize actionoutcome combinations by subject matter rather than by structure: for example, experiences with schools, parents, members of the opposite sex, etc., rather than Bayesian problems, selection situations, optimization problems, and so on. The fact that content can differ while structure

remains the same is quite difficult to see {Einhorn et al., 1979: Kahneman 8: Tversky, 1979b; Simon 8: Hayes, 19%). Therefore, I think it unlikely that most people organize their experiences by task structure. This is not to say that one could not be trained to do so. In fact, much of professional

Learning from experience and suboptimal rules

275

D H Di

Figure 1. Venn diagram showing the relationship between the hypothesis (H),

datum (D), and report of datum (D ‘It.

this example. Consider that the general makes the logical error and estimates the chance of war at .75. He then sends his troops to the border

thereby causing an invasion by the enemy. Therefore, the faulty reasoning of the general is reinforced by outcome feedback: “After all,” he might say, “those SOB’s did invade us, which is what we thought they'd do.”

The two examples illustrate the basic point of this chapter: Without knowledge of task structure, outcome feedback can be irrelevant or even harmful for correcting poor heuristics. Moreover, positive outcome feed-

back without task knowledge tends to keep us unaware that our rules are poor, since there is very little motivation to question how successes were

achieved. The conditions under which outcome feedback does not play a correcting role vis-ii-vis heuristics and strategies are denoted outcomeirrelevant learning structures (OILS). Such structures may be much more common than we think. Before examining one such structure in detail, consider probabilistic judgments within the framework of OILS, since

much of the work on heuristics is directly concerned with this type of judgment. Consider that you judge the probability of some event to be .70.

Let us say that the event doesn't happen. What does this outcome tell you about the quality of the rules used to generate the judgment? One might argue that any single outcome is irrelevant in assessing the "goodness" (i.e., degree of calibration) of probabilistic judgments. Therefore, in an

important sense, immediate outcome information is irrelevant for correcting poor heuristics. It is only if one keeps a “box score” of the relative frequency of outcomes when one judges events with a given probability

that one can get useful feedback from outcomes. However, this is likely to be a necessary but not sufficient condition for making well-calibrated judgments. First, over what time period does one keep the box score before

deciding that the judgment is or isn’t calibrated? Furthermore, how close

is "close enough” in order to say that the judgment is accurate (in the sense of being well calibrated)? Note that this whole mode of evaluating outcomes involves reinforcement that is delayed for long time periods.

2176

COVARIATION AND CONTROL

Thus it is not clear that such feedback will have much of a self-correcting effect. Second, in order to learn about the goodness of rules for estimating probability, one's box score must include not only one’s estimates and the resulting outcomes but also one's rules for deriving those estimates. For example, if I kept a record of outcomes for 100 cases in which I gave estimates of .7, what would the information that 53 of those times the event happened tell me about the quality of the rules I used? Since it is likely that many different rules could have been used to estimate probabilities in the IUD different situations, the outcome information is irrelevant

and outcome feedback is not useful unless one is aware of one's rules and a record is kept of their use (cf. Nisbett 6: Wilson, 1977, on whether we are aware of our own cognitive processes). I do not mean to imply that it is impossible to learn to make wellcalibrated probability judgments. If one makes many probability judgments in the some situation, such as weather forecasters and horse-racing handicappers do, and outcome feedback is quickly received, such conditions may not be outcome irrelevant, and feedback can be self-correcting. However, such conditions would seem to be the exception rather than the

rule for most of us. Although probabilistic judgments typically occur in OILS, what about non-probabilistic judgments? Surely, if one makes a prediction about something one can check to see if the prediction is correct or not.

Therefore, it would seem that outcomes should be relevant for providing self-correcting feedback. The remainder of this chapter discusses this issue

within the context of one general and prevalent task structure, although the specific content of such tasks may be quite different. Selection task‘ A very general task involving non-probabilistic judgments is now exam-

ined, since outcome information seems both available and relevant for providing self-correcting feedback. The task to be considered is one in which judgments are made for the purpose of choosing between alternative actions. For example, consider a situation with two possible actions, A and B. Denote by x an overall, evaluative judgment, which may itself be a function of various types and amounts of information. Furthermore, let x, be a cutoff point such that if x 2 x,, take action A;

(1)

if .r -=:: .r,, take action B Although simplistic, Equation 1 applies to many judgmentldecision situa-

tions, for example: job hiring, promotion, admission to school, loan and

credit granting, assignment to remedial programs, admission to social programs, journal article acceptance, grant awarding, etc. In these cases, a I Much of this section is drawn from Einhorn and Hogarth (IQTB).

Learning from experience and suboptimal rules

2?’?

tr lpertarrnonosl

"Success"

(Illa?

Positive Hits False hleqativss

Ilc '

'

re False

Negative Hits ll

Positives

urn"

lyc yci

k

I

1- 1 iludgmsntl

it Reject

Ace UPI

ll -c lg?

lx?.x¢l

Figure 2. Action-outcome combinations that result from using judgment to make an accept-reject decision.

judgment of the degree of “deservedness” typically determines which

action is to be taken, since the preferred action cannot be given to all. In order to compare judgment with a standard, the existence of a criterion, denoted y, is assumed to serve as the basis for evaluating the

accuracy of judgment. While the practical difficulties of finding and developing adequate criteria are enormous, the focus here is theoretical: The concept of a criterion is what is necessary for this analysis. To be consistent with the formulation of judgment, it is further assumed that the criterion has a cutoff point (y,) such that y 2 y, and y -=: y, serve as the basis for evaluating the outcomes of judgment. Thus, as far as learning about judgment is concerned, representation of outcomes in memory is often of

categorical form, that is, successes and failures (cf. Estes, 1976). It is very important to note that the structure of the task is one in which judgments (predictions) lead to differential actions and that outcomes are

then used as feedback for determining the accuracy of the predictions. The formal structure can be seen by considering the regression of y on .r and

the four quadrants that result from the intersection of x, and y, as illustrated in Figure 2. Denote the correct predictions as positive and negative hits and the two types of errors as false positives (y -: y,|x a .r,) and false negatives (y a y,|x -=: x.,). To estimate the relationship between r and yr (i.e., the correlation between .r and y, pg) it is necessary to have

information on each judgment—outcome combination. Assume first that such information becomes available over time (i.e., sequentially), and consider the experimental evidence concerned with learning the rela-

2?B

CDVARIATION AND CONTROL

tionship between r and yin such circumstances. Research on the ability to judge the contingency between x and y from information in 2 x 2 tables (Jenkins 8: Ward, I965; Smedslund, 1963; 1966; Ward 8: Jenkins, 1965) indicates that people judge the strength of relationships by the frequency of positive hits (in the terminology of Figure 2), while generally ignoring

information in the three other cells. These results are extremely important, since they say that even when all of the relevant outcome information is available, people don't use it. This means that in laboratory studies that have outcome-relevant learning structures, people have transformed them

into outcome-irrelevant learning structures. How can this be explained? The explanation advanced here is that our experience in real-world tasks is such that we develop rules and methods that seem to “work”

reasonably well. However, these rules may be quite poor and our awareness of their inadequacy is profound. This lack of awareness exists because positive outcome feedback can occur in spite of, rather than because of, our

predictive ability. In order to illustrate, consider the study by Wason (1960) in which he presented subjects with a three-number sequence, for

example: 2, 4, 6. Subjects were required to discover the rule to which the three numbers conformed (the rule being three ascending numbers). To discover the rule, they were permitted to generate sets of three numbers which the experimenter classified as conforming or not conforming to the rule. At any point, subjects could stop when they thought they had

discovered the rule. The correct solution to this task should involve a search for disconfirming evidence rather than the accumulation of

confirming evidence. For example, if someone believed that the rule had something to do with even numbers, this could only be tested by trying a sequence involving an odd number (i.e., accumulating vast amounts of

confirming instances of even-number sequences would not lead to the rule). The fact that only 6 of 29 subjects found the correct rule the first time they thought they did, illustrates the dangers of induction by simple

enumeration. As Wason (I960) points out, the solution to this task must involve “a willingness to attempt to falsify hypotheses, and thus to test those intuitive ideas which so often carry the feeling of certitnde" (p. I39, italics

added). It is important to emphasize that in Wason's experiment, where actions were not involved, a search for disconfirming evidence is possible. However, when actions are based on judgment, learning based on discon-

firming evidence becomes more difficult to achieve. Consider how one might erroneously learn an incorrect rule for making judgments and focus on the hypothetical case of a manager learning about his predictive ability concerning the "potential" of job candidates. The crucial factor here is that

actions (e.g., acceptldo not accept) are contingent on judgment. At a subsequent date the manager can only examine accepted candidates to see how many are “successful.” If there are many successes, which is likely, these instances all confirm the rule. Indeed, the important

Learning from experience and suboptimal rules

2??

point here is that it would be difficult to disconfirm the rule, even though it might be erroneous. One way in which the rule could be tested would be for the manager to accept a subset of those he judged to have low potential and then to observe their success rate. If their rate was as high as those judged to be of high potential, the rule would be disconfirmed. However, a systematic search for disconfirming evidence is rare and could be objected to on utilitarian and even ethical grounds, that is, one would have to withhold the preferred action from some of those judged most deserv-

ing and give it some judged least deserving. Therefore. utilitarian andtor ethical considerations may prevent one from even considering the collec-

tion of possibly disconfirming information. Note that the tendency not to test hypotheses by disconfirming instances is a direct consequence of the

task structure in which actions are taken on the basis of judgment. Wason (I960) points out, “In real life there is no authority to pronounce judgment on inferences: the inferences can only be checked against the evidence” (p. I39). As a result, large amounts of positive feedback can lead to reinforcement of a non-valid rule. Although outcomes contingent on the action-not-taken may not be sought, it is still the case that one can examine the number of positive hits

and false positives as a way to check on the accuracy of one’s predictions. Therefore, while such information is incomplete for accurately assessing the relationship between predictions and outcomes, such information is

what most people have available. It is therefore important to consider the factors that affect these variables.

Factors affecting positive hits and false positives Consider Figure 2 again and note that there are three factors that affect the

rates of positive hits and false positives; the location of x,, y,, and the “tilt” of the ellipse (which is the correlation between .r and y). For example, if 1',

is moved to the right, holding y, and pg, constant, there is a point at which there will be no false positives. Of course, there will be a corresponding increase in false negatives. However, if one doesn't have information about these cases (as is generally the situation), one's experience of success can be quite convincing that judgmental quality is high. Therefore, when

the criterion for giving the preferred action is raised (increasing x,), the probability, p(x 2 .r,) (also called the selection ratio, aw), is decreased and this leads to high positive hit and low false-positive rates. The second

factor, _y,, obviously affects outcomes, since the level of y, defines success and failure. Note that when y,_. is lowered, the probability, p(y 2 y,) (also

called the base rate, in), is raised and one’s experience of successes may be high irrespective of judgmental ability; that is, if one randomly assigned

people to the various actions, one would experience a success rate equal to p( y 2 y,). Therefore, to judge one's predictive ability, the comparison of

230

COVARIATION AND CONTROL

the positive hit rate with p( y 2 y,) should be made and judgmental ability

assessed on the marginal increase in successes. The third factor, pg, affects outcomes in a straightforward manner; namely, the larger pg, the greater the positive hit rate.

The effects of these three factors on the positive hit rate are well known. Taylor and Russell (1939), for example, have shown that one can increase the positive hit rate, for any given pg and base rate. by reducing the selection ratio (¢-), that is, by giving the preferred action to a smaller percentage (assuming pg, as El). Thus, even if p,_,, is low, it is possible to have a high positive hit rate depending on the values of qt and br. Taylor and Russell (1939) provide tables of positive hit rates for a wide range of values of pg, til, and irr. Examination of these tables shows that low correlations between judgments and criteria are not incompatible with large positive hit rates. In addition to the three factors already mentioned, a fourth factor must be considered. This can be illustrated by imagining the following experi-

ment. Assume that a series of judgments is made about some persons. Of those judged to be above x,, randomly assign half to action A and half to action B. Similarly do the same for those judged below x,. At some later point in time, measure performance and calculate the proportion of persons with y 2 y, in each cell (each person is assigned a U or I to indicate whether he or she is below or above the cutoff on y — the proportion above

y, being simply the mean of that cell). This is a 2 x 2 factorial design with one factor being “judgment” and the other "type of action.” Note that because the criterion cannot be measured immediately before the decision

(indeed, if it could, there would be no need for judgment), people receiving actions A and B have also received different experimental treatments. If this experiment were done, one could test for the main effect

of judgment (which measures its accuracy): the main effect for the action, that is, whether receiving A or B in itself causes differences in performance; and the interaction between judgment and action. Observe that the advantage of the experiment is that it allows one to untangle the accuracy of judgment from the treatment effects of the action. However, such an

experiment is rarely done, even conceptually, and especially not by people without extensive training in experimental design. Therefore, judgmental accuracy will almost always be confounded with possible treatment effects

due to actions. Furthermore, and with reference to the earlier discussion,

this experiment allows one to examine disconfirming information. In contrast to most real judgmental tasks, therefore, it would permit one to disconfirm the hypothesis of judgmental accuracy as well as to estimate any treatment effects due to the action.

An example of treatment effects is shown in Figure 3. The dotted ellipse is that shown in Figure 2 and represents the “true” relationship between judgements and outcomes. The shaded portion indicates those outcomes

232

COVARIATION AND CONTROL

low because of the failure to adequately understand the task structure. Therefore, although one might suppose that non-probabilistic judgments

are made in an outcome-relevant learning structure, when judgments are made for the purpose of deciding between actions, outcome information may be irrelevant for providing self-correcting feedback.

Conclusion: The basic theme of this chapter has been that outcome information, without knowledge of task structure, can be irrelevant for providing self-correcting feedback about poor heuristics. It has also been argued that knowledge of task structure is difficult to achieve because of the inductive

way in which we learn from experience (cf. Hammond, 1978, on Galilean vs. Aristotelian modes of thought). These conclusions raise two issues that

will be briefly discussed.

It may be the case that even with knowledge of task structure, one chooses to act in such a way that learning is precluded. For example,

consider a waiter in a busy restaurant. Because he doesn"t have time to give good service to all the customers at his station, he makes a prediction about which customers are likely to leave good or poor tips. Good or bad service

is then given depending on the prediction. If the quality of service has a treatment effect on the size of the tip, the outcomes "confirm" the original

predictions. Note that the waiter could perform an experiment to disentangle the treatment effects of quality of service from his predictions if he

was aware of the task structure: that is, he could give poor service to some of those he judged to leave good tips and good service to some of those judged to leave poor tips. However, note that the waiter must be willing to

risk the possible loss of income if his judgment is accurate, against learning that his judgment is poor. The latter information may have long-run benefits in that it could motivate the person to try to make better predictions or, if this is not possible, to use a strategy of giving good or

poor service randomly, thus saving much mental effort. In organizational decisions, the long-run benefits from knowing about the accuracy of one’s

predictions could be substantial. For example, if selection interviews do not predict performance (independent of treatment effects), why spend money and time using them? Therefore, the costs and benefits of short-run strategies for action versus long-run strategies for learning need to be more fully investigated. The second issue can be raised by stating the following question: If

people learn and continue to use poor rules, does this not contradict the evolutionary concept of survival of the fittest? I take this question to mean that those who use bad rules should be less likely to survive than those

who use better rules (they are more fit). However, the use of better rules I I would like to thank I. E. R. Staddon for raising the points discussed in this section.

Learning from experience and suboptimal rules

233

can still be quite removed from the use of optimal rules. The concept of most “fit” involves a relative ordering while optimality implies some absolute level. Therefore, the fact that suboptimal rules are maintained in the face of experience is not contradicted by Darwinian theory. Perhaps the most succinct way of putting this is to quote Erasmus: “In the land of

the blind, the one-eyed man is l - (:;)Zs'.I.1 + Gjzrsa - - ~ + G);r.I... - r, + r; . . . + r, (the sum of the correlations)

The variance of 3- is 1, and the variance of the sum of the Ks is M + M(M — 11?, where F is the average inter-correlation between the Xs. Hence, the correlation of the average of the

Ks with Y is (Er,)f(M + M (M — 1}F)“2; this is greater than (En),-"(M + M: - M)“: - average r,-. Because each of the random models is positively correlated with the criterion, the

correlation of the average, which is the unit-weighted model, is higher than the average of the correlations.

The robust beauty of improper linear models

401

The fact that different linear composites correlate highly with each other was first pointed out 40 years ago by Wilks (1933). He considered only situations in which there was positive correlation between predictors. This result seems to hold generally as long as these intercorrelations

are not negative: for example, the correlation between X + 2Y and 2X + Y is .8-fl when X and Y are uncorrelated. The ways in which outputs are relatively insensitive to changes in coefficients (provided changes in sign are not involved) have been investigated most recently by Green (1977), Wainer (19?6), Wainer and Thissen (19'?'6), W. Edwards (l9'78), and

Gardiner and Edwards (1975). Dawes and Corrigan (19374, p. 105) concluded that “the whole trick is to

know what variables to look at and then know how to add.” That principle is well illustrated in the following study, conducted since the Dawes and

Corrigan article was published. In it, Hammond and Adelman (19?t5) both investigated and influenced the decision about what type of bullet should be used by the Denver City Police, a decision having much more obvious

social impact than most of those discussed above. To quote Hammond and Adelman (19%): In 19374, the Denver Police Department (DPD), as well as other police departments throughout the country, decided to change its handgun ammunition. The principle reason offered by the police was that the conventional round-nosed bullet provided insufficient “stopping effectiveness” (that is, the ability to incapacitate and thus to prevent the person shot from firing back at a police officer or others).

The DPD chief recommended (as did other police chiefs) the conventional bullet be replaced by a hollow-point bullet. Such bullets, it was contended, flattened on impact, thus decreasing penetration, increasing stopping effectiveness, and decreasing ricochet potential. The suggested change was challenged by the American Civil Liberties Union, minority groups, and others. Opponents of the change claimed that the new bullets were nothing more than outlawed “dumdum” bullets, that they created far more injury than the round-nosed bullet. and should, therefore, be barred from use. As is customary, judgments on this matter

were formed privately and then defended publicly with enthusiasm and tenacity, and the usual public hearings were held. Both sides turned to ballistics experts for

scientific information and support. (p. 392) The disputants focused on evaluating the merits of specific bullets —

confounding the physical effect of the bullets with the implications for social policy; that is, rather than separating questions of what it is the bullet should accomplish (the social policy question) from questions concerning ballistic characteristics of specific bullets, advocates merely argued for one bullet or another. Thus, as Hammond and Adelman

pointed out, social policymakers inadvertently adopted the role of (poor) ballistics experts, and vice versa. What Hammond and Adelman did was to discover the important policy dimensions from the policymakers, and then they had the ballistics experts rate the bullets with respect to these

dimensions. These dimensions turned out to be stopping effectiveness

404

CORRECTIVE PROCEDURES

The data indicate otherwise. In the L.R. Goldberg (1970) study, for example, only 5 of 29 trained clinicians were better than the unit-

weighted model, and none did better than the proper one. In the Wiggins and Kohen (1971) study, no judges were better than the unit-weighted model, and we replicated that effect at Oregon. In the Libby (I9?f:) study,

only 9 of 43 judges did better than the ratio of assets to liabilities at predicting bankruptcies (3 did equally well). While it is then conceded that clinicians should be able to predict diagnosis of neurosis or psychosis, that graduate students should be able to predict graduate success, and that bank loan officers should be able to predict bankruptcies, the possibility is

raised that perhaps the experts used in the studies weren't the right ones. This again is arguing from a vacuum: If other experts were used, then the

results would be different. And once again no such experts are produced, and once again the appropriate response is to ask for a reason why these

hypothetical other people should be any different. (As one university vice-president told me, “Your research only proves that you used poor judges; we could surely do better by getting better judges” — apparently

not from the psychology department.) A final technical objection concerns the nature of the criterion variables. They are admittedly short-term and unprofound (e.g., GPAs, diagnoses): otherwise, most studies would be infeasible. The question is then raised of whether the findings would be different if a truly long-range important

criterion were to be predicted. The answer is that of course the findings could be different, but we have no reason to suppose that they would be different. First, the distant future is in general less predictable than the immediate future, for the simple reason that more unforeseen, extraneous,

or self-augmenting factors influence individual outcomes. (Note that we are not discussing aggregate outcomes, such as an unusually cold winter in the Midwest in general spread out over three months.) Since, then, clinical prediction is poorer than linear to begin with, the hypothesis would hold only if linear prediction got much worse over time than did clinical prediction. There is no a priori reason to believe that this differential deterioration in prediction would occur, and none has ever been suggested to me. There is certainly no evidence. Once again, the objection consists of an argument from a vacuum. Particularly compelling is the fact that people who argue that different

criteria or judges or variables or time frames would produce different results have had 25 years in which to produce examples, and they have

failed to do so. Psychological One psychological resistance to using linear models lies in our selective memory about clinical prediction. Our belief in such prediction is rein-

forced by the availability (Tversky & Kahneman, 1974) of instances of

406

CORRECTIVE PROCEDURES

Now what are we dealing with? We are dealing with personality and intellectual characteristics of [uniformly bright] people who are about 20 years old . . . . Why are we so convinced that this prediction can be made at all? Surely, it is not necessary to read Ecclesiastes every night to understand the role of chance. . .. Moreover, there are clearly positive feedback effects in professional development

that exaggerate threshold phenomena. For example. once people are considered sufficiently “outstanding” that they are invited to outstanding institutions, they have outstanding colleagues with whom to interact - and excellence is exacerbated. This same problem occurs for those who do not quite reach such a threshold level. Not only do all these factors mitigate against successful long-range prediction, but studies of the success of such prediction are necessarily limited to those

accepted, with the incumbent problems of restriction of range and a negative covariance structure between predictors (Dawes, 19?'5).

Finally, there are all sorts of nonintellectual factors in professional success that could not possibly be evaluated before admission to graduate school, for example, success at forming a satisfying or inspiring libidinal

relationship, not yet evident genetic tendencies to drug or alcohol addiction, the misfortune to join a research group that “blows up,” and so on, and so forth. Intellectually, I find it somewhat remarkable that we are able to predict even 16% of the variance. But I believe that my own emotional response is indicative of those of my colleagues who simply assume that the future is more predictable. I want it to be predictable, especially when the aspect of it that I want to predict is important to me. This desire, I suggest, translates itself into an implicit assumption that the future is in fact highly predictable, and it would then logically follow that if something is not a very good predictor,

something else might do better (although it is never correct to argue that it necessarily will). Statistical prediction, because it includes the specification (usually a low

correlation coefficient) of exactly how poorly we can predict, bluntly

strikes us with the fact that life is not all that predictable. Unsystematic clinical prediction (or ”postdiction”), in contrast, allows us the comforting illusion that life is in fact predictable and that we can predict it.

Ethical When I was at the Los Angeles Renaissance Fair last summer, I overheard a young woman complain that it was “horribly unfair” that she had been rejected by the Psychology Department at the University of California, Santa Barbara, on the basis of mere numbers, without even an interview. “How can they possibly tell what l"m like?” The answer is that they can't.

Nor could they with an interview (Kelly, 1954). Nevertheless, many people maintain that making a crucial social choice without an interview is dehumanizing. I think that the question of whether people are treated in a fair manner has more to do with the question of whether or not they

The robust beauty of improper linear models

40?

have been dehumanized than does the question of whether the treatment is face to face. (Some of the worst doctors spend a great deal of time conversing with their patients, read no medical journals, order few or no tests, and grieve at the funerals.) A GPA represents 31!; years of behavior on the part of the applicant. (Surely, not all the professors are biased against his or her particular form of creativity.) The GRE is a more carefully devised test. Do we really believe that we can do a better or a

fairer job by a ID-minute folder evaluation or a half-hour interview than is done by these two mere numbers? Such cognitive conceit (Dawes, 1976. p. '7) is unethical, especially given the fact of no evidence whatsoever

indicating that we do a better job than does the linear equation. (And even making exceptions must be done with extreme care if it is to be ethical, for

if we admit someone with a low linear score on the basis that he or she has some special talent, we are automatically rejecting someone with a higher

score, who might well have had an equally impressive talent had we taken the trouble to evaluate it.) No matter how much we would like to see this or that aspect of one or another of the studies reviewed in this article changed, no matter how psychologically uncompelling or distasteful we may find their results to be, no matter how ethically uncomfortable we may feel at “reducing

people to mere numbers,” the fact remains that our clients are people who deserve to be treated in the best manner possible. If that means - as it appears at present — that selection, diagnosis, and prognosis should be

based on nothing more than the addition of a few numbers representing values on important attributes, so be it. To do otherwise is to cheat the people we serve.

29. The vitality of mythical numbers

Max Singer

It is generally assumed that heroin addicts in New York City steal some two to five billion dollars worth of property a year, and commit approxi-

mately half of all the property crimes. Such estimates of addict crime are used by an organization like RAND, by a political figure like Howard Samuels, and even by the Attorney General of the United States. The estimate that half the property crimes are committed by addicts was

originally attributed to a police official and has been used so often that it is now part of the common wisdom.

The amount of property stolen by addicts is usually estimated in something like the following manner: There are 100,000 addicts with an average habit of $30.00 per day. This

means addicts must have some $1.1 billion a year to pay for their heroin ( 100,000 x 365 x $30.00). Because the addict must sell the property he

steals to a fence for only about a quarter of its value, or less, addicts must steal some $4 to $5 billion a year to pay for their heroin. These calculations can be made with more or less sophistication. Une can allow for the fact that the kind of addicts who make their living

illegally typically spend upwards of a quarter of their time in jail, which would reduce the amount of crime by a quarter. (The New York Times

recently reported on the death of William “Donkey” Reilly. A 74-year-old ex-addict who had been addicted for 54 years, he had spent 30 of those

years in prison.) Some of what the addict steals is cash, none of which has to go to a fence. A large part of the cost of heroin is paid for by dealing in

the heroin business, rather than stealing from society, and another large part by prostitution, including male addicts living off prostitutes. But no

matter how carefully you slice it, if one tries to estimate the value of property stolen by addicts by assuming that there are 100,000 addicts and This chapter originally appeared in The Public Interest, 19?], 13, 3-9. Copyright 0 19?l by National Affairs, Inc. Reprinted by permission.

The vitality of mythical numbers

409

estimating what is the minimum amount they would have to steal to support themselves and their habits (after making generous estimates for legal income), one comes up with a number in the neighborhood of $1

billion a year for New York City. But what happens if you approach the question from the other side? Suppose we ask, “How much property is stolen - by addicts or anyone else?” Addict theft must be less than total theft. What is the value of property stolen in New York City in any year? Somewhat surprisingly to me when I first asked, this turned out to be a difficult question to answer, even approximately. No one had any estimates that they had even the faintest confidence in, and the question doesn't seem to have been much asked. The amount of officially reported theft in New York City is

approximately $300 million a year, of which about $100 million is the

value of automobile theft (a crime that is rarely committed by addicts). But it is clear that there is a very large volume of crime that is not reported; for example, shoplifting is not normally reported to the police. (Much property loss to thieves is not reported to insurance companies either, and the insurance industry had no good estimate for total theft.)

It turns out, however, that if one is only asking a question like, “ls it possible that addicts stole $1 billion worth of property in New York City

last year?” it is relatively simple to estimate the amount of property stolen. It is clear that the two biggest components of addict theft are shoplifting and burglary. What could the value of property shoplifted by addicts be?

All retail sales in New York City are on the order of $15 billion a year. This includes automobiles, carpets, diamond rings, and other items not usually available to shoplifters. A reasonable number for inventoly loss to retail

establishments is 2%. This number includes management embexalements, stealing by clerks, shipping departments, truckers, etc. (Department stores, particularly, have reported a large increase in shoplifting in recent

years, but they are among the most vulnerable of retail establishments and not important enough to bring the overall rate much above 2%.) It is

generally agreed that substantially more than half of the property missing from retail establishments is taken by employees, the remainder being lost

to outside shoplifters. But let us credit shoplifters with stealing 1% of all the property sold at retail in New York City - this would be about $150 million a year. What about burglary? There are something like two and one-half million

households in New York City. Suppose that on the average one out of five of them is robbed or burglarized every year. This takes into account that in some areas burglary is even more commonplace, and that some households are burglarized more than once a year. This would mean 500,000 burglaries a year. The average value of property taken in a burglary might be on the order of $200. In some burglaries, of course, much larger amounts of property are taken, but these higher value burglaries are much rarer, and often are committed by

non-addict professional thieves. If we use the number of $200 x 500,000

410

CDRRECTIVE PROCEDURES

burglaries, we get $100 million of property stolen from people's homes in a

year in New York City. Obviously. none of these estimated values is either sacred or substantiated. You can make your own estimate. The estimates here have the character that it would be very surprising if they were wrong by a factor of

10, and not very important for the conclusion if they were wrong by a factor of two. (This is a good position for an estimator to be in.)

Obviously not all addict theft is property taken from stores or from people's homes. One of the most feared types of addict crime is property taken from the persons of New Yorkers in muggings and other forms of

robbery. We can estimate this, too. Suppose that on the average, one person in 10 has property taken from his person by muggers or robbers each year. That would be 300,000 such robberies, and if the average one

produced $100 (which it is very unlikely to do). $8 million a year would be taken in this form of theft. So we can see that if we credit addicts with all of the shoplifting, all of

the theft from homes, and all of the theft from persons, total property stolen by addicts in a year in New York City amounts to some $330

million. You can throw in all the “fudge factors" you want, add all the other miscellaneous crimes that addicts commit, but no matter what you do, it is difficult to find a basis for estimating that addicts steal over a half billion dollars a year, and a quarter billion looks like a better estimate, although perhaps on the high side. After all, there must be some thieves

who are not addicts. Thus, I believe we have shown that whereas it is widely assumed that addicts steal from $2 billion to $5 billion a year in New York City, the

actual number is ten times smaller, and that this can be demonstrated by five minutes of thought.‘ So what? A quarter billion dollars’ worth of property is still a lot of property. It exceeds the amount of money spent annually on addict rehabilitation and other programs to prevent and control addiction. Furthermore. the value of the property stolen by addicts is a small part of the total cost to society of addict theft. A much larger cost is paid in fear, changed neighborhood atmosphere. the cost of precautions, and other echoing and re-echoing reactions to theft and its danger. One point in this exercise in estimating the value of property stolen by

addicts is to shed some light on people’s attitudes toward numbers. People ' Mythical numbers may be more mythical and have more vitality in the area of crime than in most areas. In the early 1950s the Kefauver Committee published a $20 billion estimate for the annual “take” of gambling in the United States- The figure actually was “picked from a hat.” Cine staff member said: “We had no real idea of the money spent. The Califomia Crime Commission said $12 billion. Virgil Petersen of Chicago said $30 billion. We picked $20 billion as the balance of the two." An unusual example of a mythical number that had a vigorous life - the assertion that 28 Black Panthers had been murdered by police — is given a careful biography by Edward Jay Epstein in the February 13. 19?l, New Yorker. (It turned out that there were 19 Panthers

killed, ten of them by the police, and eight of these in situations where it seems likely that the Panthers took the initiative.)

The vitality of mythical numbers

411

feel that there is a lot of addict crime, and that $2 billion is a large number, so they are inclined to believe that there is $2 billion worth of addict theft. But $250 million is a large number, too, and if our sense of perspective were not distorted by daily consciousness of federal expenditures, most people would be quite content to accept $250 million a year as a lot of theft. Along the same lines, this exercise is another reminder that even responsible officials, responsible newspapers, and responsible research

groups pick up and pass on as gospel numbers that have no real basis in fact. We are reminded by this experience that because an estimate has been used widely by a variety of people who should know what they are talking

about, one cannot assume that the estimate is even approximately correct. But there is a much more important implication of the fact that there

cannot be nearly so much addict theft as people believe. This implication is that there probably cannot be as many addicts as many people believe. lvlost of the money paid for heroin bought at retail comes from stealing, and most addicts buy at retail. Therefore, the number of addicts is basically

- although imprecisely - limited by the amount of theft. (The estimate developed in a Hudson Institute study was that close to half of the volume of heroin consumed is used by people in the heroin distribution system who do not buy at retail, and do not pay with stolen property but with

their “services” in the distribution systemf) But while the people in the business (at lower levels) consume close to half the heroin, they are only some one-sixth or one-seventh of the total number of addicts. They are the ones who can afford big habits. The most popular, informal estimate of addicts in New York City is 100,000-plus (usually with an emphasis on the ”plus”). The federal register

in Washington lists some 30,000 addicts in New York City, and the New York City Department of Health’s register of addicts’ names lists some 70,000. While all the people on those lists are not still active addicts many of them are dead or in prison - most people believe that there are many addicts who are not on any list. It is common to regard the estimate

of 100,000 addicts in New York City as a very conservative one. Dr. Iudianne Densen-Gerber was widely quoted early in 1970 for her estimate that there would be over 100,000 teenage addicts by the end of the

summer. And there are obviously many addicts of 20 years of age and rnore."" In discussing the number of addicts in this article, we will be talking f A parallel datum was developed in a later study by St- Luke's Hospital of B1 addicts average age 34. More than one-half of the heroin consumed by these addicts, over a year, had been paid for by the sale of heroin. Incidentally, these B1 addicts had stolen an average of $9,000 worth of property in the previous year. 3 Among other recent estimators we may note a Marxist, 5-ol Yurick, who gives us “500,000

junkies” (Monthly Review, December I920), and William R. Corson, who contends, in the December 1920 Penthouse, that “today at least 2,500,000 black Americans are hooked on

heroin.”

412

CORRECTIVE PROCEDURES

about the kind of person one thinks of when the term “addict” is used.‘ A

better term might be "street addict.” This is a person who normally uses heroin every day. He is the kind of person who looks and acts like the normal picture of an addict. We exclude here the people in the medical profession who are frequent users of heroin or other opiates, or are addicted to them, students who use heroin occasionally, wealthy people who are addicted but do not need to steal and do not frequent the normal

addict hangouts, etc. When we are addressing the "addict problem," it is much less important that we include these cases; while they are undoubtedly problems in varying degrees, they are a very different type of problem than that posed by the typical street addict. The amount of property stolen by addicts suggests that the number of New York City street addicts may be more like 70,000 than 100,000, and almost certainly cannot be anything like the 200,000 number that is

sometimes used. Several other simple ways of estimating the number of

street addicts lead to a similar conclusion. Experience with the addict population has led observers to estimate that

the average street addict spends a quarter to a third of his time in prison. (Some students of the subject, such as Edward Preble and Iohn J. Casey, ]r., believe the average to be over 40%.) This would imply that at any one time, one-quarter to one-third of the addict population is in prison, and that the total addict population can be estimated by multiplying the number of

addicts who are in prison by three or four. Of course the number of addicts who are in prison is not a known quantity (and, in fact, as we have indicated above, not even a very precise concept). However, one can make reasonable estimates of the number of addicts in prison (and for this

purpose we can include the addicts in various involuntary treatment centers). This number is approximately 14,000—17,000, which is quite compatible with an estimate of 370,000 total New York City street addicts.

Another way of estimating the total number of street addicts in New York City is to use the demographic information that is available about the

addict population. For example, we can be reasonably certain that some

25% of the street addict population in New York City is Puerto Rican, and some 50% are blacks. We know that approximately five out of six street addicts are male, and that 50% of the street addicts are between the ages of " There is an interesting anomaly about the word “addict.” Most people. if pressed for a definition of an "addict," would say he is a person who regularly takes heroin {or some such drug) and who. if he fails to get his regular dose of heroin. will have unpleasant or painful withdrawal symptoms. But this definition would not apply to a large part of what is generally recognized as the “addict population.“ In fact, it would not apply to most certified addicts. An addict who has been detoxified or who has been imprisoned and kept away from drugs for a week or so would not fit the normal definition of "addict." He no longer has any physical symptoms resulting from not taking heroin. "Donkey" Reilly would certainly fulfill most people’s ideas of an addict, but for 30 of the 54 years he was an "addict" he was in prison, and he was certainly not actively addicted to heroin during most of the time he spent in prison. which was more than half of his "addict" career (although a certain amount of drugs are available in prison).

Intuitive prediction

415

whereas what one knows about the sales of novels is distributional information. Similarly, in predicting the longevity of a patient, the singular information includes his age, state of health, and past medical history, whereas the distributional information consists of the relevant population statistics. The singular information describes the specific features of the problem that distinguish it from others, while the distributional information characterizes the outcomes that have been observed in cases of the same general class. The present concept of distributional data does not coincide with the Bayesian concept of a prior probability distribu-

tion. The former is defined by the nature of the data, whereas the latter is defined in terms of the sequence of information acquisition.

Many prediction problems are essentially unique in the sense that little, if any, relevant distributional information is available. Examples are the forecast of demand for nuclear energy in the year 2000, or of the date by

which an effective cure for leukemia will be found. In such problems, the expert must rely exclusively on singular information. However, the evidence suggests that people are insufficiently sensitive to distributional data even when such data are available. Indeed, recent research suggests

that people rely primarily on singular information, even when it is scanty and unreliable, and give insufficient weight to distributional information

{Kahneman 8: Tversky, 1973, 4; Tversky E: Kahneman, Chap. Ill). The context of planning provides many examples in which the distribu-

tion of outcomes in past experience is ignored. Scientists and writers, for example, are notoriously prone to underestimate the time required to

complete a project, even when they have considerable experience of past failures to live up to planned schedules. A similar bias has been documented in engineers’ estimates of the completion time for repairs of power stations (Kidd, l9?'0). Although this planning fallacy is sometimes attributable to motivational factors such as wishful thinking, it frequently

occurs even when underestimation of duration or cost is actually penalized.

The planning fallacy is a consequence of the tendency to neglect distributional data and to adopt what may be termed an internal approach to prediction, in which one focuses on the constituents of the specific problem rather than on the distribution of outcomes in similar cases. The

internal approach to the evaluation of plans is likely to produce underestimation. A building can only be completed on time, for example, if there

are no delays in the delivery of materials, no strikes, no unusual weather conditions, and so on. Although each of these disturbances is unlikely, the probability that at least one of them will occur may be substantial. This combinatorial consideration, however, is not adequately represented in people’s intuitions (Bar-Hillel, 19?3). Attempts to combat this error by adding a slippage factor are rarely adequate, since the adjusted value tends to remain too close to the initial value that acts as an anchor (Tversky 8:

Kahneman, 1974, 1). The adoption of an external approach that treats the

416

CURRECTIVE PROCEDURES

specific problem as one of many could help overcome this bias. In this approach, one does not attempt to divine the specific manner in which a plan might fail. Rather, one relates the problem at hand to the distribution

of completion time for similar projects. It is suggested that more reasonable estimates are likely to be obtained by asking the external question: how long do such projects usually last? and not merely the internal question: what are the specific factors and difficulties that operate in the particular problem? The tendency to neglect distributional information and to rely mainly on singular information is enhanced by any factor that increases the

perceived uniqueness of the problem. The relevance of distributional data can be masked by detailed acquaintance with the specific case or by

intense involvement in it. The perceived uniqueness of a problem is also

influenced by the formulation of the question that the expert is required to answer. For example, the question of how much the development of a new product will cost may induce an internal approach in which total costs are broken down into components. The equivalent question of the percentage

by which costs will exceed the current budget is likely to call to mind the

distribution of cost overruns for developments of the same general kind. Thus, a change of units - for example, from costs to overruns - could alter

the manner in which the problem is viewed. The prevalent tendency to underweigh or ignore distributional infor-

mation is perhaps the major error of intuitive prediction. The consideration of distributional information, of course, does not guarantee the accuracy of forecasts. It does, however, provide some protection against

completely unrealistic predictions. The analyst should therefore make every effort to frame the forecasting problem so as to facilitate utilizing all

the distributional information that is available to the expert. Regression and intuitive prediction

In most problems of prediction, the expert has both singular information about the specific case and distributional information about the outcomes

in similar cases. Examples are the counselor who predicts the likely achievements of a student, the banker who assesses the earning potential of a small business, the publisher who estimates the sales of a textbook, or the economist who forecasts some index of economic growth. I-[ow do people predict in such situations? Psychological research (Kahneman 8: Tversky, 1973, 4,: Ross, 19??) suggests that intuitive predic-

tions are generated according to a simple matching rule: the predicted value is selected so that the standing of the case in the distribution of

outcomes matches its standing in the distribution of impressions. The following example illustrates this rule. An editor reviewed the manuscript of a novel and was favorably impressed. He said: “This book reads like a best-seller. Among the books of this type that were published in recent

Intuitive prediction

41?

years, I would say that only one in twenty impressed me more.” If the editor were now asked to estimate the sales of this novel, he would

probably predict that it will be in the top 5 percent of the distribution of sales. There is considerable evidence that people often predict by matching prediction to impression. However, this rule of prediction is unsound because it fails to take uncertainty into account. The editor of our example

would surely admit that sales of books are highly unpredictable. In such a situation of high uncertainty, the best prediction of the sales of a book should fall somewhere between the value that matches one's impression and the average sales for books of its type.

One of the basic principles of statistical prediction, which is also one of the least intuitive, is that the extremeness of predictions must be moder-

ated by considerations of predictability. Imagine, for example, that the publisher knows from past experience that the sales of books are quite unrelated to his initial impressions. Manuscripts that impressed him

favorably and manuscripts that he disliked were equally likely to sell well or poorly. In such a case of zero predictability, the publisher's best guess about sales should be the same for all books — for example, the average of the relevant category — regardless of his personal impression of the

individual book. Predictions are allowed to match impressions only in the case of perfect predictability. In intermediate situations, which are of course the most common, the prediction should be regressive; that is, it should fall between the class average and the value that best represents one’s impression of the case at hand. The lower the predictability, the closer the prediction should be to the class average. Intuitive predictions are typically nonregressive: people often make extreme predictions on the

basis of information whose reliability and predictive validity are known to be low. . . .

A corrective procedure for prediction How can the expert be guided to produce properly regressive predictions? How can he be led to use the singular and distributional information that is available to him. in accordance with the principles of statistical predic-

tion? In this section a five-step procedure that is designed to achieve these objectives is proposed. Step I: Selection of :2 reference class The goal of this stage is to identify a class to which the case at hand can be referred meaningfully and for which the distribution of outcomes is known or can be assessed with reasonable confidence. In the predictions of the sales of a book or of the gross earnings of a film,

for example, the selection of a reference class is straightforward. It is

Intuitive prediction

419

distribution, over various domains of advanced technology, of England's

share of the world market in the year 2000? How do you expect the particular case of transportation systems to compare to other technolo-

gies?” Note that the distribution of outcomes is not known in this problem. I-Iowever, the required distribution could probably be estimated on the basis of the distribution of values for England's present share of the world market in different technologies, adjusted by an assessment of the

long-term trend of England's changing position in world trade. Step 3: Intuitive estimation One part of the information the expert has about a problem is summarized by the distribution of outcomes in the reference class. In addition, the

expert usually has a considerable amount of singular information about the particular case, which distinguishes it from other members of the class. The expert should now be asked to make an intuitive estimate on the basis

of this singular information. As was noted above, this intuitive estimate is likely to be nonregressive. The objective of the next two steps of the

procedure is to correct this bias and obtain a more adequate estimate. Step 4: Assessment of predictability The expert should now assess the degree to which the type of information that is available in this case permits accurate prediction of outcomes. In the context of linear prediction, the appropriate measure of predictability is p, the product-moment correlation between predictions and outcomes.

Where records of past predictions and outcomes exist, the required value could be estimated from these records. In the absence of such data. one

must rely on subjective assessments of predictability. A statistically sophisticated expert may be able to provide a direct estimate of p on the basis of

his experience. When statistical sophistication is lacking, the analyst should resort to less direct procedures. Cine such procedure requires the expert to compare the predictability of

the variable with which he is concerned to the predictability of other variables. For example, the expert could be fairly confident that his ability to predict the sales of books exceeds the ability of sportscasters to predict

point spread in football games, but is not as good as the ability of weather forecasters to predict temperature two days ahead of time. A skillful and diligent analyst could construct a rough scale of predictability based on computed correlations between predictions and outcomes for a set of phenomena that range from highly predictable — for example, temperature - to highly unpredictable — for example, stock prices. The analyst would then be in a position to ask the expert to locate the predictability of the target quantity on this scale, thereby providing a numerical estimate of p.

An alternative method for assessing predictability involves questions

410

CORRECTIVE PROCEDURES

such as: If you were to consider two novels that you are about to publish,

how often would you be right in predicting which of the two will sell more copies? An estimate of the ordinal correlation between predictions and outcomes can now be obtained as follows: If p is the estimated proportion of pairs in which the order of outcomes was correctly predict-

ed, then 1- - 2p — 1 provides an index of predictive accuracy, which ranges from zero when predictions are at chance level to unity when predictions are perfectly accurate. In many situations 1- can be used as a crude approximation for p. Estimates of predictability are not easy to make, and they should be examined carefully. The expert could be subject to the hindsight fallacy (Fischhoff, 1975), which leads to an overestimate of the predictability of outcomes. The expert could also be subject to an availability bias (Tversky 8: Kahneman, 19173, 11} and might recall for the most part surprises, or

memorable cases in which strong initial impressions were later confirmed. Step 5: Correction of the intuitive estimate.

To correct for nonregressiveness, the intuitive estimate should be adjusted toward the average of the reference class. If the intuitive estimate was

nonregressive, then under fairly general conditions the distance between the intuitive estimate and the average of the class should be reduced by a factor of p, where p is the correlation coefficient. This procedure provides

an estimate of the quantity, which, one hopes, reduces the nonregressive error. For example, suppose that the expert's intuitive prediction of the sales of a given book is 12,000 and that, on average, books in that category sell 4,000 copies. Suppose further that the expert believes that he would

correctly order pairs of manuscripts by their future sales on B0 percent of comparisons. In this case, r -= 1.6 — 1 = 0.6, and the regressed estimate of sales would be 4,000 + 0.6(12,000 — 4,000) - 3,800. The effect of this correction will be substantial when the intuitive estimate is relatively extreme and predictability is moderate or low. The rationale for the computation should be carefully explained to the expert.

who will then decide whether to stand by his original prediction, adopt the computed estimate, or correct his assessment to some intermediate value. The procedure that we have outlined is open to several objections that are likely to arise in the interaction between analyst and expert. First, the

expert could question the assumption that his initial intuitive estimate was nonregressive. Fortunately, this assumption can be verified by asking the expert to estimate (1) the proportion of cases in the references class - for example, manuscripts - that would have made a stronger impression on him and (2) the proportion of cases in the reference class for which the

outcome exceeds his intuitive prediction — for example, the proportion of

Intuitive prediction

421

books that sold more than 12,000 copies. If the two proportions are approximately the same, the prediction was surely nonregressive.

A more general objection may question the basic idea that predictions should be regressive. The expert could point out, correctly, that the

present procedure will usually yield conservative predictions that are not far from the average of the class and is very unlikely to predict an exceptional outcome that lies beyond all previously observed values. The answer to this objection is that a fallible predictor can retain a chance to correctly predict a few exceptional outcomes only at the cost of erroneously identifying many other cases as exceptional. Nonregressive predictions over-predict: they are associated with a substantial probability that any high prediction is an overestimate and any low prediction is an

underestimate. In most situations, this bias is costly, and should be eliminated. . . . Concluding remarks The approach presented here is based on the following general notions about forecasting. First, that most predictions and forecasts contain an irreducible intuitive component. Second, that the intuitive predictions of knowledgeable individuals contain much useful information. Third, that these intuitive judgments are often biased in a predictable manner. Hence,

the problem is not whether to accept intuitive predictions at face value or to reject them, but rather how they can be debiased and improved. The analysis of human judgment shows that many biases of intuition

stem from the tendency to give little weight to certain types of information, for example, the base-rate frequency of outcomes and their predictability. The strategy of debiasing presented in this paper attempts to elicit

from the expert relevant information that he would normally neglect, and to help him integrate this information with his intuitive impressions in a

manner that respects basic principles of statistical prediction. . . .

31. Debiasing

Baruch Fischhofl

Once a behavioral phenomenon has been identified in some experimental context, it is appropriate to start questioning its robustness. A popular and often productive questioning strategy might be called destructive testing, after a kindred technique in engineering. A proposed design is subjected to conditions intended to push it to and beyond its limits of viability. Such controlled destruction can clarify where it is to be trusted and why it

works when it does. Applied to a behavioral phenomenon, this philosophy would promote research attempting to circumscribe the conditions for its observation and the psychological processes that must be evoked or controlled in order to eliminate it. Where the phenomenon is a judgmental bias, destructive testing takes the form of debiasing efforts. Destructive testing shows where a design fails; when a bias fails, the result is improved

judgment. The study of heuristics and biases might itself be seen as the application of destructive testing to the earlier hypothesis that people are competent intuitive statisticians. Casual observation suggests that people's judgment is generally “good enough” to let them make it through life without getting into too much trouble. Early studies {Peterson Er Beach, 1967) supported this belief, indicating that, to a first approximation, people might be described as veridical observers and normative judges. Subse-

quent studies, represented in this volume, tested the accuracy of this approximation by looking at the limits of people's apparent successes.

Could better judgment have made them richer or healthier? Can the success they achieved be attributed to a lenient environment, which does not presume particularly knowledgeable behavior? Tragic mistakes provide important insight into the nature and quality of people’s decisionIvly thanks to Ruth Beyth-lvlarom, Don lvlacfiregor, and Papl Slovic for their helpful

comments on earlier drafts of this paper. This work was supported by the Office of Naval Research under Contract M00014-B0-C—0l50 to Perceptronics. Inc.

Debiasing

423

making processes; fortunately, they are rare enough that we have too small a data base to disentangle the factors that may have led people astray. Judgment research has used the destructive-testing strategy to generate biased judgments in moderately well-characterized situations. The theoretician hopes that a pattern of errors and successes will emerge that lends itself to few possible explanations. Thus, the study of biases

clarifies the sources and limits of apparent wisdom, just as the study of debiasing clarifies the sources and limits of apparent folly. Both are

essential to the study of judgment. Although some judgment studies are primarily demonstrations that a

particular bias can occur under some, perhaps contrived, conditions, many other studies have attempted to stack the deck against the observation of bias. Some of these are explicitly debiasing studies, conducted in the hope that procedures that prove effective in the laboratory will also improve

performance in the field. Clthers had the more theoretical goal of clarifying the contexts that induce suboptimal judgments. The core of this

chapter is a review of studies that can be construed as efforts to reduce two familiar biases, hindsight bias and overconfidence. It considers failures as well as successes in the belief that (a) failure helps clarify the virulence of a problem and the need for corrective or protective measures, and (b) the

overall pattern of studies is the key to discovering the psychological dimensions that are important in characterizing real-life situations and

anticipating the extent of biased performance in them. The review attempts to be exhaustive, subject to the following three selection criteria: 1. Only studies published in sources with peer review are consid-

ered. Thus, responsibility for quality control is externalized. 2. Anecdotal evidence is (with a few exceptions) excluded. Although such reports are the primary source of information about some kinds of debiasing attempts (e.g., use of experts), they are subject to interpretive and selection biases that require special attention beyond the scope of this summary (see Chap. 23). 3. Some empirical evidence is offered. Excluded are suggestions that have yet to be tested and theoretical arguments (e.g., about the ecological validity of experiments) that cannot be tested.

Prior to that review, a framework for debiasing efforts will be offered,

characterizing possible approaches and the assumptions underlying them. Such a framework might reveal recurrent patterns when applied to a variety of judgmental biases.

Debiasing methods When there is a problem, it is natural to look for a culprit. Debiasing

procedures may be most clearly categorized according to their implicit

424

CORRECTIVE PROCEDURES

Table 1. Deliiasing methods according to underlying assumption

Assumption Fautty tasks Unfair tasks

Misunderstood tasks

Strategies Raise stakes Clarify instructions I stimuli Discourage second-guessing Use better response modes Ask fewer questions Demonstrate alternative goal

Demonstrate semantic disagreement Demonstrate impossibility of task Demonstrate overlooked distinction Faulty judges Perfectible individuals

Warn of problem Describe problem

Provide personalized feedback Train extensively

Incorrigible individuals

Replace them Recalibrate their responses Plan on error

Mismatch between judges and task Restructuring

Make knowledge explicit

Search for discrepant information

Education

Decompose problem Consider alternative situations Offer alternative formulations Rely on substantive experts

Educate from childhood

allegation of culpability. The most important distinction is whether responsibility for biases is laid at the doorstep of the judge. the task, or some mismatch between the two. Do the biases represent artifacts of incompetent experimentation and dubious interpretation, clear-cut cases of judgmental fallibility. or the unfortunate result of judges having, but

misapplying, the requisite cognitive skills? As summarized in Table 1, and described below, each of these categories can be broken down further according to what might be called the depth of the problem. I-Iow fundamental is the difficulty? Are technical or structural changes needed?

Strategies for developing debiasing techniques are quite different for the different causal categories.

Faulty tasks Unfair tasks. Experimentalists have standard questions that they pose to

their own and others’ work. Studies are published only if they instill

Debiasing

427

lence, and resilience of such indelible biases. However, because improved

judgment is not the intent of these corrective actions, they will be

considered only cursorily here. Mismatch between judge and task Restructuring. Perhaps the most charitable, and psychological, viewpoint is to point no fingers and blame neither judge nor task. Instead, assume that

the question is acceptably posed and that the judge has all requisite skills, but somehow these skills are not being used. In the spirit of human engineering, this approach argues that the proper unit of observation is the person-task system. Success lies in making them as compatible as possible. Just as a mechanically intact airplane needs good instrument design to become flyable, an honest (i.e., not misleading) judgment task

may only become tractable when it has been restructured to a form that allows respondents to use their existing cognitive skills to best advantage.

Although such cognitive engineering tends to be task specific, a number of recurrent strategies emerge: (a) forcing respondents to express what

they know explicitly rather than letting it remain "in the head”: (b) encouraging respondents to search for discrepant evidence, rather than collecting details corroborating a preferred answer; (c) offering ways to

decompose an overwhelming problem to more tractable and familiar components; (d) suggesting that respondents consider the set of possible situations that they might have encountered in order to understand better the specific situation at hand; and (e) proposing alternative formulations of the presented problem (e.g., using different terms. concretizing, offer-

ing analogies). Education. A variant on the people-task "systems" approach is to argue that people can do this task, but not these people. The alternatives are to use: (a) experts who, along with their substantive knowledge, have acquired some special capabilities in processing information under conditions of uncertainty; or (b) a new breed of individual, educated from some early age to think probabilistically. In a sense. this view holds that although people are not, in principle, incorrigible, most of those presently around

are. Education differs from training (a previous category) in its focus on developing general capabilities rather than specific skills.

Hindsight bias: An example of debiasing efforts A critical aspect of any responsible job is learning from experience. Once we know how something turned out, we try to understand why it happened and to evaluate how well we, or others, planned for it. Although such outcome knowledge is thought to confer the wisdom of

423

CORRECTIVE PROCEDURES

hindsight on our judgments, its advantages may be oversold. In hindsight, people consistently exaggerate what could have been anticipated in foresight. They not only tend to view what has happened as having been inevitable, but also to view it as having appeared "relatively inevitable" before it happened. People believe that others should have been able to

anticipate events much better than was actually the case. They even misremember their own predictions so as to exaggerate in hindsight what

they knew in foresight (Fischhoff, 19?'5). Although it is flattering to believe that we would have known all along what we could only know in hindsight, that belief hardly affords us a fair appraisal of the extent to

which surprises and failures are inevitable. It is both unfair and selfdefeating to castigate decision makers who have erred in fallible systems, without admitting to that fallibility and doing something to improve the system. By encouraging us to exaggerate the extent of our knowledge, this bias can make us overconfident in our predictive ability. Perception of a surprise-free past may portend a surpriseful future. Research on this bias has included investigations of most of the possible debiasing strategies included in the previous section. Few of these techniques have successfully reduced the hindsight bias; none has eliminated

it. They are described below and summarized in Table 2.

Faulty tasks

Unfair tasks. In an initial experimental demonstration of hindsight bias (Fischhoff, If-W5), subjects read paragraph-long descriptions of a historical event and assessed the probability that they would have assigned to each of its possible outcomes had they not been told what happened. Regard-

less of whether the reported outcome was true or false (i.e., whether it happened in reality), subjects believed that they would have assigned it a higher probability than was assigned by outcome-ignorant subjects. This study is listed among the debiasing attempts, since by concentrating on a few stories it answered the methodological criticism of "asking too many

questions" that might be leveled against subsequent studies. Other studies that asked few questions without eliminating hindsight bias include Slovic and Fischhoff (19727), who had subjects analyze the likelihood of possible outcomes of several scientific experiments; Mitchell and Kalb (in

press), who had nurses evaluate incidents taken from hospital settings; and Pennington, Rutter, Ivlclienna, and Morley (I980), who had women assess their personal probability of receiving a positive result on a single

pregnancy test (although the low power of this study renders its conclusion somewhat tentative). Other attempts to demonstrate an artifactual source of hindsight bias that have been tried and failed include: substituting rating-scale judg-

ments of "surprisingness" for probability assessments (Slovic 8: Fischhoff,

Debiasing

429

19’?7); using more homogeneous items to allow fuller evocation of one set of knowledge, rather than using general-knowledge questions scattered

over a variety of content areas, none of which might be thought about very deeply (Fischhoff 6-: Beyth, 1975); imploring subjects to work harder

(Fischhoff, 19??'b); trying to dispel doubts about the nature of the experiment (G. Wood, I978); and using contemporary events that judges have considered in foresight prior to making their hindsight assessments

[Fischhoff 5: Beyth, 19.75). Misunderstood tasks. One possible attraction of hindsight bias is that it may be quite flattering to represent oneself as having known all along what

was going to happen. One pays a price for such undeserved self-flattery only if (a) one's foresight leads to an action that appears foolish in

hindsight or (b) systematic exaggeration of what one knew leads to overconfidence in what one presently knows, possibly causing capricious

actions or failure to seek needed information. Since these long-range consequences are not very relevant in the typical experiment, one might worry about subjects being tempted to paint themselves in a favorable light. Although most experiments have been posed as tests of subjects’ ability to reconstruct a foresightful state of knowledge, rather than as tests of how extensive that knowledge was, temptations to exaggerate might

still remain. If so, they would reflect a discrepancy between subjects’ and experimenters’ interpretations of the task. One manipulation designed to

eliminate this possibility requires subjects first to answer questions and then to remember their own answers, with the acuity of their memory being at issue (Fischhoff, 1977b; Fischhoff 8: Beyth, 19175; Pennington et

al., 1980; G. Wood, 1978). A second manipulation requires hindsight subjects to estimate the foresight responses of their peers, on the assump-

tion that they have no reason to exaggerate what others knew (Fischhoff. 19?5; G. Wood, l9?8). Neither manipulation has proven successful. Subjects remembered themselves to have been more knowledgeable than was, in fact, the case. They were uncharitable second-guessers in the sense

of exaggerating how much others would have (or should have) known in foresight. Faulty judges Learning to avoid the biases that arise from being a prisoner of one's

present perspective constitutes a, or perhaps the, focus of historians‘ training (see Chap. 23). There have, however, been no empirical studies of

the success of these efforts. The emphasis that historians place on primary sources, with their fossilized records of the perceptions of the past, may reflect a feeling that the human mind is sufficien tly incorrigible to require that sort of discipline by document. Although it used a vastly less rigorous procedure, the one experimental training study offers no reason for

430

CORRECTIVE PROCEDURES

optimism: Fischhoff (197‘?b) explicitly described the bias to subjects and asked them to avoid it in their judgments — to no avail.

Mismatch between judges and tasks

Restructuring. Three strategies have been adopted to restructure hindsight tasks, so as to make them more compatible with the cognitive skills and predispositions that judges bring to them. One such strategy separates subjects in time from the report of the event, in hopes of reducing its

tendency to dominate their perceptual field (Fischhoff 8: Beyth, 1975; G. Wood, 1978); this strategy was not effective. With the second strategy,

judges assess the likelihood of the reported event's recurring rather than the likelihood of its happening in the first place, in the hope that

uncertainty would be more available in the forward-looking perspective (Mitchell & Kalb, in press; Slovic Gr Fischhoff, l9?"J"); this, too, failed. The final strategy requires subjects to indicate how they could have explained the occurrence of the outcome that did not happen (Slovic
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF