Stanovich Chapter 5 & 6
Intuitive Psychology
If our intuitive (or "folk") theories about objects in motion are inaccurate, it is hard to believe that our folk theories in the more complex domain of human behavior will be exceedingly accurate. Indeed, this research literature serves to warn us that personal experience is no guarantee against incorrect beliefs about human psychology. Psychologist Dan Ariely (2015) tells the story of suffering burns over 70 percent of his body as the result of an accident when he was 18 years old. He describes many months of subsequent treatment in which bandages that were removed quickly caused him great pain. The theory held by the nurses was that a quick removal (which caused a sharp pain) was preferable to slow removal which would cause a longer—although less intense—pain. After leaving the hospital and beginning his career as a psychology student, Ariely conducted experiments to test the nurses' belief. To his surprise, Ariely found that the slower procedure—lower pain intensity over a longer period—would have reduced the pain perception in such situations. He said that by the time he had finished, he realized that the nurses in the burn unit were kind and generous individuals with a lot of experience in soaking and removing bandages, but that "despite all their experience, they erred in treating the patients they cared so much about. They still didn't have the right theory about what would minimize their patients' pain. How could they be so wrong, I wondered, considering their vast experience? Perhaps other professionals might also be misunderstanding the consequences of their behaviors and make poor decisions" (p. C3, Ariely, 2015). Research indicates that intuitive judgments of pain intensity in other people are quite bad, even among physicians with much clinical experience (Tait et al., 2009). As discussed in Chapter 4, reliance on testimonials, case study evidence, and "common practice" can often obscure the need for a control group to check the veracity of a conclusion derived from informal observation. For example, Dingfelder (2006) describes how many medical professionals believe that they should not advise individuals with Tourette syndrome (described in Chapter 2) to suppress their tics (involuntary vocal expressions). The physicians believed that this caused a so-called rebound effect—a higher rate of tics occurring after the suppression. This belief, though, is based on informal observation rather than controlled experimentation. When the proper experimentation was done—observing the number of tics systematically by comparing a period of suppression to a period of nonsuppression—it appeared that there was no "rebound" effect at all following tic suppression. In Chapter 1, we illustrated that a number of commonsense (or folk) beliefs about human behavior are wrong, and this was just a small sample. For example, it turns out that there is no strong evidence indicating that highly religious people are more altruistic than less religious people (Paloutzian & Park, 2005). Studies have indicated that there is no simple relationship between degree of religiosity and the tendency to engage in charitable acts, to aid other people in distress, or to abstain from cheating other people. Incorrect intuitive theories are not limited to psychology. For example, they are rampant in the world of sport and physical fitness. For example, quantitative analyses have indicated that in football (at all levels, from high school to professional) most coaches increase their probability of winning by going for it on fourth down when their teams are at midfield (Moskowitz & Wertheim, 2011). Similar analyses have shown that, overall, coaches should punt less and on-side kick more. Statistics prove that if coaches reoriented their strategies in these respects, they would win more games (Moskowitz & Wertheim, 2011). Now, coaches might have a variety of reasons for ignoring this statistical advice (fear of being second-guessed, for example), but these reasons do not apply to the fans. Nevertheless, fans have the incorrect intuitive theory that the coaches are right. Incorrect beliefs about human behavior can have very practical consequences. Keith and Beins (2008) mention that among their students, typical views about cell phones and driving are captured by statements such as "Talking doesn't impair my driving" and "I talk on the phone to keep myself from falling asleep." The students seem completely oblivious to the fact that driving while using a cell phone (even a hands-free phone) seriously impairs concentration and attention (Kunar et al., 2008; Richtel, 2014; Strayer et al., 2016; Strayer & Drews, 2007) and is a cause of accidents and deaths (McEvoy et al., 2005; Novotny, 2009; Parker-Pope, 2009; Richtel, 2014). It is just as dangerous as drunk driving. Texting while driving is particularly lethal. The list of popular beliefs that are incorrect is long. For example, many people believe that a full moon affects human behavior. It doesn't (Univ. of California, 2013; Foster & Roenneberg, 2008). Some people believe that "opposites attract." They don't (Youyou et al., 2017). Some people believe that you shouldn't change an answer on a multiple choice test. They're wrong (Kruger et al., 2005). Some people believe that "familiarity breeds contempt." It doesn't (Claypool et al., 2008; Zebrowitz et al., 2008). Some people believe that people behave like robots under hypnosis. They don't (Lilienfeld, 2014). And the list goes on and on and on (see Lilienfeld et al., 2010). The many inadequacies in people's intuitive theories of behavior illustrate why we need the controlled experimentation of psychology: so that we can progress beyond our flat-earth conceptions of human behavior to a more accurate scientific conceptualization.
The Case of Clever Hans, the Wonder Horse
The necessity of eliminating alternative explanations of a phenomenon by the use of experimental control is well illustrated by a story that is famous in the annals of behavioral science: that of Clever Hans, the mathematical horse. Over 100 years ago, a German schoolteacher presented to the public a horse, Clever Hans, who supposedly knew how to solve mathematical problems. When Hans was given addition, subtraction, and multiplication problems by his trainer, he would tap out the answer to the problems with his hoof. The horse's responses were astoundingly accurate. Many people were amazed and puzzled by Clever Hans's performance. Was the horse really demonstrating an ability thus far unknown in his species? Imagine what the public must have thought. Compelling testimonials to Hans's unique ability appeared in the German press. A group of "experts" observed Hans and attested to his abilities. Everyone was baffled. And bafflement was bound to remain as long as the phenomenon was merely observed in isolation—without controlled observations being carried out. The mystery was soon dispelled, however, when a psychologist, Oskar Pfungst, undertook systematic studies of the horse's ability (Heinzen et al., 2014). In the best traditions of experimental design, Pfungst systematically manipulated the conditions under which the animal performed, thus creating "artificial" situations (see Chapter 7) that would allow tests of alternative explanations of the horse's performance. After much careful testing, Pfungst found that the horse did have a special ability, but it was not a mathematical one. In fact, the horse was closer to being a behavioral scientist than a mathematician. You see, Hans was a very careful observer of human behavior. As it was tapping out its answer, it would watch the head of the trainer or other questioner. As Hans approached the answer, the trainer would involuntarily tilt his head slightly, and Hans would stop. Pfungst found that the horse was extremely sensitive to visual cues. It could detect extremely small head movements. Pfungst tested this hypothesis by having the problems presented by a trainer who was occluded from the horse's view. The animal lost its "mathematical abilities" when the trainer was out of view. (Modern versions of Pfungst's techniques are used to test whether cues from their handlers are affecting drug-sniffing police dogs, Lit et al., 2011.) The case of Clever Hans is a good context in which to illustrate the importance of carefully distinguishing between the description of a phenomenon and the explanation of a phenomenon. That the horse tapped out the correct answers to mathematical problems presented by the trainer is not in dispute. The trainer was not lying. Many observers attested to the fact that the horse actually did tap out the correct answers to mathematical problems presented by the trainer. It is in the next step that the problem arises: making the inference that the horse was tapping out the correct answers because the horse had mathematical abilities. Inferring that the horse had mathematical abilities was a hypothesized explanation of the phenomenon. It did not follow logically—from the fact that the horse tapped out the correct answers to mathematical problems—that the horse had mathematical abilities. Positing that the horse had mathematical abilities was only one of many possible explanations of the horse's performance. It was an explanation that could be put to empirical test. When put to such a test, the explanation was falsified. The Clever Hans case shows the danger of jumping too fast from a behavioral description to a theoretical explanation for the behavioral pattern. A contemporary example concerns the behavioral claim often heard in political discussions that "women earn only 77 cents for every dollar a man makes" (or 79 cents, the number varies). There is nothing wrong with the statement per se, because it is simply a descriptive fact. But in political discussions, an unjustified theoretical leap is often made when this descriptive fact is used to press the theoretical argument that the "77 cents" statement is a direct indicator of discrimination: that it means that women are paid less for doing the same work as men. That statement is a theoretical inference, not a descriptive fact. And the theoretical inference is wrong. Women are not paid 23 percent less for doing exactly the same work as men. The 77 cents figure refers to total earnings, not wages in the same job. It is possible for there to be a substantial gap in earnings even when the wages paid to males and females are absolutely equal for doing the same work. Many people do not know that the 77 cents figure is arrived at simply by adding up all the income of full time (over 30 hours) workers and dividing by the number of workers. It does not take into account the type of occupation, the years experience in that occupation, the exact number of hours worked, prior education, overtime worked, the job classification, skills involved, and a host of other variables. When all of these variables are statistically controlled (using the multiple regression techniques mentioned in Chapter 5) the pay gap largely disappears (Bertrand et al., 2010; Black et al., 2008; CONSAD, 2009; Kolesnikova & Liu, 2011; O'Neill & O'Neill, 2012; Solberg & Laughlin, 1995). Thus, the descriptive fact of the 77 cents earnings gap is not evidence for the theory that women earn 23 percent less than men for doing exactly the same job. The "77 cents" statement should not be used, in political discussions, as a justification for social policies that assume the existence of large gender discrimination in wages. The error in the case of the pay gap, the error in jumping too fast from descriptive facts to a particular hypothetical explanations, is just like what happened in the Clever Hans case. Before the intervention of Pfungst, the experts who looked at the horse had made this fundamental error: They had not seen that there might be alternative explanations of the horse's performance. They thought that, once they had observed that the trainer was not lying and that the horse actually did tap out the correct answers to mathematical problems, it necessarily followed that the horse had mathematical abilities. Pfungst was thinking more scientifically and realized that that was only one of many possible explanations of the horse's performance, and that it was necessary to set up controlled conditions in order to differentiate alternative explanations. By having the horse answer questions posed by the trainer from behind a screen, Pfungst set up conditions in which he would be able to differentiate two possible explanations: that the horse had mathematical abilities or that the horse was responding to visual cues. If the horse actually had such abilities, putting the trainer behind a screen should make no difference in its performance. On the other hand, if the horse was responding to visual cues, then putting the trainer behind a screen should disrupt its performance. When the latter happened, Pfungst was able to rule out the hypothesis that the horse had mathematical abilities (Heinzen et al., 2014). Note also the link here to the principle of parsimony discussed in Chapter 3—the principle that states that when two theories have the same explanatory power, the simpler theory (the one involving fewer concepts and conceptual relationships) is preferred. The two theories in contention here—that the horse had true mathematical abilities and that the horse was reading behavioral cues—are vastly different in parsimony. The latter requires no radical adjustments in prior psychological and brain theory. It simply requires us to adjust slightly our view of the potential sensitivity of horses to behavioral cues (which was already known to be high). The former theory—that horses can truly learn arithmetic—requires us to alter dozens of concepts in evolutionary science, cognitive science, comparative psychology, and brain science. It is unparsimonious in the extreme because it does not cohere with the rest of science and, thus, requires that many other concepts in science be altered if it is to be considered true (we shall discuss the so-called principle of connectivity in Chapter 8).
Intuitive Physics
Actually, the three questions posed at the beginning of this chapter were derived from the work that psychologists have done on so-called "intuitive physics," that is, people's beliefs about the motion of objects. Interestingly, these beliefs are often at striking variance from how moving objects actually behave (Bloom & Weisberg, 2007; Riener et al., 2005). For example, in the first problem, once the string on the circling ball is cut, the ball will fly in a straight line at a 90-degree angle to the string (tangent to the circle). McCloskey (1983) found that one-third of the college students who were given this problem thought, incorrectly, that the ball would fly in a curved trajectory. About half of McCloskey's subjects, when given problems similar to the bomber pilot example, thought that the bomb should be dropped directly over the target, thus displaying a lack of understanding of the role of an object's initial motion in determining its trajectory. The bomb should actually be dropped five miles before the plane reaches the target. The subjects' errors were not caused by the imaginary nature of the problem. When subjects were asked to walk across a room and, while moving, drop a golf ball on a target on the floor, the performance of more than half of them indicated that they did not know that the ball would move forward as it fell. Finally, many people are not aware that a bullet fired from a rifle will hit the ground at the same time as a bullet dropped from the same height. You can assess your own performance on this little quiz. Chances are that you missed at least one if you have not had a physics course recently. "Physics course!" you might protest. "Of course I haven't had a physics class recently. This quiz is unfair!" But hold on a second. Why should you need a physics course? You have seen literally hundreds of falling objects in your lifetime. You have seen them fall under naturally occurring conditions. Moving objects surround you every day, and you are seeing them in their "real-life" state. You certainly cannot claim that you have not experienced moving and falling objects. Granted, you have never seen anything quite like the bullet example. But most of us have seen children let go of whirling objects, and many of us have seen objects fall out of planes. And besides, it seems a little lame to protest that you have not seen these exact situations. Given your years of experience with moving and falling objects, why can't you accurately predict what will happen in a situation only slightly out of the ordinary? It is critical to understand that the layperson's beliefs are inaccurate precisely because his or her observations are "natural," rather than controlled in the manner of the scientist's. Thus, if you missed a question on the little quiz at the beginning of the chapter, don't feel ignorant or inadequate. Simply remember that some of the world's greatest minds observed falling objects for centuries without formulating a physics of motion any more accurate than that of the modern high school sophomore. Psychological research on intuitive physics demonstrates something of fundamental importance in understanding why scientists behave as they do. Despite extensive experience with moving and falling objects, people's intuitive theories of motion are remarkably inaccurate. Experience provides no inoculation against these intuitive errors. For example, experienced taxi drivers make most of the same speed and journey time errors that nonprofessional divers do (Peer & Solomon, 2012).
The Importance of Control Groups
All sciences contain examples of mistaken conclusions drawn from studies that fell short of the full controls of the true experiment. Ross and Nisbett (1991) discuss the medical findings on the portacaval shunt, a treatment for cirrhosis of the liver that was popular many years ago. The studies on the treatment were assembled, and an interesting pattern was revealed. In 96.9 percent of the studies that did not contain a control group, the physicians judged the treatment to be at least moderately effective. In the studies in which there was a control group but in which random assignment to conditions was not used (thus falling short of true experimental design), 86.7 percent of the studies were judged to have shown at least moderate effectiveness. However, in the studies in which there was a control group formed by true random assignment, only 25 percent of the studies were judged to have shown at least moderate effectiveness. Thus, the effectiveness of this particular treatment—now known to be ineffective—was vastly overestimated by studies that did not employ complete experimental controls. The positive results found using less controlled procedures were due to placebo effects and/or biases resulting from nonrandom assignment. For example, selection effects (see Chapter 5) may operate to cause spurious positive effects when random assignment is not used. If the patients chosen for a treatment tend to be "good candidates" or tend to be those with vocal and supportive families, there may be differences between them and the control group irrespective of the effectiveness of the treatment. The tendency to see the necessity of acquiring comparative information before coming to a conclusion is apparently not a natural one—which is why training in all the sciences includes methodology courses that stress the importance of constructing control groups. The "nonvividness" of the control group—the group treated just like the experimental group except for the absence of a critical factor—makes it difficult to see how essential such a group is. Psychologists have done extensive research on the tendency for people to ignore essential comparative (control group) information. For example, in a much researched paradigm (Stanovich, 2010), subjects are shown a 2 × 2 matrix such as the one shown here that summarizes the data from an experiment. ImprovementNo ImprovementTreatment20075No Treatment5015 6.2-1 Full Alternative Text The numbers in the table represent the number of people in each cell. Specifically, 200 people received the treatment and showed improvement in the condition being treated, 75 received the treatment and showed no improvement, 50 received no treatment and showed improvement, and 15 received no treatment and showed no improvement. The subjects are asked to indicate the degree of effectiveness of the treatment. Many subjects think that the treatment in question is effective, and a considerable number of subjects think that the treatment has substantial effectiveness. They focus on the large number of cases (200) in the cell indicating people who received treatment and showed improvement. Secondarily, they focus on the fact that more people who received treatment showed improvement (200) than showed no improvement (75). In fact, the particular treatment tested in this experiment is completely ineffective. In order to understand why the treatment is ineffective, it is necessary to concentrate on the two cells that represent the outcome for the control group (the no-treatment group). There we see that 50 of 65 subjects in the control group, or 76.9 percent, improved when they got no treatment. This contrasts with 200 of 275, or 72.7 percent, who improved when they received the treatment. Thus, the percentage of improvement is actually larger in the no-treatment group, an indication that this treatment is totally ineffective. The tendency to ignore the outcomes in the no-treatment cells and focus on the large number in the treatment/improvement cell seduces many people into viewing the treatment as effective. In short, it is relatively easy to draw people's attention away from the fact that the outcomes in the control condition are a critical piece of contextual information in interpreting the outcome in the treatment condition. Many fields, not just psychology, have gradually developed an awareness of the necessity of comparative information when evaluating evidence. This is a fairly recent development even in the medical field, for instance (Gawande, 2010; Lewis, 2017). Neurologist Robert Burton (2008) describes well the path that medicine has taken: "being a good doctor requires sticking with the best medical evidence, even if it contradicts your personal experience. We need to distinguish between gut feeling and testable knowledge, between hunches and empirically tested evidence" (p. 161). Business and governments have increasingly turned to controlled experimentation to find out how to optimize their policies, and such studies can lead to surprises. Some years ago, the State of Oregon sought to test the long-held idea that providing uninsured citizens with health insurance would drive down government health costs because insured people would be less likely to walk into emergency rooms for treatment (Sanger-Katz, 2014). The uninsured using emergency rooms is a cause of increased government and hospital costs. In order to test this idea, and see how much the savings were, the State of Oregon ran a true experiment in which they randomly chose some insured people to receive insurance and had an equal-sized group who lost the insurance lottery serve as the control group. This type of investigation is known as a field experiment—where a variable is manipulated in a nonlaboratory setting. In the case of the Oregon experiment, the findings were surprising. The experimental group who received insurance did not result in fewer government costs, and they even turned out to use emergency rooms more than the control group! However, not all the outcomes were negative: The experimental group was found to have better mental health and quality of life. Another example of a field experiment was a study run to see whether some progress could be made on the problem of high school students being accepted to universities but never making it to campus in the fall (Castleman, 2015; Kirp, 2017). Not surprisingly, many of these are low income students who are the first in their families to go to college. Researchers conducted a field experiment on 5,000 students the summer before college, in which the experimental group received relevant text messages like "Have you chosen your courses yet?". Of this group, 72 percent ended up enrolled in college that fall, compared with 66.4 percent of the control group. This intervention also proved to be very cost effective. International aid organizations are likewise turning to studies with manipulated variables (true experiments) to find out "what works" (Banerjee & Duflo, 2009; Duflo & Karlan, 2016). Aid organizations often evaluate themselves, and thus end up claiming that everything they do works, which is implausible. Such an approach of course means that money will be misspent. In order to efficiently use aid money—that is, to save more lives—it is essential to make a judgment about which programs work better than others. Field experiments help with that judgment. It is sometimes hard for the public to understand that field experiments are necessary in order to achieve something else that they want—that tax money be used efficiently, to help the most people. For example, New York City attempted an experimental test of one of its programs—Homebase—that tries to prevent people from becoming homeless (Buckley, 2010). More people are eligible (a person must be behind on rent and in danger of eviction) for this program (which includes job training, counseling, and other aid) than can be served. Thus, the city did the logical thing to test the efficacy of the program: They randomly assigned (until the money—$23 million—ran out) some people to the Homebase program and an equal number were followed up who were not included in the Homebase program. This design allowed the city to determine how many people (either just a few, or perhaps many) were saved from homelessness by this expenditure of $23 million. Unfortunately, many citizens and groups in New York did not see it that way. They reacted emotionally to the vivid word "experiment" and objected to this controlled study that would allow the city to spend its money better. They thought the homeless were being treated like guinea pigs or lab rats. What these critics were forgetting was that no one was being denied service by this experiment. The same number of people would receive Homebase whether they were randomly assigned or not. The only difference was that by collecting information from the control group, rather than simply ignoring those who were not in the program, the city would be able to determine whether the program actually works! The confusions about field experiments in the Homebase example are quite common ones. People do not seem to understand that by doing field experiments on the effects of social aid in real environments, we can maximize the number served by finding out what works best. As one international aid expert, Esther Duflo, noted, "it doesn't seem like a hugely innovative view of the world, but most people who are not economists don't get it. They don't get the idea that there are budget constraints" (Parker, 2010, p. 87). It is easy to detect a little frustration in Duflo's voice as we read this. Duflo is running up against something we will discuss many times in this book—what is obvious to a scientist is often totally missed by a layperson. It seems obvious to Duflo that, with a fixed aid budget, the number of people served from a given program is a certain number. Another program that is more efficient would serve more people for the same fixed cost. And the only way to figure out if a program is more efficient is to run a true experiment. Perhaps a reframing would help people. One of Duflo's colleagues helping to run the experiments on aid in impoverished countries notes that she is often told that "you shouldn't be experimenting on people" and replies "OK, so you have no idea whether your program works—and that's not experimental?" (p. 87). She has the right idea
Comparison, Control, and Manipulation
Although many large volumes have been written on the subject of scientific methodology, it is simply not necessary for the layperson, who may never actually carry out an experiment, to become familiar with all the details and intricacies of experimental design. The most important characteristics of scientific thinking are actually quite easy to grasp. Scientific thinking is based on the ideas of comparison, control, and manipulation. To achieve a more fundamental understanding of a phenomenon, a scientist compares conditions in the world. Without this comparison, we are left with isolated instances of observations, and the interpretation of these isolated observations is highly ambiguous, as we saw in Chapter 4 in our discussion of testimonials and case studies. By comparing results obtained in different—but controlled—conditions, scientists rule out certain explanations and confirm others. The essential goal of experimental design is to isolate a variable. When a variable is successfully isolated, the outcome of the experiment will eliminate a number of alternative theories that may have been advanced as explanations. Scientists weed out the maximum number of incorrect explanations either by directly controlling the experimental situation or by observing the kinds of naturally occurring situations that allow them to test alternative explanations. The latter situation was illustrated quite well in the cholera example. Snow did not simply pick any two water companies. He was aware that water companies might supply different geographic localities that had vastly different health-related socioeconomic characteristics. Merely observing the frequency of cholera in the various localities would leave many alternative explanations of any observed differences in cholera incidence. Highly cognizant that science advances by eliminating possible explanations (recall our discussion of falsifiability in Chapter 2), Snow looked for and found a comparison that would eliminate a large class of explanations based on health-related correlates of SES. Snow was fortunate to find a naturally occurring situation that allowed him to eliminate alternative explanations. But it would be absurd for scientists to sit around waiting for circumstances like Snow's to occur. Instead, most scientists try to restructure the world in ways that will differentiate alternative hypotheses. To do this, they must manipulate the variable believed to be the cause (contamination of the water supply, in Snow's case) and observe whether a differential effect (cholera incidence) occurs while they keep all other relevant variables constant. The variable manipulated is called the independent variable and the variable upon which the independent variable is posited to have an effect is called the dependent variable. Thus, the best experimental design is achieved when the scientist can manipulate the variable of interest and control all the other extraneous variables affecting the situation. Note that Snow did not do this. He was not able to manipulate the degree of water contamination himself but instead found a situation in which the contamination varied and in which other variables, mainly those having to do with SES, were—by lucky chance—controlled. However, this type of naturally occurring situation is not only less common but also less powerful than direct experimental manipulation. Joseph Goldberger did directly manipulate the variables he hypothesized to be the causes of the particular phenomenon he was studying (pellagra). Although Goldberger observed and recorded variables that were correlated with pellagra, he also directly manipulated two other variables in his series of studies. Recall that he induced pellagra in a group of prisoners given a low-protein diet and also failed to induce it in a group of volunteers, including himself and his wife, who ingested the excrement of pellagra victims. Thus, Goldberger went beyond observing naturally occurring correlations and created a special set of circumstances designed to yield data that would allow a stronger inference by ruling out a wider set of alternative explanations than Snow's did. This is precisely the reason why scientists attempt to manipulate a variable and to hold all other variables constant: in order to eliminate alternative explanations.
Why Goldberger's Evidence Was Better
Goldberger had a type of evidence (a controlled manipulation, discussed further in the next chapter) that is derived when the investigator, instead of simply observing correlations, actually manipulates the critical variable. This approach often involves setting up special conditions that rarely occur naturally—and to call Goldberger's special conditions unnatural is an understatement! Confident that pellagra was not contagious and not transmitted by the bodily fluids of the victims, Goldberger had himself injected with the blood of a victim. He inserted throat and nose secretions from a victim into his own mouth. According to Bronfenbrenner and Mahoney (1975), two researchers describing Goldberger's efforts, he and his assistants even ate dough balls that contained the urine and feces of pellagra victims! Despite all of these extreme interventions, neither Goldberger nor the other volunteers came down with pellagra. In short, Goldberger had created the conditions necessary for the infectious transmission of the disease, and nothing had happened. Goldberger had now manipulated the causal mechanism suggested by others and had shown that it was ineffective, but it was still necessary to test his own causal mechanism. Goldberger got two groups of prisoners from a state prison farm who were free of pellagra to volunteer for his experiment. One group was given the high-carbohydrate, low-protein diet that he suspected was the cause of pellagra, while the other group received a more balanced diet. Within five months, the low-protein group was ravaged by pellagra, while the other group showed no signs of the disease. After a long struggle, Goldberger's hypothesis was eventually accepted because it matched the empirical evidence better than any other. The history of pellagra illustrates the human cost of basing social and economic policy on mistaken inferences from correlational studies. This is not to say that we should never use correlational evidence. Quite the contrary. In many instances, it is all we have to work with (see Chapter 8), and in some cases, it is all we need (for instance, when prediction, rather than determination of cause, is the goal). Scientists often have to use incomplete knowledge to solve problems. The important thing is that we approach correlational evidence with a certain skepticism. Examples such as the pellagra-sewage case occur with considerable frequency in all areas of psychology. The example illustrates what is termed the third-variable problem: the fact that the correlation between the two variables—in this case, pellagra incidence and sewage conditions—may not indicate a direct causal path between them but may arise because both variables are related to a third variable that has not even been measured. Pellagra incidence is related to SES (and to diet—the real causal variable) and SES is also related to sewerage quality. Correlations like that between sewage and pellagra are often called spurious correlations: correlations that arise not because a causal link exists between the two variables that are measured, but because both variables are related to a third variable (or just show a chance relationship, see Vigen, 2015). It is sometimes easy to fall into the trap of ignoring possible third variables. When we see a study showing correlations between parenting behaviors and their children's psychological characteristics, it is tempting to automatically think that the parenting behaviors determined (that is, caused) the children's psychological characteristics. But this automatic tendency is wrong because it ignores the genetic connection between the parents and their children—a third variable that may be responsible for the parent-child correlations (McAdams et al., 2014; Plomin et al., 2016). Let's consider a contemporary example of a third variable problem. For decades, debates have raged over the relative efficacy of public and private schools. Some of the conclusions drawn in this debate vividly demonstrate the perils of inferring causation from correlational evidence. The question of the efficacy of private versus public schools is an empirical problem that can be attacked with the investigative methods of the social sciences. This is not to imply that it is an easy problem, only that it is a scientific problem, and potentially solvable. All advocates of the superiority of private schools implicitly recognize this, because at the crux of their arguments is an empirical fact: Student achievement in private schools exceeds that in public schools. This fact is not in dispute—educational statistics are plentiful and largely consistent across various studies. The problem is the use of these achievement data to conclude that the education received in private schools causes the superior test scores. The outcome of educational testing is a function of many different variables, all of which are correlated. In order to evaluate the relative efficacy of public schools and private schools, we need more complex statistics than merely the relationship between the type of school attended and school achievement. For example, educational achievement is related to many different indicators of family background, such as parental education, number of parents in the home, SES, the number of books in the home, and other factors. These characteristics are also related to the probability of sending a child to a private school. Thus, family background is a potential third variable that may affect the relationship between academic achievement and the type of school. In short, the relationship may have nothing to do with the effectiveness of private schools but may be the result of the fact that economically advantaged children do better academically and are more likely to attend private schools. Fortunately, there exist complex correlational statistics such as multiple regression, partial correlation, and path analysis that were designed to deal with problems such as this one (Wheelan, 2013). These statistics allow the correlation between two variables to be recalculated after the influence of other variables is removed, or "factored out" or "partialed out." Using these more complex correlational techniques, researchers have analyzed a large set of educational statistics on high school students (Carnoy et al., 2005). They found that, after variables reflecting the students' home backgrounds and general mental ability were factored out, there was little relationship between school achievement and the type of school attended. Academic achievement is linked to private school attendance not primarily because of any direct causal mechanism, but because the family background and the general cognitive level of students in private schools are different from those of children in public schools. It requires similar statistical techniques (regression, partial correlation) to untangle the relationship between happiness and longevity. There is indeed a positive correlation: happier people live longer. But from this correlation alone, we cannot conclude that happiness is the cause of greater longevity. In fact, some studies have shown (by using the statistical regression technique) that the correlation between happiness and longevity disappears when levels of health are controlled (Liu et al., 2016; Vyse, 2016a). The complex correlational statistics that allow us to partial out the effects of a third variable do not always reduce the magnitude of the original correlation. Sometimes the original correlation between two variables remains even after the partialing out of the third variable, and this result itself can be informative. Such an outcome indicates that the original correlation was not due to a spurious relationship with that particular third variable. Of course, it does not remove the possibility of a spurious relationship due to some other variable. For example, it turns out that the violent crime rate in the United States is higher in the southern states than in the northern states. Anderson and Anderson (1996) tested what has been called the heat hypothesis—that "uncomfortably warm temperatures produce increases in aggressive motives and (sometimes) aggressive behavior" (p. 740). Not surprisingly, they did find a correlation between the average temperature in a city and its violent crime rate. What gives the heat hypothesis more credence, however, is that they found that the correlation between temperature and violent crime remained significant even after variables such as unemployment rate, per capita income, poverty rate, education, population size, median age of population, and several other variables were statistically controlled (see also, Larrick et al., 2011; Plante & Anderson, 2017).
Snow and Cholera
In his studies of pellagra, Joseph Goldberger was partially guided by his hunch that the disease was not contagious. But 70 years earlier, John Snow, in his search for the causes of cholera, bet the opposite way and also won (Johnson, 2007; Shapin, 2006). Many competing theories had been put forth to explain the repeated outbreaks of cholera in London in the 1850s. Many doctors believed that the exhalations of victims were inhaled by others who then contracted the disease. This was called the miasmal theory. By contrast, Snow hypothesized that the disease was spread by the water supply, which had become contaminated with the excrement of victims. Snow set out to test his theory. Fortunately, there were many different sources of water supply in London, each serving different districts, so the incidence of cholera could be matched with the different water supplies, which varied in degree of contamination. Snow realized, however, that such a comparison would be subject to severe selection biases (recall the discussion in Chapter 5). The districts of London varied greatly in wealth, so any correlation between water supply and geography could just as easily be due to any of the many other economically related variables that affect health, such as diet, stress, job hazards, and quality of clothing and housing. In short, the possibility of obtaining a spurious correlation was nearly as high as in the case of the pellagra-sewage example discussed in Chapter 5. However, Snow was astute enough to notice and to exploit one particular situation that had occurred. In one part of London, there happened to be two water companies that supplied a single neighborhood unsystematically. That is, on a particular street, a few houses were supplied by one company, then a few by the other, because in earlier days the two companies had been in competition. There were even cases in which a house had water from a company different from the one supplying the houses on either side of it. Thus, Snow had uncovered a case in which the SES of the people supplied by two water companies was virtually identical, or at least as close as it could be in a naturally occurring situation like this. Such a circumstance would still not have been of any benefit if the water from the two companies had been equally contaminated because Snow would have had no difference to associate with cholera incidence. Fortunately, this was not the case. After the previous London cholera epidemic, one company, the Lambeth Company, had moved upstream on the Thames to escape the London sewage. The Southwark and Vauxhall Company, however, had stayed downstream. Thus, the probability was that the water of the Lambeth Company was much less contaminated than the water of the Southwark and Vauxhall Company. Snow confirmed this by chemical testing. All that remained was to calculate the cholera death rates for the houses supplied by the two water companies. The rate for the Lambeth Company was 37 deaths per 10,000 houses, compared with a rate of 315 per 10,000 houses for the Southwark and Vauxhall Company. In this chapter, we will discuss how the Snow and Goldberger stories both illustrate the logic of scientific thinking. Without an understanding of this logic, the things scientists do may seem mysterious, odd, or downright ridiculous.
The toaster-contraceptive study
Many years ago, a large-scale study of the factors related to the use of contraceptive devices was conducted in Taiwan. A research team of social scientists collected data on a wide range of behavioral and environmental variables. The researchers were interested in seeing what variables best predicted the adoption of birth control methods. After collecting the data, they found that the one variable most strongly related to contraceptive use was the number of electrical appliances (toasters, fans, etc.) in the home (Li, 1975). This result probably does not tempt you to propose that the teenage pregnancy problem should be dealt with by passing out free toasters in high schools. But why aren't you tempted to think so? The correlation between appliances and contraceptive use was indeed strong, and this variable was the single best predictor among the many variables that were measured. Your reply, I hope, will be that it is not the strength but the nature of the relationship that is relevant. Starting a free toaster program would imply the belief that toasters cause people to use contraceptives. The fact that we view this suggestion as absurd means that, at least in clear-cut cases such as this, we recognize that two variables may be associated without having a causal relationship. In this example, we can guess that the relationship exists because contraceptive use and the number of electrical appliances in the home are linked through some other variable that relates to both. Socioeconomic status (SES) would be one likely candidate for a mediating variable. We know that SES is related to contraceptive use. All we need now is the fact that families at higher socioeconomic levels tend to have more electrical appliances in their homes, and we have the linkage. Of course, other variables may mediate this correlation. However, the point is that, no matter how strong the correlation is between the number of toasters and contraceptive use, the relationship does not indicate a causal connection. The contraceptive example makes it very easy to understand the fundamental principle of this chapter: The presence of a correlation does not necessarily imply causation. In this chapter, we will discuss two problems that prevent the drawing of a causal inference: the third-variable problem and the directionality problem. The toaster-contraceptive study is an example of the third-variable problem.
The Third-Variable Problem
Sometimes the third variable causing a misleading linkage between two other variables is pretty easy to see. If I told you that, across the 365 days of the year, at all of the beach resorts in America, there is a correlation between the number of ice cream cones sold and the number of drownings. The more ice cream sold, the more drownings there were. Here, it is easy to see that the association comes about not because people fill their stomachs with ice cream and that that makes them drown when they go in the water. Instead, a third variable, the temperature, links these two variables. On hot days, when there are a lot of people eating ice cream, there are also a lot of people swimming, and the more people there are out swimming, the more there are who will drown. The limitations of correlational evidence are not always so easy to recognize as the ice cream and toaster examples. When the causal link seems obvious to us, when we have a strong preexisting bias, or when our interpretations become dominated by our theoretical orientation, it is tempting to treat correlations as evidence of causation. In the early 1900s, thousands of Americans in the South suffered and died of a disease called pellagra. Characterized by dizziness, lethargy, running sores, vomiting, and severe diarrhea, the disease was thought to be infectious and to be caused by a living microorganism of "unknown origin." It is not surprising, then, that many physicians of the National Association for the Study of Pellagra were impressed by evidence that the disease was linked to sanitary conditions. It seemed that homes in Spartanburg, South Carolina, that were free of pellagra invariably had inside plumbing and good sewerage. By contrast, the homes of pellagra victims often had inferior sewerage. This correlation coincided quite well with the idea of an infectious disease transmitted, because of poor sanitary conditions, via the excrement of pellagra victims. One physician who doubted this interpretation was Joseph Goldberger, who, at the direction of the surgeon general of the United States, had conducted several investigations of pellagra. Goldberger thought that pellagra was caused by inadequate diet. Many victims had lived on high-carbohydrate, extremely low-protein diets, characterized by small amounts of meat, eggs, and milk and large amounts of corn, grits, and mush. Goldberger thought that the correlation between sewage conditions and pellagra did not reflect a causal relationship in either direction (much as in the toaster-birth control example). Goldberger thought that the correlation arose because families with sanitary plumbing were likely to be economically advantaged. This economic discrepancy would also be reflected in their diets, which would contain more animal protein. But wait a minute! Why should Goldberger get away with his causal inference? After all, both sides were just sitting there with their correlations, Goldberger with pellagra and diet and the other physicians with pellagra and sanitation. Why shouldn't the association's physicians be able to say that Goldberger's correlation was equally misleading? Why was he justified in rejecting the hypothesis that an infectious organism was transmitted through the excrement of pellagra victims because of inadequate sewage disposal? Well, the reason Goldberger was justified has to do with one small detail that I neglected to mention: Goldberger had eaten the excrement of pellagra victims.
Clever Hans in the 1990s and in the Present Day
The Clever Hans story is a historical example that has been used in methodology classes for many years to teach the important principle of the necessity of experimental control. No one ever thought that an actual Clever Hans case could happen again—but it did. Throughout the early 1990s, researchers the world over watched in horrified anticipation—almost as if observing cars crash in slow motion—while a modern Clever Hans case unfolded before their eyes and had tragic consequences. Autism is a developmental disability characterized by impairment in reciprocal social interaction, delayed and often qualitatively abnormal language development, and a restricted repertoire of activities and interests (Baron-Cohen, 2008). The extremely noncommunicative nature of many autistic children, who may be normal in physical appearance, makes the disorder a particularly difficult one for parents to accept. It is, therefore, not hard to imagine the excitement of parents of autistic children when, in the late 1980s and early 1990s, they heard of a technique coming out of Australia that enabled autistic children who had previously been totally nonverbal to communicate. This technique for unlocking communicative capacity in nonverbal autistic individuals was called facilitated communication, and it was uncritically trumpeted in the most popular media outlets of that era (Hagen, 2012; Heinzen et al., 2014; Offit, 2008). The claim was made that autistic individuals and other children with developmental disabilities who had previously been nonverbal had typed highly literate messages on a keyboard when their hands and arms had been supported over the typewriter by a sympathetic "facilitator." Not surprisingly, these startlingly verbal performances on the part of autistic children who had previously shown very limited linguistic behavior spawned incredible hopes among frustrated parents of autistic children. It was claimed that the technique also worked for nonverbal individuals with severe intellectual disability. Although the excitement of the parents is easy to understand, the gullibility of many professionals is not so easy to accept. Unfortunately, claims for the efficacy of facilitated communication were disseminated to hopeful parents by many media outlets before any controlled studies had been conducted. Had the professionals involved had minimal training in the principles of experimental control, they should have immediately recognized the parallel to the Clever Hans case. The facilitator, almost always a sympathetic individual who was genuinely concerned that the child succeed, had numerous opportunities to consciously or unconsciously direct the child's hand to the vicinity of keys on the keyboard. That cuing by the facilitator was occurring should also have been suggested by the additional observation that the children sometimes typed out complicated messages while not even looking at the keyboard. Additionally, highly literate poetic English prose was produced by children who had not been exposed to the alphabet. For example, one child allegedly typed "Am I a slave or am I free? Am I trapped or can I be seen as an easy and rational spirit?" (Offit, 2008, p. 7). Another won an international writing competition (Hagen, 2012). A number of controlled studies have been reported that have tested the claims of facilitated communication by using appropriate experimental controls. Each study has unequivocally demonstrated the same thing: The autistic child's performance depended on tactile cuing by the facilitator (Hagen, 2012; Heinzen et al., 2014; Offit, 2008). The controls used in several of the studies resembled those of the classic Clever Hans case. A controlled situation was set up in which both the child and the facilitator were presented with a drawing of an object but in which they could not see each other's drawing. When both child and facilitator were looking at the same drawing, the child typed the correct name of the drawing. However, when the child and the facilitator were shown different drawings, the child typed the name of the facilitator's drawing, not the one at which the child was looking. Thus, the responses were determined by the facilitator rather than the child. The conclusion that facilitated communication was a Clever Hans phenomenon and not a breakthrough therapeutic technique brought no joy to the investigators involved in conducting the studies. But this sad story gets even worse. At some centers, during facilitated sessions on the keyboard, clients reported having been sexually abused by a parent in the past (Offit, 2008). Children were removed from their parents' homes, only to be returned when the charges of abuse proved to be groundless. As a result of the controlled studies, competent professional opinion finally began to be heard above the media din. Importantly, it is increasingly recognized that treatments that lack empirical foundation are not benignly neutral ("Oh, well, it might work, and so what if it doesn't?"). The implementation of unproven treatments has real costs. Also, with facilitated communication, we have another example of the harm done by reliance on testimonial evidence and the fallacy of the idea that therapeutic fads and pseudoscience do no harm (see Chapter 4). We can also see that there is simply no substitute for the control and manipulation of the experimental method when we want to explain behavior. However, decades after facilitated communication was debunked as a bogus technique, it still reappears in schools and in popular culture. Writing about the technique making a comeback in schools, writer Kendrick Frazier (2015) called it "returned from the dead" (sometimes under new names like "supported typing"). In 2011, a few unfortunate parents were still being accused of sexual abuse by their children when the children were subjected to facilitated communication sessions (Hagen, 2012). In 2015, a Rutgers University professor thought that a client that she was "facilitating" actually consented to sexual activity. At her later trial, her defense was that she thought that he had no intellectual impairment at all because of what he typed (with her assistance) during the sessions (Radford, 2016a). CNN, MSNBC, and the BBC ran stories of cases entirely without skepticism long after the techniques was exposed as a case of the Clever Hans Syndrome (Hagen, 2012). A film touting the technique played in over 100 cities in 2011, almost 20 years after it was first debunked (Hagen, 2012). It is indeed the bogus remedy that just will not die. On World Autism Day, April 2, 2016, Apple got in the act by introducing a video of an autistic child claimed to be writing on an iPad with the aid of facilitated communication—called "rapid prompting" in this case (Shermer, 2016: Vyse, 2016b). Note also the link to the principle of parsimony. That the severe linguistic difficulties of autistic children could be solved by a single "magic bullet" (see Chapter 9) intervention flies in the face of decades of work on the cognitive, neuropsychological, and brain characteristics of autistic children (Baron-Cohen, 2005; Oberman & Ramachandran, 2007; Rajendran & Mitchell, 2007; Wellman et al., 2011). It would require that too much else that we know about cognition and neurology be overturned. The existence of facilitated communication would show no connectivity with the rest of science (see Chapter 8). Finally, the example of facilitated communication illustrates something discussed previously in the Clever Hans case: the importance of carefully distinguishing between the description of a phenomenon and the explanation of a phenomenon. The term "facilitated communication" is not a neutral description of what occurred between facilitator and child. Instead it posits a theoretical outcome—that communication actually occurred and had been truly enhanced by the facilitator. But that is the very thing that had to be proved! What we had here was a child tapping keys. Perhaps things would have proceeded more rationally had it originally been labeled "surprising tapping." What needed to be determined was whether the "surprising tapping" was true communication. The premature labeling of the phenomenon (key tapping) with a theory (that it represented true communication) likely made it more difficult for these practitioners to realize that further investigation was necessary to see if this theoretical label was warranted. Other fields—not just psychology—struggle with the problem of prematurely labeling a phenomenon with a theory. The legal system still uses the term "shaken baby syndrome" when in fact the American Academy of Pediatrics has recommended that that term be discarded. The problem is exactly like the Clever Hans and facilitated communication examples we have been discussing. The term "shaken baby syndrome" is a theory of why a particular child has presented with head trauma. The phenomenon is the nature of the head trauma itself. The precise reason for the head trauma is what has to be explained by whatever theory we have of how the trauma occurred. The legal system is still working through the implications of this change in terminology that had once been standard, but that we now know to be misleading (Tuerkheimer, 2010). Traffic safety engineers likewise feel that the term traffic "accident" carries too much theory with it (Richtel, 2016). The word accident implies randomness and unpredictability and luck—pure happenstance. Safety engineers know all too well that automobile crash risk has strong statistical relationships to many behaviors, none of which are random or happenstance. The engineers have in mind cases like St. Louis Cardinals pitcher Josh Hancock who slammed his rented SUV into a truck stopped on the highway with lights flashing (Vanderbilt, 2008). Calling the crash random and unpredictable (an "accident") seems not at all right when we consider that Hancock was speeding (a strong risk factor), had an alcohol concentration twice the legal limit (a strong risk factor), and was on a cell phone at the time of the crash (a strong risk factor). Oh, and he had crashed another SUV just two days before (Vanderbilt, 2008). Terming this an "accident" conveys a theory of randomness and unpredictability that does not seem right when the chosen behaviors were so wantonly reckless as in this case. The description of what happened is—a crash. As a theory, accident seems not quite right.
Prying Variables Apart: Special Conditions
The Goldberger pellagra example illustrates a very important lesson that can greatly aid in dispelling some misconceptions about the scientific process, particularly as it is applied in psychology. The occurrence of any event in the world is often correlated with many other factors. In order to separate, to pry apart, the causal influence of many simultaneously occurring events, we must create situations that will never occur in the ordinary world. Scientific experimentation breaks apart the natural correlations in the world to isolate the influence of a single variable. Psychologists operate in exactly the same manner: by isolating variables via manipulation and control. For example, cognitive psychologists interested in the reading process have studied the factors that make word perception easier or more difficult. Not surprisingly, they have found that longer words are more difficult to recognize than shorter words. At first glance, we might think that the effect of word length would be easy to measure: Simply create two sets of words, one long and one short, and measure the difference in reader recognition speed between the two. Unfortunately, it is not that easy. Long words also tend to be less frequent in language, and frequency itself also affects perception. Thus, any difference between long and short words may be due to length, frequency, or a combination of these two effects. In order to see whether word length affects perception independently of frequency, researchers must construct special word sets in which length and frequency do not vary together. Similarly, Goldberger was able to make a strong inference about causation because he set up a special set of conditions that does not occur naturally. (Considering that one manipulation involved the ingestion of bodily discharges, this is putting it mildly!) Recall that Oskar Pfungst had to set up some special conditions for testing Clever Hans, including trials in which the questioner did not know the answer. Dozens of people who merely observed the horse answer questions under normal conditions (in which the questioner knew the answer) never detected how the horse was accomplishing its feat. Instead, they came to the erroneous conclusion that the horse had true mathematical knowledge. Likewise, note the unusual conditions that were necessary to test the claims of facilitated communication. The stimuli presented to the facilitator and the child had to be separated in a way that neither could see the stimulus presented to the other. Such unusual conditions are necessary in order to test the alternative hypotheses for the phenomenon. Many classic experiments in psychology involve this logic of prying apart the natural relationships that exist in the world so that it can be determined which variable is the dominant cause. Psychologist Harry Harlow's famous experiments (Harlow & Suomi, 1970; Tavris, 2014) provide a case in point. Harlow wanted to test a prevailing hypothesis about infant-mother attachment: That attachment resulted from the mother providing the infant's source of food. However, the problem was that, of course, mothers provide much more than nourishment (comfort, warmth, caressing, stimulation, etc.). Harlow examined the behavior of infant macaque monkeys in situations in which he isolated only one of the variables associated with attachment by giving the animals choices among "artificial" mothers. For example, he found that the contact comfort provided by a "mother" made of terrycloth was preferred to that provided by a "mother" made of wire mesh. After two weeks of age, the infant preferred a cold terrycloth mother to a warm wire one, a finding indicating that the contact comfort was more attractive than warmth. Finally, Harlow found that the infants preferred the terrycloth mother even when their nourishment came exclusively from a wire mother. Thus, the hypothesis that attachment was due solely to the nourishment provided by mothers was falsified. This was possible only because Harlow was able to pry apart variables that naturally covary in the real world. Creating special conditions to test for actual causal relationships is a key tool we can use to prevent pseudoscientific beliefs from attacking us like a virus (Stanovich, 2004, 2009, 2011). Consider the case of therapeutic touch (TT)—a fad that swept the North American nursing profession in the 1990s. TT practitioners massage not the patient's body but instead the patient's so-called energy field. That is, they move their hands over the patient's body but do not actually massage it. Practitioners reported "feeling" these energy fields. Well, you guessed it. This ability to feel "energy fields" is tested properly by creating exactly the type of special conditions as in the Clever Hans and facilitated communication claims—that is, testing whether practitioners, when visually blinded, could still feel whether their hands were in proximity to a human body. Research has demonstrated the same thing as in the Clever Hans and facilitated communication cases—when vision is occluded, this ability to feel at a distance is no greater than chance (Hines, 2003; Shermer, 2005). This example actually illustrates something that was mentioned in an earlier chapter—that the logic of the true experiment is really so straightforward that a child could understand it. This is because one of the published experiments showing that TT is ineffective was done as a school science project (Dacey, 2008). In short, it is often necessary for scientists to create special conditions that will test a particular theory about a phenomenon. Merely observing the event in its natural state is rarely sufficient. People observed falling and moving objects for centuries without arriving at accurate principles and laws about motion and gravity. Truly explanatory laws of motion were not derived until Galileo and other scientists set up some rather artificial conditions for the observation of the behavior of moving objects. In Galileo's time, smooth bronze balls were rarely seen rolling down smooth inclined planes. Lots of motion occurred in the world, but it was rarely of this type. However, it was just such an unnatural situation, and others like it, that led to our first truly explanatory laws of motion and gravity. Speaking of laws of motion, didn't you take a little quiz at the beginning of this chapter?
Summary
The central point of this chapter was to convey that the mere existence of a relationship between two variables does not guarantee that changes in one are causing changes in the other. Correlation does not imply causation. Two problems in interpreting correlational relationships were discussed. In the third-variable problem, the correlation between the two variables may not indicate a direct causal path between them but instead may arise because both variables are related to a third variable that has not even been measured. If, in fact, the potential third variable has been measured, correlational statistics such as partial correlation (to be discussed again in Chapter 8) can be used to assess whether that third variable is determining the relationship. The other thing that makes the interpretation of correlations difficult is the existence of the directionality problem: the fact that even if two variables are causally related, the direction of that relationship is not indicated by the mere presence of the correlation. Selection bias is the reason for many spurious relationships in the behavioral sciences: the fact that people choose their own environments to some extent and thus create correlations between behavioral characteristics and environmental variables. As will be illustrated extensively in the next two chapters, the only way to ensure that selection bias is not operating is
Summary
The heart of the experimental method involves manipulation and control. This is why an experiment allows stronger causal inferences than a correlational study. In a correlational study, the investigator simply observes whether the natural fluctuation in two variables displays a relationship. By contrast, in a true experiment the investigator manipulates the variable hypothesized to be the cause and looks for an effect on the variable hypothesized to be the effect while holding all other variables constant by control and randomization. This method removes the third-variable problem present in correlational studies. The third-variable problem arises because, in the natural world, many different things are related. The experimental method may be viewed as a way of prying apart these naturally occurring relationships. It does so because it isolates one particular variable (the hypothesized cause) by manipulating it and holding everything else constant. However, in order to pry apart naturally occurring relationships, scientists often have to create special conditions that are unknown in the natural world.
Selection Bias
The term self selection bias refers to situations where people select themselves into a particular group rather than being randomly assigned (see Chapter 6) to the group. Self selection creates spurious correlations between personal variables and environmental characteristics—correlations that do not indicate a causal relationship. The correlations arise because people with certain behavioral/biological characteristics have chosen certain types of environments, not because the environments caused the behavioral/biological characteristics. Self selection is better explained by considering some examples. Let's look at a straightforward example that illustrates the importance of selection factors in creating spurious correlations: Quickly, name a state with an above-average incidence of deaths due to respiratory illness. One answer to this question would be, of course, Arizona....What?.... Wait a minute! Arizona has clean air, doesn't it? Does the smog of Los Angeles spread that far? Has the suburban sprawl of Phoenix become that bad? No, it can't be. Let's slow down a minute. Maybe Arizona does have good air. And maybe people with respiratory illnesses tend to move there. And then they die. There you have it. A situation has arisen in which, if we're not careful, we may be led to think that Arizona's air is killing people. However, selection factors are not always so easy to discern. They are often overlooked, particularly when there is a preexisting desire to see a certain type of causal link. Tempting correlational evidence combined with a preexisting bias may deceive even the best of minds. Let's consider some specific cases. An example from clinical psychology demonstrates how tricky and "perverse" the selection bias problem can be. It has sometimes been demonstrated that the cure rate for various addictive-appetite problems such as obesity, heroin use, and cigarette smoking is lower for those who have had psychotherapy than for those who have not. The reason, you will be glad to know, is not that psychotherapy makes addictive behavior more resistant to change. It is that, among those who seek psychotherapy, the disorder is more intractable (Satel & Lilienfeld, 2013), and self-cures have been ineffective. In short, "hard cases" seek psychotherapy more than "easy cases." This "hard case" self selection bias is so ubiquitous that it has been called the clinician's illusion (Satel & Lilienfeld, 2013)—the illusion being that clinicians tend to overgeneralize the characteristics of the extreme cases they see to the much larger population of milder cases that are less likely to come into contact with a clinician. This type of self selection bias comes into play when organizations or governments launch so-called scorecards to rate physicians. New York did this some years ago when they started publishing the mortality rates of cardiologists (Wheelan, 2013). The problem was that a cardiologist could simply boost their ratings by seeking out easy cases and avoiding the most difficult cases! Jumping to conclusions when selection effects are present can lead us to make bad real-world choices. Many women were once encouraged to take hormone replacement therapy (HRT) after menopause because of reports that it lowered the probability of heart disease. But the early studies that had indicated this had simply compared groups of women who had chosen to take HRT (i.e., who self-selected the treatment) with those who had not chosen to take HRT. However, true experiments (using random assignment, see Chapter 6) conducted later on found that HRT actually did not reduce the likelihood of heart disease at all (Bluming & Tavris, 2009; Seethaler, 2009). The earlier studies involving self-selected samples had seemed to indicate that it did because women who chose to have HRT were more physically active, less obese, and less likely to smoke than women who did not choose HRT. Selection bias can lead to some surprising conclusions. During World War II an analyst was trying to determine where to place extra armor on aircraft based on the pattern of bullet holes in the returning planes (Ellenberg, 2014). His decision was to put the extra armor in the places that were free of bullet holes on the returning aircraft that he analyzed. He did not put the extra armor in the places where there were a lot of bullet holes. His reasoning was that the planes had probably been pretty uniformly hit with bullets. Where he found the bullet holes on the returning aircraft told him that, in those places, the plane could be hit and still return. Those areas that were free of bullet holes on returning planes had probably been hit—but planes hit there did not return. Hence, it was the places on the returning planes without bullet holes that needed more armor! It is easy to use selection effects to "set up" people to make a causal inference. How about this one: Republicans enjoy sex more than Democrats. It's an absolute fact. Statistics show that the average Republican voter is more satisfied with their sex lives than the average Democratic voter (Blastland & Dilnot, 2009). What is it about Republicanism that makes people sexier? OK, you guessed it. That's not right. Politics doesn't change anyone's sex life. What accounts for the data, then? Two things. First, married people vote Republican more than single people. Second, surveys show that married people report more satisfaction with their sex lives than single people. Republicanism doesn't change anyone's sex life; it's just that a demographic group (married people) who have higher satisfaction levels are more prone to vote Republican. Examples such as the "sexy Republican" show us how careful we have to be when selection effects might be operating. Economist Steven Landsburg (2007) demonstrates how much of the data showing productivity tied to the use of technology might be interpreted as causal when in fact it is only correlational data containing selection effects. Within corporations, it is often the most productive employees who are given the most advanced technology. Thus, when a correlation is calculated, productivity will be correlated with technology use. But it is not that the technology improved the performance of these employees, because they were already more productive before they received the advanced technology. An important real-life health issue that implicates selection effects strongly is the debate about the health outcomes of alcohol consumption. Numerous studies have found that moderate drinkers have better health outcomes than not only frequent drinkers but also abstainers as well (Rabin, 2009). Aware of selection effects, neither you nor I will be tempted to tell anyone abstaining from alcohol that they would improve their health by drinking a little. This is because people self select themselves into drinking groups by deciding how much to drink. As Rabin (2009) explains, it has been found that moderate drinkers are moderate in everything they do. They exercise moderately and eat moderately. They tend to do a lot of things right. So of course the problem is that we do not know whether it is the moderate drinking itself that leads to positive health outcomes or whether it is all of the other good characteristics of the moderate drinking group (their exercise levels, diet, etc.). Because of selection effects, we cannot say that the moderate drinking itself is the cause. It is likewise with some correlational studies that have shown that wine drinkers have better health outcomes than do beer drinkers or liquor/cocktail drinkers (University of California, 2015a). The problem is that wine drinkers generally have healthier habits than beer or liquor drinkers and are different demographically. Wine drinkers smoke less, for instance, and they are more educated and affluent. When studies have employed statistical regression techniques to control for these factors, the association of positive health outcomes with wine drinking disappears. In short, the consumer's rule for this chapter is simple: Be on the lookout for instances of selection bias, and avoid inferring causation when data are only correlational. It is true that complex correlational designs do exist that allow limited causal inferences. It is also true that correlational evidence is helpful in demonstrating convergence on a hypothesis (see Chapter 8). Nevertheless, it is probably better for the consumer to err on the side of skepticism than to be deceived by correlational relationships that falsely imply causation.
The Directionality Problem
There is no excuse for making causal inferences on the basis of correlational evidence when it is possible to manipulate variables in a way that would legitimately justify a causal inference. Yet this is a distressingly common occurrence when psychological issues are involved. A well-known example in the area of educational psychology illustrates this point quite well. Since the beginning of the scientific study of reading about a hundred years ago, researchers have known that there is a correlation between eye movement patterns and reading ability. Poorer readers make more erratic movements, display more regressions (movements from right to left), and make more fixations (stops) per line of text. On the basis of this correlation, some educators hypothesized that deficient oculomotor skills were the cause of reading problems, and many eye movement-training programs were developed and administered to elementary school children. These programs were instituted long before it was ascertained whether the correlation really indicated that erratic eye movements caused poor reading. It is now known that the eye movement-reading-ability correlation reflects a causal relationship that runs in exactly the opposite direction (Rayner et al., 2012). Erratic eye movements do not cause reading problems. Instead, slow recognition of words and difficulties with comprehension lead to erratic eye movements. When children are taught to recognize words efficiently and to comprehend better, their eye movements change. Training children's eye movements does nothing to improve their reading comprehension. For more than a decade now, research has clearly pointed to word decoding and a language problem in phonological processing as the sources of reading problems (Cunningham & Zibulsky, 2014; Hulme & Snowling, 2013; Seidenberg, 2017; Willingham, 2017). Very few cases of reading disability are due to difficulties in the area of eye movement patterns. Many school districts still have in their storage basements the dusty "eye movement trainers" that represent thousands of dollars of equipment money wasted because of the temptation to see a correlation as proof of a causal hypothesis. Consider another somewhat similar example. An extremely popular hypothesis in the fields of education and counseling psychology has been that school achievement problems, drug abuse problems, teenage pregnancy, bullying, and many other problem behaviors are the result of low self-esteem. It was assumed that the causal direction of the linkage was obvious: Low self-esteem led to problem behaviors, and high self-esteem led to high educational achievement and accomplishments in other domains. This assumption of causal direction provided the motivation for many educational programs for improving self-esteem. The problem here was the same as that in the eye movement example: An assumption of causal direction was made from the mere existence of a correlation. It turns out that the relationship between self-esteem and school achievement, if it exists at all, is more likely to be in the opposite direction: Superior accomplishment in school (and in other aspects of life) leads to high self-esteem, rather than the reverse (Krueger et al., 2008; Lilienfeld et al., 2012). An example often pointed to in research methodology textbooks concerns a group of islanders in the New Hebrides who believed that lice made people healthy, because healthy islanders often had a lot of lice and ill islanders often did not. It turns out that almost all islanders had some lice most of the time. As the lice intensify, they cause a fever which kills them. Unhealthy people more quickly got the fever and became free of lice (Mazur, 2016). This created the observation that healthy people had more lice than unhealthy people. But the causality ran in the other direction: poor health led to less lice (and better health led to more lice) rather than lice leading to health. Problems of determining the direction of causation are common in psychological research. For example, psychologist Jonathan Haidt (2006) has discussed research showing that there is a correlation between altruism and happiness. There is research showing, for instance, that people who do volunteer work are happier than those who do not. Of course, it was necessary to make sure that a third variable wasn't accounting for the link between altruism and happiness. Once third variables were eliminated, it was necessary to determine the direction of the linkage: Was it that happiness caused people to be altruistic or was it that acts of altruism made people happy ("it is more blessed to give than to receive")? When the proper controlled studies were done, using the logic of the true experiment to be described in Chapter 6, it was found that there was a causal relationship running in both directions: Being happy makes people more altruistic, and performing altruistic acts makes people happier. Psychologists Chris Chabris and Dan Simons (2013) discussed a study in which researchers surveyed 2881 people in 228 census tracts and found that the census tracts with more outdoor food advertising had people who were more obese. Chabris and Simons discuss how the study was presented as if it were correct to assume that the food advertising was affecting people and making them more obese. Readers of this book will by now have thought of the alternative interpretation that runs in the opposite direction: advertisers might place more ads in neighborhoods that are high consumers of their food items. Earlier we warned about the tendency to, when seeing a study showing correlations between parenting behaviors and children's psychological characteristics, automatically think that the parenting behaviors caused the children's psychological characteristics. We pointed out that the genetic connection between the parents and their children might be a third variable responsible for the parent-child correlations. But, additionally, there may be a directionality problem as well: the child's behaviors might be evoking parental responses (Jaffee et al., 2012). The direction of causation might actually be from child to parent. Our discussion thus far has identified the two major classes of ambiguity present in a simple correlation between two variables. One is called the directionality problem and is illustrated by the eye movement and self-esteem examples. Before immediately concluding that a correlation between variable A and variable B is due to changes in A causing changes in B, we must first recognize that the direction of causation may be the opposite, that is, from B to A. The second problem is the third-variable problem, and it is illustrated by the pellagra example (and the toaster-birth control and private-school-achievement examples). The correlation between the two variables may not indicate a causal path in either direction but may arise because both variables are related to a third variable.
The Case of Clever Hans
This chapter starts with a quiz. Don't worry; it's not about what you read in the last chapter. In fact, it should be easy because it's about the observable motion of objects in the world, something with which we have all had much experience. There are just three questions in the quiz. For the first, you will need a piece of paper. Imagine that a person is whirling a ball attached to a string around his or her head. Draw a circle that represents the path of the ball as viewed from above the person's head. Draw a dot somewhere on the circle and connect the dot to the center of the circle with a line. The line represents the string, and the dot represents the ball at a particular instant in time. Imagine that at exactly this instant, the string is cut. Your first task is to indicate with your pencil the subsequent flight of the ball. For your next problem, imagine that you are a bomber pilot flying toward a target at 500 miles per hour at a height of 20,000 feet. To simplify the problem, assume that there is no air resistance. The question here is, at which location would you drop your bomb: before reaching the target, directly over the target, or when you have passed the target? Indicate a specific distance in front of the target, directly over the target, or a specific distance past the target. Finally, imagine that you are firing a rifle from shoulder height. Assume that there is no air resistance and that the rifle is fired exactly parallel to the ground. If a bullet that is dropped from the same height as the rifle takes one-half second to hit the ground, how long will it take the bullet that is fired from the rifle to hit the ground if its initial velocity is 2,000 feet per second? And the answers—oh, yes, the answers. They appear later on in this chapter. But, first, in order to understand what the accuracy of our knowledge about moving objects has to do with psychology, we need to explore more fully the nature of the experimental logic that scientists use. In this chapter, we will discuss principles of experimental control and manipulation.
Random Assignment in Conjunction with Manipulation Defines the True Experiment
We are not saying here that Snow's approach was without merit. But scientists do prefer to manipulate the experimental variables more directly because direct manipulation generates stronger inferences. Consider Snow's two groups of subjects: those whose water was supplied by the Lambeth Company and those whose water was supplied by the Southwark and Vauxhall Company. The mixed nature of the water supply system in that neighborhood probably ensured that the two groups would be of roughly equal social status. However, the drawback of the type of research design used by Snow is that the subjects themselves determined which group they would be in (self selection). They did this by signing up with one or the other of the two water companies years before. We must consider why some people signed up with one company and some with another. Did one company offer better rates? Did one advertise the medicinal properties of its water? We do not know. The critical question is, might people who respond to one or another of the advertised properties of the product differ in other, health-related ways? The answer to this question has to be, it is a possibility. A design such as Snow's cannot rule out the possibility of spurious correlates more subtle than those that are obviously associated with SES. This is precisely the reason that scientists prefer direct manipulation of the variables they are interested in. When manipulation is combined with a procedure known as random assignment (in which the subjects themselves do not determine which experimental condition they will be in but, instead, are randomly assigned to one of the experimental groups), scientists can rule out alternative explanations of data patterns that depend on the particular characteristics of the subjects. Random assignment ensures that the people in the conditions compared are roughly equal on all variables because, as the sample size increases, random assignment tends to balance out chance factors. This is because the assignment of the participants is left up to an unbiased randomization device rather than the explicit choices of a human. Please note here that random assignment is not the same thing as random sampling. The difference will be discussed in Chapter 7. Random assignment is a method of assigning subjects to the experimental and control groups so that each subject in the experiment has the same chance of being assigned to either of the groups. Flipping a coin is one way to decide to which group each subject will be assigned. In actual experimentation, a computer-generated table of random numbers is most often used. By using random assignment, the investigator is attempting to equate the two groups on all behavioral and biological variables prior to the investigation—even ones that the investigator has not explicitly measured or thought about. How well random assignment works depends on the number of subjects in the experiment. As you might expect, the more the better. That is, the more subjects there are to assign to the experimental and control groups, the closer the groups will be matched on all variables prior to the manipulation of the independent variable. The use of random assignment ensures that there will be no systematic bias in how the subjects are assigned to the two groups. The groups will always be matched fairly closely on any variable, but to the extent that they are not matched, random assignment removes any bias toward either the experimental or the control group. Perhaps it will be easier to understand how random assignment eliminates the problem of systematic bias if we focus on the concept of replication: the repeating of an experiment in all of its essential features to see if the same results are obtained. Imagine an experiment conducted by a developmental psychologist who is interested in the effect of early enrichment experiences for preschool children. Children randomly assigned to the experimental group receive the enrichment activities designed by the psychologist during their preschool day-care period. Children randomly assigned to the control group participate in more traditional playgroup activities for the same period. The dependent variable is the children's school achievement, which is measured at the end of the children's first year in school to see whether children in the experimental group have outperformed those in the control group. An experiment like this would use random assignment to ensure that the groups start out relatively closely matched on all extraneous variables that could affect the dependent variable of school achievement. These extraneous variables are sometimes called confounding variables. Some possible confounding variables are intelligence test scores and home environment. Random assignment will roughly equate the two groups on these variables. However, particularly when the number of subjects is small, there may still be some differences between the groups. For example, if after random assignment the intelligence test scores of children in the experimental group averaged 105.6 and those of children in the control group averaged 101.9 (this type of difference could occur even if random assignment has been properly used), we might worry that any difference in academic achievement in favor of the experimental group was due to the higher intelligence test scores of children in that group rather than to the enrichment program. Here is where the importance of replication comes in. Subsequent studies may again show IQ differences between the groups after random assignment, but the lack of systematic bias in the random assignment procedure ensures that the difference will not always be in favor of the experimental group. In fact, what the property of no systematic bias ensures is that, across a number of similar studies, any IQ differences will occur approximately half of the time in favor of the experimental group and half of the time in favor of the control group. In Chapter 8 we will discuss how multiple experiments such as these are used to converge on a conclusion. Thus, there are really two strengths in the procedure of random assignment. One is that in any given experiment, as the sample size gets larger, random assignment ensures that the two groups are relatively matched on all extraneous variables. However, even in experiments where the matching is not perfect, the lack of systematic bias in random assignment allows us to be confident in any conclusions about cause—as long as the study can be replicated. This is because, across a series of such experiments, differences between the two groups on confounding variables will balance out.