RA2-Exam 3
Independent Groups
Random Assignment -Participants are assigned to groups so that each participant has an equal chance of being in any group -As groups get larger, we can place more confidence in random assignment achieving what we want It to and tends to create equal groups in the long run
Rationale of ANOVA
F = (between-groups variability) / (within-groups variability) -When the IV has a significant effect on the DV the F ratio will be large -When the IV has no effect or only a small effect, the F ratio will be small (near 1)
Rationale for ANOVA
The rationale behind ANOVA for factorial designs is the same as we saw before, with one major modification -We still use ANOVA to partition (I.e., divide) the variability into two sources - IV variability and error variability -With factorial designs, the sources of treatment variability increase: main effects and interactions
Dealing with More than Two IV's
The simplest possible factorial design with three IV's (often referred to as a three-way design) has three IV's, each with two levels -This design represents a 2 x 2 x 2 experiment -This design would require eight total groups if it is planned as a completely independent groups design
Step 1: Stating the Null Hypothesis
We begin by stating the null hypothesis for the Jones study. Here, the null hypothesis would be that there is no difference between the mean of the birthday match group and the mean of the control group
Interpreting Computer Statistical Output
One-way ANOVA for Independent Samples -Source table >A table that contains the results of ANOVA. Source refers to the source of the different types of variation -The probability of a statistic is never .000 no matter how large the statistic gets, it asymptotes at 0 -In light of this problem, we list p < .001 if you ever have such a result on your computer printout
Factorial Design Effects
-Main Effect >A main effect refers to the sole effect of one IV in a factorial design >There is the possibility of a main effect for each factor -Interaction >The joint, simultaneous effect of multiple IVs on the DV >The effect of one IV depends on the specific level of the other IV >There is the possibility of an interaction for each possible combination of each of the manipulated factors -We make use of the ___-way phrasing to describe the possible interactions and more complex factorial designs can have an interaction of all factors, plus interactions of each combination of those factors
Counterbalancing
Because order effects are potential internal validity problems in a within-groups design, experimenters want to avoid them. When researchers use counterbalancing, they present the levels of the independent variable to participants in different sequences. With counterbalancing, any order effects should cancel each other out when all the data are collected
Manipulated Variable
A manipulated variable is a variable that is controlled, such as when the researchers assign participants to a particular level (value) of the variable. For example, Mueller and Oppenheimer manipulated notetaking by flipping a coin to determine whether a person would take notes with a laptop or in longhand. (In other words, the participants did not get to choose which form they would use.) Notetaking method was a variable because it had more than one level (laptop and longhand), and it was a manipulated variable because the experimenter assigned each participant to a particular level. The van Kleef team similarly manipulated the size of the pasta serving bowls by flipping a coin ahead of time to decide which session participants were in (Participants did not choose the bowl size from which they would serve themselves)
Latin Square
Another technique for partial counterbalancing is to use a Latin square, a formal system to ensure that every condition appears in each position at least once. A Latin square for six conditions (conditions 1 through 6) looks like this: 1. 2. 6. 3. 5. 4 2. 3. 1. 4. 6. 5 3. 4. 2. 5. 1. 6 4. 5. 3. 6. 2. 1 5. 6 4. 1. 3. 2 6. 1 5. 2. 4. 3 The first row is set up according to a formula, and then the conditions simply go in numerical order down each column. Latin squares work differently for odd and even number conditions
Practical Considerations of Multi-Group Designs
-Independent Groups >You must take into account the large number of participants needed to make random assignment feasible and to fill the multiple groups -Related-Groups >Matched sets --You must consider the difficulty of finding three (or more) participants to match on the extraneous variable you choose >Natural sets --May be limited by the size of the natural sets you intend to study >Repeated measures --Each participant must be measured at least three times --Need large numbers of participants in order to counterbalance all possible orders!
Variables
-Independent Variable (IV): Variables that the experimenter purposely manipulates >The IV constitutes the reason the research is being conducted; the experimenter is interested in determining its effect -Dependent Variable (DV): A response or behavior that is measured. Assumption: changes in the DV are directly due to the IV manipulation -Control variables: Factors that an experimenter holds constant >A confound is a factor that varies systematically with the IV, provides an alternative explanation for the results, and threatens internal validity
Variables
-Independent Variables (IVs): >IVs are those variables that the experimenter purposely manipulates >The IV constitutes the reason the research is being conducted; the experimenter is interested in determining the effect it has >Typically manipulated with conditions -Dependent Variable (DV): >A response or behavior that is measured >Changes in the DV are thought to be due to the IV manipulation
Related Groups
-Matched pairs >Participants are measured and equated on some variable before the experiment --Measure a variable that might confound if not controlled --Participants similar on this variable are identified and assigned to the different IV conditions -Natural pairs >Participants with a pre-existing biological or social relationship are compared -Repeated measures >Participants experience multiple IV conditions >Participants serve as their own controls
How Many Total Effects?
-The number of Main Effects + Interactions varies with the number of IVs: >Total Number of Effects = 2(^k) - 1, where k is the number of IVs -If we have 2 IVs: >2 main effects: one for the first IV, one for the second IV >1 two-way interaction -If we have 3 IVs: >3 main effects: one for the first IV, one for the second IV, and one for the third IV >4 Interactions: three two-way interactions, and one three-way interaction -If we have 4 IVs: >4 main effects: one for the first IV, one for the second IV, one for the third IV, and one for the fourth IV >11 Interactions: six two-way interactions, four three-way interactions, and one four-way interaction
Repeated Measures Cautionary Notes
-Things to consider about using repeated measures >Can we remove the effect of the IV? >Can we measure the DV more than once? >Can our participants cope with repeated testing? -Order Effects >Sequence or order effects are produced by exposure to the sequential presentation of the treatments >Carryover, Practice, and Fatigue Effects
Obtained Results and Truth
-Type I Error: False Positive (Rejects the null when it is True) >The experimented directly controls this by setting the significance level >p = a = .05 -Type II Error: False Negative (Retains the null when it is False) >Not under the direct control of the experimented, but stronger IVs and larger samples have higher power > p = B = ?
Variables
-Types of IVs >Physiological >Experience >Stimulus or environmental >Participant Characteristics -Types of DVs >Frequency or Rate >Degree or Amount >Latency >Duration >Correctness/Accuracy
Using Measured IV's
-Using a measured rather than a manipulated IV results in ex post facto research >A research approach in which the experimenter cannot directly manipulated the IV but can only classify, categorize, or measure the IV because it is predetermined in the participants (e.g., age or sex) -Without the control that comes from directly causing an IV to vary, we must exercise extreme caution in drawing conclusions from such studies -We can develop an experiment that uses one or more manipulated IVs and one or more measured IVs at the same time
How Many Statistics?
-We will calculate one F-score (and one p-value) for EVERY Main Effect and EVERY interaction to analyze the statistical significance of all potential effects -If we have 2 IVs: >2 main effects = Two F-scores and p-values for EACH IV main effect >1 Interaction = One F-score and p-value for the only two-way -If we have 3 IVs: >2 main effects = Three F-scores and p-values for EACH IV main effect >4 Interactions = Three F-scores and p-values for EACH of the two-way interactions and F-score and p-value for the three-way interaction
Assigning Participants to Groups
-When we have only one IV, we have two options - independent groups or related groups -With factorial designs, we have the same two options, plus a third hybrid option >Independent groups >Related groups >Mixed assignment: at least one IV assigned with independent groups and at least one IV assigned with related groups --Combine the advantages of the two types of designs; straightforwardness of independent groups plus conserving participants with repeated measures
Partial Counterbalancing
As the number of conditions increases, however, the number of possible orders needed for full counterbalancing increases dramatically. For example, a study with four conditions requires 24 possible sequences! If experimenters want to put at least a few participants in each order, the need for participants can quickly increase, counteracting the type efficiency of a repeated-measures design. Therefore, they might use partial counterbalancing, in which only some of the possible condition orders are represented. One way to partially counterbalance is to present the conditions is a randomized order for every subject. (This is easy to do when an experiment is administered by a computer; the computer delivers conditions in a new random order for each participant.)
Why Bother?
Experiments Support Causal Claims! -Experiments establish covariance >IV <- -> DV, established with a comparison group -Experiments establish temporal precedence >IV first, DV second -Well-designed experiments establish internal validity >IV -> DV, holding potential confounds constant
Alpha Level
Finally, at what probability level should we decide to reject the null hypothesis? Why did we reject the null hypothesis at p=.02 but not at p=.16? In psychological science, the level researchers typically use is 5% or less, or p < .05. This decision point is called the alpha level-the point at which researchers will decide whether the p is too high (and therefore will retain the null hypothesis)
Confounds
For any given research question, there can be several possible alternative explanations, which are known as confounds, or potential threats to internal validity. The word confound can mean "confused": When a study has a confound, you are confused about what is causing the change in the dependent variable. Is it the intended causal variable (such as bowl size)? Or is there some alternative explanation (such as generous attitude of the research assistants)? Internal validity is subject to a number of distinct threats. As experimenters design and interpret studies, they keep these threats to internal validity in mind and try to avoid them
Concurrent-measures design
In a concurrent-measures design, participants are exposed to all the levels of an independent variable at roughly the same time, and a single attitudinal or behavioral preference is the dependent variable. An example is a study investigating infant cognition, in which infants were shown two faces at the same time, a male face and a female face; an experimented recorded which face they looked at the longest. The independent variable is the gender of the face, and babies experience both levels (male and female) at the same time. The baby's looking preference is the dependent variable. This study found that babies show a preference for looking at female faces, unless their primary caretaker is male
Main Effect
In a factorial design, researchers test each independent variable to look for a main effect - the overall effect of one independent variable on the dependent variable, averaging over the levels of the other independent variable. In other words, a main effect is a simple difference. In a factorial design with two independent variables, there are two main effects
Step 4: Decide whether to reject or retain the null hypothesis
In the fourth step, we made a decision based on the probability we obtained in Step 3. Since the probability of choosing a person with an IQ of 130 or higher just by chance is so small, we rejected our original assumption. That is, we rejected the null hypothesis that Sarah is not able to identify smart people
Independent-groups design
In the notetaking and pasta bowl studies, there were different participants at each level of the independent variable. In the notetaking study, some participants took notes on laptops and others took notes in longhand. In the pasta bowl study, some participants were in the large-bowl condition and others were in the medium-bowl condition. Both of these studies used an independent-groups design, in which different groups of participants are placed into different levels of the independent variable. This type of design is also called a between-subjects design or between-groups design
Null Hypothesis Significance Testing (NHST)
NHST follows a set of steps to determine whether the result from a study is statistically significant. For example, we can estimate the probability that we would get a similar result just by chance. There are several statistical tests: the t test, the F test, and tests of the significance of a correlation and beta. All of them share the same underlying logical steps
Step 3: Calculating the Probability of the Result, or One Even More Extreme, If the Null Hypothesis is True
Next we calculate the probability of getting this result, or a result even more extreme, if the null hypothesis is true. To do so, we compare the t value from our sample of data to the types of t values we are likely to get if the null hypothesis is true -By far the most common way to estimate this probability is to use a sampling distribution of t. This method estimates the probability of obtaining the t we got, just by chance, from a null hypothesis population
Step 2: Computing the F ratio
Next, we calculate the F ratio for the study. In their article, the researchers reported that they obtained an F value of 5.23 for the results. This means the variability between the three means was 5.23 times larger than the variability within the three means. That sounds like a large ratio, but is it statistically significant?
Interpreting Computer Statistical Output
One-way ANOVA for Independent Samples -Sum of squares >The amount of variability in the DV attributable to each source -Mean squares >The "averaged" variability for each source >The mean square is computed by dividing each source's sum of squares by its degrees of freedom -Statistical presentation adopts the form: F(df numerator, df denominator) = ...
Practice Effects Carryover Effects
Order effects can include practice effects, also known as fatigue effects, in which a long sequence might lead participants to get better at the task, or to get tired or bored toward the end. Order effects also include carryover effects, in which some form of contamination carries over from one condition to the next. For example, imagine sipping orange juice right after brushing your teeth; the first taste contaminates your experience of the second one
Main Effects May or May Not Be Statistically Significant
Researchers look at the marginal means to inspect the main effects in a factorial design, and they use statistics to find out whether the difference in the marginal means is statistically significant. Mueller and Oppenheimer asked whether the observed differences between laptop and longhand notetaking were statistically significant. Similarly, Bartholow and Heinz explored whether the overall difference in reaction times to the two word types was statistically significant. They also investigated whether the overall difference in reaction times after the two types of photos was statistically significant. In their study, neither of these main effects was statistically significant. The observed difference in marginal means is about what you would expect to see by chance if there were no difference in the population
Statistical Power
The probability that a statistical test will show a statistically significant result when an IV truly has an effect in the population -The likelihood that an experimental hypothesis has been accepted when it is in fact true -The power of our statistical test is related to the number of participants tested > More participants -> Less variance -> More power
Design Confound
A design confound is an experimenter's mistake in designing the independent variable; it is a second variable that happens to vary systematically along with the intended independent variable and therefore is an alternative explanations for the results. As such, a design confound is a classic threat to internal validity. If the van Kleef team had accidentally served a more appetizing pasta in the large bowl than the medium bowl, the study would have a design confound because the second variable (pasta quality) would have systematically varied along with the independent variable. If the research assistants had treated the large-bowl group with a more generous attitude, the treatment of each participant would have been a design confound, too
The Decision to Reject the Null Hypothesis
The probability we obtained in Step 3 was so small that we rejected the null hypothesis assumption. In other words, we concluded that our data on Sarah's abilities were statistically significant. When we reject the null hypothesis, we are essentially saying: Data like these could have come about by chance but data like these happen very rarely by chance; therefore we are pretty sure the data were not the result of chance
t test
The t test allows researchers to test whether the difference between two group means in an independent-groups design is statistically significant. For example, 52 students were told they would be evaluating a person based on just a single information sheet about their partner. The partner's ID number, printed on the information sheet, either matched the student's own birthday (the birthday match group), or it did not (the control group). The researchers were investigating whether participants would like the partner more if the person was similar to them in this superficial way. After the students read the person's description (and saw their ID number), the researchers had the students rate how much they liked the person on a scale of 1 to 9. The researchers obtained the following results: -Mean liking rating in the birthday match group: 8.10 -Mean liking rating in the control group: 7.15 To conduct inferential statistics for this study, we ask whether these two means, 8.10 and 7.15, are significantly different from each other. In other words, what is the probability that this differences, or an even larger one, could have been obtained by chance alone, even if there's no mean differences in the population?
F test
The t test for independent groups is the appropriate test for evaluating whether two group means are significantly different. When a study compare two or more groups to each other, the appropriate test is the F test, obtained from an analysis of variance, also referred to as ANOVA
The t Test
The t test is an inferential statistical test used to evaluate the difference between the means of two groups 1. Calculate the mean and variance for each group 2. Calculate the t-score 3. Calculate the Degrees of Freedom (df) >The ability of a number in a specified set to assume any value >N = n1 + n2 >df = N - 1
Main Effect = Overall Effect
The term main effect is usually misleading because it seems to suggest that it is the most important effect in a study. It is not. In fact, when a study's results show an interaction, the interaction itself is the most important effect. Think of a main effect instead as an overall effect - the overall effect of one independent variable at a time
Experiment
The word experiment is common in everyday use. Colloquially, "to experiment" means to try something out. A cook might say he experimented with a recipe by replacing the eggs with applesauce. A friend might say she experimented with a different driving route to the beach. In psychological science, the term experiment specifically means that the researcher manipulated at least one variable and measured another. Experiments can take place in a laboratory and just about anywhere else: movie theaters, conference halls, zoos, daycare centers, and even online environments - anywhere a researcher can manipulate one variable and measure another
Step 2: Data Collection
To test Sarah's ability to identify smart people, we asked her to demonstrate it by locating a smart person in the crowd. We then gave the person she chose an IQ test to see whether she was correct. That was the datum: She correctly identified a person with an IQ of 130
Statistical Choices
-A fifth factor that influences power involves the statistical choices the researchers makes. A researcher selects a statistical test (e.g., a sign test, a chi-square test, a t test, an F test, analysis of covariance, or multiple regression) to compute the results from a sample of data. Some of these tests make is more difficult to find significant results and thus increase the chance of Type II errors. Researchers must carefully decide which statistical test is appropriate -Another statistical choice that affects the power is the decision to use a "one-tailed" or a "two-tailed" test. The choice is related to the researcher's hypothesis. For example, if we are only interested in finding out whether the schizophrenia drug significantly reduces symptoms, we use a one-tailed test. If we are interested in finding out whether the schizophrenia drug significantly reduces symptoms or significantly increases them, we use a two-tailed test. When we have a good idea about the direction in which the effect will occur, a one-tailed test is more powerful than a two-tailed test -In summary, power is the probability of not making a Type II error. It is the probability that a researcher will be able to reject the null hypothesis if it deserves to be rejected. Researchers have more power to detect an effect in a population if: 1. They select a larger (less stringent) alpha level 2. The effect size in the population is large rather than small 3. The sample size is larger rather than small 4. The data have lower levels of unsystematic variability 5. They use the most appropriate statistical test
Degree Of Unsystematic Variability
-A fourth factor that influences the power is the amount of unsystematic variability in a sample's data. All else being equal, when a study's design introduces more unsystematic variability into the results, researchers have less power to detect effects that are really there. Unsystematic variability, like having too many different flavors in two bowls of salsa, prevents researchers from seeing a clear effect resulting from their experimental manipulations -Three sources of unsystematic variability are measurement error, irrelevant individual differences, and situation noise. Measurement error occurs when the variables are measured with a less precise instrument or with less careful coding. Irrelevant individual differences among study participants can obscure the differences between two groups and thus weaken power. Using a repeated-measures design in an experiment reduces the impact of individual differences and strengthens the study's power. Situation noise can weaken power by adding extraneous sources of variability into the results. Researchers can conduct studies in controlled conditions to avoid such situation noise
Moving on to Multi-Group Designs
-A multi-group design is appropriate when you find the answer to your basic question and wish to go further; moving on to more complex and interesting questions -Again, three questions guide experimental design: >How many Independent Variables (IV)? --An experiment must have at least one IV >How many levels of the IV? --Multi-Groups Designs: one IV with three of more levels >How are participants assigned to levels? --Same options, though these are complicated by the logistical challenges of managing multiple groups
Understanding Interactions
-A significant interaction means that the effects of the various IV's are not straightforward and simple -For this reason, we virtually ignore our IV main effects when we find a significant interaction -Sometimes interactions are difficult to interpret, particularly when we have more than two IV's or many levels of an IV -A strategy that often helps us to make sense of an interaction is to produce a line graph -Strategy: Summary of the DV on the y-axis, one IV on the x-axis, with separate lines for the other IV, and separate graphs if we have three or more IVs -If the lines of the graph cross or converge, this suggests an interaction (the effects of one IV change as the second IV is varied) -Non-significant interactions typically show lines that are close to parallel
Post Hoc Tests
-ANOVA represents the primary statistical test -To discern where the significance lies in a multiple-group experiment, we must do additional statistical post hoc comparisons (aka follow-up tests) >Post hoc comparisons --Statistical comparisons made between group means after finding a significant F ratio
Hypothesis Testing
-Alternative Hypothesis (aka our real or actual hypothesis) >A hypothesis that differences between conditions are not due to chance (i.e., are due to the IV) --Assuming the proper experimental control, the differences between groups are interpreted to be due to the IV -Null Hypothesis >A hypothesis that all differences between groups are due to chance (i.e., not our IV) >The hypothesis that the DV is equal for each group --If a result is not significant, then we accept the null and conclude that our IV most likely did not affect the DV -Select a >The criteria for deciding whether the obtained p-value is low enough to reject the null hypothesis
Experimental Design
-Before conducting an experiment, review the literature! >If you find no answer in a search of the literature, then you should consider conducting an experiment -Experimental design is the general plan for... >Selecting and sampling participants, >Assigning participants to experimental conditions, >Controlling extraneous variables, >Recording dependent measure data -Principle of Parsimony >Theories should remain simply until the simple explanations are no longer valid >Likewise, sometimes it is better to use simpler designs
Counterbalancing
-Control order effects by manipulating the sequence of levels participants experience -Within-Subject counterbalancing >Every participant experiences each level of the IV, in each possible order -Within-Group counterbalancing >Different orders to different participants >Randomly assign subsets of participants to experience each potential sequence
Hypothesis Testing
-Directional Hypotheses >Specify the outcome of the experiment --Group A will be greater than Group B --Group A will be less than Group B >One-tailed test: whole a in one end of the distribution -Nondirectional Hypotheses >Do not predict the exact directional outcome of an experiment --Group A will be different than Group B >Two-tailed test: a divided into the two tails of the distribution, reducing chance of a significant effect in either direction
Nomenclature
-Even more descriptively, we adopt a convention of referring to factorial designs with a phrase that indicates both the number of IVs and the number of levels manipulated per IV >This adopts the form: A x B x C x.... x Y x Z >The simplest possible factorial design is known as a 2 x 2 design -The number of numbers tells us how many IV's there are in the design -The value of each number tells us how many levels there are of each IV
Alpha Conventions
-Even though the alpha level is conventionally set at .05, researchers can set it at whatever they want it to be, depending on the priorities of their research. If they prioritize being conservative, they might set alpha lower than 5%-perhaps at 1% or .01. By setting alpha lower, they are trying to minimize the chances of making a Type I error-of accidentally concluding there is a true effect or relationship when there actually is none -In contrast, there may be situations in which researchers choose to set alpha higher-say .10. In this case, they would be comfortable making false conclusions 10% of the time if, in truth, there is no effect. By setting alpha higher, they are also increasing the probability of making a Type I error. Researchers may wish to set alpha higher as one of the strategies to avoid missing an effect of a drug or intervention-that is, to avoid a Type II error
Very Quick Review
-Experimental design is our plan for manipulating at least one IV, holding possible confounds constant, and measuring a DV -Simplest case scenario: One IV, Two Levels -How do we assign participants to groups? >Independent Groups with Random Assignment >Related Groups through matched pairs, natural pairs, or repeated measures -Slightly more complex designs: One IV, Three or more levels -Experiments can support Causal Claims
Rationale for ANOVA
-Factorial design experiments are analyzed with ANOVA -For a two-way factorial design we use the following equations: -Fa = (IV A variability) / (Error variability) -Fb = (IV B variability) / (Error variability) -Faxb = (Interaction variability) / (Error variability)
Experimental Design: Doubling the Basic Building Block
-Factorial designs allow us to look at combinations of IVs at the same time >Are the differences for one IV dependent on differences for another IV? -The factorial design gets its name because we refer to each IV as a factor >Multiple IV's yield a factorial design -A factorial design is more realistic because there are probably few, if any, situations in which your behavior is affected by only a single factor at a time
Very Quick Review
-Factorial designs involve manipulating multiple IVs simultaneously -Factorial designs provide an opportunity to examine the effects of each factor, as well as their interactions -Line graphs can be useful for illustrating interactions -We use ANOVA to detect main effects and interactions >Though sometimes difficult to interpret, statistically significant interactions tell us that the effects of one IV are dependent on the effects of another IV
Very Quick Review
-Factorial designs involve manipulating multiple IVs simultaneously -These designs are multiplicative in their complexity, but better resemble what people might experience in their actual lives -We have conventions for describing these designs: >___-way, where ___ is the number of IVs >Factor1 x Factor2 x Factor3... -We have the same options for assigning participants to conditions - independent groups or related groups - but also a hybrid Mixed Assignment option
Inferential Statistics
-How are the results analyzed? -Multi-Groups Design: >Independent Groups --Independent Groups ANOVA >Related Groups --Repeated Measures ANOVA >In each case, the ANOVA will tell the experimenter whether the IV has an effect on the DV, but post hoc tests are needed to determine what conditions differ
Inferential Statistics
-How are the results analyzed? -Simplest case scenario: One IV, Two Levels -Independent Groups: Independent Samples t-test -Related Groups: Paired Samples t-test
Sample Size and Effect Size
-How large does a study's sample size need to be? Suppose we come across a study that found a statistically significant result but had a very small sample size-only 20 participants. Should the researchers conducting this study have used more participants-say 100 or 200 more? -There are two issues to consider. First, large samples are necessary when researchers are trying to detect a small effect size. In contrast, if researchers are confident in advance that they are trying to detect a large effect size, a small sample may be all that is necessary. Think again about the lost object analogy. If, on the one hand, you are looking for a lost skateboard, you can find it easily even if you only have a small candle in the dark room. If the skateboard is there (like a large effect size), your candle (like a small sample) is all you need to detect it. (Of course, you would also find it with a powerful flashlight, but all you need is the candle.) On the other hand, if you are looking for a lost earring, you may not be able to find it with the small candle, which is not powerful enough to illuminate the darkest corners. You need a big flashlight (like a large sample) to be sure to find the earring (like a small effect size) that is there. Therefore, the smaller the effect size in the population, the larger the sample needed to reject the null hypothesis (and therefore to avoid a Type II error) -The second issue is that a small sample is more likely to return a fluke result that cannot be replicated. Therefore, when a researcher is unsure how large an effect is in the population, a large sample is more likely to (1) prevent a Type II error and (2) produce results that can be replicated
Factorial Designs
-How many IVs? -Manipulate 2 or more Factors in a single experiment -Theoretically, there is no limit to the number of IV's that can be used in an experiment >Practically speaking, however, it is unlikely that you would want to design an experiment with one than two or three IV's -We often refer to factorial designs as ___ -way designs, where we fill in the ___ with the number of IVs
Step 3: Calculate the probability of getting such data, or even more extreme data, if the null hypothesis is true
-In this step, we calculated a probability: We used our knowledge of the normal distribution to compute the probability that Sarah could have chosen a person with an IQ of 130 or greater, just by chance, if the null hypothesis is true. What is the probability that Sarah could have chosen a person this smart or smarter by chance if she is not in fact able to identify smart people? -In making the calculations, we used what we know about IQ-that it is normally distributed with a mean of 100 and a standard deviation of 15-to estimate what percentage of people in the stadium would have each range of IQ scores. We also used our understanding of chance. Combined, these two pieces of knowledge directed us to predict that if Sarah had chosen a person at random, she would have chosen a person with an IQ of 130 or higher about 2% of the time
Why Experiments Support Causal Claims
1. Covariance: Do the results show that the causal variable is related to the effect variable? Are distinct levels of the independent variables associated with different levels of the dependent variable? 2. Temporal precedence: Does the study design ensure that the causal variable comes before the outcome variable in time? 3. Internal validity: Does the study design rule out alternative explanations for the results?
Step 3: Calculating the probability of the f value we obtained, or an even larger value, if the null hypothesis is true
-Next, we compare the F we obtained in Step 2 to a sampling distribution of F values, similar to what we did for the t value. We derive the sampling distribution of F by assuming that the null hypothesis is true in the population and then imagining that we run the study many more times, drawing different samples from that null hypothesis population. If we use the same design and run the study again and again, what values of F would we expect from such a population if the null hypothesis is true? -If there is no difference between means in the population, then most of the time the variance between the groups will be able the same as the variance within the groups. When these two variances are able the same, the F ratio will be close to 1 most of the time. Therefore, most of the F values we get from the null hypothesis population will be close to 1.0 -The sampling distribution of F is not symmetrical. While it is possible to get an F that is lower than 1 (e.g., when the between-groups variance is smaller than the within-groups variance), it is not possible to get an F that is less than zero. Zero is the lowest that F can be. However, F can be very large sometimes, such as when the between-groups variance is much larger than the within-groups variance -Just like t, the sampling distribution of F will take on slightly different shapes, depending on the degrees of freedom for F. The degrees of freedom for the sampling distribution of F contain two values. For the numerator (the between-groups variance), it is the number of groups minus 1 (in this example, 3 - 1, or 2). The degrees of freedom value for the denominator (the within-groups variance) is computed from the number of participants in the study, minus the number of groups. (There were 602 students in Bushman's study, about 200 in each group, so the degrees of freedom would be 602-3=599) -By deriving the sampling distribution, a computer program will calculate the probability of getting the F we got, or one more extreme, if the null hypothesis is true. The larger the F we obtained, the less likely it is to have happened just by chance if the null hypothesis is true. In the case of the Bushman study, the computer reported that value as p=.0056. This means that if the null hypothesis is true, we could get a F value of 5.23 only 56% of the time, just by chance -Usually we use the computer to tell us the exact probability of getting the F we got if the null hypothesis is true. If a computer is not available, we can use a table like the one in Appendix B, Critical Values of F. The critical value of F is the F value associated with our alpha level. According to this table, the critical value of F at 2 and 599 degrees of freedom is 3.10. This means that if we ran the study many times, we would get an F of 3.10 or higher 5% of the time when the null hypothesis is true -In Step 4, we will compare this critical value (3.10) to the one we obtained in our study (5.23)
Step 4: Deciding Whether to Reject or Retain the Null Hypothesis
-Now we are ready for Step 4 of the hypothesis-testing process. Again, according to the computer, the probability of the Jones team obtaining the t of 2.57 in a study with 52 people was p=.013. This p is smaller than the conventional alpha level of .05, so we reject the null hypothesis and conclude that the difference between the two groups is statistically significant. In other words, we said It is possible to get a t of 2.57 just by chance, but the probability of getting a t of that size (or larger) is very small - only .013; therefore we assume the difference between the two means did not happen just by chance -When we use a table of critical values of t, we compare the t we obtained to the critical value of t we looked up. Because the t we obtained (2.57) is greater than the critical t associated with the .05 alpha level, we can reject the null hypothesis. This means the probability of getting the t we got is something less than .05 if the null hypothesis is true -Notice that the two methods - the computer-derived probability and the critical value method - lead to the same conclusion from different directions. When using the computer's estimate of probability, we decide whether or not that estimate of p is smaller than our alpha level of .05. In contrast, when we use the critical values table, we are assessing whether our obtained t is larger than the critical value of t. Larger values of t are associated with smaller probabilities
Interpretation: Making Sense of our Statistics
-Our statistical analyses of factorial designs will provide us more information than we got from simpler one-way designs -The equations involved with the analyses are not necessarily more complicated, but they do provide more information because we have multiple main effects and interaction effects to analyze -With factorial designs we can examine main effects and interactions. This means we have more statistical effects! >We can statistically analyze the main effect of EACH IV, as well as all possible combinations of ALL IVs
Experimental Design and Statistics
-Selecting the appropriate experimental design determines the particular statistical test you will use to analyze your data -You should determine your experimental design before you begin collecting data to ensure there will be an appropriate statistical test you can use to analyze your data
Inferential Statistics
-Statistical Significance >An inferential statistical test can tell us whether the results of an experiment can occur frequently or rarely by chance --As observed relations or differences increase, test statistic increases >Results that occur rarely by chance (p<.05) are significant --As our test statistic increases, our p-value decreases! -Caution: Statistical significance varies with the presumed actual relations, but also with sample size -Caution: In research significant has a special statistical meaning, and it is not the same as big, meaningful, dramatic, important, compelling, or convincing
Alpha Is the Type I Error Rate
-The alpha we set is advance is also the probability of making a Type I error if the null hypothesis is true. Returning to our example, when we computed that the probability of Sarah choosing someone with a 130 IQ by chance was .02 (if the null hypothesis is really true), we rejected the null hypothesis. However, it's still possible that Sarah did not have special abilities and that she simply got lucky this time. In this case, we would have drawn the wrong conclusion -When we set the alpha is advance at alpha = .05, we are admitting that 5% of the time, when the null hypothesis is true, we will end up rejecting the null hypothesis anyway. Because we know how chance works, we know that if we asked someone who has no special abilities to select 100 people in the stadium at random, 5% of them will have an IQ of 125 or higher (the IQ score associated with the top 5% of people). However, we are admitting that we are comfortable with the possibility that we will make this particular mistake-a Type I error-5% of the time when the null hypothesis is true
Effect Size
-The magnitude or size of the experimental treatment >A significant statistical test tells us only that the IV had an effect; it does not tell us about the size of the effect >Cohen's d --d values greater than .80 reflect large effect sizes
Step 2: Computing the t Test For Independent Groups
-The next step is to organize the data and decide which statistical test to use. The appropriate inferential test in this case is the t test for independent groups. The t test helps us estimate whether the difference in scores between two samples is significantly greater than zero -Two features of the data influences the distances between the two means. One is the differences between the two means themselves, and the other is how much variability there is within each group. The t test is a ratio of these two values. The numerator of the t test is the simple difference between the means of the two groups: Mean 1 minus Mean 2. The denominator of the t test contains information about the variance within each of the two means, as well as the number of cases (n) that make up each of the means -t will be larger if the differences between M1 and M2 is larger. The value of t will also be larger if the SD2 (variance) for each mean is smaller. The less the two groups overlap-either because their means are farther apart or because there is less variability within each group-the larger the value of t will be. Sample size (n) also influences t. All else being equal, if n is larger, then the denominator of the t formula becomes smaller, making t larger -The Jones team obtained a t value from their study of 2.57. Once we know this t value, we must decide whether it is statistically significant
Step 4: Deciding whether to reject or retain the null hypothesis
-When we use the computer, we simply compare the computer's estimated p value, p=.0056, to the conventional alpha level of .05. In this case, we will reject the null hypothesis and conclude that there is a significant difference among the means of the three groups in the Bushman study. In other words, we have concluded that there is a significant difference is anger among the control, venting, and exercise groups -Using the critical values table, we can compare the value we obtained, 5.23, to the critical value we looked up, 3.10. Because the F we obtained is even larger than the critical value, we can also reject the null hypothesis and conclude that there is a significant difference among the means of the three groups in this study -Just as with the t test, two two decision techniques lead to the same conclusion from different directions. When we use the computer's estimate of exact probability, we are assessing whether the computer's estimate of p is smaller than our alpha level of .05. But when we use the critical values table, we are assessing whether our obtained F is larger than the critical value of F. Larger F values are associated with smaller probabilities under the null hypothesis assumption -A statistically significant F means that there is some difference, somewhere, among the groups In your study. The next step is to conduct a set of post-hoc comparisons to identify which of the group means is significantly different from the others
Rationale of ANOVA
-With ANOVA we are comparing the ratio of between-groups variability to within-groups variability -Between-Groups Variability >Variability in DV scores that is due to the effects of the IV -Within-Groups Variability (Error Variability) >Variability in DV scores that is due to factors other than the IV (individual differences, measurement error, and extraneous variation)
Advantages of Repeated Measures Designs
1. Participants in the groups are equivalent because they are the same participants and serve as their own controls 2. Gives researchers more power to notice difference between conditions 3. Requires fewer participants
Repeated-measures design
A repeated-measures design is a type of within-groups design in which participants are measured on a dependent variable more than once, after exposure to each level of the independent variable. Here's an example. Humans are social animals, and we know that many of our thoughts and behaviors are influenced by the presence of other people. Happy times may be happier, and sad times sadder, when experienced with others. Researchers Erica Boothby and her colleagues used a repeated-measures design to investigate whether a shared experience would be intensified even when people do not interact with the other person. They hypothesized that sharing a good experience with another person makes it even better than it would have been if experienced alone -They recruited 23 college women to a laboratory. Each participant was joined by a female confederate. The two sat side-by-side, facing forward, and never spoke to each other. The experimenter explained that each person in the pair would do a variety of activities, including tasing some dark chocolates and viewing some paintings. During the experiment, the order of activities was determined by drawing cards. The drawings were rigged so that the real participants' first two activities were always tasking chocolates. In addition, the real participants tasted the first chocolate at the same time the confederate was also tasting it, but the second chocolate while the confederate was viewing a painting. The participant was told that the two chocolates were different, but in fact they were exactly the same. After tasting each chocolate, participants rated how much they liked it. The results showed that people liked the chocolate more when the confederate was also tasting it -In this study, the independent variable had two levels: Sharing and not sharing an experience. Participants experienced both levels, making it a within-groups design. The dependent variable was rating of the chocolate. It was a repeated-measures design because people rated the chocolate twice (i.e., repeatedly)
Sample Size
A second factor that influences power is the sample size used in a study. All else being equal, a study that has a larger sample will have more power to reject the null hypothesis if there really is an effect in the population. A large sample is analogous to carrying a big, bright flashlight and a small sample is analogous to carrying a candle, or a weak flashlight. The bright, powerful flashlight will enable you to find what you are looking for more easily. If you carry a candle, you might mistakenly conclude that your missing object is not in the dark room when it really is (a Type II error). Similarly, if a drug really reduces the symptoms of schizophrenia in the population, we are more likely to detect that effect in a study with a larger sample than with a smaller sample -Besides being less powerful, another problem with small samples is replicability. In a small sample, one or two unusual scores might occur simply by chance, leading to a fluke result that cannot be replicated. In contrast, in a large sample, extreme scores are more likely to be cancelled out by scores in the other direction, making such one-time results less likely
Inferential Statistics
A set of techniques that uses the laws of chance and probability to help researchers make decisions about the meaning of their data and the inferences they can make from that information
Effect Size
A third factor that influences power is the size of the effect in the population. All else being equal, when there is a large effect size in the population, there is a greater chance of conducting a study that rejects the null hypothesis. As an analogy, no matter what kind of light you use, you are much more likely to find a skateboard than an earring. Skateboards are just easier to see. Similarly, if, in the population, a drug has a large effect on schizophrenia symptoms, we would be likely to detect that effect easily, even if we used a small sample and a low alpha level. But if the drug has a small effect on schizophrenia symptoms in the population, we are less likely to find it easily and we might miss it in our sample. When there is a small effect size in the population, Type II errors are more likely to occur
Interaction Effect
Adding an additional independent variable allows researchers to look for an interaction effect (or interaction) - whether the effect of the original independent variable (cell phone use) depends on the level of another independent variable (driver age). Therefore, an interaction of two independent variables allows researchers to establish whether or not "it depends." They can now ask: Does the effect of cell phones depend on age?
Marginal Means
Bartholow and Heinz conducted a study on word association between photo type and word type. One independent variable, word type, is highlighted in blue; the other, photo type, is highlighted in yellow. First, to look for a main effect of word type, you would compute the reaction time to aggressive words (averaging across the two photo conditions), and the reaction time to neutral words (averaging across the two photo conditions). The results are two marginal means. Marginal means are the arithmetic means for each level of an independent variable, averaging over levels of the other independent variable. If the sample sizes are unequal, the marginal means will be computed using the weighted average, counting the larger sample more. There's not much difference overall between reaction times to the aggressive words (555ms) and neutral words (557ms). We would say that there appears to be no main effect of word type -Second, to find the main effect of photo type, the other independent variable, you would compute the reaction time after seeing the alcohol photos, averaged across the two word type conditions, and the reaction time after seeing the plant photos, also averaged across the two word type conditions. Here again, there is not much overall difference: On average, people are about as fast to respond after an alcohol photo (556.5 ms) as they are after a plant photo (555.5 ms). There appears to be no main effect of photo type
Main Effects? Is There a Difference (Factorial)
Because there are three independent variables in a 2 x 2 x 2 design, there will be three main effects to test. Each main effect represents a simple, overall difference: the effect of one independent variable, averaged across the other two independent variables. Remember that main effects test only one independent variable at a time. When describing each main effect, you don't mention the other two independent variables because you averaged across them
Power
Besides providing the ability to use each participant as his or her own control, within-groups designs also give researchers more power to notice differences between conditions. Statistically speaking, when extraneous differences (unsystematic variability) in personality, food preferences, gender, ability, and so on are held constant across all conditions, researchers will be more likely to detect an effect of the independent variable manipulation if there is one. In this context, the term power refers to the probability that a study will show a statically significant result when an independent variable truly has an effect in the population. For example, if mindfulness training really does improve GRE scores, will the study's results find a difference? Maybe not. If extraneous differences exist between two groups, too much unsystematic variability may be obscuring a true difference. It's like being at a noisy party-your ability to detect somebody's words is hampered when many other conversations are going on around you
Mixed Factorial Design
In a mixed factorial design, one independent variable is manipulated as independent-groups and the other is manipulated as within-groups. The Strayer and Drews study on cell phone use while driving for two different age groups is an example of a mixed factorial design. Age was in independent-groups participant variable: Participants in one group were younger and those in the other group were older. But the cell phone condition independent variable was manipulated as within-groups. Each participant drove in both the cell phone and the control conditions of the study. If Strayer and Drews had wanted 50 people in each cell of their 2 x 2 mixed design, they would have needed a total of 100 people: 50 younger drivers and 50 older drives, each participating at both levels of the cell phone condition
Pretest/posttest design
In a pretest/posttest design, or equivalent groups, pretest/posttest design, participants are randomly assigned to at least two different groups and are tested on the key dependent variable twice-once before and once after exposure to the independent variable
Three-Way Interactions: Are the Two-Way Interactions Different?
In a three-way design, the final result is a single three-way interaction. In the cell phone example, this would be the three-way interaction among driver age, cell phone condition, and traffic condition. A three-way interaction, if it is significant, means that the two-way interaction between two of the independent variables depends on the level of the third independent variable. In mathematical terms, a significant three-way interaction means that the "difference in differences...is different."
Two-Way Interactions: Is There a Difference in Differences? (Factorial)
In a three-way design, there are three possible two-way interactions. In the driving with cell phone example, these would be: 1. Age x traffic condition (a two-way interaction averaging over the cell phone condition variable) 2. Age x cell phone condition (a two-way interacting averaging over the traffic conditions variable) 3. Cell phone condition x traffic condition (a two-way interaction averaging over the age variable) To inspect each of these two-way interactions, you construct three 2 x 2 tables. After computing the means, you can investigate the difference in differences using the table, just as you did for a two-way design. Alternatively, it might be easier to look for the interaction by graphing it and checking for non-parallel lines. (Statistical tests will show whether each two-way interaction is statistically significant or not)
Within-groups design
In a within-groups design, or within-subjects design, there is only one group of participants, and each person is presented with all levels of the independent variable. For example, Mueller and Oppenheimer might have run their study as a within-groups design if they had asked each participant to take notes twice-once using a laptop and another time handwritten
Within-Groups Factorial Designs
In a within-groups factorial design (also called a repeated-measures factorial), both independent variables are manipulate as within-groups. If the design is 2 x 2, there is only one group of participants, but they participate in all four combinations, or cells, of the design. The Bartholow and Heinz study was a within-groups factorial design. All participants saw both alcohol photos and plant photos, which alternated over successive trials. In addition, all participants responded to both aggression-related words and neutral words -A within-groups factorial design requires fewer participants. If Bartholow and Heinz had decided to use 50 people in each cell of their study, they would need a total of only 50 people because every person participates in each of the four cells. Therefore, within-groups designs make efficient use of participants. Because it was a within-groups design, the researchers counterbalanced the order of presentation of photos and words by having the computer present the photos and their subsequent words in a different random order for each participant
Independent Variable Conditions
In an experiment, the manipulated (causal) variable is the independent variable. The name comes from the fact that the researcher has some "independence" in assigning people to different levels of this variable. A study's independent variable should not be confused with its levels, which are also referred to as conditions. The independent variable in the van Kleef study was serving bowl size, which had two conditions: medium and large
Selection Effects
In an experiment, when the kinds of participants in one level of the independent variable are systematically different from those in the other, selection effects can result. They can also happen when the experimenters let participants choose (select) which group they want to be in. A selection effect may result if the experimenters assign one type of person (e.g., all the women, or all who sign up early in the semester) to one condition, and another type of person (e.g., all the men, or all those who wait until later in the semester) to another condition
Independent-Groups Factorial Designs
In an independent groups factorial design (also known as a between-subjects factorial), both independent variables are studies as independent groups. Therefore, if the design is a 2 x 2, there are four different groups of participants in the experiment. The DeWall team's study on alcohol, aggression, and body weight was an independent-groups factorial: Some lighter-weight men drank a placebo beverage, other light men drank an alcoholic one, some heavier men drank a placebo beverage, and other heavy men drank an alcoholic one. In other words, there were different men in each cell. If the researchers decided to use 50 participants in each cell of the design, they would have needed a full 200 participants: 50 in each of the four groups.
Samples and Populations
In doing research with human participants, our interest in not in the samples we have measured, but in what these samples tell us about the population from which they were drawn -We want to generalize, or infer, from our samples to the larger population
Manipulation Checks and Pilot Studies
In other studies, researchers need to use manipulation checks to collect empirical data on the construct validity of their independent variables. A manipulation check is an extra dependent variable that researchers can insert into an experiment to convince them that their experimental manipulation worked. The same procedure might also be used in a pilot study. A pilot study is a simple study, using a separate group of participants, that is completed before (or sometimes after) conducting the study or primary interest. Kaplan and Pascoe might have exposure a separate group of students to either a serious or a humorous lecture, and then asked them how amusing they found it. Researchers may use pilot study data to confirm the effectiveness of their manipulation before using them in a target study
Avoiding Selection Effects with Matched Groups
In the simplest type of random assignment, researchers assign participants at random to one condition or another in the experiment. In certain situations, researchers may wish to be absolutely sure the experimental groups are as equal as possible before they administer the independent variable. In these cases, they may choose to use matched groups, or matching -To create matched groups from a sample of 30, the researchers would first measure the participants on a particular variable that might matter to the dependent variable. Student ability, operationalized by GPA, for instance, might matter in a study of notetaking. They would next match participants up in pairs, starting with the two having the highest GPAs, and within that matched set, randomly assign one of them to each of the two notetaking conditions. They would then take the pair with the next-highest GPAs and within that set again assign randomly to the two groups. They would continue this process until they reach the participants with the lowest GPAs and assign them at random, too
The t Test
Interpretation of t Value -Textbook: Determine the degrees of freedom (df) involved, and use a table to determine the t necessary for a p-value less than alpha >This table contains t values that occur by chance >Compare your t value to these chance values >To be significance, the calculated t must be equal to or larger than the one in the table -Reality: Statistical analysis software will give a significance value that corresponds with the obtained t-value, which can be compared to a (i.e., .05) >p-value is inversely related with t-value (e.g., t-up, p-down)
Confidence Interval
It's becoming more common for researchers to present the confidence intervals for their results. A confidence interval provides a range which is likely to include the true population value (e.g., the mean of a population of a different between two means in the population). By convention, confidence intervals correspond to the traditional alpha level of .05 and provide us with 95% confidence that our interval contains the true population value
Describing Interactions in Words
It's one things to determine that a study has an interaction effect; it's another to describe the pattern of the interaction I words. Since there are many possible patterns for interactions, there's no standard way to describe one: It depends on how the graph looks, as well as on how the researchers frames the results -A foolproof way to describe an interaction is to start with one level of the first independent variable (that is, the first category on the x-axis), explain what's happening with the second independent variable, then move to the next level of the first independent variable (the next category on the x-axis) and do the same thing. For example, for the interaction in the Bartholow and Heinz study, you might describe it like this: "When people saw photos of alcohol, they were quicker to recognize aggression words than neutral words, but when people saw photos of plants, they were slower to recognize aggression words than neutral words." As you move across the x-axis, you make it clear that the effect of the other independent variable (word type) is changing -Another way to describe interactions involves key phrases. Some interactions, like a crossover Interaction, can be describe using the phrase "it depends," as in: "The memory capacity of children depends on their level of expertise." Other interactions can be described using the phrase "especially for," as in: "Alcohol leads to aggression, especially for heavy guys."
Measure Variables
Measured variables take the form of records of behavior or attitudes, such as self-reports, behavioral observations, or physiological measures. After an experimental situation is set up, the researchers simply record what happens. In their first study, Mueller and Oppenheimer measured student performance on the essay questions. After manipulating the notetaking method, they watched and recorded - that is, they measured - how well people answered the factual and conceptual questions. The van Kleef team manipulated the serving bowl size, and then measured two variables: how much pasta people took and how much they ate
Analyzing Multi-Group Designs
Multi-groups designs are primarily measured with analysis of variance (ANOVA) -An ANOVA used to analyze a multi-group design with one IV is known as a one-way ANOVA
Alpha Level
The alpha level is the first factor that influences power. Usually set at .05, alpha is the point at which researchers decide whether or not to reject the null hypothesis. When researchers set alpha lower in a study (say, at .01), it will be more difficult for them to reject the null hypothesis. But if it is harder to reject the null hypothesis, it is also more likely that they will retain the null hypothesis even if it deserves to be rejected. Therefore, when researchers set the alpha level low (usually to avoid Type I errors), they increase the chances of making a Type II error. In one sense, Type I and Type II errors compete: As the chance of Type I errors goes down, the chance of Type II errors goes up -Because of this competition between Type I and Type II errors, alpha levels are conventionally set at .05. According to most researchers, this level is low enough to keep Type I error rates in control but high enough that they can still find a significant result (keeping Type II error rates down)
Comparison Group
The covariance criterion might seem obvious. In our everyday reasoning, though, we tend to ignore its importance because most of our person experiences do not have the benefit of a comparison group, or comparison condition. For instance, you might suspect that your mom's giant past bowl is making you eat too much, but without a comparison bowl, you cannot know for sure. An experiment, in contrast, provides the comparison group you need. Therefore, an experiment is a better source of information than your own experience because an experiment allows you to ask and answer: Compared to what?
Dependent Variable
The measured variable is the dependent variable, or outcome variable. How a participant acts on the measured variable depends on the level of the independent variable. Researchers have less control over the dependent variable; they manipulate the independent variable and then watch what happens to people's self-reports, behaviors, or physiological responses. A dependent variable is not the same as its levels, either. The dependent variable in the van Kleef study was the amount of pasta eaten (not "200 calories")
Posttest-only design
The posttest-only design is one of the simplest independent-groups experimental designs. In the posttest-only design, also known as an equivalent groups, posttest-only design, participants are randomly assigned to independent variable groups and are tested on the dependent variable once
Control Group Treatment Group(s) Placebo Group
There are a couple of ways an independent variable might be designed to show covariance. Your early science classes may have emphasized the importance of a control group in an experiment. A control group is a level of an independent variable that is intended to represent "no treatment" or a neutral condition. When a study has a control group, the other level or levels or the independent variable are usually called the treatment group(s). For example, if an experiment is testing the effectiveness of a new medication, the researchers might assign some participants to take the medication (the treatment group) and other participants to take an inert sugar pill (the control group).When the control group is exposed to an inert treatment such as a sugar pill, it is called a placebo group, or a placebo control group.
Two Possible Errors
There are also two ways to make an incorrect decision. (1) We could conclude that Sarah probably has special abilities, when she really does not. This kind of mistake is known as a Type I Error, a "false positive." (2) We could conclude that Sarah probably does not have special abilities, when she really does. This kind of mistake is known as a Type II Error, or a "miss."
Full Counterbalancing
There are two methods for counterbalancing an experiment: full and partial. When a within-groups experiment has only two or three levels of an independent variable, researchers can use full counterbalancing, in which all possible condition orders are represented. For example, a repeated-measures design with two conditions is easy to counterbalance because there are only two orders (A->B and B->A). In a repeated-measures design with three conditions - A, B, and C - each group of participants could be randomly assigned to one of the six following sequences: A->B->C A->C->B B->A->C B->C->A C->A->B C->B->A
Two Possible Correct Conclusions
There are two ways to make a correct decision. (1) We could concluded from the sample of behavior that Sarah has a special ability to identify smart people (reject the null hypothesis), and in truth, Sarah really does-so our conclusion is correct. (2) We could decide that we cannot confidently conclude that Sarah can identify smart people (retain the null hypothesis), and in truth, she really cannot. Again, our conclusion would be correct
Experimental Design
Three questions guide experimental design: -How many Independent Variables (IV)? >An experiment must have at least one IV -How many levels of the IV? >Although an experiment may have only one IV, it must have at least two levels (a.k.a. groups or conditions) >Simplest case scenario: one IV with two levels -How are participants assigned to levels?
The Decision to Retain the Null Hypothesis
To clarify this decision, consider a situation in which we would retain the null hypothesis: the fist scenario, in which Sarah identified a person with a score of 115. The probability that Sarah would identify a person who is at least that smart just by chance even if she is not special is 16% or .16. That probability seemed high enough that we would not reject our initial assumption that she is not able to detect smart people (the null hypothesis)
Step 1: Stating the null hypothesis
To go through the hypothesis-testing steps for the Bushman study, which compared groups who sat quietly, vented their anger, or exercised, we start by assuming the null hypothesis. In this case, we assume there is no difference among the three groups. In other words, the null hypothesis is that all possible differences among the three means result in zero
Sampling Distribution of t
To start, it helps to know that we never actually create a sampling distribution; instead, we estimate what its properties will be. To do so, we theorize about what values of t we would get if the null hypothesis is true in the population. If we were to run the study many, many times, drawing different random samples from this same null hypothesis population, what values of t would we get? Most of the time, we should find only a small difference in the means, so t would be close to zero most of the time (i.e., the numerator of the t test would be close to zero, making t close to zero.) Therefore, the average of the t values should be around 0.00. And just by chance, half the time, t might be a little lower that 0.00. Sometimes, we might even get a t that is much higher or lower than 0.00. Thus, when the null hypothesis is true, we will still get a variety of values of t, but they will average around zero -In null hypothesis significance testing, sampling distributions of t are always centered at zero because they are always created based on the assumption that the null hypothesis is true. But the width of the sampling distribution of t will depend on the same size (i.e., the sample size of the study that we hypothetically run many times). When the sample size in our study is small, the sampling distribution will be wider and will result in more values at the extremes. This occurs because a small sample is more likely, just by chance, to obtain t values that are far from the mean because of sampling error. In contrast, in a large sample, the sampling distribution will be thinner -Sampling distributions are created based on sample size, but they do not use sample sizes directly; instead, they use a slightly smaller number called degrees of freedom. In the present example, the degrees of freedom are computed from the number of people in the first group, minus 1, plus the number of people in the second group, minus 1: (26-1)+(26-1)=50. -This sampling distribution of t tells us what values of t we would be likely to get if the null hypothesis is true. In addition, this figure tells us that occasionally it is possible (but very rare) to get t values of 1.8, 2.0, or even larger, just by chance
Avoiding Selection Effects with Random Assignment
Well-designed experiments often use random assignment to avoid selection effects. In the pasta bowl study, an experimenter flipped a coin to decide which participants would be in each group, so each one had an equal chance of being in the large-bowl or medium-bow condition. What does this mean? Suppose that, of the 68 participants who volunteered for the study, 20 were exceptionally hungry that day. Probabilistically speaking, the rolls of the die would have placed about 10 of the hungry people in the medium-bowl condition and about 10 in the large-bowl condition. Similarly, if 12 of the participants were dieting, random assignment would place about 6 of them in each group. In other words, since the researchers used random assignment, it's very unlikely, given the random (deliberately unsystematic) way people were assigned to each group, that all the hungry people, dieters, and so on would have been clustered in the same group
Inferential statistics
What do statistics do for us? -We are always uncertain. Statistics are tools to evaluate the likelihood of what we believe to be true -Statistics inform our decisions!
Interactions are more important than main effects
When researchers analyze the results of a factorial design, they look at main effects for each independent variable and they look for interactions. When a study shows both a main effect and an interaction, the interaction is almost always more important
Control Variable
When researchers are manipulating an independent variable, they need to make sure they are varying only one thing at a time - the potential causal force or proposed "active ingredient" (e.g., only the form of notetaking, or only the size of the serving bowl). Therefore, besides the independent variable, researchers also control potential third variables (or nuisance variables) in their studies by holding all other factors constant between the levels of the independent variable. For example, Mueller and Oppenheimer manipulated the method people used to take notes, but they held constant a number of other potential variables: People in both groups watched lectures in the same room and had the same experimented. They watched the same videos and answered the same questions about them, and so on. Any variable that an experimented holds constant on purpose is called a control variable
Procedures Behind Counterbalancing
When researchers counterbalance conditions (or levels) in a within-groups design, they have to split their participants into groups; each group receives one of the condition sequences. How do the experimenters decide which participants receive the first order of presentation and which ones receive the second? Through random assignment, of course! They might recruit, say, 30 participants to a study and randomly assign 15 of them to receive the order A then B, and assign 15 of them to the order B then A
Factorial Designs Study Two Independent Variables
When researchers want to test for interactions, they do so with factorial designs. A factorial design is one in which there are two or more independent variables (also referred to as factors). In the most common factorial design, researchers cross the two independent variables; that is, they study each possible combination of the independent variables. Strayer and Drews created a factorial design to test whether the effect of driving while talking on a cell phone depended on the driver's age. They used two independent variables (cell phone use and driver age), creating a condition representing each possible combination. To cross the two independent variable, they essentially overlaid one independent variable on top of another. This overlap process created four unique conditions, or cells: younger drivers using cell phones, younger drivers not using cell phones, older drivers using cell phones, and older drivers not using cell phones
Step 1: Assume there is no effect (The Null Hypothesis)
When we tested Sarah's abilities, we started with the skeptical assumption that she was not able to identify smart people. In statistical hypothesis testing, this kind of starting assumption is known as null hypothesis. Null means "nothing," and colloquially, a null hypothesis means "assume that nothing is going on." Depending on the research question, the null hypothesis can mean that a person does not have special abilities, that an independent variable does not have an effect on the dependent variable, or that two variables are not correlated with each other
Variations on Factorial Designs
When you add a level to an IV in a factorial design, you add several groups to your experiment because each new level must be added under each level of your other independent variable(s) -Adding levels in a factorial design increases groups in a multiplicative fashion -For example, whereas a 2 x 2 design requires 4 total levels, a 3 x 2 design requires 6 total levels
Power
Whereas preventing a Type I error involves only one factor (alpha level), preventing a Type II error depends on a set of factors, collectively known as power. Formally defined, power is the likelihood of not making a Type II error when the null hypothesis is false. In positive terms, it is the probability that a researcher will be able to reject the null hypothesis if it should be rejected (i.e., if there is really some effect in the population). If we are testing a schizophrenia drug that, in truth, does reduce the symptoms of schizophrenia, then power refers to how likely a study is to detect that effect by finding a significant result -The alpha level, the sample size, the effect size, the degree of variability in the sample, and the choice of statistical test all have an impact on the power of a study, and thus they all influence the Type II error rate
Detecting Interactions from a Graph
While it's possible to compute interactions from a table, it is sometimes easier to notice them on a graph. When results from a factorial design are plotted as a line graph and the lines are not parallel, there may be an interaction, something you would confirm with a significance test. If the lines are crossed, you would suspect an interaction. If the lines are parallel, there probably is no interaction. The lines don't have to cross to indicate an interaction; they simply have to be nonparallel.
Using the Sampling Distribution to Evaluate Significance, or p
Why did we go to the trouble of estimating that sampling distribution anyway? Doing so helps us complete Step 3 o the null hypothesis testing process: Now that we have derived the sampling distribution, we can use it to evaluate the probability of getting the t obtained in the Jones study (2.57), or an even more extreme value of t, if the null hypothesis is true in the population -One way to determine this probability is to find where our obtained t falls on the x-axis of the sampling distribution. Then we use calculus to determine the area under the curve from that point outward, which gives us the probability of obtaining a t as large as, or larger than, 2.57 when the null hypothesis is true. Researchers typically compute this probability using a computer program, such as SPSS, JASP, or R. The p that the computer reports is the exact probability of getting a t that extreme or more extreme if the null hypothesis is true. In the Jones et al. case, the computer reported that the probability of obtaining a t value of 2.57, with 50 degrees of freedom, is exactly .013. -Another way to determine the probability of a particular t value is to use the table in Appendix B, Critical Values of t. This table shows the probability of obtaining different values of t for different degrees of freedom. Such a table does not give the exact area under the curve, as the computer can do. However, using that table, we can look up the critical value of t - the t value that is associated with our alpha level. For example, we can look up the critical t value associated with a .05 alpha level and 50 degrees of freedom: 2.009. This critical value of t means that in a sampling distribution of t based on 50 degrees of freedom, we would get a t of 2.009 or more extreme 5% of the time, just by chance. In Step 4, we will compare the critical value of t to the t we actually obtained. 2.57
Order Effects
Within-groups designs have the potential for a particular threat to internal validity: Sometimes, being exposed to one condition changes how participants react to the other condition. Such responses are called order effects, and they happen when exposure to one level of the independent variable influences responses to the next level. An order effect in a within-groups design is a confound, meaning that behavior at later levels of the independent variable might be caused not by the experimental manipulation, but rather by the sequence in which the conditions were experienced
Demand Characteristic
Within-groups designs have three main disadvantages. First, they have the potential for order effects. Second, it might not be possible or practical. A third problem occurs when people see all levels of the independent variable and then change the way they would normally act. If participants in the van Kleef pasta bowl study had seen both the medium and large serving bowls (instead of just one or the other), they might have thought, "I know I'm participating in a study at the moment; seeing these two bowls makes me wonder whether it as something to do with serving bowl size." As a result, they might have changed their spontaneous behavior. A cue that can lead participants to guess as experiment's hypothesis is known as a demand characteristic, or an experimental demand. Demand characteristics create an alternative explanation for a study's results. You would have to ask: Did the manipulation really work, or did the participants simply guess what the researchers expected them to do, and act accordingly?
Estimating Interactions from a Table
You can use a table to estimate whether a study's results show an interaction. Because an interaction is a difference in differences, you start by computing two difference. Begin with one level of the first independent variable: the alcohol photos. The difference in reaction time between the aggressive and neutral words for the alcohol photo is 551-562=-11ms. Then go to the second level of the first independent variable: the plant photos. The difference in reaction time between the aggressive and neutral words for the plant photos is 559-552=7ms. (Be sure to compute the difference in the same direction both times; in this case, always subtracting the results for the neutral words from those for the aggressive words.) There are two differences: -11ms and 7ms. These differences are different: One is negative and one is positive. Indeed, statistical tests told the researchers that the difference of 18ms is statistically significant. Therefore, you can conclude that there is an interaction in this factorial study
Participants Variable
You might have noticed that one of the variables, cell phone use, was truly manipulated; the researchers had participants either talk or not talk on cell phones while driving. The other variable, age, was not manipulated; it was a measured variable. The researchers did not assign people to be older or younger; they simply selected participants who fit those levels. Age is an example of a participant variable - a variable whose levels are selected (i.e., measured), not manipulated. Because the levels are not manipulated, variables such as age, gender, and ethnicity are not truly "independent" variables. However, when they are studies in a factorial design, researchers often call them independent variable for the sake of simplicity
Systematic Variability Is the Problem
You need to be careful before accusing a study of having a design confound. Not every potentially problematic variable is a confound. Consider the example of the pasta bowl experimenters. It might be the case that some of the research assistants were generous and welcoming, and others were reserved. The attitude of the research assistants is a problem for internal validity only if it shows systematic variability with the independent variable. Did the generous assistants work only with the large-bowl group and the reserved ones only with the medium-bowl group? Then it would be a design confound. However, if the research assistants' demeanor showed unsystematic variability (random or haphazard) across both groups, then their attitude would not be a confound