ANOVA ****
Replication Crisis Culprits
1.Reliance on p-values only 2.Easy manipulation of significance 3.Publication bias Scientific method = Key process for experimentation that is used to observe and answer important questions about the world.
WITHIN-GROUP VARIABILITY (Error variability)
Amount of variability there is around each individual condition mean. •If there is not a lot of variability around each condition mean, SS-within is small •Greater variability around each condition mean increases SS-within •Larger within-groups variance
SStotal
an index of the overall variability around the grand mean. ∑(X - GM)2
post hoc test
multiple comparisons test that is appropriate after examination of the data
a priori test
multiple comparisons test that must be planned before examination of the data
cell
scores that receive the same combination of treatments
residual variance
variability due to unknown or uncontrolled variables; error term
Post-hoc tests
Between-groups design: ¡Tukey tests, Bonferroni, others Within-groups design: ¡Tukey is not designed for within-groups designs. ¡Can use Bonferroni, though. ¡ There are many versions of post-hoc tests. Some are more conservative than others; in other words, corrections intended to avoid Type I errors can make it very difficult to reject H0
repeated-measures ANOVA
statistical technique for designs with repeated measures of subjects or groups of matched subjects
Publication bias
•Publishing results based on significance instead of the direction or strength of the study findings •Privileges significance over estimation •Most null findings (i.e., no significant results) never see the light of day •Leads to biased understanding of psychological phenomena •May lead researchers to engage in shady research practices •P-hacking •Harking
Reproducibility
•Obtaining consistent results using the same original data, methodology, and analysis •
BETWEEN-GROUP VARIABILITY (Treatment variability)
Amount of variability there is around the overall mean (the Grand Mean). •Imagine A = B = C •Average of (A, B, C) = A = B = C •Each condition mean will not deviate from the Grand Mean •Now imagine they are not equal •Each condition mean will deviate from the Grand Mean •The more they deviate from the Grand Mean, the larger the effect •Larger between-groups variance
Simple effect
The effect of one IV on the DV, at a specific level of the other IV
SSbetween
an index of the amount of variability between conditions—measured as variability of conditions around the grand mean. ∑(X ̅ - GM)2 Sample Mean - Grand Mean (for each score)
Effects in Two-way ANOVAs
answer three questions ¡Is there an effect of the first IV? (main effect) ¡Is there an effect of the second IV? (main effect) ¡Is there an interaction between the IVs? You can have ANY combo of main effects and interactions When there is a significant interaction, the main effects don't tell the whole story ¡Main effects can even be misleading when there is a significant interaction ¡Would follow up with tests of simple effects, to characterize the interaction (ANOVA does not actually give us answers about each simple effect) The interaction typically overrides main effects IF there is an interaction, we pay attention to the interaction and to the simple effects. IF there is NOT an interaction, we pay most attention to the main effects. BUT, in practice: To know if there really are main effects or an interaction, we need to run ANOVA By visual inspection, it looks like: BUT, let's run the ANOVA to find out which (if any) of these are statistically significant. Null Hypotheses ¡H0 : μFocused = μDistracted ÷Driving is not affected by level of distraction. ¡H0 : μMorning = μNight ÷Driving is not affected by time of day. ¡[not easily formulated as symbols] ÷The effect of distraction on driving is not dependent on time of day Research Hypotheses ¡H1 : μFocused ≠ μDistracted ÷Driving is affected by level of distraction. Main effect of distraction? ¡H1 : μMorning ≠ μNight ÷Driving is affected by time of day. Main effect of time of day? ¡[not easily formulated as symbols] ÷The effect of distraction on driving depends on time of day¡ Interaction? dfrows(time of day) = Nrows - 1 = 2 - 1 = 1 dfcolumns(distraction) = Ncolumns - 1 = 2 - 1 = 1 dfinteraction = (dfrows)(dfcolumns) = (1)(1) = 1 dfwithin = dfF,M + dfF,N + dfD,M + dfD,N = 8 ¡dfF,M = 3 - 1 = 2 ¡dfF,N = 3 - 1 = 2 ¡dfD,M = 3 - 1 = 2 ¡dfD,N = 3 - 1 = 2 dftotal = Ntotal - 1 = 12 - 1 = 11 Because the df for between and within are the same for all three tests, we can use the same F crit In a 2 x 3 factorial design, df rows and df columns would be different. If done by hand, we would next need to calculate sums of squares for each source of variance. But we won't. SStotal = ∑(X - GM)2 = 536.25 SSA = ∑(X ̅A - GM)2 = 216.75 SSB = ∑(X ̅B - GM)2 = 168.75 SSwithin = ∑(X- X ̅cell)2 = 132 SSAxB = SStotal - SSA - SSB- SSwithin = 18.75 Fs: MS wahtever/mswithin Reject/fail to reject each 1
factorial anova
experimental design with two or more independent variables
factor
independent variable
SSwithin
is an index of the variability around each condition mean. ¡BUT, in a repeated-measures design, Subjects variability is already accounted for, so that part is removed from this SS ∑(X - X ̅)2 for btwn groups ANOVA Each Score - Sample Mean = SStotal - SSbetween - SSsubjects for within
one-way ANOVA
statistical test of the hypothesis that two or more pop means in an independent-samples design are equal
error term
variance due to factors not controlled in the experiment
interaction
when the effect of one independent variable on the dependent variable depends on the level of another independent variable
ANOVA nomenclature
•preceded by two adjectives that indicate: • 1. Number of independent variables "one way" "two way" 2. Research design "between-groups" "repeated-measures" or "within-groups" asks, is there much more between-condition variability (L vs. R) than within-condition (within L or R) variability? *but of course we'd have 3 conditions, not just L vs. R In other words, is it likely that these came from the same populations? •
Tukey HSD test
test significance test for all possible pairs of treatments in a multitreatment experiment
Interaction
When the effect of one independent variable depends on the specific level of the other independent variable. When the effect of one IV on the DV changes across levels of the other IV Yes and builds spirit of game/play Improv is walking backwards — see bigger details and context Yes and builds spirit of game/play Improv is walking backwards — see bigger details and context to define/describe, start with simple effects, then compare them. ¡Simple effect: The effect of one IV at a specific level of the other IV If they are different from each other in size or direction, there is an primary reason for running a factorial experiment.
Multiple-Comparison Tests
When you calculate a significant F in ANOVA, how do you know where the differences are? ¡a priori (planned ahead of time) ¡post-hoc ("after the fact") Let's look first at planned comparisons Omnibus ANOVA does not tell me...must do follow up tests****only if overall F is sig**** Tukey's HSD, Bonferroni, Scheffe, LSD Tukey's HSD test. Tukey's test was developed in reaction to the LSD test and studies have shown the procedure accurately maintains alpha levels at their intended values as long as statistical model assumptions are met (i.e., normality, homogeneity, independence). Tukey's HSD was designed for a situation with equal sample sizes per group, but can be adapted to unequal sample sizes as well (the simplest adaptation uses the harmonic mean of n-sizes as n*). The formula for Tukey's is: Scheffe's test. Scheffe's procedure is perhaps the most popular of the post hoc procedures, the most flexible, and the most conservative. Scheffe's procedure corrects alpha for all pair-wise or simple comparisons of means, but also for all complex comparisons of means as well. Complex comparisons involve contrasts of more than two means at a time. As a result, Scheffe's is also the 4 least statistically powerful procedure. Scheffe's is presented and calculated below for our pairwise situation for purposes of comparison and because Scheffe's is commonly applied in this situation, but it should be recognized that Scheffe's is a poor choice of procedures unless complex comparisons are being made.
multiple-comparisons test
tests for statistical significance between treatment means or combination of means
PLANNED COMPARISON
IF before data collection, we had reasons to expect a difference between 2-week and 8-week (and for some reason we're not much interested in the other pairwise comparisons): paired samples T-Test Can't go back after the fact and act like you only wanted to compare only those two. Shady science.
Qualitative Interaction
One IV reverses its effect depending on the level of the other IV
"angles" of an interaction
There are two different ways to express an interaction pattern from any 2x2 factorial ANOVA. ¡Does the effect of IV-A depend on the different levels of IV-B? ¡Does the effect of IV-B depend on the different levels of IV-A? ¡
One-way repeated-measures ANOVA
A hypothesis test used when you have ¡one nominal IV with at least 3 levels ¡a scale DV ¡within-subject design (same people in each group) Also called a one-way within-groups ANOVA A major advantage over one-way between-groups ANOVA: repeated-measures allows us to account for one more source of variance ¡Variance due to participants (subjects) - dfbwn = Ngroups - 1 groups : "treatments" (conditions) dfsubjects = n - 1 dfwithin = (dfbwn)(dfsubjects) dftotal = dfbwn + dfsubjects + dfwithin OR: dftotal = N - 1 = 12 - 1 = 11 Number of observations
ANOVA key points
Basic idea: ¡are the 3(+) sample means more spread out than we'd expect due to chance (under the H0)? ¡Is there more variance between the different groups than within the groups? we KNOW there's going to be variance... Remember that variance is the standard deviation squared within-groups variance: amount of variance between means expected under H0 (no effect) variance of the means / mean of the variances Hypotheses: ¡H0: μ1= μ2= μ3 = μ4 = μn ¡H1: At least one μ differs from the others ¡
Grand mean
the mean of all scores, regardless of treatment
Mean square
the variance; a sum of squares divided by its degrees of freedom
One-way ANOVA
¡Nominal IV, 3+ levels ¡Scale DV ¡Between-groups or Repeated-measures ¡Does the one single factor (IV) have an effect on the DV?
Bonferroni
(manual) correction. If you run separate t-tests for each comparison, then adjust alpha by dividing by the number of tests you run: ÷Three comparisons: ¢2-week vs. 4-week ¢2-week vs. 8-week ¢4-week vs. 8-week ÷alpha = .05 becomes .05/3 = .017 ÷p-value has to be less than .017 to reject the null Bonferroni is a strict correction, which means it lowers power. If you select the Bonferroni correction in JASP, it alters the p-value for you by multiplying it by the number of comparisons. USE the original alpha. ÷p-value has to be less than .05 to reject the null Will get same answers (decision to reject/fail to reject null) as with the "manual" Bonferroni correction.
F statistic
Figure 11.1 shows three treatment populations that are identical. Each population produced a sample whose mean is projected onto a dependent-variable line at the bottom of the figure. Note that the three means are fairly close together. Thus, a variance calculated from these three means will be small. This is a between-treatments estimate, the numerator of the F ratio. Small numerator of F statistic In Figure 11.2, the null hypothesis is false. The mean of Population C is greater than that of Populations A and B. Look at the projection of the sample means onto the dependent-variable line. A variance calculated from these sample means will be larger than the one for the means in Figure 11.1. By studying Figures 11.1 and 11.2, you can convince yourself that the between-treatments estimate of the variance (the numerator of the F ratio) is larger when the null hypothesis is false. Large numerator of F statistic is also a family of distributions Identified by two df measures •one df is associated with the numerator (number of samples, i.e., conditions) •the other df is associated with the denominator (sample sizes) F=1 is the expected value under the null hypothesis in ANOVA no one-tailed vs. two-tailed F = t2 F with one df in the numerator = t distributions
SSsubjects
an index of the variability of each subject's overall performance around the grand mean. ∑(X ̅ participant - GM)2
Factorial ANOVA
¡2+ nominal IV, each with 2+ levels ¡Scale DV ¡Between-groups, Repeated-measures, or Mixed design
Two-Way ANOVA
¡Main Effects: 1)Does Factor A by itself have an overall effect on the DV? 2)Does Factor B by itself have an overall effect on the DV? ¡Interaction: 3)Does the effect of one factor vary depending on the other factor? ¢Do the two factors combined have an effect on the DV that we wouldn't understand from just looking at each factor by itself?
t test
¡Nominal IV, 2 levels ¡Scale DV ¡Independent or Paired Samples
Assumptions
•The inferential tests we have covered are all called parametric tests. •They work best when certain assumptions about underlying populations are met. •One key assumption in ANOVA is that variances are equal in the underlying populations being sampled from. •Samples come from populations with similar variances •Called homogeneity of variances (a.k.a., homoscedasticity) •General rule: If largest variance is more than twice the smallest variance, the assumption of homogeneity is violated. Because the largest variance is not more than twice the smallest, we have met this assumption •If the assumption is met, the ANOVA works as intended •In other words, we can have confidence in the conclusion at the end •If the assumption is violated: •If the sample size is large, then we can run ANOVA without major concern •BUT if the sample size is small, then the results of the ANOVA might be wrong
Partitioning of Variance
Between-Groups: Between/Within Repeated-Measures: Between/Within/Subjects MSbetween -> Effect MSwithin -> "Error" (aka "residual") Decreasing MSwithin (by accounting for subject variance) increases F-statistic OR Two-Way Between-Groups ANOVA MSwithin used for all three F ratios Three different MSbetween used for three different F ratios
statistical tests
Differences / Error/variability (Between-group) Measure of differences between conditions / The "effect" (Within-group) Measure of variability within conditions / "noise" t stat =Diffs btwn means / Standard error F stat = Between-groups variance / Within-groups variance H0 is true: this ratio is ~1 H0 is false: this ratio is >1 groups à "treatments" (conditions) How much >1? We'll use critical values/cutoffs In fact, between-group diffs/within-group variability *between-group can mean between-condition (repeated-measures design) Sampling distributions recap The standard normal z distribution: t is a family of distributions: t approaches the normal distribution w/higher N t distribution with df = ∞ is a normal curve
patterns of interaction
Three patterns of interactions that are possible. ¡An effect is larger at one level than at the other, but they're in the same direction --> quantitative ¡An effect is present at one level, but not the other (i.e., one simple effect is null). ¡An effect is reversed at one level compared to the other (i.e., the simple effects are in opposite directions). --> qualitative
Replicability
•Obtaining consistent results across several studies that aim to answer the same scientific question with different data
Analysis of Variance (ANOVA)
•starts with the assumption that all groups (a.k.a. treatments; conditions) will yield the same outcome •Null hypothesis •After data collection, one could end up with the claim that all groups do NOT produce the same outcome •Alternative hypothesis Why not just run a bunch of t tests? Probability of not committing a Type I err (correctly retaining null) on 2 t tests? P(A and B) = P(A) x P(B) = .95 x .95 = .9025 If I ran all 6 t tests, 26% chance of committing a Type I error somewhere in there! To keep your type I error rate low, •A hypothesis test used when you have •A scale DV •One categorical IV with *at least* 3 levels •Could be with a between-groups design or a within-groups design • • • •Helps to control type I error rate by allowing us to do just ONE overall test (omnibus ANOVA) •Also helps us control type I error rate in another way, by making our comparisons more conservative (more on this later)
t-test
•when comparing two sets of scores •One-sample t test: Sample mean compared to known population •Career satisfaction for sample of army nurses vs. known pop of civilian nurses •Paired-samples t test: Same people measured on two levels of an IV (or natural pairs or matched pairs designs) •Task performance on a big monitor versus small monitor •Independent-samples t test: Different people measured on two levels of an IV •Memory performance w/ or w/o caffeine
One-way between-groups ANOVA
F = MS B / MS W variance of the means/ mean of variances between-groups variance / within-groups variance Note: your book also refers to these as treatment variance (between) and error variance (within) MSB and MSW are two estimates of the population variance. When H0 is true, MSB = MSW and thus F=1 When H0 is false, MSB will be larger than MSW and thus F>1. SO, F=1 is the expected value under the null hypothesis in ANOVA 1. Set null and alternative hypotheses: Populations you are comparing may be exactly the same (null hypothesis, H0) or one or more of them may have a different mean (alternative hypothesis, H1). •H0 : No differences between population means •H0 : μ1 = μ2 = μ3 •H1 : At least one population mean is different from the mean of all of them combined (i.e., the Grand Mean). Tentatively assume the null is true 2. Obtain data from samples to represent the populations you are interested in. 3. Tentatively assume that the null hypothesis is correct. If the populations are all the same, any differences among sample means are the result of chance. 4. Perform operations on the data using the procedures of ANOVA until you have calculated an F value. 5. Choose a sampling distribution that shows the probability of the F value when H0 is true. 6. Compare your calculated F value to the critical value. NHST: If the F-statistic is larger than the reject the null hypothesis. 7. Come to a conclusion about H0. 8. Tell the story of what the data show. If there are three or more treatments and you reject H0, a conclusion about the relationships of the specific groups requires further data analysis. --- •dfbwn= Ngroups - 1 • •dfwithin = df1 + df2 + df3 , each is N-1 • •dftotal= dfbw + dfwithin •dftotal= Ntotal - 1
Easy manipulation of significance
Heavily Influenced By... •Power •The probability of making a correct decision (rejecting the null) when the null hypothesis is false •Distribution of p values varies substantially depending on sample size and power. •Increasing power increases your chance of rejecting the null •Alpha level •Less-stringent alpha (.05) makes it easier to make a type 1 error and more stringent alpha (.01) makes it easier to make a type 2 error Less stringent can lead to type 1 error. Good idea to always follow up with additional studies, but this illuminates the tricky balance we as researchers encounter between minimizing type 1 and type ii errors. •One tailed versus two tailed tests •One tailed tests have more power to detect an effect but also may be totally misleading if the effect is in the opposite direction •Sample Size •Bigger the N, more reliable estimate, greater likelihood of rejecting the null •Manipulation of any of those factors could lead to Type I error if the null hypothesis is true •Many of these factors are easy to manipulate after the data have been collected and analyzed •Makes it easy to find effects after initially retaining the null
Quantitative Interaction
One IV exhibits a strengthening or weakening of its effect at one or more levels of the other IV, but the direction of the initial effect does not change
analysis of variance
inferential statistics technique for comparing means, comparin variances, and assessing interactions Also, the omnibus ANOVA is actually testing whether the average deviation of the means from the grand mean is more extreme than would be expected if they were all from overlapping population distributions. In other words, omnibus ANOVA does not actually provide evidence that any pair will be different.
carryover effect
one level of IV continues to affect participant's response in next treatment condition
main effect
significance test of the deviations of the mean levels of one independent variable from the grand mean ¡The effect of one IV by ignoring (or really, averaging out) the influence of the other IV Compare marginal means to assess Evaluate the effect of one IV by discarding the influence (or ignoring) the other IV Similar to t test or one-way ANOVA
Effect size
standardized measure of magnitude of an effect the IV has on the DV There are a variety of effect size measures for ANOVA (see next slide) •Can get an effect sizes for overall IV in the ANOVA: •η2 ("eta squared") •R2 ("R squared") •ω2 ("omega squared") •They are all basically the proportion of variability in DV that can be accounted for by the IV (some version of SSbetween/SStotal) •Different heuristics for different effect sizes** •e.g., for ω2 --> 0.01 = small, 0.06 = medium, 0.14+ = large effect You have several options for effect size In this case, omega-squared is a much more conservative (and arguably more accurate) measure of effect size! Eta-squared and R-squared are overestimates when sample size is small. -- •standardized measure of magnitude of an effect the IV has on the DV •You can calculate a d value for any comparison between 2 groups d=(X ̅_1-X ̅_2)/s ̂_within s ̂_within= √(〖MS〗_within ) The drawback for Eta Squared is that it is a biased measure of population variance explained (although it is accurate for the sample). It always overestimates it. This bias gets very small as sample size increases, but for small samples an unbiased effect size measure is Omega Squared. Omega Squared has the same basic interpretation, but uses unbiased measures of the variance components. Because it is an unbiased estimate of population variances, Omega Squared is always smaller than Eta Squared.
sum of squares
sum of squared deviations from the mean
F test
test of the statistical significance of differences among means, or two variances, or of an interaction
F
test statistic for ANOVA MSbtwn/MSwithin 〖SS〗_between/〖df〗_between / 〖SS〗_within/〖df〗_within df: How we identify the F distrib SS: Sum of Squares, basis for variance. Mean squares: What we call variance in an ANOVA
F distribution
theoretical sampling distribution of F values
Guidelines and Best Practices
•Formulate a-priori hypotheses and pre-register your study •https://www.cos.io/initiatives/prereg • •Plan analyses prior to data collection •Choose your alpha, sample size, and tailed test (one-tailed vs. two-tailed) • •Interpret findings considering the p-value AND confidence interval and effect size Increase reliability of findings by increasing power, sample size, and by choosing an alpha level and tailed-test that balances type I and type II error Don'ts •Only consider the p-value • •Collect more data just to find a significant finding (p-hacking) • •Propose a post-hoc hypothesis after seeing the data (harking) •
Replicability or reproducibility crisis
•Ongoing methodological crisis in the field of psychology to replicate and reproduce scientific findings •In 2015, Open Science Collaboration attempted to reproduce the findings of 100 journal articles published in three high-impact journals •How many studies do you think failed to replicate? •More than half! •Only 39% of the study results were replicated • •Many effect sizes were smaller than those in the original published manuscripts, even when in the same direction • •61% of the study results did not hold up • Power Posing: Several studies attempted to replicate these findings to no avail. Was a small effect for self-reported feelings of powerfulness, but no physiological impact. Smiling Make You Happier: According to the facial feedback hypothesis, people's affective responses can be influenced by their own facial expression (e.g., smiling, pouting), even when their expression did not result from their emotional experiences. The original Strack et al. (1988) study reported a rating difference of 0.82 units on a 10 point Likert scale. Our meta-analysis revealed a rating difference of 0.03 units with a 95% confidence interval ranging from -0.11 to 0.16. Willpower a Finite Resource: The state of reduced self-control capacity was termed ego depletion. Baumeister and colleagues tested their model using a sequential-task experimental paradigm, in which participants engaged in two consecutive tasks. For participants randomly allocated to the experimental (ego-depletion) group, both tasks required self-control. For participants allocated to the control (no depletion) group, only the second task required self-control whereas the first task did not require any, or very little, self-control. The selfcontrol tasks required participants to alter or modify an instinctive, well-learned response, akin to resisting an impulse or temptation (Baumeister, Vohs, & Tice, 2007).
Reliance on p-values only
•p is the probability of observed results (or more extreme) occurring, IF H0 is true •p is NOT the probability that the data are due to chance •p is NOT the probability that H0 is true •p is NOT the probability that H0 is false •p is NOT the probability of a Type I error •p is NOT the probability of making a wrong decision •The complement of p (which is 1-p)is NOT the probability that H1 is true or false • •p is the probability of observed results (or more extreme) occurring, IF H0 is true -- Can be Misleading •Result in binary thinking •p = .049 is significant but p = .05 is not? Despite that they're just the probability of the data or more extreme under the null hypothesis, P values can dictate how we interpret and communicate our results. Check out Elaine here on the right who is super excited about p values that are less than .05 but is plugging away at her computer, perhaps p hacking, when p = .079. You might be surprised to hear that some psychologists would call that a marginal or trending significance, which is wrong! •Easy to misinterpret •P (R) → reliability, doesn't tell you probability of replication • •P (E) → error, not a probability of an error/doesn't tell you if you're making a correct rejection • •P (H)--> hypothesis, p value doesn't tell you probability of the hypothesis • ••Not a measure of estimation •Confidence interval (CI): a more precise and accurate measure of the sample mean as an estimate of the true population mean •95% CI means we can be 95% confident that the interval captures the true population parameter •The wider the interval, the less confident we are in our findings •Statistical significance is not necessarily meaningful •Effect size (e.g., Cohen's d) : an indicator of the magnitude or meaningfulness of the observed effect •If we found a significant result, but d = .02 (small effect), how might we interpret our findings? --- •A p-value is the probability of observed results (or more extreme) occurring, IF H0 is true • Reliance on p-values only can result in: •Binary thinking •Dismissal of estimation •Use confidence intervals and effect sizes to bolster interpretation •Conflation of significance with meaningfulness •