midterm 2 300B conceptual
How we compare within-groups and between-group estimates of population variance?
-If the null hypothesis is true, the ratio of the between-groups estimate to the within-groups estimate should be approximately one to one If the null hypothesis is false, the between-groups estimate should be bigger than the within-groups estimate because the former is influenced by two sources of variation - variation of the scores within the population and variance of the means of the populations from each other
four circumstances to test for power
-before experiment -after: 1) if retained null but expected to reject it 2) if retained null and wanted to 3) if rejected null as expected
what is power? statistical definition
1 - β = Power • Probability of finding a significant difference if the effect that you are looking for is real • Probability of correctly rejecting a false H0 • The probability of NOT making a Type 2 error
4 ways to test credibility of an outcome
1) Hypothesis Testing* 2) Effect size and Power 3) Replication - have other studies support your finding 4) r 2: "Proportion of variability accounted for"*
variability: μ1=700 and μ0 = 500 in both cases but σ is reduced from 100 to 50. What happens to 1-ß (power)?
1-ß (power) increases as σ decreases
Advantages of F-test (3)
1. Allows us to analyze > 2 groups while controlling for Type 1 errors 2. Better interpretation of the impact of the IV on the DV when you can look at all the groups as a whole rather than running separate studies to compare different levels of the IV 3. More powerful: • Less unexplained variability in analysis (since you're just running one single study instead of many separate studies) • Pool variability across multiple groups to estimate population variance
Steps for Computation of ANOVA
1. Compute grand mean and group means 2. Compute SST , SSgroups , SSerror 3. Compute dfT , dfgroups , dferror 4. Compute MSgroups , MSerror 5. Compute F(obs) 6. Compute η2
Control for probability of making a Type 1 error Error rate: Probability of making a Type 1 error for a specific set of analyses Two types:
(1) Error rate per comparison (αPC) • Probability of making a Type 1 error for each individual comparison • Set by the researcher (2) Error-rate per experiment / experiment-wise error (αEW) e.g. Dunn' alphaEW <0.05 • Probability of a Type 1 error for a family of comparisons • Involves a uniquely computed table of critical values
Methods for reducing Variability: Monitor Procedure We can reduce variability by increasing the internal validity of our experiment How?
(1) Monitor our experimental procedures closely: • Employ a tighter methodology / protocol (make sure all Ps are treated the same) • Take care to reduce experimenter bias (be neutral when approaching the experimental) • Increase the reliability of the DV (make sure measurements are done properly)
Methods for reducing Variability: Research Design We can reduce variability by increasing the internal validity of our experiment. How?
(2) Consider the choice of our research design • The related samples design: • Is a more powerful design than the independent samples design • Reduces variability caused by bigger N (more individual differences) in independent samples design
Population, Sampling, and Sample dist
3 Sampling distribution of means need to be pooled to create 1 Sampling distribution of the F-statistic. When k=3, there is a total of 10 relevant distributions F distribution: mathematically defined family of curves that is the comparison distribution used in an analysis of variance An F table gives you the cutoff scores on the F distribution at different alpha levels
Do not increase sample size after analyzing your data
Adding participants after you have analyzed your data (a posteriori) increases your chance of a Type I error
One-way ANOVA
An analysis of variance where the groups are defined on only one independent variable with several levels (e.g., Drug dosage - 2-, 4-, 8-mg). New test statistic/test ratio: F that can be used to compare sample means (like t).
Same t(obs) but in one outcome we reject null hypothesis and in the other case we retain the null hypothesis
As df increases, the critical value, t(crit), decreases
How do we compute power? different situations
Can compute power directly for: (1) Empirical Population where all the members of the population experience the treatment Need to calculate effect size to find power for: (1) Empirical Population where only some members of the population experience the treatment (sample from empirical population) (2) Theoretical Population with a related samples design (3) Theoretical Population with an independent samples design
Between-groups estimate of the population variance
Even when populations are identical (null hypothesis is true), samples from different populations will each be a little different (each mean won't be exactly the same) Variability between sample means depends on how much variation there is in the population: • A lot of variation in the population -> more variation in the sample means • A little variation in the population -> little variation in the sample mean • Variation among the means of samples taken from identical populations is related directly to the variation of the scores in each of those populations • We can estimate the variance in the population from the variation among the means of our samples • When the null hypothesis is true, the variation between sample means is also determined by chance factors (just like within-groups estimate of the population variance) • Called a between-groups estimate of the population variance or mean square between groups (MSgroups)
How do we interpret r2?
For independent or related samples design: • Small: r2 = .10 • Medium: r2 = .25 • Large: r2 = .40 For multi-group designs (k > 2), we report different statistics R2 (η2): (more on this later) • Small: R2 = .01 • Medium: R2 = .06 • Large:R2 = .14
What is the null hypothesis? Several populations being compared all have the same mean:
H0: μ1 = μ2 = μ3
Basic Logic of the Analysis of Variance
Hypothesis testing in analysis of variance is about whether the means of the samples differ more than you would expect if the null hypothesis were true In order to do this, we will focus on the variances and go over two different ways that we can estimate population variances: (1) Within-groups estimate of the population variance (2) Between-groups estimate of the population variance
Between-groups estimate of the population variance: if null is or isnt true
If the null hypothesis is true: • Estimate gives an accurate indication of the variation within the population (due to chance factors) If the null hypothesis is false: • Estimate is influenced by both the variation within the population as well as the variation among the population means (variation due to a treatment effect)
How do we use effect size to compute power?: For theoretical populations - independent samples design - if n isnt equal
If unequal there are two options: (1) If the total number of participants is large, and the difference between n1 and n2 is relatively small, Use the smaller of n1 and n2 as an estimate of n (2) If the total number of participants is small, and the difference between n1 and n2 is large then we need to calculate the harmonic mean of n
Power as a function of alpha (α)
If we increase our cutoff for α (our cutoff here would move to the left) we would also decrease β and increase (1-β) so power increases as alpha increases
Reporting r^2
In the context of reporting your t statistic: • SE = 1.16, t(14) = 1.72, p > .10, r 2 = .17 (or as a percentage, r 2(100) = 17%) Or • "17% of the variability in (name the DV...behaviour) is associated with (name your IV), but 83% of the variability in (name of DV) is unaccounted for"
Within-groups estimate of population variance
Like the t test: • Population variance is unknown but can be estimated using the scores in the sample • Assume that all populations have the same variance: σ2 1 = σ2 2 = σ2 3 Take all variance estimates from the samples into a single pooled estimate à within-groups estimate of the population variance, or mean square error (MSerror) • Not affected by whether the null hypothesis is true or false • Estimate will come out the same whether the means of the population are all the same (null hypothesis is true) or if they are not all the same (null hypothesis is false) • Focuses on the variation inside each population or chance factors/unexplained variability in participants (i.e., individual differences, experimenter error)
Factors that influence power
Major factors (3): (1) Effect size, which is made up of: (1) Treatment Effect (mean difference) (2) Variability (2) Sample size (NT or N) or size of each group (n or Nk) Minor factors (3) • Alpha level • Directional vs. Non-directional hypothesis • Choice of research design
Characteristics of the null and alternative hypothesis distributions
Null distribution (H0): • Kurtotic • Assumed to be a normal distribution • Defined by μ0 & σ Treatment / Alternative distribution (H1): • Kurtotic • Assumed to be a normal distribution • Defined by μ1 & σ
How do we use effect size to compute power?: empirical population
Only some members of the population experience the treatment (Single sample design): • μ0 & σ is known; μ1 is estimated from the sample mean • Use d to estimate γ then use that to find power : δ = d * squarerootofN • δ is a value used in referring to power tables that combines d and sample size
Why don't we want to do multiple t-tests?
Performing multiple t-tests increases the likelihood that we make a Type 1 error (α) Type 1 error - rejecting the null hypothesis when the null hypothesis is actually true • If α = .05, then probability of Type 1 error = .05
F-statistic qualities
Shape: Asymmetric and positively skewed X-axis • no upper limit of values • Significant outcomes at positive end of tail Central Tendency • If equal Ns, median = 1.00 Family or series of curves • 2 calculations from df (from within groups and between groups variance) Fcrit varies as a function of α, dfgroups (between groups df), dferror (within groups df)
How do we interpret Cohen's effect size
Small - .20 Medium - .50 Large - .80
Step 6. Interpretation and Report of Findings - ANOVA
Statistical outcome: F(2,12) = 5.72, p<.05 Note: 2=dfgroups; 12=dferror Interpretation: The values for p(obs) suggests a strong relationship between type of cookie and the number of chocolate chips per cookie However, we do not know which brand of cookies had the most (or the least) number of chocolate chips per cookie The mean number of chocolate chips varied as a function of type of cookie (Regular chips ahoy M = 24.6; Chewy Chips Ahoy M = 18.8; Chunky Chips Ahoy M = 16.8), F(2,12) = 5.72, p < .05, η2 = .49. This is still incomplete because we don't know the direction of the effect - which types of cookie differ from each other.
Using r2 to compare results
Study A: r 2 = 0.081 • Only 8% of variability between the two groups can be explained... 92% of the variability in the scores is explained by other factors Study B: r 2 = 0.224 • 22% of the variability between the two groups can be explained...78% variability in the scores is explained by other factors Based on r2, finding from Study B is more impressive because more of the variability (difference) in the scores between the two groups can be explained by the IV
In Hypothesis Testing, we calculate p(obs). What does p(obs) represent?
The actual probability that given the null hypothesis is true, we could have obtained these results by chance
Example write up power
To achieve a power of .70 with an effect size of 0.44, 12 subjects (6 males, 6 females) were randomly assigned to the four treatment conditions (N=48). Power was calculated based on previously obtained effect sizes.
Variability -3 types
Total Variability SST - Total amount of variability in the Y Variable Explained Variability SSR - Variability in the Y variable that can be accounted for by our knowledge of the X variable Unexplained Variability SSE - Variability in the Y variable that cannot be accounted for by our knowledge of the X variable
research design and power
Use a related samples research design if possible (this design reduces the variability value in the denominator
Basic premise of the analysis of variance
When the null hypothesis is true, the ratio of the between-groups population variance estimate to the within-groups population variance estimate should be about 1 When the null hypothesis is false, this ratio should be greater than 1
As the difference between μ0 and μ1 increases (with standard deviation constant), effect size: a) Increases b) Decreases c) Stays the same
a
Treatment effect or mean difference If the mean difference is larger (μ1=850, vs. μ1 = 700) when μ0 = 500, what happens to 1-ß (power)? a) Power increases b) Power decreases c) Power stays the same
a
Does the probability 1-α pertain to the null or the treatment/alternative distribution? a. Null Distribution (H0) b. Treatment / Alternative Distribution (H1)
a.
Are you rejecting or retaining the null if it falls in the 1-α area? a. Rejecting H0 b. Retaining H0
b
Which of the following is incorrect about the within-groups variance estimate: a) Another term for it is "mean square error" b) It is only a good estimate if the null hypothesis is true c) It is also called MSerror d) It focuses on variability due to chance factors (also called "unexplained variability")
b
What does β represent? a) Probability of rejecting the null hypothesis when it is true b) Probability of rejecting the null hypothesis when it is false c) Probability of retaining the null hypothesis when it is false d) Probability of retaining the alternative hypothesis when it is true
c
Which of the following is correct: a) More power = large magnitude of difference, larger standard deviation, larger sample, larger alpha b) More power = large magnitude of difference, smaller standard deviation, larger sample, smaller alpha c) More power = large magnitude of difference, smaller standard deviation, larger sample, larger alpha d) More power = smaller magnitude of difference, smaller standard deviation, larger sample, smaller alpha
c
Power is: a) The probability that the null hypothesis is true b) The probability that the null hypothesis is false c) The probability a false null hypothesis will be rejected d) The probability a true null hypothesis will be rejected
c)
What is the within-groups population variance estimate based on? a) Take the smallest variance estimate from the different samples b) Take the largest variance estimate from the different samples c) Take the average of the variance estimates from all the samples d) Take the mean of the all sample means and then use that to calculate variance
c)
When the null hypothesis is true: a) Only within-groups estimate of variance is a good estimate of population variance b) Only between-groups estimate of variance is a good estimate of population variance c) Both are good estimates of population variance
c) This means that when the null hypothesis is true, both estimates should be fairly similar. When the null hypothesis is false, the between-groups estimate of population variance should be much larger than the withingroups estimate
Interpretation of d, f, r2, and η2
d : small:.20 medium:.50 large:.80 f = .10 .25 .40 r2= .10 .25 .40 η2 = .01 .06 .14
How do we use effect size to compute power?: For theoretical populations - independent samples design
d = (Xbar1 − Xbar2 ) /S-pooled δ=d*squarerootof(n/2) -> find power
How do we use effect size to compute power?: For theoretical populations - related samples design
d = Dbar −µsubdbar/Ssubd δ = d * squarerootofN -> refer to power table to find powe
If the null hypothesis is false between-groups estimate of population variance will: a) Only be influenced by chance alone b) Only be influenced by variation within each sample c) Only be influenced by differences in population means d) Be influenced by both chance and differences in population means
d)
conceptual relationship between t and r
r= Covariance/ SXSY = Variability explained/ Total variability=Mean Difference Estimated/ SE= t
calculating r in independent samples
r^2 = t(obs)^2/ t(obs)^2 + df
r^2 = proportion
variability explained/total variability proportion of variability in Y accounted for by the variability in X
Advantages of Multiple Comparisons
• Can do specific comparisons between means • Determine which experimental condition responded best to your treatment • Controls for inflated probability of making a Type 1 error Although the ANOVA was significant, we do not know which brand(s) differ from one another.
Disadvantage of F-Test
• Comparing several condition means so we don't know where the difference is • E.g. 3 groups: neutral, standard smile, and authentic smile. How do we know which groups actually differ from each other? • Solution: apply a set of procedures called multiple comparisons to the data
What is effect size?
• Degree to which experimental manipulation separates H0 and H1 distributions • Takes into account difference between μ0 and μ1 • And this distance is expressed in standard deviation units (takes into account variability) • Larger the effect size -> the smaller the overlap of H0 and H1 • Effect size is a standard metric so sample size and type of measure do not influence its value • Effect size ≠ Power
Effect Size when k > 2
• Effect size was d when we looked at independent samples t-tests • When k>2, effect size = "f"
Effect size with a theoretical population
• Estimate effect size from a sample -> Use Cohen's d • Exact formula varies for each research design • Called statistical effect size or estimated effect size. • Same interpretation as γ d = (X1 − X2 )− 0 / Sp
What is Power? • Probability of correctly rejecting a false null hypothesis
• Example: If power = 0.65, it means that if the null hypothesis is false to the degree we expect, the probability is .65 that the results of the experiment will lead us to reject H0 If Power = 0.20: • If the new teaching method is actually the better method, the probability is 0.20 that the results of the experiment will lead us to reject the null hypothesis that the traditional method is better. • Low probability, less likely to reject the null hypothesis, given that the null is false. • If an experiment is more powerful, there is a greater probability of rejecting a false H0 compared to a less powerful experiment.
Problem with relying on Hypothesis Testing
• Hypothesis testing does not tell us how strongly our treatment of participants is related to their change in performance (behaviour) • Study A: t(80) = 2.67, p = .009; Study B: t(16) = 2.15, p = .047 • Both studies reject the null hypothesis and suggest that there is something systematic relationship between the IV and DV • However, p(obs) does not tell us how strong the relationship is between the IV and DV or how much of an effect exists
When to test for power - After your experiment
• If you retained the null but expected to reject it • How can I improve my design? Was N too small? Was my treatment condition not strong enough? • If you retained the null and wanted to • Compute Power to demonstrate the experiment could have detected an effect if the effect was really there • If you rejected the null as expected • Is my outcome of practical significance?
Step 2. Select Sampling Distribution -ANOVA
• Model of Hypothesis Testing: Random Sampling Model • Research Design: independent samples research design with1 factor, k = 3 independent groups • Type of Data: Score data (ratio) • Statistic for Analysis: Group means • Population Parameters: Not known • Sample Size: nj = 5 (N = 15) Sampling Distribution: "Distribution of the F-statistic for df of 2 & 12" )df groups and df error)
Planned Comparisons: t-test 1 pairwise comparison • If you only have one planned comparison, run the t-test as we normally would Features: (3)
• Most powerful option when you only have two groups • Use the t-table of critical values with df: N - 2 (just as you would in an independent samples t-test) • Use standard α (i.e., α=.05)
Tukey's HSD Test
• Most widely used, least controversial • Best for k>3, less powerful for k=3 • Homogeneity of σ2 cannot be violated • Requires significant F(obs) • Uses experiment-wise error rate (αEW) through a specfic critical values table • MSerror value from ANOVA used for pooled error term • Unique table of critical values has been created for the distribution of Studentized q (Q-statistic)
Directional vs. Non-directional hypothesis and power
• Power is higher with a one-tailed test than with a two-tailed test as long as the hypothesized direction is correct. • A one-tailed test at the 0.05 level has the same power as a two-tailed test at the 0.10 level.
What is 'good' Power?
• Power of 0.90 is excellent • For behavioural sciences: 0.50 - 0.70 is acceptable
Type 2 Error (β)
• Probability of not finding an effect of the new teaching method, given that the new teaching method is actually better than the traditional method . • Probability of Vancouver not beating LA when Vancouver actually is the better team • Probability of retaining H0 when the null hypothesis is false and should be rejected • Power = 1 - β
Assumptions of a one-way ANOVA
• Scores in each sample come from population where DV is normally distributed • Homogeneity of population variances sigmasquared1=sigmasquared2=sigmasquared3 • Random sampling and random assignment of participants so data are independent of each other
What is power? Conceptual Definition
• Sensitivity of an experiment to detect a real effect of the independent variable on Participants' behaviour • Assuming that the new method is actually better, does the experiment have a high probability of providing strong evidence that the new method is better than the traditional method?
t-test >1 pairwise comparison • When you have >1 pairwise comparison you increase Type 1 error rates Solution:
• Set αPC for each comparison. As a general rule divide α by the number of comparisons you are making (Bonferroni correction). • If you have 2 pairwise comparisons, can set αPC at .025 & .025 so Σ = .05 • If you have 3 pairwise comparisons, you can set αPC at .0167, .0167, & .0167 so Σ = .05 • Apply standard t-test to each pairwise comparison but use your αPC instead of α =.05 to determine significance. That is, p(obs) will need to be less than αPC to be significant.
Features of Planned Comparisons (3)
• Significant F(obs) value is not required (p can be > .05) • Must justify choice of comparisons before data are collected and follow same guidelines as those for choosing a directional hypothesis • Total # of comparisons possible cannot exceed k-1
Planned (A priori) Comparison
• Subset of comparisons planned and justified by researcher before data are collected Example from the research report: • Based on the introduction, you may already have an idea which groups may differ from each other. • Based on past research you might predict that one group may have higher well-being and social engagement scores compared to another group. (e.g., the aerobics group will be higher in well-being compared to the bridge group)
Post Hoc or A posteriori Comparisons
• Tests between sample means not specified before data were collected • Assumes all possible pairwise comparisons will be made (whether they are or not) • Used for exploratory analysis Example from the research report: • You might not have a prediction about whether the current events group is different from the handicrafts group in social engagement but still want to look at this comparison after looking at the data. We would compare these to each other in post hoc comparisons. You can (and often should) do both planned and post-hoc comparisons
How do we measure effect size?
• Two parts: Treatment effect and variability • Called gamma or Cohen's effect size(d) • γ = effect size in SD units for an empirical population γ = µ1 −µ0/σ
Sample Size and Power
• When you have too few participants in each condition it is hard to reject H0 • When you have too many participants in each condition, it is easy to reject H0 BUT as we learned last time, this isn't the best way to then judge the impressiveness of your outcome why? A large value for N decreases the value for standard error & sampling distribution changes shape • σXbar= σ/squarerootofN
When to test for power - Before your experiment
• best time to conduct your power analysis • Are the features of the experimental design sufficient to detect an effect of the IV on the DV? • Determine what your expected effect size is (through pilot studies or previous experiments) • You can then calculate what your sample size should be - manipulate power equation
Variability "accounted for"
• r 2 should not be interpreted as cause and effect if the IV was measured and not manipulated
Variability explained (η2)
• η2 called "eta-squared" • "Proportion of total variability of the scores from the grand mean that is accounted for by variability between the group means" • "How much variation in the DV is accounted for/ explained by / associated with your IV"