MODULE 4: Hypothesis Testing & Assumptions
Things that are related to Power
-Sample size* -Alpha level (with a caveat!) -Reliable measures (psychometric data available) -Appropriate research design -Appropriate analysis strategy -Understand the relationship between significance level (alpha), effect size, power, and confidence in the results
Multicollinearity & Singularity
-Multicollinearity *Variables are too highly correlated -Singularity *Variables are redundant -Must be considered in analysis strategy as well as instrument development -Redundant variables inflate the error term
What if our data do not meet assumptions?
-Outliers? -Transform the data *Positive skew **Log transformation **Square root transformation **Reciprocal transformation -Negative skew **Reverse score transformation -Use built-in "corrections" -Use a non-parametric test
Effect size measures
•There are several effect size measures that can be used: -Cohen's d -Pearson's r -Odds Ratio/Risk rates •Pearson's r is a good intuitive measure -Oh, apart from when group sizes are different ...
General Assumptions for Parametric Tests
-Normality -Linearity -Homoscedasticity -No multicollinearity and singularity
The mean of the z distribution equals:
0
If alpha equals 0.05, how many times out of 100 would you expect to reject the null hypothesis when the null hypothesis is in fact "true"?
5
In a normal distribution approximately ____ of the scores will fall within 1 standard deviation of the mean.
68%
Effect Sizes...
Sex has a significant effect on both sea lions and pugs: With sea lions, sex has a LARGE effect size. With pugs, sex has a SMALL effect size. aka size differences
What does the central limit theorem state?
The means of many (large enough) random samples from a population will be normally distributed
If we drew a random sample from an introductory family sciences course, to whom could we best generalize our results?
all family science students
If we set alpha at .01 instead of .05, we have a:
lower risk of a Type I error and greater risk of a Type II error
Which of the following is NOT a transformation that can be used to correct skewed data?
tangent transformation
Which of the following assumptions are underlying the use of parametric tests (based on the normal distribution)?
the data should be normally distributed the samples being tested should have approximal equal variance the data should be at interval or ratio level Correct! all of the above
The Kolmogorov-Smirnov test can be used to test:
whether data are normally distributed
Parametric vs. Non-Parametric
-A statistical test can be either parametric or non-parametric *Parametric **Assumed normal distribution **Homogenous variance **Independent data **More precise measurements, more conclusions **Examples: correlation, regression, ANOVA, t-tests - Non-Parametric **Any distribution, any variance **Used when data doesn't fit assumptions **Nominal and ordinal data **Simpler, but less info **Example: Chi-square, Mann-Whitney U, Spearman correlation
Assumptions
-An assumption is a condition that ensures that what you're attempting to do works. * If met, then we can take the test statistic and p-value at face value and interpret accordingly. *If violated, then we know the test statistic and p-value will be inaccurate and could lead us to the wrong interpretation. -Each test that we will study this semester (e.g., correlation, regression, ANOVA, etc.) has several required assumptions that must be met
Statistical Significance
-Based on the probability of obtaining a particular statistical outcome by chance given that there is no true relation in the population (i.e., the null hypothesis is true). - If our statistical analysis tells us that the obtained probability of getting these results by chance are less than 5% (less than .05): *Reject the null hypothesis *Results are "significant" -If our analysis tells us that the obtained probability is greater than 5% (greater than .05): *Fail to reject the null hypothesis
so why an alpha level of .05?
-Consider the consequences of making a Type I error * Lots of money at stake (e.g., new study will cost $10,000,000) *Benefits vs. risks of SSRI's for adolescents *Kaplan course will increase GRE scores -For exploratory/pilot study, consider using .10 -Lower alpha level increases probability of making a Type II error -Remember: Alpha/Beta levels not related to truth, but our chances of discovering it
Independence
•The errors in your model should not be related to each other. •If this assumption is violated: -Confidence intervals and significance tests will be invalid. -You should apply the techniques covered in Chapter 21.
Normally distributed something or other
•The normal distribution is relevant to: -Parameters -Confidence intervals around a parameter -Null hypothesis significance testing •This assumption tends to get incorrectly translated as 'your data need to be normally distributed'.
Example: Hygiene Scores at the Download Festival
In SPSS Descriptive Statistics - Frequencies - Statistics (Skewness, Kurtosis) - Charts (Histogram, select show normal curve) -You can also request skewness and kurtosis statistics via the Descriptives menu.
It is always preferable to make a Type II error
It depends on the cost of making a Type I or Type II error
Which of the following tests whether variances are homogeneous?
Levene's test
Which of the following will increase the power of a hypothesis test?
change the sample size from 25 to 100
The power of a hypothesis test is defined as:
the probability that the test will reject H0 if there is a real treatment effect
The assumption of homogeneity of variance is met when:
the variances in different groups are approximately equal
Aims 2
•Assumptions of parametric tests based on the normal distribution •Understand the assumption of normality -Graphical displays -Skew -Kurtosis -Normality tests •Understand Homogeneity of Variance -Levene's Test •Know how to correct problems in the data -Log, Square Root and Reciprocal Transformations -Pitfalls and alternatives -Robust tests
Transforming data
•Log transformation (log(xi)) -Reduce positive skew. •Square root transformation (√xi): -Also reduces positive skew. Can also be useful for stabilizing variance. •Reciprocal transformation (1/ xi): -Dividing 1 by each score also reduces the impact of large scores. This transformation reverses the scores, you can avoid this by reversing the scores before the transformation, 1/(xhighest - xi).
How is power calculated?
-1- beta - G*Power is most common -What's adequate power? * Most are okay with 0.8 (although some prefer 0.85 or 0.90)
What is power?
-A measure of sensitivity of the study to detect a real effect of the independent variable (IV) -The ability to detect an effect (large or small) if there is one -Power = 1 - Beta (Beta = Type II error)
Test Statistics
-A statistic for which the frequency of particular values is known (e.g., t or F). -Observed values can be used to test hypotheses. *We can calculate the probability of obtaining a certain value of the test statistic if there were no effect (e.g., the null hypothesis were true). test stats = variance explained by the model/variance not explained by the model = effect/error *we always want a bigger number on top than on the bottom because the value would be bigger than 1
Type I & Type II Error
-As you decrease the risk of a Type I error, you increase the chances of a Type II error. -It is difficult to determine whether either error has been made because actual population parameters are not usually known. -Type I generally seen as more serious, in practical terms (claiming a significant effect when none actually exists)
Homogeneity of Variance / Homoscedasticity
-Assumption of equal variances *Across variables *Across groups -Tests *Levene's Test *Mauchly's W (Sphericity)
Bottom Line
-Even if we fail to reject the null hypothesis, it does not necessarily mean the null hypothesis is true. *This is because a hypothesis test does not determine which hypothesis is true, or even which is most likely: it only assesses whether available evidence exists to reject the null hypothesis.
Significance (α)
-How "different" is different? -Probability: * How likely are we to get a result by chance -Significance: *α (aka "alpha" or "p-value") *Probability of chance *Level of significance -Common α level = .05 *In the social sciences, an error risk of less than 5% is considered "significant"
Statistical Test of Normality
-Kokmogorov-Smirnov test -Does the distribution as a whole deviate from a normal distribution? - In SPSS: *Analyze - Descriptive Statistics - Explore - Select your variable(s) - Plots - Normality plots with tests
How do we know if data are normally distributed?
-Large samples - generally normal -Visual inspection (graph) -Descriptive statistics *Review values of skewness and kurtosis (closer to 0 = more normal) -Statistical tests of normality *Kolmogorov-Smirnov Test
Misconceptions around p-values
-Misconception 1: A significant result means that the effect is important. * No, because significance depends on sample size. aka small sample = less meaningful -Misconception 2: A non-significant result means that the null hypothesis is true. *No, a non-significant result tells us only that the effect is not big enough to be found (given our sample size), it doesn't tell us that the effect size is zero. -Misconception 3: A significant result means that the null hypothesis is false. *No, it is logically not possible to conclude this.
Effect Size
-Standardized measure of the size of an effect *Effect sizes complement other statistics, such as p-values *Allows for objective evaluation of the size of an observed effect *A measure of strength of relationship -Common effect size measures: *Cohen's d *Eta-squared (or partial eta-squared) *R-squared *Cramer's V
Inferential Statistics
-Take info & values from sample and infer back to population -Used to make conclusions about population -Estimating generalizability of our findings -Hypothesis testing
Hypothesis Testing
-Testing differences between groups -Hypothesis makes prediction about difference -Null hypothesis: H0 *Predicts no difference aka what the scientist sets up to prove against -Alternative/Experimental hypothesis: H1 or HA *A directional test (one-tailed) predicts difference in certain direction. aka either above or below the mean. Only allowed to look at one side of the distribution *A non-directional test (two-tailed) predicts just the difference, not direction
Hypothesis Testing 2
-The null hypothesis is formulated in such a manner that you will be able to statistically test whether the variable has a relationship with another variable *Assumption is that there exists no relationship between the variables or that they are distinct from each other *e.g., Communication training for couples is unrelated to marital happiness. aka no relationship -The alternative hypothesis is a complete and absolute contradiction to the null hypothesis *Sometimes also referred to at the research hypothesis *e.g., Communication training for couples is related to marital happiness.
Errors
-Type I error *When we say there's a difference, but there's really no difference *The probability is the α-level (usually .05) *"false positive" *most scientists are very concerned of this one -Type II error *When we say there's no difference, but there really is a difference *Probability is the β-level (usually .2) *"false negative"
One-tailed or two-tailed test?
-Usually two-tailed (non-directional) -One-tailed (directional) only if: replicating results that clearly indicate one direction *Two-tailed doesn't make sense **e.g., new types of tires will last longer, exercise will make you live longer, warm milk will make you sleep longer, etc. *Attempt to replicate previous research showing clear results in one direction
Principles of Hypothesis Testing
-We assume the null is true (there is no effect). -We fit a statistical model to our data that represents the alternative hypothesis and see how well it fits (in terms of variance it explains). -To determine how well the model fits, we calculate the probability (called the p-value) of getting that 'model' if the null hypothesis were true. -If that probability is very small, we conclude the model fits that data well and we gain confidence in the alternative hypothesis.
Evaluating the marijuana experiment
-We need to determine the probability of getting 9 out of 10 pluses if chance alone is responsible and compare this probability with alpha. *From the binomial distribution, the probability of 9 pluses is .0098 *This is lower than .05, so we "reject the null." *This means that only 98 times in 10,000 we would get 9 pluses if chance alone is the cause. *We conclude that marijuana affects appetite. **(We assume that the sample was randomly selected and we assume that this sample is representative of the population; we thus assume this conclusion applies to the general population of marijuana users in our area).
Family-wise Error Rate
-When doing multiple one-way tests * α is cumulative when doing multiple t-tests --Group A vs. Group B, α = .05 (5%) --Group A vs. Group C, α = .05 (5%) --Group B vs. Group C, α = .05 (5%) ---Total: 15% error possible *Better to compare all 3 with one test and one error rate (α = .05, 5%) --e.g., ANOVA or regression -the more tests you fin then the more type 1 error you can commit therefore economize the # of tests you run
Alpha level
-Why .05? -If statistical analysis tells us that the obtained probability of getting these results by chance are less than 5% (less than .05) *Reject null hypothesis; Results are "significant" -If analysis tells us obtained probability is greater than 5% (greater than .05) *Fail to reject the null hypothesis -This is NOT related to "truth", but the risk we're willing to take aka 5% error
The pharmacologist mentioned above sets out to test his hypothesis with a preliminary pilot study. When analyzing the results of the study, would you use a one-tailed or a two-tailed test?
A two-tailed non-directional test
Which of the following is a suggested program to use when calculating necessary sample size of a study in order to achieve the desired level of power?
G*Power
One vs. Two Tailed
H0: we expect the sample mean to be equal to the population mean one tailed: aka predicting a direction *H1: left tail: children watch less that 3 hours of tv per week *H1: right tail: children watch more that 3 hours of tv per week two tailed: not predicting direction, predicting if there's a difference *H1: children do not/do watch 3 hours of TV per week
H0 vs. H1
Remember: H0 (null hypothesis) is what is being tested quantitatively in terms of your alpha level *BUT... When stating your hypothesis, state the experimental/alternative hypothesis (H1), which is your actual prediction - aka every single test happens to run the null hypothesis = no relationship between variables -we can never prove anything but we can reject things = science = reject or fail to reject = nothing is always absolutely true in science
A pharmacologist recently developed a drug that, in theory, should have an effect on the activity of the lateral hypothalamus. It is likely that such a drug would have an effect on appetite. Which of the following would be an accurate statement of the pharmacologist's alternative/experimental hypothesis?
The drug will have an effect on the activity of the lateral hypothalamus and thus on a person's appetite.
A Type I error means that a researcher has
falsely concluded that a treatment has an effect
errors explained
if the null is true and we retain it = correct if the null is false and we retain it = type 2 error if the null is true and we reject the Ho = type 1 error if we reject the Ho and the Ho is false = correct
If a distribution of raw scores is negatively skewed, transforming the raw scores into z scores will result in a _______ distribution.
negatively skewed
Which of the following is a form of standard score?
percentage
Additivity and linearity
•The outcome variable is, in reality, linearly related to any predictors. •If you have several predictors then their combined effect is best described by adding their effects together. •If this assumption is not met then your model is invalid.
When does the assumption of normality matter?
• •In small samples. - The central limit theorem allows us to forget about this assumption in larger samples. •In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.
Spotting outliers example
•A biologist was worried about the potential health effects of music festivals. •Download Music Festival •Measured the hygiene of 810 concert-goers over the three days of the festival. •Hygiene was measured using a standardised technique: -Score ranged from 0 to 4 •0 = you smell like a corpse rotting up a skunk's arse •4 = you smell of sweet roses on a fresh spring day
Researcher degrees of freedom
•A scientist has many decisions to make when designing and analysing a study. -The alpha level, the level of power, how many participants should be collected, which statistical model to fit, how to deal with extreme scores, which control variables to consider, which measures to use, and so on •Researchers might use these researcher degrees of freedom to shed their results in the most favourable light (Simmons, Nelson, & Simonsohn, 2011) •Fanelli (2009) assimilated data from studies in which scientists reported on other scientists' behaviour. -On average, 14.12% had observed fabricating or falsifying data, or altering results to improve the outcome -A disturbingly high 28.53% reported other questionable practices
Effect sizes
•An effect size is a standardized measure of the size of an effect: -Standardized = comparable across studies -Not (as) reliant on the sample size -Allows people to objectively evaluate the size of observed effect.
Reducing bias
•Analyse with robust methods: -Bootstrapping -Methods based on medians and trims. •Trim the data: -Delete a certain amount of scores from the extremes. •Windsorizing: -Substitute outliers with the highest value that isn't an outlier •Transform the data: -By applying a mathematical function to scores.
Benefits of Bayesian approaches
•Bayesian approaches specifically evaluate the evidence for the null hypothesis -Unlike NHST, you can draw conclusions about the likelihood that the null hypothesis is true •Unlike p-values Bayesian methods are not confounded by the sample size and the stopping rules applied to data collection. •Bayesian methods focus on estimating parameter values (which quantify effects) or evaluating relative evidence for the alternative hypothesis (Bayes factors) -Because there is no black-and-white thinking, only estimation and interpretation, behaviour such as p-hacking is circumvented
Incentive structures
•Beth and Danielle -Can people with extreme views literally not see grey areas? -Participants see words displayed in various shades of grey and for each word they try to match its colour by clicking along a gradient of greys from near white to near black. •Danielle: -finds that politically moderate participants are significantly more accurate (the shades of grey they chose more closely matched the colour of the word) than participants with extreme political views to the left or right. •Beth -Found no significant differences in accuracy between groups. •Danielle's study was no different than Beth's (except for the results) but: -Danielle's interesting, surprising and media-friendly result is more likely to be published than Beth's null finding. -Danielle now has a better-looking CV than Beth: she will be a stronger candidate for jobs, research funding, and internal promotion. •Fanelli, 2010 -In the USA, scientists from institutions that publish a large amount, are more likely to publish results supporting their hypothesis. -Given that a good proportion of hypotheses ought to be wrong, do those working in high-stress 'publish or perish' environments less often have wrong hypotheses, or do they cheat more?
Homoscedasticity/ Homogeneity of Variance 2
•Can affect the two main things that we might do when we fit models to data: -Parameters -Null Hypothesis significance testing
Assessing homoscedasticity/ homogeneity of variance
•Graphs (see lectures on regression) •Levene's Tests -Tests if variances in different groups are the same. -Don't use: •In large samples, small differences will be significant. •In small samples, big differences won't be significant •Solutions -Robust tests (Welch's t, Welch's F) -Adjusted standard errors
Logical flaw
•If 'null hypothesis' is true, then it is highly unlikely to get this test statistic: -This test statistic has occurred. -Therefore, the null hypothesis is highly unlikely.' •If 'person plays guitar' is true, then it is highly unlikely that he plays in Iron Maiden -This person plays in Iron Maiden -Therefore, 'person plays guitar' is highly unlikely.
NHST and wider problems in science
•Incentive structures and publication bias •Researcher degrees of freedom •p-hacking and HARKing
Meta-analysis
•Meta-analysis involves computing effect sizes for a series of studies that investigated the same research question, and taking a weighted average of those effect sizes. •Each effect size is weighted by its precision (i.e., how good an estimate of the population it is) before the average is computed. •Large studies, which will yield effect sizes that are more likely to closely approximate the population, are given more 'weight' than smaller studies, which should have yielded imprecise effect size estimates. •The aim of meta-analysis is not to look at p-values and assess 'significance', but to estimate the population effect. -Therefore, it overcomes the same problems of NHST that we discussed for effect sizes
Misconceptions around p-values
•Misconception 1: A significant result means that the effect is important -No, because significance depends on sample size. •Misconception 2: A non-significant result means that the null hypothesis is true -No, a non-significant result tells us only that the effect is not big enough to be found (given our sample size), it doesn't tell us that the effect size is zero. •Misconception 3: A significant result means that the null hypothesis is false? -No, it is logically not possible to conclude this .
Preregistration
•Open science -A movement to make the process, data and outcomes of research freely available to everyone. •Pre-registration of research -The practice of making all aspects of your research process (rationale, hypotheses, design, data processing strategy, data analysis strategy) publicly available before data collection begins. -Registered reports in an academic journal •If the protocol is deemed to be rigorous enough and the research question novel enough, the protocol is accepted by the journal typically with a guarantee to publish the findings no matter what they are -Public websites (e.g., the Open Science Framework).
Assumptions 2
•Parametric tests based on the normal distribution assume: -Additivity and linearity -Normality something or other -Homogeneity of Variance -Independence
Aims and Objectives
•Problems with NHST -Misconceptions -All-or-nothing thinking -NHST and wider issues in science •A phoenix from the EMBERS -Effect sizes -Meta-analysis -Bayesian Estimation -Registration -Sense
Sense
•The ASA statement on p-values (Wasserstein & American Statistical Association, 2016). -The ASA points out that p-values can indicate how incompatible the data are with a specified statistical model (e.g., how incompatible the data are with the null hypothesis. You are at liberty to use the degree of incompatibility to inform your own beliefs about the relative plausibility of the null and alternative hypotheses, as long as you don't interpret p-values as a measure of the probability that the hypothesis in question is true. They are also not the probability that the data were produced by random chance alone. -Scientific conclusions and policy decisions should not be based only on whether a p-value passes a specific threshold. -Don't p-hack. Be fully transparent about the number of hypotheses explored during the study, and all data collection decisions and statistical analyses. -Don't confuse statistical significance with practical importance. A p-value does not measure the size of an effect and is influenced by the sample size, so you should never interpret a p-value in any way that implies that it quantifies the size or importance of an effect. -'By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.'
Advantages of effect sizes
•They encourage interpreting effects on a continuum and not applying a categorical decision rule such as 'significant' or 'not significant'. •Effect sizes and sample size -Effect sizes are affected by sample size (larger samples yield better estimates of the population effect size), but, unlike p-values, there is no decision rule attached to effect sizes so the interpretation of effect sizes is not confounded by sample size. •Effect sizes and researcher degrees of freedom -Although there are researcher degrees of freedom (not related to sample size) that researchers could use to maximize (or minimize) effect sizes, there is less incentive to do so because effect sizes are not tied to a decision rule in which effects either side of a certain threshold have qualitatively opposite interpretations.
To transform ... or not
•Transforming the data helps as often as it hinders the accuracy of F (Games & Lucas, 1966). •Games (1984): -The central limit theorem: sampling distribution will be normal in samples > 40 anyway. -Transforming the data changes the hypothesis being tested •E.g. when using a log transformation and comparing means you change from comparing arithmetic means to comparing geometric means -In small samples it is tricky to determine normality one way or another. -The consequences for the statistical model of applying the 'wrong' transformation could be worse than the consequences of analysing the untransformed scores.
Spotting normality
•We don't have access to the sampling distribution so we usually test the observed data •Central Limit Theorem -If N > 30, the sampling distribution is normal anyway (arguably) •Graphical displays -P-P Plot (or Q-Q plot) -Histogram •Values of skew/kurtosis -0 in a normal distribution -Convert to z (by dividing value by SE) •Kolmogorov-Smirnov Test -Tests if data differ from a normal distribution -Don't use: •In large samples, small differences will be significant. •In small samples, big differences won't be significant
Homoscedasticity/ Homogeneity of Variance
•When testing several groups of participants, samples should come from populations with the same variance. •In correlational designs, the variance of the outcome variable should be stable at all levels of the predictor variable.
p-hacking and HARKing
•p-hacking -Researcher degrees of freedoms that lead to the selective reporting of significant p-values •Trying multiple analyses/ measuring multiple outcomes but reporting only the significant results •Stopping data collection at a point other than when the pre-determined sample size is reached •Including (or not) data based on the effect they have on the p-value. •Including (or excluding) variables in an analysis based on how those variables affect the p-value •Merging groups of variables or scores to yield significant results •Transforming, or otherwise manipulating scores to yield significant p-values •HARKing -The practice in research articles of presenting a hypothesis that was made after data collection as though it were made before data collection
Effect size measures form
•r = .1, d = .2 (small effect): -the effect explains 1% of the total variance. •r = .3, d = .5 (medium effect): -the effect accounts for 9% of the total variance. •r = .5, d = .8 (large effect): -the effect accounts for 25% of the variance. •Beware of these 'canned' effect sizes though: -The size of effect should be placed within the research context.