Psyc 201-Chapter 13: Inferential Statistics
t test
- The most common test statistic in psychological research for comparing two groups, or a sample against a population - To compare two means, the most common null hypothesis test is the t test
Mean Squares Between Groups
- (MSB) and represents the differences between group means
Mean Squares Within Groups
- (MSW) and represents the differences within groups (essentially error)
Null Hypothesis Testing
- A common form of statistical hypothesis testing in which researchers calculate the probability of obtaining their result if the null hypothesis is true; they then decide whether to reject or retain the null hypothesis based on their calculations
Two-Tailed Test
- A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution - Used more commonly, and can detect more extreme scores in either direction, but is less powerful.
p value
- A number between 0 and 1 and interpreted in the following way: A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis - If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained
Critical Values
- A point on the test distribution that is compared to the test statistic to determine whether to reject the null hypothesis
Test Statistic
- A statistic that is computed only to help find the p value
Sampling Error
- An error that occurs when a sample somehow does not represent the target population - Random variability in a statistic from sample to sample
The File Drawer Problem
- Another problem with NHST and Type I errors is called the File Drawer Problem ‣ If you have a significant p, researchers tend to publish the results... if you have a nonsignificant p, then researchers tend to file the results away in a drawer. ‣ Consequently, the number of Type I errors actually published is likely higher than 5% ‣ Furthermore, the strength of the relationship between variables is likely overstated
Retain the Null Hypothesis
- Any differences that we see in the sample is just due to chance, and that there is no systematic difference in the samples
Parameters
- Any numerical quantity that characterizes a given population or some aspect of it - Tells us something about the whole population
Statistically Significant
- If our observed probability (p) is less than our a priori α, the we state that we have a statistically significant result. ‣ This means that statistically, the probability of our results is not due to chance, but represents a systematic difference in our results
Key Takeaways Part 1
- Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance. - The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favour of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained. - The probability of obtaining the sample result if the null hypothesis were true (the p value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors. - Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
Type 1 Error
- Occurs when we reject the Null Hypothesis when it is actually true... this is represented by α
Type 2 Error
- Occurs when we retain the Null Hypothesis, but should have rejected it... this is represented by β ‣ Type I and Type II errors are linked together. ‣ If you try to minimize the Type I error, you will increase Type II error ‣ If you try to minimize the Type II error, you will increase Type I error
Practical Significance
- The importance or usefulness of the result in some real-world context
Key Takeaways Part 3
- The decision to reject or retain the null hypothesis is not guaranteed to be correct. A Type I error occurs when one rejects the null hypothesis when it is true. A Type II error occurs when one fails to reject the null hypothesis when it is false. - The statistical power of a research design is the probability of rejecting the null hypothesis given the expected relationship strength in the population and the sample size. Researchers should make sure that their studies have adequate statistical power before conducting them. - Null hypothesis testing has been criticized on the grounds that researchers misunderstand it, that it is illogical, and that it is uninformative. Others argue that it serves an important purpose—especially when used with effect size measures, confidence intervals, and other techniques. It remains the dominant approach to inferential statistics in psychology.
Reject the Null Hypothesis
- The differences that we see in the sample is not due to chance, and that our independent variable had an effect on our dependent variable
Alternative Hypothesis (H1)
- The idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population
α (alpha)
- The probability that we are willing to take to reject the null hypothesis when it is actually true ‣ For most psychological studies, we set p(α) = .05
One-way ANOVA
- The test statistic for the ANOVA is called F. It is a ratio of two estimates of the population variance based on the sample data. - Used to compare the means of more than two samples (M1, M2...MG) in a between-subjects design
Null Hypothesis (H0)
- There is no relationship in the population and that the relationship in the sample reflects only sampling error - The sample relationship "occurred by chance"
Post Hoc Comparisons
- These allow us to compare two means to see if they are different... but, doing multiple tests inflates our probability of making a Type 1 error (falsely rejecting the Null) ‣ Consequently, we make adjustments to the standard t-test to account for this increase in error
Key Takeaways Part 2
- To compare two means, the most common null hypothesis test is the t test. The one-sample t test is used for comparing one sample mean with a hypothetical population mean of interest, the dependent-samples t test is used to compare two means in a within-subjects design, and the independent-samples t test is used to compare two means in a between-subjects design. - To compare more than two means, the most common null hypothesis test is the analysis of variance (ANOVA). The one-way ANOVA is used for between-subjects designs with one independent variable, the repeated-measures ANOVA is used for within-subjects designs, and the factorial ANOVA is used for factorial designs. - A null hypothesis test of Pearson's r is used to compare a sample value of Pearson's r with a hypothetical population value of 0.
One-Sample t test (Single-Sample t test)
- Used to compare a sample mean (M) with a hypothetical population mean (μ0) that provides some interesting standard of comparison
Dependant-Samples t test
- Used to compare two means for the same sample tested at two different times or under two different conditions - For example, you might measure scores before training, and after training (time based study), or you might look at the difference between words and non-words (condition based study). ‣ This type of test is also known as a Related Samples, a Within Groups, or a Paired Samples* test.
One-Tailed Test
- Used to detect a difference only if our t-statistic of our sample is more extreme in the direction that we set a priori. ‣ This can be a more powerful option, but only if the direction is important... otherwise you risk missing an effect.
Independant-Samples t test
- Used to test the differences of the means of two separate, independent groups
Analysis of Variance (ANOVA)
- When there are more than two groups or condition means to be compared, the most common null hypothesis test is the analysis of variance (ANOVA) - Tells us the difference between the means
Testing Pearson's Correlation
‣ Finally, we can also test to see if Pearson's correlation coefficient, r, is statistically significant.
Statistical Vs. Practical Significance
‣ Just because we have statistical significance does not mean that we have practical significance. ‣ This has to do with our sample size, and the ability to reject the null hypothesis. ‣ As our sample size increases, it is easier to reject the null hypothesis ‣ As our sample size increases, the relationship strength decreases ‣ This means that although we might have a statistically significant result, we might not have a practical difference.
The Replicability Crisis
‣ Recently, there has been a focus on the replicability crisis in Psychology, in that many previously published studies have failed to replicate ‣ Only 39 of 100 studies replicated, and in general the effects sizes were half of those found in the original studies ‣ In some cases, researchers even found the opposite result.
The t-Distribution
‣ The t-statistic is useful because it follows a very specific distribution when the null hypothesis is true. ‣ The t-distribution has the following properties ‣ It is unimodal ‣ It is symmetrical ‣ It has a mean of 0 ‣ Its precise shape is determined by the degrees of freedom (df) which is based on the sample size, N. ‣ Knowing the t-value allows us to determine p
Alternative to NHST
‣ There are a couple of alternatives to NHST. ‣ One alternative is to use confidence intervals instead of NHST. ‣ A confidence interval around the mean is a range of values that is computed so that some percentage of the time (usually 95%) the population parameter will lie within that range. ‣ But, this is just really NHST in disguise! ‣ Another alternative is to use Bayesian statistics ‣ This is gaining in popularity, but can be difficult to understand - an approach in which the researcher specifies the probability that the null hypothesis and any important alternative hypotheses are true before conducting the study, conducts the study, and then updates the probabilities based on the data
Problems with NHST
‣ There are several problems with NHST ‣ Many researchers do not understand what p actually represents ‣ There is a common misperception that if your p = .01, it means that you have 99% chance of replicating the finding... this is wrong! ‣ Some have argued that setting α = .05 is arbitrary ‣ Why can you publish a report if p = .04, but not if p = .06? ‣ p-values do not tell you anything about the strength of the relationship between two variables, just that they are different. ‣ We can combine p-values with r2 values to better understand the relationship
Open Science Practices
‣ There has been a push to increase the transparency of science ‣ Some journals now mark papers that follow the practices of open science ‣ Pre-registering hypotheses and data analysis plans ‣ Openly sharing research materials (to allow replication) ‣ Making their raw data available to other researchers
p-Hacking
‣ Trying to find p < .05 is the holy grail of science because it means you can be published. ‣ But, this leads to a problem where researchers may be inadvertently inflating their probability of a Type I error ‣ This is known as p-hacking