exam 2
1. confidence intervals 2. hypothesis testing
inferential statistics
probability of obtained statistical test value when H0 is true
rho
experimental design with samples whose dependent variable scores cannot logically be paired
independent-samples design
statistical test for a difference in population means that can detect positive and negative differences
two-tailed test of significance
rejection of the null hypothesis when it is true
type I error
•Two uses •1. Sets the significance level: Probability chosen as the criterion for rejecting the null hypothesis. •2. Sets the Type I error rate
Alpha
•Pick people at random from the population •Every member of the population has an equal chance of being selected into the study •Ensures our sample is representative of the population •This is the ideal sampling method (but it's rare) the attention paid to how you sample depends on your goal -- are you trying to generalize? political polling, etc., really need to worry about the sampling procedure you use
Unbiased (random) sampling
mean value of a random variable over an infinite number of samplings
expected value
a group that receives treatment in an experiment and whose dependent variable scores are compared
experimental group
goal in inferential statistics is usually to figure something out about a population We typically use samples to do this •"To use a sample is to agree to accept some uncertainty about the results." Tales of Distributions (Spatz, 2019; p. 152)
inferential statistics
Type of Test Goal of Statistical Test Sampling Distribution One-Sample t test Compare X ̅ to known µ t Sampling dist of means μ_x ̅ = μ_0 s ̂_x ̅ =s ̂/√N t=(X ̅-μ_0)/s ̂_X ̅ d=(X ̅-μ_0)/s ̂ Paired-samples t test Compare 2 X ̅s, from same/paired participants t Sampling dist of mean differences "µ" _D ̅ = 0 s ̂_D ̅ =s ̂_D/√N t= D ̅/s ̂_D ̅ OR t= (X ̅_1-X ̅_2)/s ̂_D ̅ d= D ̅/s ̂_D OR d= (X ̅_1-X ̅_2)/s ̂_D Independent-samples t test Compare 2 X ̅s, from different participants t Sampling dist of differences between means "µ" _((X_1 ) ̅ -(X_2 ) ̅ ) = 0 s ̂_((X_1 ) ̅ -(X_2 ) ̅ ) t= (X ̅_1-X ̅_2)/s ̂_((X_1 ) ̅ -(X_2 ) ̅ ) d= (X ̅_1-X ̅_2)/s ̂_p All 3 tests: X ̅, s, d, CI Data vizes NHST: Set null & alt hypotheses (one & two-tailed) Set α, determine tα Calculate tobtained, compare to tα Distribution that represents all possible values of the sample statistic under the null -Created by taking infinity random samples of the same N from the same population -Theoretical and mathematically defined (e.g., t distributions) -Mean is "expected value", SD "standard error"
ok dude
difference so large that chance is not a plausible explanation for the difference •Observed difference is very unlikely to have occurred if the null hypotheses (e.g., no difference) was true. Chance is not a plausible explanation for the difference. •Synonymous with "rejected the null hypothesis"
statistically significant
•Use participants who are readily available •Often a university's "psychology subject pool" WEIRD: Western, Educated, Industrialized, Rich, Democratic •Problem: questionable generalizability, but statistical conclusions can be robust •Replication helps us to overcome some of the shortcomings w/this technique
Biased (convenience) sampling
Distribution of ALL the individual scores in the population. often ~Normal distribution.
Population Distribution
probability of a type II error
beta
sample selected in such a way that not all samples from the population have an equal chance of being chosen
biased sample
a no-treatment group to which other groups are compared
control group
number from a sampling distribution that determines whether the null hypothesis is rejected Threshold for statistical significance.
critical value
concept in mathematical statistics that determines the distribution that is appropriate for sample data
degrees of freedom
paired-samples design in which individuals are paired by the researcher before the experiment
matched pairs
Paired-samples design in which pairing occurs without intervention by the researcher.
natural pairs
hypothesis about a population or relationship among populations
null hypothesis
process that produces probabilities that are accurate when the null hypothesis is true
null hypothesis significance testing
•We will use the term "Population" not just to mean "all people" or "all the things that could be measured" •We will also use it to mean something like, "all people, in this specific situation" •Example: "Does meditation improve attention in college students?" •The population we're interested in is "college students who meditate" •And the "score" we're interested in is a measure of attention •And we would compare it to another population: "college students who do not meditate"*
population
t= (X ̅-"μ" _x ̅ )/s ̂_x ̅ (when we DON'T know σ), we estimate it (s ̂) via the sample standard deviation (S)
t-statistic
1.What is the size of this difference? •Effect size 2.What's a plausible interval for our estimate of a population parameter? •Confidence interval 3.How likely/unlikely is it that this sample would occur, if there is truly no difference? •Null hypothesis significance testing (NHST)
three questions/answers
one value/level of independent variable
treatment
failure to reject a null hypothesis that is false
type II error
•provide a range of plausible values for the population parameter •Confidence intervals are an interval estimate. •They include the population parameter a certain percentage of the time •Often set at 90%, 95% or 99% •For example, a 90% CI would contain the population parameter on 90% of samples •Compared to a point estimate, a confidence interval allows you to infer more about the mean of an unmeasured population •Broadly: Think about CI as using a margin of error around a point estimate •CIs give us a sense of the precision of our point estimate...narrower is better!
Interval estimates
a difference that is not statistically significant
NS
There are actually 3 possible alternative hypotheses you must choose from before you start your experiment: 2.µ0 > µ1 (army nurses show lower career satisfaction than other nurses) 3.µ0 < µ1 (army nurses show higher career satisfaction than other nurses)
ONE-TAILED/DIRECTIONAL
•complete set of the things we're interested in
Population
•a set of data drawn from a population
Sample
Compares a sample mean with a standard or known population mean (unknown sigma) t distribution Sampling Distribution of Means
Single-sample t-test
the probability of a type I error
alpha
hypothesis about population parameters that is accepted if null hypothesis is rejected
alternative hypothesis
the sampling distribution of the mean approaches a normal curve as N gets larger
central limit theorem
Did we assign participants to groups randomly? Do our results fit well with previous research? Did we take previous research into account? Is the NHST approach appropriate? Did we assign participants to groups randomly? > extraneous variable (random assignment is critical in between-groups designs, explain why within-groups designs don't have this issue) Do our results confirm or extend previous research? > if yes than we can trust it more, if not than we need to ask why Is the NHST approach questionable? > if sample size is too big we can always find p < .05, easy to misinterpret Did we take previous research into account? > science is cumulative, we don't consider base rates of things/priors happening in the world with nhst
conclusion
•Sample means are not typically identical to population mean •More often close to population mean than far from it •BUT, impossible to know for sure how close it is, from a single sample •Small samples miss the target by more than do large samples •BUT, even moderately sized samples can provide somewhat stable estimates of populations •When sample means are averaged in the long-run, they do equal population mean
sampling
subset of population. chosen so that all samples of the specified size have equal probability of being selected
random sample
area of a sampling distribution that corresponds to test statistic values that lead to rejection of the null hypothesis
rejection/critical region
convenience samples; seldom obtained using random sampling
research sample
the standard deviation of a sampling distribution
standard error
decreases as N inc
standard error of the mean
•The sampling distribution of the mean approaches a normal curve as N (sample size) increases •The mean of the sampling distribution (the expected value) = the mean of the population. "μ" _x ̅ ="μ" •The standard deviation of the sampling distribution of the mean is called standard error. It is equal to the population standard deviation divided by the square root of sample size. σ_X ̅ = σ/√N *Note: These characteristics may not apply to other statistics (e.g., mode, median, variance), but the sampling distribution of means is most commonly used for the inferential statistical approach we use, NHST. •If population distribution is normal: •Sampling distribution is normal (for any N) •If population distribution is not normal (e.g., skewed): •Sampling distribution becomes more normal as N increases Sampling distribution becomes normal with moderate to large sample sizes, even for skewed population data. (N=30 is good number to start with) --- •A sampling distribution of means is hypothetical •it is the distribution of means that you'd get from a given population if you re-sampled a bazillion times (with the same sample size every time) •Sampling distribution of means is a normal distribution, even when pop. distribution is not (as long as N is large enough) •Mathematically defined •We can identify probabilities of specific means being drawn from a population
Central Limit Theorem (p. 159 of Tales textbook)
•Difference (deviation) divided by error •d is a difference between means in terms of population standard deviation units •This gives us a standardized sense of how big this difference is (or, how far off our sample mean is from the comparison/standard value) Cohen's heuristics: 0.2, 0.5, and 0.8 for small, medium, and large effects (respectively) Positive values of d indicate a sample mean larger than the hypothesized mean; negative values of d signify a sample mean smaller than the claim In this problem, a positive d indicates the sample mean is larger than the population mean; a negative d indicates a sample mean smaller than the population mean The magnitude of d is the important part: Difference between means, in standard deviation units If means are the same, d = 0 When we originally looked at d, it was pop mean 1 vs pop mean 2, div by pop SD units Decision is arbitrary, so sign of d doesn't matter Conventionally: "Experimental" group is group 1, control is group 2, so differences favoring experimental group are positive sample mean - standard comparison value (pop mean under null) divided by population SD estimated based on sample
Cohen's d
point estimate (sample stat) +/- critical value (t sub alpha from t table) x standard error (sd of sampling dist) second part is margin of error rmbr s hat is over N -1 95% CI:[-0.17, 5.77] •95% Confidence Interval: an interval around the point estimate that would include the true population parameter 95% of the time in the long run (if we were to randomly sample from the same population over and over again, using the same sample size each time). •95% Confidence Interval: The book gives the interpretation that we can be 95% confident that the mean will be within the Confidence Interval. Some statisticians argue that this is technically not correct (and the first interpretation above is the only valid one). However, the book's interpretation can also be a useful way to think about it. A 90% CI means that for 90% of all possible samples (of the same size), that interval will capture the true population parameter. A 95% Confidence Interval means that for 95% of all possible samples(of the same size),that interval around the sample statistic will capture the true population parameter. Only sample statistics in the outer 5% of the sampling distribution have confidence intervals that "miss" the true population parameter. Note: True PopulationParameter is constant! All that we have is our sample. So, we don't know for sure whether our sample mean is close to the population mean (or if our CI contains pop. mean) Still, a Confidence Interval is more usefulin estimating the population parameterthan is a mere point estimate alone. Key point: confidence intervals provide a sense of the precision of our estimate Narrower confidence intervals indicate more precise estimates Narrow CIs usually come from less variable populations and bigger sample sizes •allow you to decide something about the mean of an unmeasured population...they provide interval estimates for a population parameter
Confidence Interval around a Sample Mean (about a Population Mean)
t distribution Sampling Distribution of Differences Between Means Used to analyze a between-group design Like the paired-samples t-test, we have two samples standing for two populations. Unlike the paired-samples t-test, each sample experiences a different level of the IV (between-groups design). One sample experiences condition A, whereas the other sample experiences condition B. ---- Here, we don't want to treat line 1 as one person in two conditions, we need to treat each group separately. So no getting individual difference scores. We need to get means and then differences. How many participants in each group? What are the two means? What are the two standard deviations (s ̂)? *calc deviations, square of deviations, s hat (n-1) bar graph d= ((X_1 ) ̅ -(X_2 ) ̅)/(s_p ) ̂ Mention here that samples with higher sample size have a more precise estimate of variability, so they should be more heavily weighted. This formula helps does that. It's some calculations using sample size and s hat of each group. ((X_1 ) ̅ -(X_2 ) ̅) ± t_α (s_((X_1 ) ̅ -(X_2 ) ̅ )) Point estimate (difference between two means); Critical value (from the t table); Standard error of a difference (pooled error) df = N1 + N2 - 2 Good news! You won't have to calculate standard error; we will provide this information. 1. Null hypothesis (H0 ): Caffeine doesn't affect memory (μ1 = μ2). Alternative hypothesis (H1): Caffeine affects memory (μ1 ≠ μ2). Two-tailed alternative: Caffeine affects memory (μ1 ≠ μ2). One-tailed alternatives: Caffeine improves memory (μ1 > μ2) Caffeine impairs memory (μ1 < μ2) Doing a two-tailed or one-tailed t test is an important decision that will affect our stats. Here, make sure that the null hypothesis for one-tailed test is different and must also be directional. For the first one-tailed test, the null would be caffeine doesn't affect memory or it impairs it. For the second one-tailed test, the null would be caffeine doesn't affect memory or it improves it. assume null trueee :) 2. Sampling distribution of differences between means; Here, we plot differences between means of the pairs of groups we sampled from the population.; μ_(X ̅_1-X ̅_2 ) = 0 Here is what this sampling distribution really is all about: 1) sampling distribution is of differences between means of two samples, 2) it's centered around 0 1) Take a sample from the population and calculate the mean 2) Take another sample from the population and calculate the mean 3) Now calculate the difference between two group means 4) Repeat this process to get all possible differences between group means 5) Plot differences between means of the pairs of groups we sampled from the population 3. Choose a value to set critical values and rejection regions: alpha (α) 4. Calculate the t statistic from your experiment t= ((X_1 ) ̅ -(X_2 ) ̅)/s_((X_1 ) ̅ -(X_2 ) ̅ ) We already have this from calculating confidence intervals! 5. concl
Independent-samples t test
•Start with two equivalent groups •Treat them identically, EXCEPT... •...manipulate one thing. That is, create one difference in the way the groups are treated. •This is called the Independent Variable (IV) •Measure the groups on a thing you think would be influenced by the manipulation •This is called the Dependent Variable (DV) •If the difference is not likely due to chance factors, then conclude that the difference is due to the manipulation (i.e., the IV) 2 LEVELS OF THE INDEPENDENT VARIABLE DEPENDENT VARIABLE SEE OUTLINES •Between-groups design = between-subjects design = independent samples design •Within-groups design = within-subjects design = paired-samples design = repeated measures design Participants get different levels of the IV OR Participants get both levels of the IV (can measure dep variable after each level or just once aft 2 levels in which woven tgt) Differentiating paired vs. independent samples analyses The design of your experiment determines whether you should conduct a paired-samples or independent-samples analysis. Do you have a within-group or between-groups design? Do scores in one condition need to be matched with scores in the other condition? The IV does NOT determine which analysis is appropriate. The same IV can be manipulated in both within-group and between-groups designs.
Logic of a Simple Experiment
"Statistically significant" = "reject the null hypothesis" = "p < alpha" = "FOUND an effect" •Method 1: Calculate a test statistic and compare to a critical cutoff value. -Use α to look up the critical cutoff value in a table (t-critical values) -Compute the test statistic -IF the test statistic is more extreme than the critical value, THEN reject H0 -This is the method we use in our hand calculations. •Method 2: p-value compared to α -IF p-value is lower than α, THEN reject H0 •For example, if α = .05 and p = .03, we reject the null hypothesis. But p = .08, fail to reject the null. -THIS is the method we use with JASP (and most stats software) •We won't learn how to calculate a p-value by hand -When you see "p < .05" in research reports, that's to indicate a statistically significant effect (assuming α = .05) •p is the probability of observed results (or more extreme) occurring, IF H0 is true •p is NOT the probability that the data are due to chance Number 1 is tricky. The key distinction that would make this more accurate is, "IF the null hypothesis is true" •p is NOT the probability that H0 is true •p is NOT the probability that H0 is false •p is NOT the probability of a Type I error •p is NOT the probability of making a wrong decision •The complement of p (which is 1-p)is NOT the probability that H1 is true or false •Method 3: Confidence interval -Calculate a confidence interval based on the chosen α •If α = .05, then 95% confidence interval -IF the mean of the sampling distribution is NOT contained in the confidence interval, THEN reject H0 •ALL three methods will give the same conclusion. They are interchangeable.
Making a reject/retain decision in NHST: Three ways
-Recognize 2 possibilities for the population sampled from, which has a mean μ_1 Null hypothesis (hypothesis of equality): H_0 Alternative hypothesis (hypothesis of difference): H_1 Assume for the time being that H_0 is true Choose a sampling distribution that gives all possible values of a statistic under H_0 Choose a value to set critical values and rejection regions: alpha (α) *NOTE: It's actually best practice to use a lower df if you cannot find the exact df in the table, because it would make for a more conservative test. But as we'll see, I think that's a moot point when we consider "close" rejections as somewhat problematic. *Alpha correspond to critical t and X ̅ values; 1.These career satisfaction values correspond to different alpha values we can set 2.Which correspond to different t values in this exact t distribution Collect sample from the population, calculate X ̅ Calculate the difference between the sample mean and the null hypothesis mean. Then calculate test statistic. -- Ok so how do we generate probabilities? What information do we need if we are to use a t distribution? No need to actually calculate the empirical sampling distribution, we can just use the theoretical normal sampling distribution So we can get a t value for our sample mean, locate it on the sampling distribution and determine likelihood of null being true -- Make a decision. Use the sampling distribution to determine the probability of the pattern actually observed plus those more extreme Small probability: Reject the null hypothesis If X ̅ > μ_0, conclude that μ_1> μ_0 If X ̅ < μ_0, conclude that μ_1< μ_0 Large probability: Conclude that the data are consistent with the null hypothesis (i.e., not a difference)
NHST
•NHST always begins with a claim about a parameter, such as a population mean, μ0 •µ0 = standard, or known pop •µ1 = new/unknown pop that you are using sample X ̅ to represent • "The logic of NHST is indirect. Researchers (much like the suspicious sophomore) usually believe that the hypothesis of difference is correct. Support for the hypothesis of difference would be provided if the hypothesis of equality is discredited. Fortunately, the hypothesis of equality produces predictions about outcomes. NHST compares actual outcomes to those predictions. If the outcome is not what the hypothesis of equality predicts, that hypothesis is discredited, leaving only the hypothesis of difference." •Two possible outcomes: Hypothesis of equality (null hypothesis): my sample comes from a population which is not different than the known population But this hypothesis is what generates mathematical predictions that we can test the obtained results against --How likely are the obtained results if truly no difference?-- 2. Hypothesis of difference (alternative hypothesis): my sample comes from a population which is different than the known population This is what we usually think is true when we embark on a research project -- Note: In NHST, we never "prove the null" •If we reject the null hypothesis, we are claiming we have strong support for the alternative hypothesis •With α = .05, this means a less than 5% chance this data pattern could have happened under the null! •If we fail to reject the null (aka "retain" the null) we are not making a strong claim about the null (i.e., we cannot claim equivalence), because there are several possibilities: •There really is no effect (null is true) •Our research design could have been weak (null is false, but we didn't detect it) •Or others... • This roundabout logic is a major criticism of NHST, and something we will return to later in the semester NHST generates the probability that—if the null hypothesis is true in reality—we would obtain these results or results more extreme than these
NHST Logic
1.Set up the null and alternative hypotheses (assume null is true for now) 2.Specify and characterize the appropriate sampling distribution •Type of distribution •df • µ0 •Standard error (calculated from s ̂) 3.Find critical value based on chosen alpha 4.Calculate mean for sample 5.Calculate test statistic 6.Make a decision: Use the sampling distribution to determine the probability of that sample mean (or more extreme) if the null is true. •If test statistic is more extreme than critical value, reject H0 (claim a difference) •If test statistic is not more extreme than critical value, retain H0 (claim no difference) Use other information—effect size and confidence interval—to give context to this result. We reject the null hypothesis: army nurses (X ̅ = 25) have significantly greater career satisfaction than general pop of civilian nurses (µ = 22), t(99) = 6.98, p < .05. The effect size of this difference is d = 0.70, which is a medium-to-large effect by Cohen's heuristics. Also, CI: [24.15, 25.85], indicating a fairly precise estimate of the population mean of army nurse career satisfaction.
NHST for one-sample t-test revised
We want to know if one population has the same characteristics (e.g., mean) as another population on some measure. If we knew the values for both populations, we could simply compare them directly. No need for inferential stats at all in that case. In a one-sample design, we compare a sample mean from one population with a standard or known mean of another population If sigma is known, we could run a one-sample z-test If sigma is unknown, we run a one-sample t-test 1.Set up the null and alternative hypotheses (assume null is true for now) 2.Specify and characterize the appropriate sampling distribution 3.Find critical value based on chosen alpha 4.Calculate mean for sample 5.Calculate test statistic 6.Make a decision: Use the sampling distribution to determine the probability of that sample mean (or more extreme) if the null is true. •Small probability of the sample mean (if null were true): reject the null (claim a difference) •Large probability of the sample mean (if null were true): retain the null (claim no difference) How extreme is extreme enough to reject the null? •We set alpha (α), our cutoff value: typically α = .05 (i.e., p < .05), two-tailed (so .025 in each tail) •We look up the t value corresponding to that cutoff (the critical value), a.k.a., tcritical •This determines our rejection region, where a sample mean must be located to reject the null •If our calculated test statistic > the critical value (tstatistic > tcritical), it falls in the rejection region: we reject the null hypothesis and call the difference "statistically significant" •This gives us a reasonable balance: •If we set a more lax criterion, we would have too many false findings •If we set a stricter criterion, we would never conclude we'd found anything But...instead of a nuanced statement of likelihood, NHST is a binary decision based on somewhat arbitrary convention
ONE-SAMPLE t TEST
t distribution Sampling Distribution of Mean Differences •Used to analyze a within-groups design •or a between-groups design with pairs that are logically matched : Compares two samples that both experienced both levels of the IV standing for two populations •Primarily used for within-group, or repeated-measures designs: Participants are in both conditions of the experiment, measured on each condition •Memory strategy #1 (retrieval practice) vs. memory strategy #2 (restudy) •Congruent and incongruent trials in a Stroop color-naming task • Can also be used for: •Natural pairs: Contrasting two members of a pair on the DV, and these pairings occur without intervention by the researcher. •Siblings: Are first-born kids more intelligent than second-born? •Parent/child: Is a child more extraverted than their parent? •Matched pairs: Individuals are paired by the researcher before the experiment. •Pairs for a collaborative memory experiment --- mean of differences! mean diff: X ̅_D or D ̅ = ΣD/N Estimated pop SD (of the diff): s ̂_D= √(〖Σ(D〖-X ̅〗_D)〗^2/(N-1)) d= D ̅/s ̂_D Remember the sign is dependent on how you chose to do the order of subtraction... Sometimes there's a reason to do one way or another (post - pre; second born - first born), sometimes not Here the order was large - small, so negative diffs mean large was faster s ̂_D ̅ =s ̂_D/√N Lower limit=(D ̅)-t_α (s ̂_D ̅ ) Upper limit =(D ̅ )+t_α (s ̂_D ̅ ) Step 1: Set up the null and alternative hypotheses •Difference in speed between the two monitor sizes is 0; μ_difference=0 •Difference in speed between the two monitor sizes is NOT 0; μ_difference≠0 Assume for the time being that H_0 is true Step 2: Specify and characterize the appropriate sampling distribution that gives all possible values of a statistic under H_0 What is our statistic here? MEAN DIFF Paired-samples t test sampling distribution:Distribution of Mean Difference Scores "µ" _D ̅ = 0 Step 3: Choose a value to set critical values and rejection regions: alpha (α) Step 4: Collect sample(s) from the populations, calculate statistics D ̅ Step 5: Calculate the test statistic. t = (X ̅_1-X ̅_2)/s ̂_D ̅ OR t = D ̅/s ̂_D ̅ Step 6: Make a decision Use other information—effect size and confidence interval—to give context to this result. Reversing the deviations...•Not much would change. Conclusions are the same.
Paired-samples t test
•Variability between conditions is typically lower in a paired-samples t-test run on a within-groups design •Comparing each person to themselves tends to reduce random noise •As a result, a within-groups design tends to allow for finding effects with smaller sample sizes
Paired-samples vs. Independent-samples t-test
•Statistic as an estimator of a parameter •As a single value, X ̅ is our best estimate of μ
Point estimates
Too much power and you run the risk of detecting trivially small differences - But we can use effect size and reason to decide whether these are practically important differences Too little power and you risk not detecting real differences that could be important - This is often a worse problem than "too much power" Typically looking for power = .80 Practical use of Power • To determine the number of participants (subjects) required to detect an effect
Practical use of Power
Distribution(s) of the individual scores of a sample from the population. 1.Nothing like the population distribution. Small samples are not awesome. ...and new samples of the same size look different......from each other. 2.A larger sample starts to look more like the population distribution. Nice. They look a bit more like each other than with smaller samples.Still not perfect stand-ins for the population, but much better than the small samples Larger samples are better than small ones. They tend to be more like each other, and more like the population. Much more variability in means with small samples than larger ones. And wider range, more extreme deviations from true population mean
Sample Distributions
Distribution of MEANS of a *large set of samples* from the population. *large set of samples* technically means infinity allow us to assess the likelihood of a specific sample mean •Infinite number of samples of the same size drawn from a population •Means of each sample calculated •Means are used as the "data points" for building the sampling distribution of means •Mean of the sampling distribution = μ_X ̅ •Equal to μ (mean of the population) •Standard deviation of the sampling distribution = σ_X ̅ •also called "Standard Error" •Smaller than σ (SD of pop.) σ_X ̅ = σ/√N --- BUT...we typically have only sample data... •In practice, we don't take infinity samples and construct this distribution. •The sampling distribution is estimated from what data we do have. *SEE TABLE OF DISTINGUISHING; SE dec as N inc
Sampling Distribution of means
•Extremity of underlying difference (i.e., true effect size) •Observations that are far different from null are more likely to lead to rejecting the null hypothesis (i.e., big effects are easier to "find") •One-tailed versus two-tailed tests •Well, this one depends •A one-tailed test in the right direction increases likelihood of rejecting the null (compared to a two-tailed test) •A one-tailed test in the wrong direction means there is no chance of rejecting the null •Alpha chosen at the start of the study •Smaller alpha makes it less likely to reject null (less likely to find an effect) •Sample size •Larger sample sizes make it more likely to reject the null hypothesis
Some elements that affect statistical significance
The probability of making a correct decision (rejecting the null) when the null hypothesis is false - Probability of finding an effect, if it exists - Also, the probability that we will avoid a Type II error probability: 1-β aim for 80% or 0.8 --- Ways to increase 1)Use a higher alpha level( e.g.,10%insteadof5%) What's wrong with this approach? Because this also increases the probability of a Type I error, this is not usually a good method for increasing statistical power. However, this approach can be useful in early stages of research. But should be followed by additional studies with strict alpha, to confirm any findings. 2) Use a one-tailed hypothesis instead of atwo- tailed hypothesis A two-tailed test divides alpha into two tails. When we use a one-tailed test, putting the entire alpha into just one tail, we increase the chances of rejecting the null hypothesis (IF we were correct about the direction of the difference), which increases statistical power. What's wrong with this approach? IF the effect is in the opposite direction, we have no chance of detecting it. Must have good reasoning for choosing a 1-tailed test, besides to increase power. For example, strong predictions from prior research/theory. 3) Increasesamplesize(N)!!!! 4) Somehowreducevariability. As sample size increases from (a) to (b), the distributions of means become narrower and therefore provide more power. The same is true for decreasing standard deviations (variability). 5) Somehow make the difference between population means bigger. As the difference between means becomes larger, there is less overlap between curves. Here, the lower pair has a larger mean difference and therefore less overlap, which translates to higher statistical power
Statistical Power
•The term "significant" in "statistically significant" does not mean "important" -It's simply another way to say "rejected the null hypothesis" •Lots of time when we "find an effect" (i.e., reject the null, claim statistical significance), that effect is useful for science •Sometimes it's also useful in solving real world problems -"Testing enhances learning" is a useful product of science •BUT, sometimes statistically significant effects are not particularly valuable •How to sort out statistical significance from practical importance: -Start by looking at effect size. •If it's small, then it's possible the effect won't really occur much naturally, in real life. •If it's large, then perhaps it will be present in real life, naturally. -BUT, also look at context of: •Other relevant research (helps interpret effect size measure) •Applied domain (helps us determine whether we should care in the real world) •
Statistical significance versus practical importance
In book they use a sampling distribution of mean differences (differences from in class highlighted in yellow) Because we're conceptualizing this in terms of differences, μ_x ̅ = 0 We just express these as differences instead of means It also allows you to intuitively check your work All of the math and the formulas are the same But now we are thinking of the difference as a unit, rather than just the mean This will actually be a sampling distribution we use going forward...
book
•IF the observed value is very close to the critical value, then consider running the study again—and ideally with a larger sample •For example, if t-critical = 1.98 and t-statistic = 1.95 •For example, if t-critical = 1.98 and t-statistic = 1.995 •It could be a fluke, either way. Just to be safe, run the study again, with more data. If it's a real effect, you'll probably get it. If it's not, then good thing you double-checked. Larger effects more often lead to rejecting the null hypothesis. Larger effects are easier to detect.
borderline
range of scores that is expected to contain a parameter Although population means are generally unknowable, a sample can often be obtained. A confidence interval (CI) is an inferential statistic that uses the sample mean and standard error to shed light on the whereabouts of the population mean. Calculating a CI produces a lower and an upper limit. The limits define an interval that is expected, with a particular degree of confidence, to contain the population mean. Commonly chosen degrees of confidence are 95% and 99%. Confidence intervals are used to bracket not only population means but other population parameters as well. CIs are an important part of the "new statistics" (Cumming, 2014). The terms "margin of error," "interval estimate," and "confidence interval" all represent the same idea. Specifically, a confidence interval is an interval estimate based on a sample statistic; it includes the population mean a certain percentage of the time if the same population is sampled from repeatedly.(Note: We are not saying that we are confident that the population mean falls in the interval; we are merely saying that we expect to find the population mean within a certain interval a certain percentage of the time—usually 95%—when we conduct this same study with the same sample size.) Because surveys only talk to a sample of the population, we know that the result probably won't exactly match the "true" result that we would get if we interviewed everyone in the population. The margin of sampling error describes how close we can reasonably expect a survey result to fall relative to the true population value. A margin of error of plus or minus 3 percentage points at the 95% confidence level means that if we fielded the same survey 100 times, we would expect the result to be within 3 percentage points of the true population value 95 of those times. --- End by stating that this is a narrow CI, indicating fairly precise estimation. Why is it so narrow? Large sample size, which as two impacts on this equation: 1.Makes SE smaller 2.Makes critical value smaller (not easy to see in this version of the t-table, but it does).. 95% CI: [24.15, 25.85]
confidence interval
d = X bar - claim / s hat
effect size index
Based on our decision we can be either correct or incorrect and there are a number of ways to classify this So there is the true situation (null or alternative Then there is the Decision (you either fail to reject the null or reject the null Type I Errors Rejecting the null hypothesis when the null hypothesis is true Saying that something happened when it didn't (FALSE ALARM = alpha) Type II Errors Failing to reject the null hypothesis when the null hypothesis is false Saying that nothing happened when it did (MISS) TRUE STATE OF REALITY V. NHST DECISION GRAPHIC
errors
Where does a score fall in a sample distribution? /Where does a score fall in a population distribution? Where does a sample mean fall in a sampling distribution? And if the distribution is normal, this info also tells us about probabilities of drawing a given score or sample. FYI: For some reason, your book uses μ instead of "μ" _x ̅ in this formula. First two: How many standard deviations away does this score fall from the mean, in this sample (far left formula) or in this population (middle formula)? Formula on the right: How many standard errors (standard deviations in a sampling distribution of means) does this mean fall from the mean of means (expected value) in this sampling distribution of means?
locating values in a distribution
statistical test of the hypothesis that a sample with mean X-bar came from a population with a mean mu.
one-sample t test
statistical test that can detect a positive difference in population means or a negative difference, but not both
one-tailed test of significance
Previously reported directions in the literature Previously reported effect sizes are small Only one direction is interesting, and missing the other direction would not be a big issue One-tailed (directional) PROS: Increase power and thus, increase likelihood of finding an effect CONS: Can lead to missing an opposite effect OR can lead to more false positives due to increased Type-1 error Two-tailed (non-directional) • PROS: More conservative about false positives, you are less likely to make a Type-1 error • CONS: Suffer from a loss of power, so you can miss an effect that exists
one-tailed vs. two-tailed tests
•How to interpret this "p < ⍺" statement: •p < .05: less than a 5% probability of getting the results obtained (or results more extreme) if the null hypothesis is true •occurring fewer than 5 times in 100, if the null is true (i.e., 5% false discovery) •Thus, "p < .05" is synonymous with "rejected the null hypothesis" in this example •More generally, p < ⍺ is synonymous with "rejected the null hypothesis" •p ≥ .05: greater than or equal to a 5% probability of getting the results obtained (or results more extreme) if the null hypothesis is true •occurring more than 5 times in 100 •Thus, "p ≥ .05" is synonymous with "retained the null hypothesis" (i.e., failed to reject the null) •With JASP (and other stats programs), we will get exact p-values •A p-value is the probability of the obtained statistical test value (or one more extreme) when H0 is true •Low probabilities indicate H0 is probably not true •When lower than ⍺, reject H0 FORMAT:(µ = 22), t(99) = 0.23, p > .05.
p values
experimental design in which scores from each group are logically matched
paired-samples design
1-beta; prob of rejecting a false null hypothesis
power
using chance to assign participants to groups so that extraneous variables affect groups equally
random assignment
experimental design in which each subject contributes to more than one treatment
repeated measures
theoretical distribution of a statistic based on all possible random samples drawn from the same population
sampling distribution
the distribution of sample means over repeated sampling from one population 1. every sample is random sample drawn from specified population 2. sample size N is same for all samples 3. number of samples very large 4. mean calculated for each 5. sample means arranged into freq distribution Remember: means near to the mean of the sampling distribution are more likely to occur. Means in the tails are very unlikely.
sampling distribution of the mean
•Frequency distributions of scores (ch 2) • Distribution that represents all possible values of the sample statistic under the null -Created by taking infinity random samples of the same N from the same population -Theoretical and mathematically defined (e.g., t distributions) -Mean is "expected value", SD "standard error" •If scores are distributed normally, you can use a z score to determine the probability of occurrence of any particular score (ch 7) • •Now...frequency distribution OF STATISTICS (not scores): Sampling distributions •A bunch of separate samples •All drawn from the same population •If this is distributed normally, you could use z scores (and soon, t scores) to estimate the probability of occurrence of any particular value of that statistic One of the more complex and important ideas of the whole semester "Here is a progression of ideas that lead to what a sampling distribution is and how it is used. In Ch 2, you studied frequency distributions of scores. If those scores are distributed normally, you can use z scores to determine the probability of the occurrence of any particular score (Ch 7). Now imagine a frequency distribution, not of scores but of statistics, each calculated from separate samples that are all drawn from the same population. This distribution has a form, and if it is normal, you could use z scores to determine the probability of the occurrence of any particular value of the statistic. A distribution of sample statistics is called a sampling distribution." p.150
sampling distributions
probability (alpha) chosen as the criterion for rejecting the null hypothesis
significance level
standard deviation of a sampling distribution of differences btwn means
standard error of a difference
•Collect a sample of participants •Perform calculations to allow them to stand in for a population •Draw inferences about population
statistical test logic
Sampling distribution we've discussed so far is normally distributed, and we know sigma (can use z) But we typically don't know sigma σ to estimate standard error σ_x ̅ ... Estimate from the sample: s ̂_x ̅ If you have a large enough sample, estimating from the sample is basically like estimating from the population. But if your sample is smaller, need to use a different sampling distribution! •This is a common issue! We often only have sample statistics, not population parameters family of curves for small samples Some things to note: 1)We use t distributions when we do not know sigma 2)There are multiple t distributions with slightly different shapes 3)The t distributions are similar to the standard normal (z) distribution 4)The t distributions are sampling distributions of the mean 5)Each t distribution is identified by its df (which is set by N) ●\Degrees of freedom, or df = N - 1 table also allows us to find probability of samples (and it will be important) •Like the standard normal z distribution, t distributions are symmetrical and centered on 0. •And values on the distribution represent standard deviations (standard error) from the mean. •The t distributions have greater variance than the z. Wider, and fatter tails. •The exact shape of a t distribution depends on the sample size. With smaller samples, a greater proportion of the distribution is contained in the tails. •With bigger and bigger sample size, the t distribution gets closer to standardized normal (z) distribution.
t distribution
There are three rows across the top of the table; the row you use depends on the kind of problem you are working on. In this chapter, use the top row because the problems are about confidence intervals. The second and third rows are used for problems in Chapter 9 and Chapter 10. The body of Table D contains t values. They are used most frequently to find a probability or percent associated with a particular t value. For data with a df value intermediate between tabled values, be conservative and use the t value associated with the smaller df. Table D differs in several ways from the normal curve table you have been using. In the normal curve table, the z scores are in the margin and the probability figures are in the body of the table. In Table D, the probability figures are in the headings at the top and the t values are in the body of the table. In the normal curve table, there are hundreds of probability figures, and in Table D there are only six. These six are the ones commonly used by researchers. Probability figures (the most commonly used cutoffs) t statistic values associated w/those probabilities This bottom row is same as z values
t table
t=difference/error Diffs between means, in a t test; Standard error, in a t test (SD of sampling dist.) t is doubly affected by sample size d=difference/error Diffs between means, like t test; Now deviations are in POPULATION SD units d is barely affected by sample size. t is strongly affected by sample size (larger N = larger t). Estimated pop. SD would change a little bit with fluctuations in sample size. So, observed Cohen's d CHANGES VERY LITTLE WITH CHANGES IN SAMPLE SIZE. Large difference between obtained sample and population... but no statistically significant difference?! Effect size: Independent of statistical significance, what is the size of the effect? low Power: Your ability to detect a statistically significant effect when there really is an effect (when the null is truly false) In general, if your results are just at the border of significance—slightly above or slightly below threshold for statistical significance—run the study again, with a larger sample size. Can find even tiny effects with large enough sample size. *For large sample sizes (typically above 500), critical values are equal to standard normal z distribution (so, 1.96 in this case) •It is a mistake to equate statistical significance with practical importance -Practical importance depends on context, usefulness -Effect size can be an indicator of practical importance •Statistical significance shifts with sample size -Larger samples allow us to find even small effects •Both good and bad, in a way •Effect size does not much change with sample size -Use this to help interpret NHST decision
t vs. d
