BIO 206: Exam #2
CI that are calc. from sample data can be used to determine the range of possible values for pop. parameters - 95% CI go along with α = 0.05 - Would have smaller range than a 99% CI
12. Know how confidence intervals are used to test hypotheses.
Alternative to F-Test since it's so sensitive - Assumes that the frequency distribution of measurements is roughly symmetrical within all groups - Performs better than F-test when this assumption is violated & can be applied to more than 2 groups
12. What is Levene's Test?
H0: Pop. is normally distributed Ha: Pop. is not normally distributed According to Central Limit Theorem, don't have to test for normality when sample size very large
13. What are the H0 & Ha for the normality test? Do you need to test for normality when the sample size is very large?
- Misleading because are requirements that must be fulfilled before carrying out test - Risk of Type I error increases if assumptions violated
13. Why is the word "assumptions" misleading? What happens if violated assumptions but go ahead & do test anyway?
Confidence intervals related to standard error with the 2SE Rule of Thumb (Calc. approximate 95% confidence intervals) 1.) Calc. mean of sample (Ybar) 2.) Calc. SE of sample 3.) Multiply SE by 2 4.) Subtract 2SE from Ybar to get lower limit 5.) Add 2SE to Ybar to get upper limit 6.) Interpret confidence limit
4. How are confidence intervals related to standard error? Describe the rule of thumb
No Overlap: Can reject H0 & conclude pops. differ at .05 alpha Mean Overlap: Can't reject the H0 Overlap but not w/ Mean: Cannot make claim about the significance
4. Use confidence intervals to determine whether 2 means are significantly different. What can conclude if 2 CIs of 2 groups don't overlap? What can you conclude if CIs of at least 1 group overlaps the mean of the other group? What can you conclude if the CIs of 2 groups overlap, but neither includes mean of the other group?
Confidence Intervals: range of values that's likely to contain pop. parameter (Used for precision & hypothesis testing) - Number between two values in range is mew, or pop. parameter because believe value is between those 2 values & want to accurately describe & make conclusion about it
4. What are confidence intervals used for? It contains 2 numbers equally spaced on either side of 1 other number: What is that number?
1.) State the Hypothesis 2.) Compute Test Statistic 3.) Determine the P-Value (Large or small) 4.) Draw Appropriate Conclusion (Reject H0 is p-value small b/c unlikely to happen by chance)
6. Be able to use the 4 steps of hypothesis testing to be able to detect patterns (Differences between groups & relationships between variables)
Power: probability that a random sample will lead to rejection of a false H0 - Desirable so can draw accurate conclusions about a population from sample data - Can increase power if sample size large, true discrepancy from H0 is large, or if variability in pop. is low
6. Discuss power statistics. What is power in statistics? Why is it desirable? What can you do to increase power?
Type I: rejecting a true H0 (Significance level α sets prob. of committing Type I error) - can reduce chance by having smaller α (Usually to 0.01) (Can increase chance of Type II error) Type II: failing to rejecting false H0 - Study w/ low prob. of Type II error said to have high power (prob. random sample lead to rejection of false H0)
6. Explain what Type I & Type II errors are & what the consequence of each type of error is?
- CI that are calculated from sample data can be used to determine the range of possible values for pop. paramater - 95% CI go with an α of 0.05 - Higher CI has a bigger range since there's more likelihood parameter is in that range while lower CI (95%) has smaller range & why less confident
6. Know how confidence intervals are used to test hypotheses.
- Observed come from own data from a sample - Expected comes from a model assuming H0 is true & results due to chance
8. All X2 tests compare observed to expected values/ Where do the observed values come from? Where do expected values come from?
- Larger the discrepancy between O & E, larger the X2 value will be - Determine whether large or small by comparing to critical value/CV - Reject H0 if test statistic is > CV (O doesn't = E, data don't match predicted results which assume H0 is true) - Significant difference between the 2 & something is scientifically interesting
8. If the observed values are very dif. from expected values, will X2 be large or small? How do you determine if X2 is large or small? If X2 value (test statistic) is large, will you reject H0 & if reject H0, what can you conclude about observed & expected values?
Goodness of Fit test is more powerful
8. If there are only 2 categories, the X2 Goodness of Fit test is equivalent to the binomial test. However, the X2 GoF is better why?
- F-Test: evaluates whether 2 pop. variances are equal (H0 is equal & Ha is not equal) - Test Stat: F, calc. from ratio of the 2 sample variances - Similar variances would be close to 1 (Large & small separated by CV) - Very sensitive to assumption of being normally distributed, so alternative is Levene's test, but F-Test does have more power -
12. What's the F-Test? What is it?
- Changes each measurement by the same mathematical formula for data to be compared & become more similar to normal distribution - 3 most common transformations: log, arcsine, & square-root transformation
13. How are data transformations used to normalize data so they can be analyzed with a parametric test? Which transformations are most commonly used in the sciences?
- Normality Test determines whether sample data has been drawn from normally distributed pop. - Many statistical tests assume that data come from a normally distributed population --> PARAMETRIC TESTS - Nonparametric tests are for data not normally distributed - But parametric tests are preferred because they have MORE POWER to be able to make better conclusions
10. How is the normality test used? Why is it important to assess normality before doing many statistical tests? What are parametric & nonparametric tests?
- CLT: if randomly sample large # of indivs. from pop, mean or sum of sample will be normally distributed - Important b/c many statistical tests require that sample data be normally distributed & we can meet that requirement if the sample size is large, even if pop. from which sample was normally taken was not normally distributed - Important b/c can use statistical tests that are based on normal distributions to use parametric tests for more power
10. What is the Central Limit Theorem (CLT) and why is it important?
Continuous Unimodal & Symmetrical Area under curve between 2 points corresponds with probability in that range Mean = mode = median 95% of area under curve lies within 1.96 s.d. of the mean * Shape & Location of normal distrib. can be completely described by mean & standard deviation*
10. What is the shape of the normal distribution?
Because it can be used to describe many biological variables and many statistical techniques have been developed for dealing with variables with a normal distribution
10. Why is it important that the normal distribution is a frequency distribution that can be used to approximate the sampling distribution of many continuous variables?
- Less variability goes with smaller SEs, which goes with smaller CIs that are less likely to overlap & more likely to reject H0 - Variability decreases likelihood of rejecting H0 - As α decreases, width of CI broadens but precision does go down
11. How does the amount of variability in a sample affect the likelihood of rejecting H0 in hypothesis tests? What happens to width of CI & precision as α changes (i.e. .05 v. .01)
If large dif. between Ybar & mew, absolute value of t will be large (But must compare to critical value --> df is n-1) - If exceeds critical value, can reject H0, claim have dif. means, claim results not due to chance & due to something scientifically interesting
11. If the If t-value is large enough, it may indicate that the null can't be believed. What is dividing point between "large" & "small" t-values? If t-values exceeds that point, what will you conclude about the difference between the means?
1. Measure the PRECISION of an estimate of pop. parameter 2. Find most likely range of values within which pop. parameter might lie 3. Evaluate hypotheses about pop. parameters
11. What are the 3 things confidence intervals are used for?
1-Sample: compares mean of sample (Ybar) to hypothetical mean (mew) H0: no dif. between Ybar & mew & Ha: is that there is dif. - Increasing sample size makes more likely to reject - Assumptions are: that data from random sample & variable normally distributed in the pop.
11. What is one-sample t-test & when is it used? What is H0 & Ha? Where do numbers often included in H0 come from? Does sample size affect likelihood of rejection H0? What are assumptions of one-sample t-test?
- Sampling Distribution - T-distribution used to assess the significance of the dif. between 2 sample means or the dif. between a sample mean & mean of pop. or a hypothesized mean - Assumes H0 is true
11. What is the Student's T-Distribution? Explain what sampling distributions show, how they are developed (Including the 1 important assumption used in the process), & how they are used.
Formula: Ybar +/- tSE - Sample size decreases SE - Smaller SE goes with narrower range for CI - Can be more confident with larger interval, but that means it is less precise - If hypothesized value lies within range, rail to reject Ho, so more difficult with larger range
11. What is the formula for confidence interval for the mean? As sample size increases, does SE increase or decrease? What does that do to size of CI & estimate of precision? How does that effect likelihood of rejecting H0 in hypothesis tests using CI?
- Distribution with 0 at center, CV confines total of 5% of area (2.5% on both sides) - Assumes null is true - If assumption true, no dif. between means you're comparing & t-value will be close to 0 - May be something other than 0 if sampling error has occurred
11. What is the sampling distribution for t like? What does it assume is true? What does it mean if assumption is true? What might cause t-value not equal to that even if H0 is true?
Formula: (Ybar - mew) / SE - Larger sample size helps to decrease SE, & more likely to reject H0 - Need large t value to reject, so bigger dif. of Ybar & mew makes more likely to
11. What is the t-value formula & how does it change? Since you're more likely to reject the H0 if t-value is large, how does sample size, s.d, SE, & dif. of Ybar & mew affect prob. of rejecting H0?
Type I error
11. What type of error would it be if, due to chance, get a large t-value, even though H0 is true. In that case, you would reject the H0 & conclude that the populations that the means came from have dif. means, even though actually don't?
- Data in middle are in the 95% confidence interval not in the rejection regions on either end - Data has to be within the rejection region, which is marked by critical values on either end - Area equal to 0.05, or α - Calc. p-value - P-value represents prob. of Type I error after-the-fact
11. With t-values, why does value near middle indicate data are consistent with H0? How far from middle does t have to be before can reject H0? Where are critical values of important figure? Where is rejection region & what is area equal to? How can you calc p-value from this figure & what does it represent?
Populations are the focuses of studies & samples are smaller subsets taken from populations to study them - Have to use samples to make conclusions on populations most of the time b/c populations are often too big for each individual to be analyzed & impossible to study whole pop. - Important to estimate pop. parameters so have an understanding of biological variables
4. What are samples & populations? Why can't we directly make conclusions about populations? Why is it important to estimate pop. parameters?
Sampling Distribution: probability distribution of all of the possible values for an estimate that might be obtained when sampling a pop. - Usually have sample means all compiled together of the dif. samples - Sample size impacts width & thus precision (Not accuracy) of the sampling distribution - Higher sample size = skinnier
4. What are sampling distributions? How are they constructed & what do they show? How represent error associated with taking samples? How are they affected by sample size & how does that affect the precision of estimates based on sampling distributions.
- Statistics are values describing samples from pops. while parameters are values describing a whole pop. - Statistics are not exact estimates but rather approximations since whole population can't be studied & samples are subject to sampling error
4. What are statistics & parameters? Why are statistics not exact estimates of parameters?
We are 95% confident that the population parameter lies within the range of ... to ...
4. What is an example of an interpretation of a confidence interval?
Sampling Error: statistical error that comes from the random chance of a random sample not representing the entire pop. of data - More sampling error reduces the precision - Necessary to distinguish between effects so don't make a Type I error - Reduce sampling error by increasing sample size
4. What is sampling error? Does it reduce precision or accuracy? Why is data analysis necessary to distinguish between effects from sampling error & true biological effects? How do you reduce sampling error?
- Standard Error is the measure of variability about a sampling distribution - Reflects precision of an estimate since reflects the difs. between an estimate & the target parameter - Smaller the SE, more precise & less uncertainty there is about target parameter in pop. - Decrease SE by increasing sample size
4. What is standard error, what does it measure, & how is it used?
- Important to quant. how precise so make sure you're making accurate conclusions & not falsely rejecting H0 - SE - CI - 2 Goals: Estimation & Calc. how precise estimates are
4. Why is it important to quantify how precise estimates of pops. are? How is this quantification accomplished (2 methods)? What are the 2 goals of statistical inference?
Random trial = process that has 2 or more outcomes - Outcome of a random trial can't be predicted with certainty - Only one outcome is observed from each repetition of a random trial
5. Explain what random trials are
Probability (of an event) = proportion of all random trials in which the specified event occurs if we repeat the process many times (Proportions; always 0-1) - Calculated by: fraction of trials in which event occurs (# favorable outcomes / sample space) - Tells you likelihood of an event occurring
5. How is probability calculated & what does it tell you?
- Bell-curved; symmetrical about mean - Unimodal - Prob. density highest exactly at mean - Mean, median, & mode all same Can be used to approximate SAMPLING DISTRIBUTION of estimates, especially of sample means - & good for frequency distributions of biological values
5. Normal distribution is most important prob. distribution for continuous variables. Describe its shape & how the data are distributed relative to the mean of the distribution.
- Event has prob. of 0 if event never happens - Event has prob. of 1 if event always happens
5. What does it mean if the probability of an event is 0? What does it mean if the probability of an event is 1?
- Between any 2 values of contin. variable, infinite # of other possible values - Described with curve whose height (Y-AXIS) is probability density (Describe prob. of any range of values for the variable) - Normal Distribution = continuous prob. distrib. - In continuous, can't determine prob. of specific event but instead prob. of obtaining values falling in certain range (Calc. area under curve between 2 points) - Area under entire curve always equal to 1
5. What does the probability distribution of a continuous variable look like? How is the y-axis labelled? What's the dif. between discrete & continuous prob. distributions?
- Can be for categorical & discrete numerical variables - All possible outcomes taken into account, so sum of all probs. must add to 1 - Y-axis: Probability (.05, .1, etc) - Prob. of an outcome can be determined by looking at the height of the bar that represents outcome - Discrete, indiv. outcomes; 1,2,3, yes/no, true/false, etc.)
5. What does the probability distribution of a discrete variable look like? How is the y-axis labelled? In these distributions, how can prob. of an outcome be determined?
P-value is the prob. of getting a result as extreme as what you observed, even though the H0 is true - Small = only small chance of getting that result if H0 true, so have evidence allowing to doubt H0 & maybe reject it - Type I Error = rejecting true H0 (α gives prob. of committing Type 1 error); so, can reject H0 if p-value < or equal to 0.05 (Or whatever alpha) - If p-value greater than α, conclude fail to reject H0 - If reject H0, the smaller the p-value is, greater evidence is that results didn't occur by chance (Still very slim chance) - Significance values aren't absolute lines & researchers can consider whether results strong enough if close
6. Understand & explain P-values. How does the p-value relate to the likelihood of Type I error?
One-tailed: Ha only includes values either higher or lower than the one stated in H0 - Should be used if want to determine if dif. between groups in specific direction Two-tailed: Ha includes all feasible values for the pop. parameter other than the one stated in H0 (P value is twice that of 1-tailed
6. What are directional (one-sided/one-tailed) & nondirectional (two-sided/two-tailed) alternative hypotheses? When should they be used?
- Null Distrib. = sampling distribution of outcomes for test stat. under assumption H0 is true - Created by obtaining data & organizing in computer for sampling distrib. of value & probability - Mismatch between data & expectation under H0 can be quite large, even when H0 is true, particularly if not many data, calc. prob. of mismatch as extreme as or more extreme than that observed, & see if data compatible with H0 - Null distribs. assume H0 is true, meaning all outcomes described by the sampling distrib. are determined purely by chance
6. What are null sampling distributions, how are they created, & how are they used to test hypotheses?
POPULATIONS, not samples! - Take samples to get data to be used to calc. stats. which used to test hypotheses, leading to pop. conclusions - Hypotheses must be written about the populations & conclusions must be written bout pops. that samples came from
6. What are scientific hypotheses about?
Test Statistic: number calculated from the data that is used to evaluate how compatible the data are with the result expected under H0 (Analyze these to make conclusions)
6. What are test statistics? Where do they come from & how are they used?
- Used to assess how compatible the data in a sample are with the H0 - Always start w/ idea that any results you obtained were due to chance & if observe what appears to be pattern in data, yet fail to reject H0, apparent pattern was artifact of sampling error, caused by chance
6. What is hypothesis testing used for?
Hypothesis testing uses sample data to make inferences about pop. from which sample was taken - Only asks whether parameter differs from specific "null" expectation - HT asks "Is there any effect at all?" - Uses probability to see if results due to chance or something significant
6. What is the basic theory behind hypothesis testing? (REALLY, REALLY IMPORTANT)
- Alternative hypothesis is non specific & includes all other feasible values for pop. so H0 only statement that can be tested with data - If we reject H0, data inconsistent with H0 & say data support alternative hypothesis - Ruled out H0 value, tells which direction true value likely lies compared to null hypothesized value (But use estimation to provide magnitudes)
6. Why is the null hypothesis the only hypothesis that can be tested directly? What can we conclude if we reject the H0?
H0: The relative frequency of successes in the pop. is p0 (Null expectation/"p0" can be any specific proportion between 0 & 1, inclusive - H0 can be true if nothing but chance influenced results (Often, null expectation is equal to 0.5) - Binomial test gives an exact P-value & can be applied in principle to any data that fit into 2 categories - Test statistic is just the number of "successes" - Probabilities of indiv. events can be summed to produce p-values that can be used to test hypotheses using the binomial test
7. Binomial test for analyzing: What's the null for this test? How is the test statistic for a binomial test found?
X-Axis: Number or proportion of successes Y-Axis: Probability of # of successes occurring - Height of bars is the probability
7. Binomial variable sampling distribution: What are the values on the X & Y-axes? What do the heights of the bars represent?
Two Proportion Test: want to compare 2 proportions to each other (Does proportion 1 differ significantly from proportion 2?) - Is a one variable test (Like Binomial) but take 2 samples
7. How do you use the Two Proportion test?
- Gives an exact P-value, conditional prob. of obtaining a test stat. at least as extreme as ours given that the H0 is true
7. What does the binomial test p-value represent? How is it used to test the H0?
Only two (bi) possible outcomes & both are named (-nominal) categories - Binomial test applies the binomial sampling distribution to hypothesis testing for a PROPORTION - Used when a variable in a pop. has 2 possible states (i.e. "success" & "failure") - Analyzes binomial data with categorical variables
7. What kind of variables can be analyzed with a binomial test? Recognize binomial variables in graphs, tables, & text
- Values near 0 have higher probability b/c O more similar to E & similar to H0 - Less likely for larger numbers b/c bigger difference & less like what is expected, more in rejection region - No negative values b/c nominal variables don't have directionality - Skewed shape can vary from degrees of freedom - Rejection region is at Critical Value & beyond (If X2 value in that region, role of chance in influencing results is lower & something scientifically interesting)
8. Understand how sampling distribution of X2 values is used to test hypotheses regarding O & E values. Why does sampling distrib. for X2 have particular shape (Values near 0 w/ high prob. & values larger lower) Why no negative values? Where is rejection region & what does it mean?
1.) Indivs. in data set are from random sample of whole pop. 2.) None of the categories should have an expected frequency less than 1 3.) No more than 20% of the categories should have expected frequencies less than 5 - If assumptions not met & test anyway, chance of Type I error increases
8. What 3 assumptions must be met for a X2 Goodness of Fit Test & Contingency Test to be valid? Why is it important to check to make sure those assumptions are true before you do that statistical test?
- Used with categorical variables (Either 1 variable/1 sample: Goodness of Fit, or 2 variables: Contingency & Fisher's) - In X2 tests, sample data are compared to the data that would be expected (Observed vs. expected) if H0 of probability model is true
8. What kind of data can be analyzed with X2 tests (Goodness of Fit, Contingency analysis, & Fisher's Exact Test?
Confidence Int. is the range of probable values for pop. if H0 is true - For OR, H0 is OR = 1 (Meaning no difference) Say that 95% confident the OR/RR of population is between xxx and xxx. - if 1 isn't in that range, then reject H0
9. Confidence intervals for odds & risk ratios
Used in contingency table analysis - Used to test equality - Used especially when more than 20% of cells have expected frequencies <5 - Assumptions are randomness
9. Under what conditions is the Fisher's Exact Test used?
- X2 contingency analysis - looking for relationships between 2 variables - Relative Risk - mostly clinical/medical trials - Odds Ratio - mostly clinical/medical trials - Fisher's Exact - relationships between 2 variables
9. What are the 4 types of analyses you can do with a 2x2 contingency table? What are each good for?
Contingency analysis is a hypothesis test used to check whether 2 categorical variables are independent or not H0: There's no association between variables (Independent) Ha: There is an association between the variables (Dependent) Assumptions: Categorical variables, all observations independent, cells in table are mutually exclusive, expected values in cells should be 5 or greater in at least 80% of cells - If assumptions not met, risk of error will increase - Can use Fisher's Exact
9. What are the null & alternative hypotheses of contingency tests? What are assumptions for contingency test? What can you do if assumptions violated & what are consequences if don't meet assumptions but go ahead & do contingency test anyway?
Is simply one minus relative risk (1 - RR) - Quantity tells how much smaller the risk in the treatment group is compared to the control Absolute Risk Reduction = value for relative risk can be used to estimate how much a particular treatment increases or decreases risk compared to a comtrol
9. What is absolute risk reduction & what does it tell you?
Also measures magnitude of association between 2 categorical variables when each variable has only 2 categories - Calculated by doing: Proportion of success divided by Proportion of failure (Odds) - Odds ratio is odds of success in 1 group divided by odds of success in second group - Represents odds that an outcome will occur given a particular exposure *Odds & risk ratios are very common in clinical studies*
9. What is the odds ratio, how is it calculated, what is it used for, & what does it tell you?
Relative Risk: prob. of an undesired outcome in the treatment group divided by prob. of the same outcome in control group - Event occurring between two groups, used in cohort studies (Prospective study) - Number Needed to Treat (NNT): often in health trials; # of indivs. who would have to be treated for 1 person to benefit; experience the explanatory variable
9. What kinds of studies produce data that can be analyzed by calculating relative risk? What is Number Needed to Treat (NNT)?
Columns = Explanatory variable Rows = Response variable
9. What variables go in columns & rows when creating a contingency table to present data from 2 categorical variables?