EXST FINAL

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Create a dataset that contains only the variables During and During_S. Call this dataset TensionD (Hint: we learned how to drop variables in a previous lab)

data tensionD; set tension; Diff = During_S - During; keep During During_S Diff; run;

True or False: ANOVA is a statistical measure adopted to analyze differences in group variances

False

True or False: If the correlation between two quantitative variables is zero, then they are independent.

False

True or False: If you do not have a significant interaction, you would still want to conduct post-hoc tests such as the Tukey and Bonferroni to assess hypothesized pairwise comparisons.

False

List all assumptions of ANOVA

Independence of Observations Normality of Residuals Equality of Variance

i. State the Type of t-Test to conduct - Paired or Independent ii. Concisely in a maximum of 2 sentences explain your choice or supplementary question in the scenario A professor wants to infer about the potential final grade of his students based on scores from 2 exams - their midterm score and the final project.

Paired t-test - this is because each student is subjected to repeated sampling (one for the midterm, and one for the final project).

Two television stations in Baton Rouge conducted a poll to determine the proportion of LSU students that are likely to vote in favor of a new education policy. TV Station A sampled 200 LSU students, whereas TV Station B sampled 2000 LSU students. Both TV stations tested this hypothesis at the same significance and confidence level. (circle) ONE option indicating which scenario is most likely to happen to either TV Station A, TV Station B or the chances are equally likely for either TV station. Explain why. Reject the null hypothesis when it is NOT true:

TV Station B. This is the power of the test, which given the same significance level, is higher for higher sample size.

True or False: If two quantitative variables are independent, then there is no correlation between them.

True

True or False: Correlation sometimes imply causation.

True. For example: "fasten seat belt sign on and turbulence are correlated but turning on the sign does NOT cause turbulence. However, hitting yourself and getting a bruise, the hit causes the bruising.

Which of the following is true about the F ratio? Check all the correct answers. A. it has no negative values B. it is positively skewed C. the mean of the F distribution equals zero D. it must be greater than 1 to determine if there is a significant interaction

A,B,D

True or False: Regression assumes independence of variables. EXPLAIN this answer!

False. We assume there is some relationship amongst the variable(s), meaning they are not independent. We run a regression to see how variables relate to each other!

List all assumptions of Linear Regressions

LINE Linearity Independence of Observations Normality of Residuals Equality of Variance

To crack statistics exams, there are two proposed strategies: A - read textbook and lecture notes only, B - work practice problems only. Obviously, a combination of the 2 is most likely the best but due to limited time students have an option between only strategies A and B. The question of interests is: "Is strategy A better than B?" 1. Briefly describe how you would collect data, and design a study to answer the question above. 2. In the sampling plan described above, what are your variables of interest? 3. State the relevant hypotheses. 4. What assumptions should be concerned with based on your choice of hypothesis test? 5. How does violation of the assumption(s) affect your final decision? 6. What information from your hypothesis test answers the following question: "Given that the strategy A and B produce equivalent results, then out of pure luck or chance, what is the likelihood of getting a result (that is a sample statistic) from another sample that is at least as extreme as the result reported in your current example?

1. Take a random sample of students and randomly divide them into two groups. Ask one group to read the textbook for a specified amount of time, and the other group to work problems for the same amount of time. The material studied should be the same. Afterwards, give all the students the same test on the material studied and measure exam scores, and exam graders should not know which students studied in which way. 2. Response variable = Exam score Independent variable = Group (either textbook (A) or practice problems group (B)) 3. Null: Mean Exam Score group A = Mean Exam Score Group B. Alternate: Mean Exam Score group A Mean Exam Score Group B. at 5% significance level 4. Normality and Independence 5. Inferences made from the test are not reliable since the sampling distribution of the test statistic depends on the CLT. However, inferences are relatively reliable with sufficiently large sample sizes from random sampling. 6. The p-value from this test.

Explain what a p-value is in simple terms.

A p-value is the probability that you will get the same results as the ones you are testing with the assumption that the null hypothesis is true.

One problem with hypothesis testing is that a real effect may not be detected. This problem is most likely to occur when: A. the true effect is small and the sample size is small. B. the true effect is large and the sample size is small. C. the true effect is small and the sample size is large. D. the true effect is large and the sample size is large.

A, If the true effect is small we need a larger sample size to detect. If the true effect is large we can use a smaller sample size to detect. Small sample sizes and small true effects lead to real effects not being detected, because the effect will be lost in the randomness of the sample. As sample size goes up, power (the ability to detect a true effect) goes up.

A test to screen for a serious but curable disease is similar to hypothesis testing, with a null hypothesis of no disease, and an alternative hypothesis of disease. If the null hypothesis is rejected treatment will be given. Otherwise, it will not. Assuming the treatment does not have serious side effects, in this scenario it is better to increase the probability of: A. making a Type 1 error, providing treatment when it is not needed. B. making a Type 1 error, not providing treatment when it is needed. C. making a Type 2 error, providing treatment when it is not needed. D. making a Type 2 error, not providing treatment when it is needed.

A, providing treatment to a healthy patient (while maybe unnecessary) would be a preferable mistake to not treating a person who is actual ill. Type 1 error is rejecting the null when it is actually true (providing treatment) and a Type 2 error is not rejecting the null (not providing treatment) when the null is actually false. Clearly, a type 1 error is a bigger mistake than a type 2 error in this situation.

Which of the following are the assumptions required to perform a one-way between groups Analysis of Variance? Check all the correct answers. A. Independence of Observations B. Absence of Multicollinearity C. Homogeneity of Variance D. Normality of the Sampling Distribution

A,C,D

The dataset Students, which you can find on MOODLE contains information about 24 students from different Statistics classes from my years at Florida Polytechnic. I have chosen the students randomly and divided them into 3 categories based on their letter grade (A, B, C). I also, catalogued their studying habits and exam scores. The column Grade has their grade and the column Gender their gender. The column StudyW is how much they study during each week for the class and the column StudyE is how much they study before an exam according to a questionnaire I administered at regular intervals. The column Book is Y if the students are using the book and N if they don't. The column midterm contains their midterm scores and the column final contains their scores on the final. If you want to test if the students of the different groups (different letter grade) study the same hours before an exam which test would you use and why?

Balanced ANOVA because there are three groups in this case, and all of them have same amount of observations (multiple means with equal sample size). The students were chosen randomly, they are independent within and between. Do the Levene test for Homoscedasticity.

The dataset Students, which you can find on MOODLE contains information about 24 students from different Statistics classes from my years at Florida Polytechnic. I have chosen the students randomly and divided them into 3 categories based on their letter grade (A, B, C). I also, catalogued their studying habits and exam scores. The column Grade has their grade and the column Gender their gender. The column StudyW is how much they study during each week for the class and the column StudyE is how much they study before an exam according to a questionnaire I administered at regular intervals. The column Book is Y if the students are using the book and N if they don't. The column midterm contains their midterm scores and the column final contains their scores on the final. If you wanted to test if there is a difference in the letter grade between the two genders which test would you use and why?

Chi-square analysis because the variables are categorical.

Two television stations in Baton Rouge conducted a poll to determine the proportion of LSU students that are likely to vote in favor of a new education policy. TV Station A sampled 200 LSU students, whereas TV Station B sampled 2000 LSU students. Both TV stations tested this hypothesis at the same significance and confidence level. (circle) ONE option indicating which scenario is most likely to happen to either TV Station A, TV Station B or the chances are equally likely for either TV station. Explain why. Produce a confidence interval that captures the true [population] proportion:

Equally likely The chance of the confidence interval capturing the true proportion is specified by the confidence level. Because both researchers are using the same confidence level, they have the same chance of capturing the truth.

Two television stations in Baton Rouge conducted a poll to determine the proportion of LSU students that are likely to vote in favor of a new education policy. TV Station A sampled 200 LSU students, whereas TV Station B sampled 2000 LSU students. Both TV stations tested this hypothesis at the same significance and confidence level. (circle) ONE option indicating which scenario is most likely to happen to either TV Station A, TV Station B or the chances are equally likely for either TV station. Explain why. Reject the null hypothesis when it is true:

Equally likely This is the probability of a Type I error, which is the significance level, alpha α. Because both TV stations are using the same significance level, they have the same probability of a Type I error.

If there is a statistically significant difference between two population means, this indicates the difference is large (i.e. of practical significance). True or False? Explain.

False, Hypothesis tests with small effect sizes (low practical significance) can produce very low p-values when you have a large sample size and/or the data have low variability. Consequently, effect sizes that are trivial in the practical sense can be highly statistically significant. Small effect sizes can produce tiny p-values by: 1. You have a very large sample size. As the sample size increases, the hypothesis test gains greater statistical power to detect small effects. With a large enough sample size, the hypothesis test can detect an effect that is so miniscule that it is meaningless in a practical sense. 2. The sample variability is very low. When your sample data have low variability, hypothesis tests can produce more precise estimates of the population's effect. This precision allows the test to detect tiny effects. Statistical significance indicates only that you have sufficient evidence to conclude that an effect exists. It is a mathematical definition that does not know anything about the subject area and what constitutes an important effect.

The null hypothesis of Ho: X̄ = 0 is a correct way to state the null hypothesis. True or False? Why?

False, X̄ will vary from sample to sample and almost never equal 0 while µ is fixed (since µ is a population parameter). The purpose of hypothesis testing is to determine whether there is enough statistical evidence collected from a sample in favor of a certain belief, or hypothesis, about a parameter. Since we collect X̄ (a sample statistics) to make inference on µ (a population parameter) in hypothesis testing, we MUST set our null and alternative hypothesis in terms of µ. A valid null hypothesis would be Ho: µ = 0.

If I am interested in testing Ho: µ = 0 versus Ha: µ ≠ 0 and collected a sample average and constructed a 95% confidence interval of [-2.23,4.47], I would reject the null hypothesis at an alpha = 0.05 level since my confidence interval contains negative and positive values. True or False? Explain.

False, a confidence interval from X̄ gives a range plausible value for µ. In particular, a 95% confidence interval gives all values of µ that would NOT be rejected at the alpha = (1-0.95) level. Since 0 is contained in the 95% confidence interval and alpha = 0.05, we would not reject Ho.

I can propose a null and alternative hypothesis after observing my data. True or False? Explain. For example, I know that the null hypothesis will be Ho: µ = 0. However, in determining my alternative hypothesis, after observing an X̄ of 6, I elect to use Ha: µ > 0 instead of Ha: µ ≠ 0 since the sample average was greater than 0.

False, unless we are specifically interested BEFORE collecting our data in Ha: µ > 0, we cannot change our hypothesis after collecting to better fit our data. This sometimes called 'fishing' for results. This article may be of interest as well:

If I don't reject the null hypothesis (say I got a p-value of .78 with an alpha of 0.05), I can conclude the null hypothesis is true. True or False? Explain.

False, we do not conclude the null hypothesis is true (hence, the language of 'do not reject' as opposed to 'accept'). If there is greater than a 5% (assuming alpha = 0.05) chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained.

True or False: If there is no to very little correlation between two quantitative variables, then the variables are not related.

False. Correlation only measures the strength of LINEAR relationships. The 2 variables may be related nonlinearly.

True or False: A high correlation coefficient always implies a high strength of relationship between the variables.

False. Extreme outliers can inflate the actual correlation between two variables.

True or False: A higher R-squared is always preferred.

False. Unless you are comparing models with equal number of predictors, a higher R2 may just be due the model having relatively more predictor variables.

i. State the Type of t-Test to conduct - Paired or Independent ii. Concisely in a maximum of 2 sentences explain your choice or supplementary question in the scenario To test the efficacy of 2 drugs - A and B, a researcher randomly samples 40 rats, 20 of them are randomly given drug A, and the other 20 on drug B. One the rats in group B died suddenly of natural causes before the end of the study period.

Independent T-test - this is because each rat was not subjected to repeated sampling. Since it is independent t-test, which does not require equal sample size in each group, the death of one rat can be ignored.

The dataset Students, which you can find on MOODLE contains information about 24 students from different Statistics classes from my years at Florida Polytechnic. I have chosen the students randomly and divided them into 3 categories based on their letter grade (A, B, C). I also, catalogued their studying habits and exam scores. The column Grade has their grade and the column Gender their gender. The column StudyW is how much they study during each week for the class and the column StudyE is how much they study before an exam according to a questionnaire I administered at regular intervals. The column Book is Y if the students are using the book and N if they don't. The column midterm contains their midterm scores and the column final contains their scores on the final. If you wanted to test if there is a difference in the study hours per week between the group of students that use the book and those that don't which test would you use and why?

Independent t-test for two means because in this comparison there are two means from the groups use book and don't use book, and the students were chosen randomly, they are independent within and between.

A student analyzed data for a one-way ANOVA situation for which there were 3 levels, and 21 people measured at each level. Unfortunately, after running the analysis, the student lost the computer output. She said "All I remember is that one of the mean squares was 100 and the other one was 500, but I can't remember which was which. And I remember that the p-value for the test was about .01." Based on this information, can you construct the analysis of variance table? If so, fill it in. If not, explain why not

LAB 10 SOLUTIONS

A study attempts to investigate the relationship between lung cancer and smoking. Similar patients in a hospital were sampled for this study - some with lung cancer and some other ailments (but not lung cancer). The researcher then recorded whether or not the patient smoked. A hypothesis test was conducted to determine whether the proportion of smokers is higher for the patients with lung cancer than for patients without lung cancer. The reported p-value was less than 0.0001. Does this provide significant evidence that smoking causes lung cancer? Why or why not?

No. This was an observational study which cannot be used to establish causality. An experimental study, with structures to control for any confounding effects, on the other hand CAN be used to establish causality.

i. State the Type of t-Test to conduct - Paired or Independent ii. Concisely in a maximum of 2 sentences explain your choice or supplementary question in the scenario A study was conducted to determine whether shared movie preferences among married couples is an indicator of a long-lasting relationship. Partners for each couple were made to, in isolation, select from a list of movies (never seen by either partner) which they like. One couple divorced before the study was over. Does new information affect your choice of t-test to conduct? If no, do you ignore this information - what do you do?

Paired t-test - this is because each couple is subjected to repeated sampling (one from each partner). Drop all the information corresponding to that couple (both partners) since paired t-test requires equal sample sizes.

The dataset Students, which you can find on MOODLE contains information about 24 students from different Statistics classes from my years at Florida Polytechnic. I have chosen the students randomly and divided them into 3 categories based on their letter grade (A, B, C). I also, catalogued their studying habits and exam scores. The column Grade has their grade and the column Gender their gender. The column StudyW is how much they study during each week for the class and the column StudyE is how much they study before an exam according to a questionnaire I administered at regular intervals. The column Book is Y if the students are using the book and N if they don't. The column midterm contains their midterm scores and the column final contains their scores on the final If you wanted to compare the difference between the midterm and the final (both out of a hundred) which test would you use and why?

Paired t-test because same students give the midterm scores and final scores (both of the samples consist of same test subjects).

This dataset contains the rating of two groups of TV viewers on an assortment of shows. The question is whether adolescents are harsher critics than adults. This is an example of an independent sample ttest. What does 'independent' mean here?

Paired-samples t tests (dependent) compare scores on two different variables but for the same group of cases; independent-samples t tests compare scores on the same variable but for two different groups of cases A similar question to ours that would be a dependent t test is if we took ONE group, say adolescents, and compared their ratings of shows before vs. after napping.

In what way is linear regression a "generalized form" of correlation. (Hint: think of the transition from two-sample t-test to ANOVA)

Regression can be used to assess the strength of linear relationship among more than 2 variables.

True or False: ANOVA is robust with violations to assumptions of homogeneity of variance so if you fail Levene's test by a small amount, you're still ok.

True

True or False: Regression assumes independence of error terms.

True. This assumption allows for the error terms to be uncorrelated. If observations were dependent, the errors would be correlated. One form of this dependence is called auto correlation, and it frequently arises in time-series data.

Explain the difference between a type 1 and type 2 error.

Type 1 error (α) is rejecting the null hypothesis when it is actually true. As explained above, p-value is calculated by assuming the null hypothesis is true. Thus, we reject when p<alpha because the results are very unlikely under this null hypothesis. However, by this logic, our type 1 error rate will be alpha. Type 2 error (β) is not rejecting the null hypothesis when it is actually false. Power is the probability of NOT making a type 2 error (1 - β). If the difference between the true difference between the hypothesized value and actual is small (low power), type 2 error will go up. If the sample size is small (low power), type 2 error will go up. Holding sample size and true difference constant, there is a tradeoff between type 1 and type 2 error. However, we can lower our probability of a type 2 error without raising type 1 error and avoid this tradeoff if we increase our sample size or the true difference is actually larger.

List 2 scenarios under which the correlation coefficient of 2 variables may not be reliable.

When there are: - Confounding variables. Eg. There is a high correlation between ice cream sales and number of deaths by drowning. However, this so because of the summertime (the confounding variable). - Extreme observations or outliers. (Check this with an example by introducing a large outlier in two uncorrelated variables). - Repeated measures (that is measurements across space and time). Eg. Correlating the week 3 blood pressure with the week 4 blood pressure among the same patients. - Observations that were not randomly sampled. - Distinct Subpopulations that may be correlated differently. Simpson's Paradox


Set pelajaran terkait

17. Local and humoral control of tissue blood flow

View Set

Citi Trainings: Research Involving Human Subjects (RCR-Basic)

View Set

OHS 314 Chapter 3 Part 3: Physiology of the Auditory System

View Set