Statistics Final
Significant linear relationship will always be ...
≠ (not equal to)
two tailed test
≠ (not equal to)
Cluster design
NOT an experimental design!! (does not exist, only exists for sampling not design)
Correlation coefficient unit
R
Coefficient of determination unit
R^2
Experimental study
assign subjects to their treatments (random assignments)
Observational study
do not have random assignments
b_1
for every unit increase in x, we expect y to increase by b_1 on average
µ (mu) symbol = _____
mean
Randomized design
random, no groupings (each subject receives one treatment)
σ (sigma) symbol = _____
standard deviation (SD)
CL + alpha = ______
100%
How do you find median
Put the numbers in order and find the middle number
NC State wants to expand the wellness center. They decided to ask the students if they felt there was a need for an extension or not. They randomly select 100 freshmen and ask for their opinion. Which type of sampling bias is this most likely to result in? o Undercoverage o Response o Non-Response o Selection
a) Undercoverage Why -
Which of the following is TRUE about the shape of the sampling distribution of the sample mean? o The shape of the sampling distribution is always the same shape as the population. o Increasing the sample size causes the shape of the sampling distribution to become narrower. o Increasing the sample size can cause the sampling distribution to become more skewed. o Changing the sample size does not cause the shape of the sampling distribution to change
b) Increasing the sample size causes the shape of the sampling distribution to become narrower. Why -
Which of the following problems does single-blinding address? o Non-adherence o Non-response o Subject Bias o Researcher Bias
c) Subject Bias Why -
Random sampling
every sample has the same probability of being selected
one tailed test
less than (<) or greater then (>)
p-value
smallest significance level for rejecting Ho fail to reject when p-value > significance level
Conducting a hypothesis test: your Ho & Ha should always have _______ _______ but ________ __________
the same number different relative symbols (ex. Ho: p = 0.5, Ha: p > 0.5) - they both contain the value 0.5, but one contains an equals symbol & the other contains a greater than symbol
SD formula =
σ / √n
Requirement for experimental design?
Randomization (random assignment)
How do you find range
Subtract the lowest number from the highest number (max-min)
A researcher claims that 71% of Americans approve of unions, which would be the highest number since 1965. a) Suppose a random sample of 120 Americans is selected. Find the probability that more than 98 people in the sample approve of unions, or explain why this calculation cannot be done b) A random sample of 200 Canadians is found and 152 approve of unions. We want to compare this to the American population. Could this sample proportion of Canadians who approve of unions reasonably have come from a population with the same proportion claimed by the researcher (i.e. for Americans)?
a) 0.0049 b) z = 1.56 is not an unusual score (i.e. is within 2 standard deviations of the mean). Therefore, it is reasonable to conclude that the proportions could be the same in Canada and America.
The world's heaviest onion weighed in at 8.5kg (or 18.74 lbs). Mammoth onions are known for their large size and have an average weight of 2.6kg with a standard deviation of 0.97kg. If we assume the weights of Mammoth onions are normally distributed, find a) The probability that a random Mammoth onion reaches a weight of at most 1.5kg. b) The weight that corresponds to the largest 15% of Mammoth onions. Round to 2 decimals
a) 0.1292 b) 3.61 Why -
In which of the following cases can both the mean and mode of a sample be used to describe the center of the distribution? o A symmetrical distribution with one peak o A symmetrical distribution with two peaks o A skewed distribution with one peak o A skewed distribution with two peaks
a) A symmetrical distribution with one peak Why -
A p-value is calculated for the test above to be 0.0483. Which of the following is correct? o At a significance level of 1%, we would Fail to Reject the null hypothesis. o At a significance level of 5%, we would Fail to Reject the null hypothesis. o At a significance level of 10%, we would Fail to Reject the null hypothesis. o We would always Reject the null hypothesis because the sample proportion is larger than 50%.
a) At a significance level of 1%, we would Fail to Reject the null hypothesis. Why -
18. Viewership for 19 of the top 20 US networks fell last year, suggesting that traditional TV is struggling to compete in the age of streaming. Sporting giant ESPN was the only network in the top 20 to record higher viewing figures. This news lead Statistics student Aleksander to wonder about college students' TV viewing habits. Aleksander randomly selects 3 of the 7 dormitories on his campus and plans to leave a survey with each student in the 3 selected dorms. a) What type of sampling method is Aleksander using? b) Identify one potential issue with this plan. Explain the issue using statistical language. c) Aleksander is a sports fan and wants to know if other students feel the same. Which question below is better to ask: • Question A: "Which type of television programming do you watch most often: News, Reality TV, Scripted TV, Sports, Other?" • Question B: "Do you watch sports more than other television programming?"
a) Cluster sampling, with the dormitories representing the clusters. b) Possible answers: Non-response bias: students may choose not to respond to the survey left at their dormitory. Undercoverage/bad sampling frame: Aleksander only selected students who live on-campus, which may not reflect the habits of all of the college's students c) Question A: The wording of Question B is leading respondents to say "yes" because it only provides one option. Question B may also lead to response bias because Aleksander clearly wants to know about sports and not other types of programming.
We are interested in estimating the average commuting time of all NCSU commuters. Let's say we have some way of randomly sampling enough commuters to study this. Which inference method would work the best to achieve our goal? o Confidence interval for a mean o Confidence interval for a proportion o Hypothesis test for means o Hypothesis test for proportions o Matched pairs test for means
a) Confidence interval for a mean Why -
Which of the following is an example of response bias? o To achieve a higher Google review score, a restaurant offered discount coupons to all customers who gave 5/5 Google review scores. o To investigate customer satisfaction, a restaurant encouraged all customers to finish an online survey. But only 10% of customers had indeed submitted the surveys. o To estimate the proportion of all NCSU undergraduates who owned a car, the school officer randomly queried 100 undergraduates in the centennial campus. o None of the above are examples of response bias.
a) To achieve a higher Google review score, a restaurant offered discount coupons to all customers who gave 5/5 Google review scores. Why -
Renewables are set to become the world's leading source of electricity generation by 2025, with solar power leading the charge, according to a new report from the International Energy Agency. After solar tax credits, the average cost for a solar panel system on an average-size house in the U.S. is $14,696. The standard deviation of the cost is $3,163. a) A random sample of 50 U.S. homes is selected and the cost to install a solar panel system is recorded for each. Describe the sampling distribution of the sample mean cost for solar panel systems in samples of 50 U.S. houses. What conditions allow for this? b) Which is more likely? • Scenario A: A random sample of 50 U.S. homes has a sample mean of $15,571 or more. • Scenario B: A random sample of 50 U.S. homes has a sample mean of $14,242 or less.
a) We do not know the shape of the population, but we have a sufficiently large sample size (n=50) to determine, by the Central Limit Theorem, that the distribution of the sample mean will be approximately Normal: b) Scenario B
Cluster sampling
all observations come from some groups (not all will give information)
The cost per acre of undeveloped land in rural Pennsylvania is Normally distributed with a mean of $5000 and a standard deviation of $800. Estimate the percent of rural undeveloped land whose cost per acre lies between $4200 and $5800. o 34% o 68% o 60% o 90%
b) 68% Why -
Midterm scores on a statistics course were normally distributed with mean 𝜇 = 78 and standard deviation 𝜎 = 7. The z-score corresponding to Harry's score is 𝑧 = 1.20. Ron and Ginny have scores of 85 and 87, respectively. What is the correct order of best (highest) to worst (lowest) score? o Harry, Ron, Ginny o Ginny, Harry, Ron o Ginny, Ron, Harry o Harry, Ginny, Ron
b) Ginny, Harry, Ron Why -
How do critical values (i.e. confidence coefficients or multipliers) change when the confidence level increases? o If the confidence level increases, the critical values get closer to 0. o If the confidence level increases, the critical values get farther from 0. o It depends on whether we are using 𝑧 or 𝑡 critical values. o The change in the critical values depends on the sample size as well, so we cannot say how it will change.
b) If the confidence level increases, the critical values get farther from 0. Why -
The average farm size in Brazil in 1980 was 70.7 hectares. A recent study sampled 100 random farms in Brazil and calculated their average size to be 72.8. The following hypothesis test was conducted: 𝐻0: 𝜇 = 70.7 𝑣𝑠 𝐻𝑎: 𝜇 > 70.7 where 𝜇 is the current mean farm size in Brazil. The p-value was found to be 0.02. Which is the correct interpretation of the p-value? o The probability that the current mean farm size in Brazil is greater than the farm size in 1980 is 0.02. o If the current mean farm size in Brazil is the same as in 1980, the probability of observing a sample mean farm size in Brazil of 72.8 hectares or more is 0.02. o The probability of observing a sample mean farm size in Brazil greater than 72.8 hectares is 0.02. o The probability that 100 farms in Brazil have a mean size that is equal to 70.7 hectares is 0.02.
b) If the current mean farm size in Brazil is the same as in 1980, the probability of observing a sample mean farm size in Brazil of 72.8 hectares or more is 0.02. Why -
The NC State statistics department wants to know how much time all NC State students spend daily on their phones. They randomly selected 500 students from a total of 35,000 students at the school. They found out that the average time for the selected 500 students was 7.2 hours. According to the university's census report, the average time of all 35,000 students was 8.5 hours. Which of the following combinations is correct? o Population is 500 students, sample is 35,000 students, parameter is 8.5 hours, and statistic is 7.2 hours. o Population is 35,000 students, sample is 500 students, parameter is 8.5 hours, and statistic is 7.2 hours. o Population is 500 students, sample is 35,000 students, parameter is 7.2 hours, and statistic is 8.5 hours. o Population is 35,000 students, sample is 500 students, parameter is 7.2 hours, and statistic is 8.5 hours.
b) Population is 35,000 students, sample is 500 students, parameter is 8.5 hours, and statistic is 7.2 hours. Why -
Neil works for a large pharmaceutical company and he has recently developed a drug which he believes will reduce hair loss in men over the age of 35. Neil decides to use a completely randomized experimental design to test the drug. The hair loss drug itself will be given to participants in the treatment group via a pill capsule ("treatment pill"). How should he structure the control group? o The participants in the control group will be given nothing. o The participants in the control group will be given a sugar pill capsule that looks identical to the treatment pill, but will have no biological effect. o The participants in the control group will be given the treatment pill. o There is no control group in a completely randomized design, therefore we cannot answer this question.
b) The participants in the control group will be given a sugar pill capsule that looks identical to the treatment pill, but will have no biological effect. Why -
The average farm size in Brazil in 1980 was 70.7 hectares. A recent study sampled 100 random farms in Brazil and calculated their average size to be 72.8. The following hypothesis test was conducted: 𝐻0: 𝜇 = 70.7 𝑣𝑠 𝐻𝑎: 𝜇 > 70.7 where 𝜇 is the current mean farm size in Brazil. The p-value was found to be 0.02. The null hypothesis is rejected. Which of the following correctly interprets this decision? o There is sufficient evidence to suggest that the mean farm size in Brazil in 1980 is greater than 70.7 hectares. o There is sufficient evidence to suggest that the current mean farm size in Brazil is greater than 70.7 hectares. o There is sufficient evidence to suggest that the mean farm size in Brazil in 1980 is 70.7 hectares. o There is sufficient evidence to suggest that the current mean farm size in Brazil is 70.7 hectares
b) There is sufficient evidence to suggest that the current mean farm size in Brazil is greater than 70.7 hectares. Why -
Researchers conducted a study of 500 college athletes to determine the average heart rate after exercise among athletes. They found their 95% confidence interval to be (150,160). Which of the following interpretations is correct? o There is a 95% probability that the true average heart rate among athletes after exercise is between 150 rpm and 160 rpm. o We are 95% confident that the true average heart rate among athletes after exercise is between 150 rpm and 160 rpm. o We are 95% confident that the average heart rate among these 500 athletes after exercise is between 150 rpm and 160 rpm. o We are confident that 95% of all athletes have a mean heart rate after exercise that is between 150 rpm and 160 rpm.
b) We are 95% confident that the true average heart rate among athletes after exercise is between 150 rpm and 160 rpm. Why -
Bill records the 1-mile-run times of his classmates. Most of them complete the run in 8 to 10 minutes but a few of his classmates were exceptionally fast and completed the run in 5 to 7 minutes. If Bill plotted the running times of his classmates, what shape would the plot have? o Right-skewed o Symmetric o Left-skewed o Normal
c) Left-skewed Why -
A researcher is developing a new flu vaccine and is concerned that participant political leanings may affect reporting of side-effects of the vaccine. Assuming he's conducting a double-blind study against a placebo, which of the following would be the best experimental design to use to achieve our goal of measuring vaccine effectiveness including side-effects, and why? o Observational study, since we want to observe the difference between the vaccine and the placebo. o Completely Randomized Design (CRD), to make sure that the experiment is unbiased and all the participants are randomly assigned to take the vaccine or the placebo. o Randomized Block Design (RBD) and use political orientation as blocks, to best account for the effect that different political leanings may have on the reporting of side-effects. o Randomized Block Design (RBD) and use the treatment as blocks, to best account for the placebo effect.
c) Randomized Block Design (RBD) and use political orientation as blocks, to best account for the effect that different political leanings may have on the reporting of side-effects. Why -
The Dean wants to make an 89% confidence interval for the true average number of pages per book in DH Hill Library. Assuming he will meet all the necessary conditions, what is the best way he can make the interval as narrow as possible? o Use a smaller test statistic o Decrease the confidence level o Take the largest sample possible o Replace t* with z*
c) Take the largest sample possible Why -
In the U.S. about 1.5% of undergraduate students have children whereas 12% of graduate students have children. Assume two random samples of the same size of undergraduate and graduate students are found. Which estimate of the population proportion (undergraduate students with children or graduate students with children) is expected to be more variable? o We cannot tell based on the information provided. o The estimate of the proportion of undergraduate students with children is expected to be more variable. o The estimate of the proportion of graduate students with children is expected to be more variable. o The estimates of the proportions will have equal variance since the samples are the same size.
c) The estimate of the proportion of graduate students with children is expected to be more variable. Why -
Which of the following is NOT true about the sample proportion 𝑝̂? o The mean of 𝑝̂is equal to the population proportion 𝑝. o The variance of 𝑝̂depends on the sample size 𝑛. o The variance of 𝑝̂is largest when the population proportion 𝑝 is equal to 1. o If the sample size 𝑛 is large enough, the sampling distribution of 𝑝̂is approximately normal.
c) The variance of 𝑝̂is largest when the population proportion 𝑝 is equal to 1. Why -
Which of the following is FALSE about variable type? o We can calculate the mean and median of quantitative variables. o We can draw a boxplot of quantitative variables. o We can draw a histogram of categorical variables. o We can calculate frequency of categorical variables.
c) We can draw a histogram of categorical variables. Why -
Changing does what to R^2?
changing units does nothing to change R^2
The school officers wanted to estimate the proportion of undergraduates in NCSU who took the COVID-19 Booster Vaccines. To conveniently estimate this proportion, they sent questionnaires to all NCSU undergraduates through emails. They received 5000 response emails among which 200 students reported that they had taken the COVID-19 Booster Vaccines. What type of sampling method was used here? o A simple random sample. o A systematic sample. o A stratified random sample, using whether a student had taken the COVID-19 Booster Vaccine or not as the strata. o A volunteer response sample
d) A volunteer response sample Why -
We decided to construct a 95% confidence interval for the true proportion of NCSU students who have the habit of eating breakfast, from a random sample of 2000 NCSU students. The resulting interval is (0.77, 0.83). Which of the following is a correct interpretation of the confidence LEVEL? o We are 95% confident that the sample proportion of these 2000 NCSU students who have the habit of eating breakfast is between 77% and 83%. o 95% of all random samples of 2000 NCSU students will show that between 77% and 83% of students have the habit of eating breakfast. o There is a 95% probability that the true proportion of NCSU students who eat breakfast is between 0.77 and 0.83. o If we take random samples over many times, then about 95% of the confidence intervals that we build through these random samples will contain the true proportion of NCSU students who eat breakfast.
d) If we take random samples over many times, then about 95% of the confidence intervals that we build through these random samples will contain the true proportion of NCSU students who eat breakfast. Why -
Which of the following numerical summaries are measurements of spread (i.e. how different the values in a dataset are from one another)? o Min, Max, Boxplot o Mean, Median, Quartiles o Range, Standard Deviation, Z-scores o Range, Interquartile Range, Variance o Normal Distribution, Standard Normal Distribution, Sample Statistics
d) Range, Interquartile Range, Variance Why -
Neil works for a large pharmaceutical company and he has recently developed a drug which he believes will reduce hair loss in men over the age of 35. Neil decides to use a completely randomized experimental design to test the drug. The hair loss drug itself will be given to participants in the treatment group via a pill capsule ("treatment pill"). Identify the response variable. o Whether or not they receive the hair loss drug. o Men over the age of 35. o Men over the age of 35 who receive the hair loss drug. o Reduction in hair loss in men over the age of 35.
d) Reduction in hair loss in men over the age of 35. Why -
True or False: Confidence intervals are better to report than sample statistics? o False, since they do not always contain the true parameter o False, because a one number summary is better than a range o True, because confidence intervals give the probability of the parameter o True, because confidence intervals give a range of numbers that we believe will include the population parameter
d) True, because confidence intervals give a range of numbers that we believe will include the population parameter Why -
In the statistical paradigm, we are primarily interested in doing which of the following? o Discovering surprising characteristics of our population. o Avoiding bias by randomly sampling from the population. o Using the central limit theorem to make our populations normally distributed. o Using sample statistics to make inferences about population parameters.
d) Using sample statistics to make inferences about population parameters. Why -
Which set of hypotheses can be used to investigate the question, "Has the number of students who own a dog increased?" o 𝐻0: 𝑝 = 0.55 𝑣𝑠 𝐻𝑎: 𝑝 > 0.55 o 𝐻0: 𝑝 = 0.50 𝑣𝑠 𝐻𝑎: 𝑝 > 0.55 o 𝐻0: 𝑝 = 0.55 𝑣𝑠 𝐻𝑎: 𝑝 > 0.50 o 𝐻0: 𝑝 = 0.50 𝑣𝑠 𝐻𝑎: 𝑝 > 0.50
d) 𝐻0: 𝑝 = 0.50 𝑣𝑠 𝐻𝑎: 𝑝 > 0.50 Why -
When we fail to reject Ho, we _______
do not have sufficient evidence at alpha to conclude ...
To determine if interferon increases visual acuity in patients with macular degeneration, measurements are taken on the affected eye before interferon treatment and again 8 weeks after use. The results are compared. Which inference method would work the best to determine if interferon treatments were successful? o Confidence interval for a mean o Confidence interval for a proportion o Hypothesis test for means o Hypothesis test for proportions o Matched pairs test for means
e) Matched pairs test for means Why -
When p-value is greater than alpha, we ______
fail to reject Ho
Matched pairs
has to be paired up as individuals, not just two groups
To narrow a confidence interval ....
increase sample size or decrease Z*
When CL increases, multiplier _______, & MOE _________
increases increases (the opposite when CL decreases)
The more skewed the population, the ______ the sample size has to be for the CLT to apply
larger
Negative correlation coefficient = ______ slope of regression line
negative
Hawthorne effect
people in an experiment know they are being watched, so their behavior reflects that & they don't act as they normally would (not accurate behavior)
Subject bias
people in experiment behave how they think you want them to behave (intentionally control their behavior in an unnatural way)
Stratified sampling
some observations from all the groups (all will give information)
CI for proportion steps
step 1. check conditions (on formula sheet: random sample & np>12 & nq>12) step 2. CI for p formula (on formula sheet) - find z* from t-table infinity row (apply the appropriate CI you're trying to find & there's your value) step 3. plug in values (p̂, z*, n) - n = sample size - p̂ = 'found' value / sample size - your final answer for CI should have two values [ ___,___ ] step 4. interpret (we are % confident that p is between these values where p is (the given information))
Hypothesis test for two means steps
step 1. form hypothesis test step 2. check conditions step 3. conduct test (test statistic formula from formula sheet) step 4. find p-value step 5. reject or fail to reject (interpret & specify parameters)
correlation close to -1 or 1 = _______ linear relationship
strong (& vice versa)
Strength & direction of linear relationship
strong or weak positive or negative (four possible outcomes)
How do you find mean
sum of values divided by number of values
Block design
the random assignment of subjects to treatments is carried out separately within each block (two specific groupings each given a different treatment)
correlation close to 0 = ______ linear relationship
weak (& vice versa)
Non adherent bias
when people/subjects purposefully don't follow the experiment (take a drug that they're told not to take, or the opposite, etc.)
Systematic sampling
will not come up (don't need to know)
Central limit theorem
with a big enough sample size, ȳ follows a Normal Distribution