Psych 100A Textbook Q's
B
(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: 2. In a sample of students from UCLA, what proportion have GPAs greater than 3.2?
C
(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: 3. If the population average time between dental appointments is 12 months with a standard deviation of 4 months, what is the probability that a sample of 4 individuals has an average time between dental appointments of 8 months?
A
(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: What is the probability that an individuals in the United States is older than 70 years old?
C
. If the research question we are most interested in is whether education predicts infant mortality rate, which confidence interval would we be most interested in? A. Confidence interval for the mean B. Confidence interval for the intercept C. Confidence interval for the slope D. All of the above are required to answer the research question
B
4) What R code would create a new variable called DEPRESS that contains the difference between PRE versus POST depression severity? a) Depression = PRE - POST b) PHQ9$DEPRESS <- PHQ9$PRE - PHQ9$POST c) Depression$PRE - Depression$POST d) Depression(Depression$PRE - Depression$POST)
A
A ____________ distribution can be used to investigate how unusual a specific individual is, but a ____________ distribution is used to investigate how unusual a sample statistic (e.g., sample mean) is. a. sample distribution; sampling distribution b. sampling distribution; sample distribution c. population distribution; sample distribution d. sampling distribution; population distribution
B
A student who is majoring in Civil Engineering finds out that the median income for her major is $50,000, the z score being .86. How should we interpret this z score? response - correct A This major's median earnings are higher than 86% of the other majors. B The major's median earnings are .86 standard deviations larger than the average median earning. C The probability that this student will earn ore than the average college graduate is .86. D Using the empirical rule, .86 of all median earnings fall between 0 and this value.
A, C
Above is a dotplot generated using the following command: gf_point(Infant.Mortality~1, data = swiss)%>% gf_hline(yintercept = mean(swiss$Infant.Mortality)) What does the horizontal line across the plot correspond to (Check all that apply)? A. Average Infant mortality B. Average education C. The intercept from a simple model predicting infant mortality D. The intercept from a complex model predicting infant mortality from education
D
Above is a histogram of the birthweight (wt) in ounces for the 100 newborns in this data set. gf_histogram(~ wt, data = Gestation100, fill = "royalblue2", bins = 15) If we wanted to explore the idea that mother's smoking might explain the variation in birth weight, what visualization might be helpful to us? response - incorrect A gf_histogram(~ wt, data = Gestation100) %>% gf_facet_grid(smoke ~ .) B gf_point(wt ~ smoke, data = Gestation100) C gf_boxplot(wt ~ smoke, data = Gestation100) D All of the above would be helpful to us as we explore our data.
B
Above is a sampling distribution of F-values generated through a shuffling routine. What do we use this distribution for? A. We can compare this distribution to one made with bootstrapping to see if they look different B. We can compare our observed F-value to the values in the distribution to see how unusual our F is C. If our sample distribution does not look like this distribution, we reject the simple model D. This distribution is not useful for comparing to our distribution because our sample size is 74, but the sample size here is 1000
c
Above is the output estimating a simple model (also called null or empty model) for fullSpeed, how can we interpret the intercept? A. The population average speed of light is 299845 B. The speed of light is always 299845 C. In this experiment, the average speed of light was 299845 D. All runs in this experiment showed a speed of light of exactly 299845
B
Above we have included the ANOVA tables for Wetsuit = NoWetsuit + other stuff. Which distance is the basis of the SS Error? response - correct A The distance between data points and the empty model's prediction. B The distance between data points and the NoWetsuit model's prediction. C The distance between the NoWetsuit model's prediction and the empty model's prediction. D The distance between the NoWetsuit model's residual and the empty model's residual.
A
Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. A classmate takes a look at these results and suggests that you judge these models using F instead of PRE. Why is that good advice? response - correct A Because F takes degrees of freedom into account and the smoke.factor model might just have a big PRE because that model used more degrees of freedom. B Because when you are comparing a regression model against a group model, you can't use PRE to compare them. C Because F is the only statistic that allows us to make comparisons of explanatory variables that are measured differently. D Because PRE is based on SS, which is a less familiar statistic to most people. F is based on variance, which is a more well known statistic.
D
Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. Why do these two models have the same value for SS total (25356)? response - incorrect A Because both SS totals are based on the residuals from the empty model. B Because both SS totals are based on the same outcome variable. C Because both SS totals are based on the values from the same data set. D All of the above reasons together explain why the SS totals are the same.
B
Above we have included the ANOVA tables for wt = smoke.factor + other stuff. Which distance is the basis of the SS error? response - incorrect A The distance between data points and the empty model's prediction. B The distance between data points and the smoke.factor model's prediction. C The distance between the smoke.factor model's prediction and the empty model's prediction. D The distance between the smoke.factor model's residual and the empty model's residual.
C
Assume this code has already been run: wt.stats <- favstats( ~ wt, data = Gestation100) What will the following line of code do? rnorm(100, wt.stats$mean, wt.stats$sd) response - correct A Create a single, normal curve of weights, using the mean and SD of wt. B Generate a random sample of one data point from a normal distribution with a mean of 100. C Generate a random sample of 100 data points from a normal distribution with the same center and spread as wt. D Generate a sampling distribution of 100 means from a normal distribution with the same center and spread as wt.
C
Based on the descriptions and calculations above, would 12 months be included in a 95% confidence interval calculated from a sample which found an average of dental appointment of 10 months with a standard deviation of 4 months. I (a). Yes, 12 would be included in the interval I (b). There is a 95% chance 12 would be included in the interval I (c). No, 12 would not be included in the interval I (d). It is impossible to tell.
A
Consider a case where the population variance (σ 2 ) is known. In this case, we do not need to estimate variance in the sample. In order to generate a confidence interval for the mean of the distribution, which mathematical function could we use to represent the sampling distribution: A. Normal distribution with sample mean and known variance B. Normal distribution with sample mean and sample variance C. t-distribution with degrees of freedom = n D. t-distribution with degrees of freedom = n-1
A
Consider a model where we predict infant mortality from education. What would the proper word equation for such a model be? A. Infant.Mortality = Education + Error B. Education = Infant.Mortality + Error C. Infant= Mortality + Education D. Mortality = Education + Infant + Error
A
Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The population mean of the sampling distribution of the mean for Study 1 will have the same population mean of the sampling distribution of the mean for Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW
C
Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sample mean from Study 1 will be smaller than the Sample mean from Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW
B
Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sampling distribution of the mean from Study 1 look exactly the same as the population distribution. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW
A
Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sampling distribution of the mean from Study 1 will be more normally distributed than the sampling distribution of the mean from Study 2 A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW
B
Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The standard error of the mean from Study 1 will be greater than the standard error of the mean from Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW
c
Economic theories suggest that if there is less supply, there will be more demand. Perhaps if there are fewer of a particular major, then those majors might be paid higher wages. We used totalgrads (how many people graduated with that major) to predict median_income for that major in a model called totalgrads.model. When we calculated the best fitting estimates, we found a negative slope (b1=−.01989. Choose the correct interpretation for this value. response - correct A This is how much error has been explained per degree of freedom. B This is the proportion of b1 that could have been randomly generated if the empty model were true. C This is the decrement in predicted median earnings (in thousands of dollars) for each additional thousand graduates of that major. D This is the thousands of dollars predicted for the median income of a major that had 0 graduates.
c
Economic theories suggest that if there is less supply, there will be more demand. Perhaps if there are fewer of a particular major, then those majors might be paid higher wages. We've now used totalgrads (how many people graduated with that major) to predict the median_income of that major in a model called totalgrads.model (R code follows). totalgrads.model <- lm(median_income ~ total, data = collegegrads) Which of the following statements about SS is true? response - correct A SS Error for totalgrads.model should be greater than SS Total because the dots on this plot are not close to the regression model B The model (totalgrads.model) is not a very good explanatory model, so SS Model should be smaller than 1. C SS Model will be smaller than SS Total because SS Model is always smaller than SS Total, whether a model explains a lot of variation or not. D SS Error will be greater than SS Model because SS Error is always bigger than SS Model, whether a model explains a lot of variation or not.
B
Estimate the linear model predicting Infant Mortality from Education. What is the interpretation of the slope? A. For provinces with -.03% education beyond primary school, infant mortality is expected to increase by 1% B. For each 1% increase in education beyond primary school, infant mortality is expected to decrease by .03% C. For each 20.27% increase in education beyond primary school, infant mortality is expected to decrease by 0.03% D. None of the above
C
Given this distribution of data, could the population of newborn weights be shaped like a normal distribution? response - correct A No, the sample distribution doesn't look curved enough to have come from a normal distribution. B We cannot tell unless we did a few calculations to see if this fits the empirical rule. C Yes, it is possible because samples often do not look exactly like the populations that they were drawn from. D This sample distribution could only have come from a normal distribution because it is roughly symmetric and largely clustered in the center of the distribution.
D
If I want to draw a set of 30 random cases from my dataset with 1500 observations which function would I use: a) select(dataset, c(1:1500, 30)) b) dataset [ , sample(1:1500,30)] c) select (dataset, 30) d) dataset [sample (1:1500, 30) , ]
A
If the distribution of NoWetsuit was more variable (that is, has a greater standard deviation) than the distribution of Wetsuit, what would be true about the confidence interval of NoWetsuit compared to Wetsuit? response - correct A The NoWetsuit confidence interval would be wider. B The less "confident" you can be in the NoWetsuit confidence interval. C The NoWetsuit confidence interval should be based on the t rather than z distribution. D The NoWetsuit confidence interval should be calculated with the Central Limit Theorem and not with bootstrapping or simulation.
A
If the population average time between dental appointments is 12 months with a standard deviation of 4 months, what is the probability that a sample of 4 individuals has an average time between dental appointments of 8 months or lower? I (a). 0.02 I (b). 0.98 I (c). 0.16 I (d). 0.84
C
If the researchers are interested in whether wearing wetsuits affects swimming velocity, what is the outcome variable of interest? response - correct A Wetsuit B NoWetsuit C The difference in velocity between swimming in NoWetsuit compared to Wetsuit. D Type of athlete
D
If the sample size were larger, would the confidence interval be wider or narrower? A. Wider, because more people leads to more variability (i.e., higher standard deviation) B. Wider, because the sample size is positively related to margin of error C. Narrower, because the standard deviation will get smaller with more people D. Narrower, because the sample size is negatively related to margin of error
C
If we wanted to create a sampling distribution of the shuffled slopes (b1), how would you modify this code? shuffledSDob1 <- do(10000) * b1(median ~ total, data = collegegrads) response - correct A Shuffle the do() function like this: shuffle(do(10000)) B Shuffle the data like this: shuffle(collegegrads) C Shuffle either median or total like this: shuffle(median) or shuffle(total) D Shuffle the b1() function like this: shuffle(b1(median ~ total, data = collegegrads))
B
If we wanted to explore the idea that mother's age might explain the variation in birth weight, what visualization might be helpful to us? response - correct A gf_histogram(~ wt, data = Gestation100) %>% gf_facet_grid(age ~ .) B gf_point(wt ~ age, data = Gestation100) C gf_boxplot(wt ~ age, data = Gestation100) D All of the above would be helpful to us as we explore our data.
A
If we were to create a 99% confidence interval for the same data instead, would that confidence interval be wider (bigger) or narrower (smaller) than the 95% confidence interval. A. Wider/Bigger B. Narrower/Smaller C. It depends on sample size D. Impossible to tell
B
If we were to use this data to guess the mean weight of newborns in the population, what would we be trying to estimate? response - correct A Standard error B β0 C Xi D SStotal
B
If you created a bootstrapped sampling distribution of 10,000 means from your sample of SpeedUp, what qualities would you expect it to have? response - correct A A roughly normal shape, and a standard deviation similar to the standard deviation of the sample B A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample C A mean similar to the sample mean, and a standard deviation similar to the standard deviation of the sample D A shape similar to that of the sample, and a standard deviation smaller than the standard deviation of the sample
A
If you decide you want to increase your level of confidence in your estimate of Wetsuit (from 95% to 99%), what will happen to your confidence interval? response - correct A It will become wider. B It will become narrower. C It will become both wider and less reliable. D It will become both narrower and less reliable.
B
If you fit a model that predicts Mins by including FTMade as an explanatory variable, how many parameters would the model have? response - incorrect A 2: Mins and FTMade B 2: the y-intercept and the slope of the regression line C 2: the mean of Mins and the increment added for each free throw made that exceeds the mean number of free throws made Yi, b0, b1, Xi D 4: Yi, b0, b1, Xi
D
If you want to know if a regression model is better than a simple model in terms of making a prediction, what parameter should you make a sampling distribution of? response - correct A The mean B The standard deviation C The confidence interval D The slope
C
If you were to calculate the sum of the residuals from the empty model of Mins, what would it be? A Less than the sum of the residuals from the Points.model of Mins B More than the sum of the residuals from the Points.model of Mins C 0 D It's impossible to tell
C
If you were to calculate the sum of the residuals from the empty model of Mins, what would it be? response - correct A Less than the sum of the residuals from the Points.model of Mins B More than the sum of the residuals from the Points.model of Mins C 0 D It's impossible to tell
C
If you've identified your confidence interval for SpeedUp, what exactly are you confident about? response - incorrect A You're confident that the means of samples of SpeedUp falls within a normal range. B You're confident that at least 95% of athletes would swim faster when wearing a wetsuit. C You're confident that the true effect of wearing a wetsuit on swimming velocity lies within it. D You're confident that the SlowDown will be normally distributed in the population.
D
In addition to the NBA player data for the 2011 season, we also have a similar data frame called NBAPlayers2015 (with many of the same variables). Which of these values will be the same if we create two models with these lines of R code: Points11.model <- lm(Min ~ Points, data = NBAPlayers2011) Points15.model <- lm(Min ~ Points, data = NBAPlayers2015) response - incorrect A The SS total for both these models will be the same, because they have the same outcome variable. B The SS model for both these models will be the same because they have the same explanatory variable. C The best-fitting estimate of the empty model will be the same because it will be the mean number of minutes played. D None of these values (SS total, SS model, mean) will be the same.
D
In which case would you use the t-distribution over the normal distribution to make a confidence interval? I (a). When you do not want to make assumptions about the distribution of the errors I (b). When generating a confidence interval for a slope instead of a mean I (c). When sample size is very small (n < 30) I (d). When the population standard deviation is unknown
A
In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: Three group mode (a) model comparison (b) confidence intervals (c) both
C
In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: regression model (a) model comparison (b) confidence intervals (c) both
C
In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: two group model (a) model comparison (b) confidence intervals (c) both
D
James computed a 95% confidence interval for the mean of a variable using a sample of size 100 as CI = 4 to 12. Based on this James concluded the likelihood of finding a individual with a score of 3.5 is 5% or less. Why is this conclusion FALSE? A. This conclusion is not FALSE, it is TRUE B. We can only be exactly 95% confident when sample size is infinite, so based on a sample of 100 we're less than 95% sure C. Based on the score we can calculate an exact probability of the individual's score, so we don't need to conclude it's 5% or less. D. The confidence interval is a range of possible means, not range of possible individuals.
A
Let's compare two models, one that treats Points as a quantitative variable to predict Mins, and the other that uses Points to create 24 groups (Points24Group) before using it to predict Mins. The supernova tables below show that the PRE for Points24Group reduces the total variation in Mins by 77%, but the Points model reduces it by 65%. Why isn't the Points24Group model better than the Points model of Mins? F ratio of points = 322, F ratio of Points24Group = 22.25 A The F ratio shows that the Points model explains more variation per degree of freedom than the Points24Group. B The SS error is bigger for the Points model, which demonstrates its advantage over the Points24Group model. C The Points model is far more elegant because the name is shorter and less clunky. D Trick question! The Points24Model is better than the Points model because the PRE is bigger, the SS model is bigger, and the SS error is smaller. There are no measures that suggest that the Points model is better.
A
Let's imagine a world where the population average speed of light is 299792 with a standard deviation of 105. Fill in the blank for the following code to create a sampling distribution of means. SDOM <- do(1000)*mean(rnorm(20, mean = ________, sd = ________)) A. 299792; 105 B. 105; 299792 C. 299792; 105/sqrt(20) D. 105/sqrt(20); 299795
C
Now imagine that the same student simply wanted to know whether her original score was a 1470. How would you get her answer? A SAT[4]><1470 B SAT[4]=1470 C SAT[4]==1470 D SAT[4]<-1470
A
One researcher wondered if some of the variation in the difference in velocity came from the type of swimmer they were. Triathletes swim in wetsuits more often than competitive swimmers do, and she worried that their experience would influence the results of this study. Above is a faceted histogram of difference in velocity by the Type of athlete. The two vertical lines depict the mean of the swimmer group and the mean of the triathlete group. The lines on the 2 graphs line up - very similar mean value What would be the PRE of this model: Type.model <- lm(SpeedUp ~ Type, data = Wetsuits) response - incorrect A Close to 0 B Close to 1 C Around .5 D I would need to run the supernova() function to be able to say.
B
One student suggests that players who make a lot of free throws (FTMade) are better and they would see more game time. Another student argues that making free throws doesn't make you a better player—having a higher free throw percentage (FTPct) is the sign of a better player, and suggests that would explain the variation in minutes played (Mins). Which of these plots would depict the relationship between Mins and one of these explanatory variables? response - incorrect A Box plots (i.e., gf_boxplot) B Scatter plots (i.e., gf_point) C Faceted histograms (i.e., gf_histogram with gf_facet_grid) D All of the above would be able to depict these relationships equally clearly.
D
Our Points.model of the outcome variable Mins can be represented as: Yi=b0+b1X1+ei LeBron James scored 2,111 points in the 2011 season. In this equation, what part represents the prediction the Points.model would make for minutes played by LeBron James? A b0 B b1 C b1Xi D b0+b1Xi
D
Our Points.model of the outcome variable Mins can be represented as: Yi=b0+b1X1+ei LeBron James scored 2,111 points in the 2011 season. In this equation, what part represents the prediction the Points.model would make for minutes played by LeBron James? response - incorrect A b0 B b1 C b1Xi D b0+b1Xi
B
Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) What does the .43 mean? On anova table it's Age response - correct A This is the correlation between age and wt. B This is the increment to add on to the prediction of wt for every year of mother's age. C This is the increment to add on to the prediction of mother's age for every ounce of wt. D This is the percentage of newborn weights that are predicted accurately by using mother's age in the model.
B
Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) intercept = 109.26, age = 0.43 Which equation represents this model with the best-fitting estimates? response - correct A Yi = 109.26 + .43 + ei B Yi = 109.26 + .43Xi + ei C Yi = 109.26X1i + .43X2i + ei D Yi = 109.26b0 + .43b1 + ei
B
Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). How would we depict the null model of maximum velocity in a wetsuit on this plot? response - incorrect A A line using the best-fitting estimates from lm(Wetsuit ~ NoWetsuit, data = Wetsuits) B A horizontal line at the mean of Wetsuit C A vertical line at the mean of NoWetsuit D A dot in the middle of this plot
B
Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). Which is the best way to take a look at this relationship (between Wetsuit and NoWetsuit)? response - incorrect A gf_histogram(~ Wetsuit, data = Wetsuits) %>% gf_facet_grid(NoWetsuit ~ .) B gf_point(Wetsuit ~ NoWetsuit, data = Wetsuits) C gf_boxplot(Wetsuit ~ NoWetsuit, data = Wetsuits) D All of these would effectively depict this relationship.
D
Recall that the variable SpeedUp is the difference between swimming with Wetsuit versus NoWetsuit. Why might you want to find the point below which 2.5% of bootstrapped sample means for SpeedUp fall, and the point above which 2.5% of simulated sample means for SpeedUp fall? response - incorrect A The points would help you determine whether the sample would be considered likely to have come from a population with a mean between those points. B It would enable you to calculate the critical distance. C It would enable you to establish the confidence interval. D All of the above.
A
Sampling distributions are good tools for visualizing which of the following ideas: a. sampling variability b. measurement error c. individual differences d. sums of squares
D
SpeedUp contains the swimming velocity with Wetsuit minus NoWetSuit. Could these differences in swimming velocity be normally distributed in the population? response - correct A No, that is not possible, because this distribution shows that the data are not clustered in the center. B No, that is not possible, because the sample was too small (n = 12), so the population could not have been normally distributed. C No, that is not possible, because this variable was created by subtracting two measurements. D It is possible.
A
Take a look at the model that we fit in this output. How would we represent this number with General Linear Model (GLM) notation? Intercept = 0.0775 response - correct A b0 B b1 C β0 D β1
d
The PRE from the total.model of median income is .0114. Which of the following lines of code would let us see the distribution of PREs if there were no relationship between median earnings and total graduates? response - correct A SDoPRE <- do(10000) * PRE(shuffle(median_income) ~ totalgrads, data = collegegrads) B SDoPRE <- do(10000) * PRE(median_income ~ shuffle(totalgrads), data = collegegrads) C SDoPRE <- do(10000) * PRE(resample(median_income) ~ totalgrads, data = collegegrads) D All of the above would create a sampling distribution of PREs based on the empty model.
A
The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: Wetsuiti = b0 + b1NoWetsuiti + ei If the confidence interval for β1 is .9547 m/sec plus or minus 0.118 m/sec, how big is the standard error of the sampling distribution of b1? response - incorrect A Roughly .118 divided by 2 B Roughly .118 divided by square root of 12 C Roughly .9547 divided by 2 D Roughly .9547 divided by square root of 12
C
The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: Wetsuiti = b0 + b1NoWetsuiti + ei If the confidence interval for β1 is .9547 m/sec plus or minus 0.118 m/sec, which of the following is NOT a correct interpretation? response - incorrect A We are 95% confident that the true slope of the DGP will be in this range. B There is a 95% chance that if you repeated this experiment with a different set of swimmers, the slope of the regression line will fall within this confidence interval. C 95% of all Wetsuit velocities have this relationship with the NoWetsuit velocity. D The true parameter (β1) will very likely fall inside this interval.
A
The best-fitting model using NoWetsuit velocity to predict Wetsuit is this: Yi=0.1423+0.9547Xi+ei How should we interpret 0.9547? response - correct A The increment to add on to the prediction of Wetsuit for every 1 m/sec of NoWetsuit B The increment to add on to the prediction of Wetsuit to each person's NoWetsuit C The difference between average Wetsuit and NoWetsuit D The prediction for Wetsuit when NoWetsuit is 0
B
The mean maximum swim velocity when wearing a wetsuit (i.e., Wetsuit) is 1.51 m/sec. If the critical distance is 0.08 m/sec, what's the range of possible values within which you're 95% confident that actual population mean would fall? response - correct A 1.47 m/sec to 1.55 m/sec B 1.43 m/sec to 1.59 m/sec C It depends on the standard deviation of the sample of Wetsuit. D It depends on the standard deviation of the population of Wetsuit.
c
The mean of the distribution of median earnings in this data is $40,151.45. Is this a good estimate of the median earnings of college graduates in the United States? response - correct A Yes because the mean is an estimate that balances the residuals. B No because the mean is a terrible estimate of a median. C No because the observations in this data set are majors, not individual graduates. D No because some people (e.g., famous college dropouts like Mark Zuckerberg and Bill Gates) earn a lot more than college graduates.
D
The output of favstats(~ wt, data = Gestation100) is shown above. If they had collected a different sample of 100 newborns, what value would be different? response - correct A Max B Mean C Median D Most likely, all of the above
B
The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") If you were to stack up all the bars, what would the total be? response - incorrect A 100 B 10,000 C The number of objects in the population D The sum of all the values in this distribution
A
The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") Which of the following is a true statement? response - correct A The standard error of this distribution is smaller than wt.stats$sd. B The standard error of this distribution is larger than wt.stats$sd. C The standard error of this distribution is approximately equal to wt.stats$sd. D It's impossible to know from this information alone.
D
The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") Someone tells you that their baby was 116 ounces at birth. What is the likelihood of a baby having a birth weight of 116 or lower in the population? response - incorrect A Probably close to 0 because such a low birth weight is in the lowest tail of this sampling distribution. B Probably close to .05 because this is a normally distributed sampling distribution. C I could find out by running this R code: tally(~mean <= 116, data = SDoM, format = "proportion"). D I am not sure because I cannot tell the likelihood of a single baby having such a birth weight from this sampling distribution of means.
A
There are actually five experiments in this dataset (all the same experiment run a few different times). First, we're going to look at just the data from Experiment 3. Which of the following commands would create a new data set called Exp3, which contains only the data from the third experiment (HINT: you can try out different commands until you find one that creates what you want). A. Exp3 <- filter(morley, Expt == 3) B. Exp3 <- select(morley, Expt == 3) C. Exp3 <- sample(morley, sum(Expt == 3)) D. Exp3 <- resample(morley, sum(Expt == 3))
d
This is the distribution of median earnings for different college majors. What makes the mean a good model of this distribution? response - correct A It's your only option, because you can't take the median of a group of medians. B The mean is always the best model for a skewed distribution. C The mean is the best statistic because we can simulate sampling distributions of the mean. D The mean is a model that spends only one degree of freedom and minimizes squared error.
a
Using the code below, we created a model to predict median earnings using STEM as the explanatory variable, and then to construct 95% confidence intervals around the parameter estimates. The output is pictured above. STEM.model <- lm(median_income ~ STEM, data = collegegrads) confint(STEM.model) If we repeated this study but our sample size was larger and thus our standard error was smaller, what would be different about the confidence interval? response - incorrect A It's likely that the 95% CI of b1 would be smaller. B It's likely that the 95% CI of b1 would be larger. C There is no way to tell because standard error is not related to confidence intervals. D The confidence interval would stay the same as long as the confidence level is the same.
C
We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). If you know a player had 0 free throws, how many minutes would you predict he played? intercept = 1552.309 and FTMade = 2.834 A 0 B 2.83 C 1662.31 D 1662.31 + 2.83
D
We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). We also fit a model of Mins predicted by Points (points scored) and called it Points.model (output below). From these best-fitting parameters, can we tell which model explains more variation: FTMade.model or Points.model? A Yes, FTMade.model is a better model, because the increment of time added on per free throw made is larger than the increment of time added on per point scored. B Yes, FTMade.model is a better model because the intercept is larger than the intercept for Points.model. C No, we should never compare models that have different explanatory variables because they are in different units. D No, we cannot tell from the best-fitting estimates how much variability has been explained by a model.
C
We found the best-fitting estimates and put them into the FTMade.model of the outcome variable Mins: Yi=1662.31+2.83Xi+ei LeBron James played 3,063 minutes, scored 2,111 points, and made 758 free throws in the 2011 season. What is the FTMade.model's prediction for minutes played by LeBron James? A 3063 B 1662.31 C 1662.31 + 2.83*758 D 1662.31 + 2.83*758 + 2111
C
We found the best-fitting estimates and put them into the FTMade.model of the outcome variable Mins: Yi=1662.31+2.83Xi+ei LeBron James played 3,063 minutes, scored 2,111 points, and made 758 free throws in the 2011 season. What is the FTMade.model's prediction for minutes played by LeBron James? response - correct A 3063 B 1662.31 C 1662.31 + 2.83*758 D 1662.31 + 2.83*758 + 2111
D
We have quantified error from the FTMade.model of Mins and the Points.model of Mins by using the supernova() function. Which of the following are reasons to think that the Points.model is better than the FTMade.model? A The FTMade.model's PRE is less than the Points.model's PRE. B The FTMade.model's SS model is less than the Points.model's SS model. C The Points.model's SS error is less than the FTMade.model's SS error. D All of the above
C
We looked on at the majors that were considered STEM majors. boxplot(median_income ~ major_category, data = STEMonly) major_category.model <- lm(median_income ~ major_category, data = STEMonly) 4 major categories!!! How would you specify the major_category.model using GLM notation? response - correct A Yi=b1X1i+b2X2i+b3X3i+b4X4i+ei B Yi=b1X1i+b2X2i+b3X3i+ei C Yi=b0+b1X1i+b2X2i+b3X3i+ei D Yi=b0+b4X4+ei
A
We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. All the R code for this is shown below. Gestation100$smoke.factor <- factor(Gestation100$smoke, levels = c(0:3), labels = c("never", "smokes now", "until current pregnancy", "once did, not now")) smoke.factor.model <- lm(wt ~ smoke.factor, data = Gestation100) Which GLM equation specifies the smoke.factor.model? response - correct A Yi = b0 + b1X1i + b2X2i + b3X3i + ei B Yi = b0 + b1X1i + b2X2i + b3X3i + b4X4i + ei C Yi = b0X0i + b1X1i + b2X2i + b3X3i + ei D Yi = b0 + b3X3i + ei
B
We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. The ANOVA table for this model is shown above. Why is the degrees of freedom for the Model 3? response - correct A Because there are three parts of the model: Model, Error, and Total. B Because there were three additional parameters estimated in the smoke.factor model compared to the empty model. C Because there were three variables involved in this study: smokers, nonsmokers, and weight. D Because there were three data points that we excluded to fit this model.
A
We made this scatterplot to explore the idea that free throw percentage (FTPct) would predict how many minutes a player gets to play. If we fit an empty model to this data, how would we depict it on this plot? response - correct A A horizontal line that shows the mean for minutes played. B A vertical line that shows the mean free throw percentage. C A diagonal line that bisects the cloud of points. D You would not be able to represent the empty model visually because it is a single number.
b
We ran this code to calculate the residuals from the STEM model. STEM.model <- lm(median_income ~ STEM, data = collegegrads) collegegrads$resid <- resid(STEM.model) Interpret the residual (-7.86) for the Molecular Biology major. response - correct A This major's median earning is about $8,000 less than the Grand Mean. B This major's median earning is about $8,000 less than the STEM model's predicted median earning for this major. C The STEM model would predict that graduates of this major make $8,000 less than the average major. D The STEM model's prediction is $8,000 less for this major than the empty model's prediction.
D
We use this code to calculate the correlation coefficient (Pearson's r) for Mins and Points: cor(Mins ~ Points, data = NBAPlayer2011) What have we found? A A measure of how tight the data points are around the regression line B The slope of the regression line between the standardized Mins and Points C The strength of a bivariate relationship D All of the above
D
We use this code to calculate the correlation coefficient (Pearson's r) for Mins and Points: cor(Mins ~ Points, data = NBAPlayer2011) What have we found? response - incorrect A A measure of how tight the data points are around the regression line B The slope of the regression line between the standardized Mins and Points C The strength of a bivariate relationship D All of the above
B
What does the orange (normal) curve drawn on this histogram represent? A This represents the population from which these data were drawn randomly. B This represents a normal distribution that was fit to the mean and standard deviation of this data. C This represents a normal curve that shows the 95% of data points that lie within two standard deviations of the mean. D This is another way of representing the sample distribution.
A
What is the Bonferonni adjustment used for? I (a). Adjusting Type I Error rate when we do multiple comparisons I (b). Correcting for bias in a confidence interval I (c). Accounting for variance explained while adjusting for degrees of freedom I (d). Adjusting variance estimates to take into account measurement error
A
What is the purpose of using the shuffle method as compared to bootstrapping? A. Shuffling provides an estimated sampling distribution assuming the simple model is true B. Shuffling provides an estimated sampling distribution assuming the complex model is true C. Shuffling maintains the original range of the data, whereas bootstrapping does not D. There is no advantage of using shuffling over bootstrapping
A
What kind of distribution would this code create? do(10000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) response - incorrect A A sampling distribution of bootstrapped slopes B A sampling distribution of means C A sampling distribution of the mean difference between Wetsuit and NoWetsuit D The population distribution that our sample could have come from
B
What will the following code do? resample(Wetsuits, 12) response - correct A Take a new sample from the population of Wetsuits B Create a new sample from the observations in Wetsuits C Create a new sample from Wetsuits that is the same as the original sample D Select a random observation from the 12 observations in Wetsuits
A
What will the following code do? xqt(.025, df = 999) response - correct A Return t critical for a sample size of 1000 B Return a square root C Return the confidence interval D Return the percentage of data points that fall below .025, given df = 999
B
What would this code show us? SDob1 <- do(100000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) SDob1 <- arrange(SDob1, desc(b1)) SDob1$b1[2500] response - correct A The wetsuit velocity of the top 2500 swimmers in the population B The highest population increment that could have produced our sample and it would still be considered likely C The lowest population mean that could have produced our sample and it would still be considered likely D The highest population mean that could have produced our sample and it would still be considered likely
C
What's the purpose of generating a sampling distribution of means of SpeedUp by resampling (also called bootstrapping)? response - correct A These values can supplement your existing data if your sample is too small. B This distribution can confirm the accuracy of your calculated sample mean of SpeedUp. C This distribution can help you quantify how much your best estimate of the population mean could vary. D Bootstrapping eliminates the element of chance from the sampling process.
B
What's the value in using the t distribution? response - incorrect A It's less variable than the normal distribution. B It works well as a model of the sampling distribution if the sample size is small, or standard deviation of the population is unknown. C It helps us determine the degrees of freedom from our data. D It works well as a model of the population if the sample size is small, or standard deviation of the population is unknown.
B
When calculating PRE which two sums of squares are used for this calculation? A. SSModel & SSError B. SSModel & SSTotal C. SSError & SSTotal
A
When calculating the F-ratio which two sums of squares are used for this calculation? A. SSModel & SSError B. SSModel & SSTotal C. SSError & SSTotal
a
When we calculated the best fitting estimates, we found a negative slope (b1 = −.01989). How can we find the confidence interval around this slope? response - incorrect A Bootstrap b1s by resampling from this sample and finding the cut off for the highest and lowest 2.5% of b1s B Randomize b1s by shuffling the totalgrads values that go with median_incomes values and find the highest and lowest 2.5% of b1s C confint(empty.model) D All of the above
B
Which of the following R code would create a new variable called SpeedUp that contains the difference between swimming with Wetsuit versus NoWetsuit? response - correct A SpeedUp = Wetsuit - NoWetsuit B Wetsuits$SpeedUp <- Wetsuits$Wetsuit - Wetsuits$NoWetsuit C SpeedUp(Wetsuits$Wetsuit - Wetsuits$NoWetsuit) D SpeedUp$Wetsuit - SpeedUp$Wetsuit
A
Which of the following chunks of code creates a sampling distribution of F-ratios using the shuffle method? A. SDoF <- do(1000)*fVal(Infant.Mortality~shuffle(Education), data = swiss) B. SDoF<- do(1000)*fVal(Infant.Mortality~Education, data = resample(swiss)) C. SDoF <- do(1000)*fVal(shuffle(Infant.Mortality~Education, data = swiss)) D. SDoF <- shuffle(do(1000)*fVal(Infant.Mortality~Education, data = swiss))
C
Which of the following does not influence the width of a confidence interval for the mean? I (a). Sample Size I (b). Confidence level I (c). Size of effect I (d). Standard deviation
c
Which of the following is an accurate interpretation of the 95% confidence interval for the b1 coefficient in the regression model? A. 95% of the time you'll have to wait between 10.11 minutes and 11.34 minutes to see an eruption. B. There is a 95% chance that β_1 is between 10.11 and 11.34 C. If the population slope β_1 is between 10.11 and 11.34 then our observed slope is considered likely (no less than 5%). D. If I ran another study of Old Faithful, with the same sample size, there is a 95% chance I'll get the exact same slope estimate
C
Which of the following is the correct interpretation of MS Total (297,230) in the supernova table above? A This is, roughly, the total number of points in the data frame. B This is, roughly, the total number of squared means based on the empty model. C This is, roughly, the average squared residual from the mean. D This is, roughly, the standard deviation from the mean.
C
Which of the following is the correct interpretation of MS Total (297,230) in the supernova table above? response - incorrect A This is, roughly, the total number of points in the data frame. B This is, roughly, the total number of squared means based on the empty model. C This is, roughly, the average squared residual from the mean. D This is, roughly, the standard deviation from the mean.
B
Which of the following is the correct interpretation of PRE (0.65) in the supernova table above? A 65% of the players' minutes in the data frame can be predicted with their Points. B 65% of the SS from the empty model can be explained by adding Points to the complex model. C 65% of the Points model can be proportionally reduced by the empty model. D The Points model's SS total will be 65% of the SS total from the empty model.
B
Which of the following is the correct interpretation of PRE (0.65) in the supernova table above? response - correct A 65% of the players' minutes in the data frame can be predicted with their Points. B 65% of the SS from the empty model can be explained by adding Points to the complex model. C 65% of the Points model can be proportionally reduced by the empty model. D The Points model's SS total will be 65% of the SS total from the empty model.
B
Which of the following is the correct interpretation of PRE (0.98) in the supernova table above? response - correct A 98% of the Wetsuit velocities in the data frame can be predicted with the corresponding NoWetsuit velocity. B 98% of the SS from the empty model can be explained by adding NoWetsuit to the complex model. C 98% of the NoWetsuit model can be proportionally reduced by the empty model. D The NoWetsuit model's SS Total will be 98% of the SS Total from the empty model.
A, B
Which of the following methods would be appropriate for creating a confidence interval for the slope (b1) CHECK ALL THAT APPLY A. t-distribution B. bootstrapping C. shuffling D. normal distribution
B
Which of the following would be an accurate interpretation of a confidence interval for the average time to dental appointments? I (a). 95% of individuals will have time to dental appointments within this range I (b). For 95/100 studies of time to dental appointments, their confidence interval will capture the true mean. I (c). There is a 5% change that the true mean is not inside the calculated confidence interval. I (d). There is a 95% chance that the true mean is 10 months.
A
Which of the following would be the R command for estimating a simple model predicting Infant.Mortality? A. lm(Infant.Mortality~NULL, data = swiss) B. lm(Education~Infant.Mortality, data = swiss) C. predict(Infant.Mortality) D. lm(NULL~Infant.Mortality, data = swiss)
A, B, D
Which of the variables below would be appropriate for a histogram? (Check all that apply.) A College B IQ C Pres2008 D Population
B
Why is the SS Total the same value for the FTMade.model and the Points.model? A Both are based on the residuals from the mean of the same explanatory variable. B Both are based on residuals from the mean of the same outcome variable. C All models that use the same data frame will have the same SS total. D Both models are based on the same number of values (n = 176).
B
Why is the SS Total the same value for the FTMade.model and the Points.model? response - correct A Both are based on the residuals from the mean of the same explanatory variable. B Both are based on residuals from the mean of the same outcome variable. C All models that use the same data frame will have the same SS total. D Both models are based on the same number of values (n = 176).
A
Why might it be helpful to calculate means from random samples of 100 drawn from a normal distribution with the same mean and standard deviation as wt? response - correct A Because the resulting sampling distribution would give us an idea of how much random sample means could vary. B Because if the resulting sampling distribution is roughly normal, that proves that our sample came from a normally distributed population. C Because this helps us fit the best fitting normal curve over our sample distribution. D Because the mean of the sampling distribution will tell us what the true mean of the population is.
B
Why might we prefer to create a confidence interval using a t-distribution rather than a normal distribution? A. There is no reason to prefer a t-distribution over a normal distribution B. The t-distribution takes into account the fact that population variance is unknown C. Normal distribution makes an assumption about the distribution of the errors, but the t-distribution does not D. The t-distribution will give a narrower interval, which is more precise, so it is preferred
B
Why would you choose to use bootstrapping to create a confidence interval rather than simulation? A. It creates a symmetric confidence interval B. It does not require an assumption about the shape of the errors (e.g., normally distributed) C. Bootstrapping will always give the same answer, whereas simulation gives a different answer every time you run the code. D. There are no reasons to choose bootstrapping over simulation.
B
You run the following R code: Points.model <- lm(Mins ~ Points, data = NBAPlayers2011) Points.model When you do so, you get the following output: intercept = 1156.0680 points = 1.062 Which of the following equations represents the fitted model? response - correct Yi=1156.68 + 1.06 + ei A Yi=1156.68 + 1.06 + ei Yi= 1156.68+1.06Xi+ei B Yi= 1156.68+1.06Xi+ei Yi=1.06 + 1156.68 + ei C Yi=1.06 + 1156.68 + ei D Yi=1.06+1156.68Xi+ei
B
You're interested in computing the confidence interval around the estimated mean of SpeedUp (how much a swimmer's velocity increases by having a wetsuit). What should you add to the following code in order to do so? Empty.model <- lm(SpeedUp ~ NULL, data = Wetsuits) response - correct A xqt(.025, df = 999) B confint(Empty.model) C SDoM <- do(10000) * mean(resample(Empty.model)) D Empty.model
B
if my data is a good representation of the population, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution
D
if the population mean is the mean in my data and the variance is known, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution
E
if the population mean is the mean in my data and the variance is unknown, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution
a, c
if the simple model is true in the population, use... (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution
B
which of the following is an example of measurement error? a) we can use year in college to predict amount of school spirit, but even when we've accounted for year there's still some variability (or error) around the prediction b) 2 nurses measure the blood pressure of a patient at the same time, one on the left arm and one on the right. one nurse gets 115/67 and the other gets 117/65 c) when asked about salary one participant from Canada reports his salary in Canadian dollars instead of American dollars whereas all the other participants report their salary in American dollars d) Alex makes a mistake when entering the number of sexual partners a participant has, and enters 115 instead of 15
C
If we express our model as Yi = b0 + b1X1i + b2X2i + ei which part represents the model's prediction for Exercise? A Yi B b0 C b0 + b1X1i + b2X2i D b1X1i
B
Let's say you wanted to create a vector called "SAT" from a list of SAT scores. How would you do that? A SAT <- (1300, 1120, 1050, 1470, 1350) B SAT <- c(1300, 1120, 1050, 1470, 1350) C SAT <- vector(1300, 1120, 1050, 1470, 1350) D vector <- SAT(1300, 1120, 1050, 1470, 1350)
B
To examine the distribution of Happiness, which would be more useful? A A tally B A histogram C In this case, either would be useful. D In this case, neither would be useful.
D
What notation cannot be used to represent the mean of the sample? A Yi B β0 C μ D All of the above
C
Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election, and the number of states in which Obama won? A winner(~ Pres2008, data = USStates) B gf_boxplot(~ Pres2008, data = USStates) C tally(~ Pres2008, data = USStates) D str(~ Pres2008, data = USStates)
D
Why is the mean a good model for hwy? A Because the mean is the only true statistical model that can represent a population parameter. B Because the mean is the best model whenever you make a visualization of data. C Because the mean is the best model for all categorical variables. D Because the mean is a model that balances the residuals and minimizes the sum of squared residuals.
D
Broadly speaking, what do we study when we study statistics? A Data B Formulas C Variables D Variation
D
Does this show that cardiovascular health (that is, being in a lower pulse group) causes students to also exercise more? A Yes, because the F statistic is quite large (around 10), and the PRE is reasonable. B Yes, because we have discovered the best fitting parameter estimates. C No, because this analysis actually shows that exercising more causes lower pulse rates. D No, because the design of this study was correlational, so we cannot determine causation.
B
Having a low resting heart rate (recorded in the variable Pulse) is supposed to be an indicator of good cardiovascular health. Let's say we wanted to create three groups based on Pulse: low, medium, and high. Which of the following code would do that, and save the values in a new variable called Pulse3Group? A Pulse3Group <- ntile(3) B StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3) C StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 2) D StudentSurvey <- ntile(StudentSurvey$Pulse3Group)
C
How will this correction (changing 54 back to 27) affect the mean and the median? A Both the mean and median will be equally affected by this correction. B The median will be affected more than will the mean. C The median will be affected less than will the mean. D It's impossible to say how the median will be affected by the correction.
D
How would you create a plot to look at the distribution of Distance? A gf_plot(~Distance$BikeCommute) B gf_histogram(~ Distance) C gf_histogram(BikeCommute, Distance) D gf_histogram(~ Distance, data = BikeCommute)
B
How would you quickly find the total number of water samples (or test tubes) collected across all of the lakes in your study? A sample(sum, FloridaLakes) B sum(FloridaLakes$NumSamples) C tally(~NumSamples, data = FloridaLakes) D arrange(FloridaLakes, NumSamples)
A
How would you use R to calculate variance in hwy? A var(mpg$hwy) B lm(hwy ~ var, data = mpg) C anova(mpg, data = hwy) D favstats(~ hwy, data = hwy)
B
If a data point is very far away from the mean, what would you expect for the residual? A When farther away, the more positive the residual. B When farther away, the larger the absolute value of the residual. C When farther away, the more variable the residual. D When a data point is very far away from the mean, the residual should be 0, because the mean balances the residuals.
C
If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed? A A value within one quarter of 33.6 B 0 C 33.6 D It's impossible to say.
D
If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data? A -10 B 33.6 C 57.2 D 23.6
A
If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual? A -10 B 33.6 C 57.2 D 23.6
B
If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean? A The car's highway miles per gallon is 60% better than the other configurations of cars in the distribution. B The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy. C The car's highway miles per gallon is now smaller because .6 is smaller than 27. D The car's highway miles per gallon should be a whole number (like in the Empirical Rule), which clearly suggests an error in the calculation.
A
If we fit a normal curve on the distribution of hwy (see visualization below), what is it that we're modeling with it? A Error around the model for hwy B The median C The empty model for hwy D Sample statistics
C
If we used this code to fit the empty model: Empty.model <- lm(Exercise ~ NULL, data = StudentSurvey) And then used the predict() function to make a prediction for each student's number of hours exercised per week, what value would it predict for each student? A It would depend on how much they actually exercised. B The value would be the mean number of hours exercised by that student that year and would vary for each student. C The value would be the mean number of hours exercised by this sample and would be the same for each student. D You would not be able to determine the value because it is represented by b0.
B
If you create an empty model of TopSpeed, what would it mean to have an "empty model"? A The model would be the best way of explaining how many variables contribute to TopSpeed (such as time of year and type of bike). B The model would include only mean TopSpeed. C The model would predict a different TopSpeed depending on the situation. D None of the above.
B
If you fit a model that predicts Mins by including FTMade as an explanatory variable, how many parameters would the model have? A 2: Mins and FTMade B 2: the y-intercept and the slope of the regression line C 2: the mean of Mins and the increment added for each free throw made that exceeds the mean number of free throws made D 4: Yi, b0, b1, Xi
B
If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) Empty.model A How much error there is around the empty model B The mean C β0 D All of the above
D
If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) anova(Empty.model) A How much error there is around the empty model B The sum of the squared residuals C The sum of squares D All of the above
C
If you want to quickly see the name of the lake with the lowest average mercury level, what R command might you run? A arrange(FloridaLakes) B tally(FloridaLakes$AvgMercury) C arrange(FloridaLakes, AvgMercury) D str(FloridaLakes)
A, B, D
If you want to use R to get the sum of 10 and 20, what code would you write? (Check all that apply.) A sum(10+20) B sum(10,20) C Sum(10+20) D 10+20
D
If you want to write a note to yourself about your R code, but you want R to ignore it, how would you do so? A You can't. R is very literal and will run everything entered. B Simply type"[ignore]" at the start of the line. C Include a tab at the start of the line. D Include a hashtag (#) at the start of the line.
C
If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy? A If the standard deviation is large, the z score should also be very large and positive. B If the standard deviation is large, the absolute z score should also be large but we won't be able to tell if it is positive or negative. C If the standard deviation is large, the z score should be small and positive. D Standard deviation and z are unrelated because they measure different things about the distribution.
A, D
If you wanted to generalize to all lakes in Florida, but only included lakes within a 50 km radium of the research center in your study; what should concern you? (Check all that apply.) A The sample is not random. B The sample is not convenient. C The sample will not have variation. D The sample may not represent the population you want to know about.
B
If you wanted to get the five-number summary for PhysicalActivity, what R code would you run? A sort(USStates, PhysicalActivity) B favstats(~ PhysicalActivity, data = USStates)) C gf_histogram(USStates$PhysicalActivity) D makefun(USStates.PhysicalActivity)
A
If you were interested in proportions rather than counts, whch argument would you add to your code above? A format = "proportion" B "proportion" C format = "relative frequency" D "percentage"
B
If you'd like to see an overview of what's in the data frame—a list of your variables, whether they're numeric or factors, and so forth—what command would you use? A tally() B str() C c() D sort()
C
If you're told that there's random measurement error in how one of your variables was recorded, what do you know for sure? A Your data are biased. B A mistake was made when the data were either recorded or entered. C There will be more variation than you might expect. D All of the above
C
If you've calculate the standard deviation for hwy, what have you found? A Roughly the total squared deviations from the mean, in squared highway miles per gallon B Roughly the average squared deviation from the mean, in squared highway miles per gallon C Roughly the average deviation from the mean, in highway miles per gallon D None of the above
A
Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. What does that mean? A The population in most states is sedentary. B The population in most states is very active. C The US population, overall, is sedentary. D The US population, overall, is very active.
C
Imagine that you created a histogram of PhysicalActivity. You meant to set it to have 15 bins, but you accidentally set 5 bins instead. How would the result be different from what you hoped? A Your histogram would be missing data. B You would see more bars than you hoped. C You would see less detail that you would have otherwise depicted. D There would be no difference because your N is 50.
D
Imagine that you have both the empty model for Exercise and the complex model for Exercise (i.e., the model that includes Pulse3Group). What would you do if you wanted to compare how well they predict Exercise? A Compare the SS from each model B Look at the reduction in error in the Pulse3Group model C Examine the PRE D Any of the above
B
Imagine that you wrote the following code. What would it do? gf_boxplot(Happiness ~ Stress, data = Fingers, color = "orange") %>% gf_jitter() A Create two, separate plots (a boxplot and a jitter plot) B Create a single plot (a boxplot with an overlaid jitter plot) C Create a jitter plot (the last command written) D Create a boxplot (with the jitter code omitted because it's incomplete)
A
Imagine that you've calculated SS for both the empty model and the complex model for Exercise. What will be true about these SS? A SS leftover from the empty model will be greater than the SS leftover from the complex model. B SS leftover from the empty model will be smaller than the SS leftover from the complex model. C SS leftover from the empty model will be equal to the SS leftover from the complex model. D In both cases the SS will be 0 because the residuals are balanced by the mean.
B
Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape? A TopSpeed and the predicted values B TopSpeed and residuals C Predicted values and residuals D No two of these distributions will have a similar shape.
B
In GLM notation, which of he following represents the model (or prediction)? A Yi B b0 C ei D None of the above
A
In Yi = 10.38 - .85X1i - 3.14X2i + ei what does X1i stand for? A Whether someone is in the medium pulse group or not B The number of members in the medium pulse group C The intercept for Pulse3Groupmed D Whether someone is in the low or medium or high group
C
In a study designed to find out what explains variation in Happiness, _____ would be the outcome variable and _____ would be the explanatory variable. A Stress; Happiness B Stress; the cause of happiness C Happiness; Stress D Happiness; the rating of happiness
C
In general, in R, __________ is where you type in code and __________ is where the code runs. A The script window (i.e., script.R); the script window (i.e., script.R) B The R Console; the R Console C The script window (i.e., script.R); the R Console D The R Console; the script window (i.e., script.R)
A, D
In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this? (Check all that apply.) A It helps us understand the population. B It helps us understand each individual in the sample. C Solely to help us understand this particular sample. D It helps us understand the processes that produced the variation we see.
A
In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves as more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern? A The difference you found was the result of sampling variation. B Your study didn't have enough participants. C Your random assignment wasn't really random. D Your random selection wasn't really random.
B
Interpret a PRE of 0.05 A There is a .05 chance that we have made a truly explanatory model. B .05 of the total variation in exercise hours is explained by the pulse groups. C .05 of the sample has a relationship between exercise hours and pulse groups. D .05 of the complex model's sum of squares can be explained by Pulse3Group.
A
Let's say a researcher hopes to explore the hypothesis that knowing about someone's stress level can help to predict their happiness. What word equation best captures this idea? A Happiness = Stress + other stuff B Stress = Happiness + other stuff C Other stuff = Stress + Happiness D Happiness = Stress
A
Let's say we wanted to write a word equation to explain the variation in the time it takes to bike to work. We think that the Distance of a person's commute is an important explanatory variable. What would the word equation look like? A Time = Distance + other stuff B Distance = Time + other stuff C Other stuff = Distance + Time D Model = Time + Distance
A, B, C
Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivity and a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common? (Check all that apply.) A They display the same variable. B They have the same number of bars. C The shape of the distribution would be the same. D Their axes would have the same labels.
C
Let's say you want to create an R object, so you can call it up later. The object you want to create represents the Oxford Dictionaries Word of the Year for 2017, which happens to have been "Youthquake." How would you create that object? A oxfordword2017 <- youthquake B youthquake" <- oxfordword2017 C oxfordword2017 <- "youthquake" D "youthquake" <- oxfordword2017 E You can't. R objects must be numeric.
C
Let's say you want to filter your data so that you do NOT include lakes that have missing data for Chlorophyll. What line of code will do that? A filter(FloridaLakes, Chlorophyll != "0") B filter(FloridaLakes, Chlorophyll == "0") C filter(FloridaLakes, Chlorophyll != "NA") D filter(FloridaLakes, Chlorophyll == "NA")
B
Let's say you've calculated the sum of squares for hwy. What would be the advantage of dividing that number by n - 1 (i.e., dividing it by the df)? A It turns the sum of squares into a measure of spread. B You can use it to compare error across samples of different sizes. C You would have calculated the population variance. D None of the above. There's no advantage of dividing SS by the df.
A
Let's split GPA into three groups—low, medium, and high—and then create a faceted histogram. What goes in the blanks in the following code? SleepStudy$GPA3Group <- ntile(_____, 3) gf_dhistogram(~ Happiness, data = _____) %>% gf_facet_grid(GPA3Group ~ .) A SleepStudy$GPA; SleepStudy B SleepStudy$GPA; SleepStudy$GPA C SleepStudy; SleepStudy D GPA3Group; SleepStudy
D
Maybe considering yourself a morning person (a "lark") or an evening person (an "owl") is related to variation in GPA. Which of the following plots would help us see whether variation in GPA is related to variation in LarkOwl? A gf_histogram(~ GPA, data = SleepStudy) %>% gf_facet_grid(LarkOwl ~ .) B gf_boxplot(GPA ~ LarkOwl, data = SleepStudy) C gf_point(GPA ~ LarkOwl, data = SleepStudy) D All of the above
B
One student suggests that players who make a lot of free throws (FTMade) are better and they would see more game time. Another student argues that making free throws doesn't make you a better player—having a higher free throw percentage (FTPct) is the sign of a better player, and suggests that would explain the variation in minutes played (Mins). Which of these plots would depict the relationship between Mins and one of these explanatory variables? A Box plots (i.e., gf_boxplot) B Scatter plots (i.e., gf_point) C Faceted histograms (i.e., gf_histogram with gf_facet_grid) D All of the above would be able to depict these relationships equally clearly.
A
The College Board discovered a mistake! All of their tests administered in 2017 were scored 50 points lower than they should have been. Assuming they have a vector called SAT.2017 that includes all test scores, how would they add 50 points to each score in the vector? A SAT.2017 + 50 B SAT.2017 <- add(50) C sum(SAT,50) D sum(SAT+50)
C
The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles, the distance for one of the commute has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean? A The mean will be unaffected by the correction. B The mean will be higher because of the correction. C The mean will be lower because of the correction. D It's impossible to say how the mean will be affected by the correction.
D
The data frame is currently organized alphabetically by lake. What if you'd like to see it ordered by average mercury level, with the most polluted lake appearing first on the list? Save the result back into FloridaLakes. A FloridaLakes <- sample(FloridaLakes, desc(AvgMercury)) B FloridaLakes <- sum(AvgMercury) C FloridaLakes <- tally(~ NumSamples, AvgMercury) D FloridaLakes <- arrange(FloridaLakes, desc(AvgMercury))
A
The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6? Yi=Y¯¯¯¯+ei A Yi B Y¯¯¯¯ C ei D None of the above
C
The same student has a tendency to come back repeatedly to ask the same question. With that in mind, you decide to save the answer above to an R object so you can readily call it up later. You decide to call the R object annoyingstudent. Which line of code would create that object? A annoyingstudent <- SAT[4]><1470 B annoyingstudent <- SAT[4]=1470 C annoyingstudent <-SAT[4]==1470 D SAT[4]=1470 <- annoying student
C
The study included 53 lakes in Florida. Where is this information in the data frame? A One of the values in the data frame B The number of variables in the data frame C The number of rows of data D The number of columns of data
D
The sum of squares gets larger as A The variation increases B The sample size increases C The spread of the distribution increases D All of the above
B
The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data? A Histogram B Bar graph C Both are equally appropriate. D Neither are appropriate for a categorical variable.
B
Think back to the vector called SAT. Let's imagine that the student who earned the fourth score in this vector would like to know her score. You try sat[4], but get an error. What did you do wrong? A Because sat isn't numeric, it needs quotes around it. B Because R is case sensitive, SAT needs to be capitalized. C SAT needs to be capitalized and in quotes. D That shouldn't have produced an error.
B
Wanting to see "MyUniversity" in the R Console, you've just run the following command: print(MyUniversity). However, R returned an error message. What's the correct command, if you want to print "MyUniversity"? A Print(MyUniversity) B print("MyUniversity") C #"MyUniversity" D Print "MyUniversity"
A
We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group). Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, what does Yi represent? A Each person's value for Exercise B The average Exercise for all participants C The deviation between each person's Exercise and the average Exercise for all participants D It might be any of the above, depending on which interpretation you're using.
C
We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group). Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, which equation represents this Pulse3Group model? A Yi = b0 + ei B Yi = b0 + b1Xi + ei C Yi = b0 + b1X1i + b2X2i + ei D Pulse3Groupi = b0 + b1Exercisei + ei
B
We can calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals? A The values of the residuals from the empty model will be the same as the values of the residuals from the complex model. B The residuals represent the difference between the data and the model's prediction. C The residuals represent the difference between the data and the Grand Mean. D In both cases, the residuals can reduced to near 0 simply by being careful with measurement and data entry.
A
What R code created the line that indicates the mean? A gf_vline(xintercept = 33.6, color = "blue") B gf_mean(mean = 33.6, color = "blue") C gf_mean(33.6, color = "blue") D None of the above
D
What R code will output the standard deviation for hwy? A sd(mpg$hwy) B sqrt(var(mpg$hwy)) C favstats(~ hwy, data = mpg) D All of the above
B
What R code would create a distribution of Smokers? A histogram(Smokers, USStates) B gf_histogram(~ Smokers, data = USStates) C histogram ~ Smokers D gf_histogram ~ Smokers
D
What R code would produce a relative frequency histogram of PhysicalActivity? A gf_histogram(~ PhysicalActivity, data = USStates) B gf_relativehist(~ PhysicalActivity, data = USStates) C gf_densityhist(~ PhysicalActivity, data = USStates) D gf_histogram(..density..~ PhysicalActivity, data = USStates)
C
What R code would you use to fit the empty model for TopSpeed? A gf_histogram(NULL ~ TopSpeed, data = BikeCommute) B NULL(TopSpeed, data = BikeCommute) C lm(TopSpeed ~ NULL, data = BikeCommute) D gf(TopSpeed ~ NULL, data = BikeCommute)
A
What gives you the output of the 5 first values of all the variables? A head() B str() C c() D sort()
C
What notation can be used to represent the mean of the population? A β0 B μ C Both of the above D None of the above
B
What will happen in R if you run: print("StatsCourse")? A R will send "StatsCourse" to your local printer. B R will display: "StatsCourse". C R will show the full data file names "StatsCourse". D R will return an error message. R is a programming language, not a printer interface.
B
What will the following code produce? SAT <- c(1300,1120,1050,1470,1350) First.Score <- SAT[1] Second.Score <- SAT[2] First.Higher <- First.Score > Second.Score First.Higher A 180 B TRUE C 1300 D First.Score > Second.Score
D
What would the following R code do, beyond creating a histogram? gf_histogram(~ College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage") A Specify the data frame and specify the variable to be used on the x-axis. B Give the histogram a title and color the histogram red. C Create a new data frame with distribution of residents with college degrees (in percent). D Give the histogram a title and specify the label for the x-axis.
D
What's true about data? A They require that you've selected a sample. B They are the result of measurement. C They represent something about the world. D All of the above
C
What's true of sampling variation? A It's almost purely theoretical. We rarely encounter it. B It leads to bias. C Because of it, no sample will perfectly reflect the population. D All of the above
D
What's true of the distribution of any variable, if your model is the mean of that variable? A The distribution of the variable is more narrow than the distribution of its residual. B The distribution of the variable is always centered on a number lower than is the distribution of its residual. C The distribution of the variable is always centered on 0, whereas the center of the distribution of its residual is unpredictable. D The distribution of the variable is the same shape as the distribution of its residual.
B
When Pulse3Group is included in our model to explain variation in Exercise, how is error from this more complex model calculated? A The deviation of each person's Exercise from the Grand Mean for Exercise B The deviation of each person's Exercise from the mean Exercise of their Pulse3Group C The deviation of each Pulse3Group's mean to the Grand Mean for Exercise D None of the above
A
When you add an explanatory variable to your model, what should be the effect on the Sum of Squares from the empty model? A It should remain unchanged. B It should go up. C It should go down. D It depends on how much variation is accounted for by the explanatory variable.
B, C
Which of the following are quantitative variables? (Check all that apply.) A LarkOwl B CognitionZscore C Happiness D Stress
B, C, D
Which of the following from FloridaLakes are quantitative variables? (Check all that apply.) A Lake B pH C NumSamples D MinMercury
D
Which of these options could be used to depict the relationship between Exercise and Pulse3Group? A gf_boxplot(Exercise ~ Pulse3Group, data = StudentSurvey) B gf_histogram(~ Exercise, data = StudentSurvey) %>% gf_facet_grid(Pulse3Group ~ .) C gf_point(Exercise ~ Pulse3Group, data = StudentSurvey) D All of the above
A
Why is the mean a good model for this distribution? A Because the mean balances the deviations above and below the mean. B Because the mean balances the number of values above and below the mean. C Because the mean is the midpoint of the range. D All of the above are reasons why the mean is a good model.
D
Why might the mean be a good simple model for this distribution? A Because the mean is the only statistically acceptable value of b0. B Because the mean is the most frequent value in this distribution. C Because in all skewed distributions, the mean is the best model because it is different from the median. D Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.
B, C
Why should you look at a histogram of a variable before you do other statistical analyses? (check all that apply) A You'll need the results from your histogram in order to write additional R code. B You might catch errors made in data collection/entry. C You can see the shape of the distribution to see if it makes sense. D R won't be able to run other functions on the data frame unless you make a histogram first.
A, B, C, D
You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true? (Check all that apply.) A It's unimodal. B Most scores are clumped at the center. C It's roughly symmetrical. D It's bell-shaped.
B
You decide to conduct a study of energy drinks using undergraduates from your school. You select participants by randomly choosing ID numbers from among all ID numbers of current students. Once chosen, you randomly pick one of two energy drinks for students to consume weekly, throughout the school term. The first step is an example of _____ and the second is an example of _____. A Random assignment; random selection B Random selection; random assignment C Random assignment, random assignment D Random selection, random selection
B
You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before? A An error message. You're missing the argument between the parentheses. B A smooth density plot overlaying your bars. C A smooth density plot instead of bars. D A y-axis that now displays density.
C
You run the following command: RandomLakes <- sample(FloridaLakes, 10). What will be the result? A A printout of random lakes B A new data frame of 10 lakes drawn randomly from the population C A new data frame of 10 lakes drawn randomly from your FloridaLakes data frame D None of the above
A
You'd like to divide the original data frame into three groups with low, medium, and high levels of average mercury. What R function would you use to do this and save the result as a new variable called MercGroup? A FloridaLakes$MercGroup <- ntile(FloridaLakes$AvgMercury, 3) B MercGroup <- sort(AvgMercury, 3) C arrange(FloridaLakes, 3) D ntile(FloridaLakes$AvgMercury, 3)
A
You'd like to see the first 10 rows of FloridaLakes, so you run head(FloridaLakes). It doesn't give you what you wanted. Why not? A You didn't indicate that you wanted to see 10 rows. B head() displays variable names. C head() can only be applied to vectors. D This is an odd request, so there's no R command for it.
C
You've been commissioned to do a study of all lakes with average mercury levels above 1. You want to save the data of the lakes that meet this criterion to a new data frame called HighMercury. What's wrong with the following code? HighMercury <- filter(floridalakes, avgmercury > 1) A It's missing quotation marks around the number 1. B It doesn't appropriately name the new data frame. C There are capitalization errors. D Nothing
D
f you wanted to see the distribution for College (percentage of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~ College, data = USStates, bins = 10) A "bins = 10" will return an error B "gf_" is unnecessary C The "~" is unnecessary D Nothing