Math 227 - Stats ALL

Ace your homework & exams now with Quizwiz!

B

(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: 2. In a sample of students from UCLA, what proportion have GPAs greater than 3.2?

C

(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: 3. If the population average time between dental appointments is 12 months with a standard deviation of 4 months, what is the probability that a sample of 4 individuals has an average time between dental appointments of 8 months?

A

(a) population, (b) sample, and (c) sampling. Which distribution would be used to answer the following question: What is the probability that an individuals in the United States is older than 70 years old?

Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election and the number of states in which Obama won?

(tally)

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: A teacher is 46 years old. What is the Age.model's prediction for their Salary (in thousands of dollars)? 52.5245

-9.305 + 1.319*46

rflip()

-stimulates a coin toss -rflip(3)-stimulates 3 coin tosses

C

. If the research question we are most interested in is whether education predicts infant mortality rate, which confidence interval would we be most interested in? A. Confidence interval for the mean B. Confidence interval for the intercept C. Confidence interval for the slope D. All of the above are required to answer the research question

Using the SpeedDating data frame, fit a model of LikeM using SharedInterestsM as the explanatory variable. What is the 95% confidence interval for β1?

.35 to .53

If you use shuffle() to create a randomized sampling distribution of b1 (a group difference) based on a sample of data, what will be the mean of the resulting sampling distribution?

0

surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this?

1)It helps us understand the population 2)It helps us understand the processes that produce the variation we see.

SleepStudy$GPA3Group <- ntile(_____, 3) gf_dhistogram(~ Happiness, data = _____) %>% gf_facet_grid(GPA3Group ~ .)

1)SleepStudy$GPA 2)SleepStudy

What do frequency histograms and relative frequency histograms have in common?

1)They display the same variable 2)They have the same number of bars 3)The shape of the distribution would be the same

If you wanted to know how many students were in Fingers, which of the following commands could help you?

1)str() 2)Fingers 3)tail()

Why should you look at a histogram of a variable before you do other statistical analyses? (check all that apply)

1)you might catch errors in data 2)You can see the shape of the distribution to see if it makes sense

If you're told that there's random measurement error in how one of your variables was recorded, what do you know for sure?

1)your data is biased 2)A mistake was made 3)There will be more variation than you may have expected

The mean maximum swim velocity when wearing a wetsuit (i.e., Wetsuit) is 1.51 m/sec. If the margin of error is 0.08 m/sec, what's the range of possible values within which you're 95% confident that actual population mean would fall?

1.43 m/sec to 1.59 m/sec

We want to construct a 95% confidence interval. The margin of error is 3.0. What is the approximate value of the standard error?

1.5

If you use the distribution of WgtGain4 in the FatMice18 data frame (shown in this histogram) as a probability model, what is the likelihood of a mouse in a future study gaining more than 15 grams of weight?

1/18

In addition to the NBA player data for the 2011 season, we also have a similar data frame called NBAPlayers2015 (with many of the same variables). We have created scatterplots of Mins by Points for both seasons (black dots represent the 2011 season and purple dots represent the 2015 season). Based on these scatter plots, which season had a higher correlation between minutes played and points scored?

2011 looks more like a line

Based on information in the supernova() table above, how would you calculate the approximate value of PRE? PRE = SSerror / SStotal

2638 divided by 10485

B

4) What R code would create a new variable called DEPRESS that contains the difference between PRE versus POST depression severity? a) Depression = PRE - POST b) PHQ9$DEPRESS <- PHQ9$PRE - PHQ9$POST c) Depression$PRE - Depression$POST d) Depression(Depression$PRE - Depression$POST)

If we created a boxplot and then chained a jitter plot onto it, what proportion of the points would fall inside the box?

50%

Using the SpeedDating data frame, run favstats on AttractiveF. At what value would the sum of squared errors (sum of squares) be lowest? favstats(~AttractiveF, data=SpeedDating) min Q1 median Q3 max mean sd n missing 1 5 6 8 10 6.273723 1.917694 274 2

6.3 ~ mean

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. Given this information, interpret the 70.3 in the table below.

70.3 percent of the surveyed residents said that they competed in a physical activity that month.

The best-fitting model using AttractiveM to predict LikeM can be specified like this: LikeMiLikeMi = b0b0 + b1AttractiveMib1AttractiveMi+eiei Which of the following is an INCORRECT interpretation of the confidence interval for β1 in this model?

95% of all LikeM ratings have this relationship with AttractiveM ratings.

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: WetsuitiWetsuiti = b0b0 + b1b1NoWetsuitiNoWetsuiti + eiei If the confidence interval for β1β1 is .9547 m/sec plus or minus 0.118 m/sec, which of the following is NOT a correct interpretation?

95% of all Wetsuit velocities have this relationship with the NoWetsuit velocity.

Which of the following is the correct interpretation of PRE (0.98) in the supernova table above?

98% of the SS from the empty model can be explained by adding NoWetsuit to the complex model.

If the blue circles in the diagram above represent data points in a group model, which distance would be used to calculate the sum of squares error?

A Distance from Point to Y-hat

A

A ____________ distribution can be used to investigate how unusual a specific individual is, but a ____________ distribution is used to investigate how unusual a sample statistic (e.g., sample mean) is. a. sample distribution; sampling distribution b. sampling distribution; sample distribution c. population distribution; sample distribution d. sampling distribution; population distribution

To examine the distribution of Happiness, which would be more useful?

A histogram

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). How would we depict the null model of maximum velocity in a wetsuit on this plot? response - correct

A horizontal line at the mean of Wetsuit

If you created a bootstrapped sampling distribution of 10,000 means from your sample of SpeedUp, what qualities would you expect it to have?

A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample

Which distribution would you use to create a confidence interval around a parameter estimate?

A sampling distribution

What kind of distribution would this code create? do(10000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12))

A sampling distribution of bootstrapped slopes

You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before?

A smooth, density plot overlaying your bars.

B

A student who is majoring in Civil Engineering finds out that the median income for her major is $50,000, the z score being .86. How should we interpret this z score? response - correct A This major's median earnings are higher than 86% of the other majors. B The major's median earnings are .86 standard deviations larger than the average median earning. C The probability that this student will earn ore than the average college graduate is .86. D Using the empirical rule, .86 of all median earnings fall between 0 and this value.

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual? A) -10 B) 33.6 C) 57.2 D) 23.6

A) -10

Find the average squared deviation for Siblings in the StudentSurvey data frame A) 1.39 B) 1.73 C) 1.18 D) 362

A) 1.39

Tally up the number of lakes for which the variable AgeData is 0. How many are there? A) 10 B) 22 C) 34 D) 43

A) 10

Fit the NULL or empty model of Exercise in the StudentSurvey data frame. What is the sum of squares for this model? A) 11864 B) 360 C) 32.956 D) 9.054

A) 11864

Fit the NULL or empty model of WgtGain4 in the FatMice18 data frame. What is the sum of squares for this model? A) 186.28 B) 17 C) 10.957 D) 0

A) 186.28

What's the mean level of Chlorophyll in the FloridaLakes data frame? A) 23.12 B) 0.7 C) 3.0 D) 152.4

A) 23.12

The FloridaLakes data frame includes a variable called Calcium. How many lakes have a Calcium level that exceeds 5.0? (Hint: Try using the tally command.) A) 35 B) 18 C) 12 D) 53

A) 35

Using the DataCamp window above, determine how many variables there are in the FloridaLakes data frame. A) 5 B) 12 C) 53 D) 94

A) 5

Use lm() to explore the model that includes Gender to explain Height in the StudentSurvey data frame. How many inches must you add to the mean height for females to get the mean height for males? A) 5.151 B) 6.5695 C) 4.783 D) 6.112

A) 5.151

Using the FatMice18 data frame, run lm() to fit the model for WgtGain4, using Light as an explanatory variable. If Yi=b0+b1Xi+eirepresents the fitted model, what is the value of b0? A) 5.889 B) 5.000 C) Whether a mouse is in the LL group or not D) The number of mice in the LL group

A) 5.889

Use lm() to fit the empty model for Fat in the NutritionStudy data frame. What is the coefficient? A) 77.03 B) 57.0 C) 65.12 D) none of the above

A) 77.03

Between what two values do we see the middle 50% of all IQ scores in the USStates data frame? A) 98.5 and 102.7 B) 94.2 and 104.3 C) 100.3 and 100.9 D) 98.6 and 100.3

A) 98.5 and 102.7

We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group). Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, what does Yi represent? A) Each person's value for Exercise B) The average Exercise for all participants C) The deviation between each person's Exercise and the average Exercise for all participants D) It might be any of the above, depending on which interpretation you're using

A) Each person's value for exercise

If we fit a normal curve on the distribution of hwy, what is it that we're modeling with it? A) Error around the model for hwy B) The median C) The empty model for hwy D) Sample statistics

A) Error around the model for hwy

You'd like to divide the original data frame into 3 groups with low, medium, and high levels of average mercury. What R function would you use to do this and save the result as a new variable called MercGroup? A) FloridaLakes$MercGroup <- ntile(FloridaLakes$AvgMercury, 3) B) MercGroup <- sort(AvgMercury, 3) C) arrange(FloridaLakes, 3) D) ntile(FloridaLakes$AvgMercury, 3)

A) FloridaLakes$MercGroup <- ntile(FloridaLakes$AvgMercury, 3)

Let's say a researcher hopes to explore the hypothesis that knowing about someone's stress level can help predict their happiness. What word equation best captures this idea? A) Happiness = Stress + other stuff B) Stress = Happiness + other stuff C) Other stuff = Stress + Happiness D) Happiness = Stress

A) Happiness = Stress + other stuff

Let's say a researcher hopes to explore the hypothesis that knowing about someone's stress level can help to predict their happiness. What word equation best captures this idea? A) Happiness = Stress + other stuff B) Stress = Happiness + other stuff C) Other stuff = Stress + Happiness D) Happiness = Stress

A) Happiness = Stress + other stuff

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this? (check all that apply) A) It helps us to understand the population B) It helps us understand each individual in the sample C) Solely to help us understand this particular sample D) It helps us understand the processes that produced the variation we see

A) It helps us to understand the population D) It helps us understand the processes that produced the variation we see

When you add an explanatory variable to your model, what should be the effect on the Sum of Squares from the empty model A) It should remain unchanged B) It should go up C) It should go down D) It depends on how much variation is accounted for by the explanatory variable

A) It should remain unchanged

Using the USStates data frame, make a bar graph to illustrate the number of states that voted for McCain and Obama (recorded as the variable Pres2008). Based on what you see, which of the following statements is true? A) McCain won in more than 20 states B) Obama won in more than 30 states C) Obama won in almost twice as many states as did McCain D) None of the above is true

A) McCain won in more than 20 states

Which of the following lines of R code would save only the patients with less than 200 drinks per week into a new data frame called NutriStudy? A) NutriStudy <- filter(NutritionStudy, Alcohol <200) B) NutriStudy <- arrange(NutritionStudy, Alcohol) C) filter(NutritionStudy, Alcohol != 200) D) NutriStudy <- tally(NutritionStudy$Alcohol)

A) NutriStudy <- filter(NutritionStudy, Alcohol <200)

In these three visualizations (gf_point(), gf_jitter(), and gf_boxplot()), how do we typically write code for the outcome and explanatory variable? A) Outcome ~ Explanatory B) Explanatory ~ Outcome C) ~ Outcome, data = Explanatory D) ~ Explanatory, fill =~Outcome

A) Outcome ~ Explanatory

What kind of variables should go in ```gf_histogram()```? A) Quantitative B) Categorical

A) Quantitative

The College Board discovered a mistake! All of their tests administered in 2017 were scored 50 points lower than they should have been. Assuming they have a vector called SAT.2017 that includes all test scores, how would they add 50 points to each score in the vector? A) SAT.2017 + 50 B) SAT.2017 <- add(50) C) sum(SAT,50) D) sum(SAT+50)

A) SAT.2017 + 50

Imagine that you've calculated SS for both the empty model and the complex model for Exercise. What will be true about these SS? A) SS leftover from the empty model will be greater than the SS leftover from the complex model B) SS leftover from the empty model will be smaller than the SS leftover from the complex model C) SS leftover from the empty model will be equal to the SS leftover from the complex model D) In both cases the SS will be 0 because the residuals are balanced by the mean

A) SS leftover from the empty model will be greater than the SS leftover from the complex model

Let's split GPA into three groups -- low, medium, and high -- and then create a faceted histogram. What goes in the blanks in the following code? SleepStudy$GPA3Group <- ntile(___,3) gf_histogram(..density..~Happiness, data = _____) %>% gf_facet_grid (GPA3Group ~ .) A) SleepStudy$GPA; SleepStudy B) SleepStudy$GPA; SleepStudy$GPA C) SleepStudy; SleepStudy D) GPA3Group; SleepStudy

A) SleepStudy$GPA; SleepStudy

Let's split GPA into three groups -- low, medium, and high -- and then create a faceted histogram. What goes in the blanks in the following code? SleepStudy$GPA3Group <- ntile(_____, 3) gf_histogram(..density..~ Happiness, data = _____) %>% gf_facet_grid(GPA3Group ~ .) A) SleepStudy$GPA; SleepStudy B) SleepStudy$GPA; SleepStudy$GPA C) SleepStudy; SleepStudy D) GPA3Group; SleepStudy

A) SleepStudy$GPA; SleepStudy

In the supernova table for Light.model, which uses Light to explain variation in WgtGain, why is the degrees of freedom for Model equal to 1? A) The Light model uses up a degree of freedom by estimating one more parameter, b1, than the empty model B) The Light model uses up a degree of freedom by estimating one more parameter, X1i, than the empty model C) The Light model is evaluated by calling one more function in R, supernova(), than the empty model. Calling additional functions spends degrees of freedom D) The Light model discards on of the data points used to fit the empty model and thus loses a degree of freedom

A) The Light model uses up a degree of freedom by estimating one more parameter, b1, than the empty model

When Pulse3Group is included in our model to explain variation in Exercise, how is error from this more complex model calculated A) The deviation of each person's Exercise from the Grand Mean for Exercise B) The deviation of each person's Exercise from the mean Exercise of their Pulse3Group C) The deviation of each Pulse3Group's mean to the Grand Mean for Exercise D) None of the above

A) The deviation of each person's Exercise from the mean Exercise of their Pulse3Group

In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern? A) The difference you found was the result of sampling variation B) Your study didn't have enough participants C) Your random assignment wasn't really random D) Your random selection wasn't really random

A) The difference you found was the result of sampling variation

In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves as more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern? A) The difference you found was the result of sampling variation. B) Your study didn't have enough participants. C) Your random assignment wasn't really random. D) Your random selection wasn't really random.

A) The difference you found was the result of sampling variation.

Someone has a hypothesis that the Gender can be used to explain the number of Piercings that students in the StudySurvey have. Fit the model and save it. Then, create a function that takes the model as an input. Finally, use your function to make a prediction for males. (Note that males are coded as "M") What does the output tell you? A) The mean number of piercings for males is 0.717 B) The mean number of piercings for males is 2.98 lower than piercings for female C) The error for male is 0.17 D) You should expect the next randomly chosen male to have 1.71 piercings

A) The mean number of piercings for males is 0.717

Let's say we want to compare the Light model for weight gain (WgtGain4 = Light + error) to the empty model (WgtGain4 = mean + error). What does the "mean" in the empty model equation refer to? A) The mean of WgtGain4 for all the mice B) The mean WgtGain4 for the first Light condition C) The mean of Light D) The mean of the residuals

A) The mean of WgtGain4 for all the mice

USStates Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. A) The population in most states is sedentary B) The population in most states is very active C) The US population, overall, is sedentary D) The US population, overall, is very active

A) The population in most states is sedentary

If you wanted to generalize to all lakes in Florida but only included lakes within a 50 km radium of the research center in your study, what should concern you? (check all that apply) A) The sample is not random B) The sample is not convenient C) The sample will not have variation D) The sample may not represent the population you want to know about

A) The sample is not random D) The sample may not represent the population you want to know about

Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivity and a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common? (check all that apply) A) They display the same variable B) They have the same number of bars C) The shape of the distribution would be the same D) Their axes would have the same labels

A) They display the same variable B) They have the same number of bars C) The shape of the distribution would be the same

In our example about Thumb length and Sex, which is the outcome variable? A) Thumb B) Sex

A) Thumb

In Yi = 10.38 - .85X1i - 3.14X2i + ei what does X1i stand for? A) Whether someone is in the medium pulse group or not B) The number of members in the medium phase group C) The intercept for Pulse3Groupmed D) Whether someone is in the low or medium or high group

A) Whether someone is in the medium pulse group or not

The mean Alcohol is 3.279 drinks per day. A particular patient consumes 2 drinks per day. Which of the following symbols would be used to represent the value 2 in the notation of the General Linear Model? A) Yi B) b0 C) ei D) none of the above

A) Yi

The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6 Yi= Yhat + ei A) Yi B) Yhat C) ei D) none of the above

A) Yi

You'd like to see the first 10 rows of FloridaLakes, so you run head(FloridaLakes). It doesn't give you what you wanted. Why not? A) You didn't indicate that you wanted to see 10 rows. B) head() displays variable names C) head() can only be applied to vectors D) This is an odd request, so there's no R command for it

A) You didn't indicate that you wanted to see 10 rows.

USStates Experiment, using different numbers of bins in your histogram of Smokers. If you want to see gaps between the blocks in your histogram, which is the better choice? A) a small number of bins B) a large number of bins C) The appearance of gaps is inherent to the data itself; gaps are unrelated to the number of bins

A) a small number of bins

Why is the mean a good model for this distribution? A) because the mean balances the deviations above and below the mean B) because the mean balances the number of values above and below the mean C) because the mean is the midpoint of the range D) all of the above are reasons why the mean is a good model

A) because the mean balances the deviations above and below the mean

From the data set USStates, which of the variables below would be appropriate for a histogram? (check all that applY) A) college B) IQ C) Pres2008 D) Population

A) college B) IQ D) Population

Using the SleepStudy data frame, produce a jitter plot to examine ClassesMissed by Gender (coded 0 for female, 1 for male). Among students who missed no class were there more females or more males? A) females B) males

A) females

If you were interested in proportions rather than counts, which argument would you add to your code? A) format = "proportion" B) "proportion" C) format = "relative frequency" D) "percentage"

A) format = "proportion"

You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true? (check all that apply) A) it's unimodal B) Most scores are clumped at the center C) It's roughly symmetrical D) It's bell-shaped

A) it's unimodal B) Most scores are clumped at the center C) It's roughly symmetrical D) It's bell-shaped

Which is larger in the FloridaLakes data frame, the mean Calcium or the median Calcium? A) mean B) median C) they are equal

A) mean

If you want to use R to get the sum of 10 and 20, what code would you write? (check all that apply) A) sum(10+20) B) sum(10,20) C) Sum(10+20) D) 10+20

A) sum(10+20) B) sum(10,20) D) 10+20

Use "%>%" notation to add gf_density() (a density plot) to your density histogram of Smokers in USStates. What does the curve look like? A) unimodal B) bimodal C) uniform D) skewed

A) unimodal

How would you useR to calculate variance in hwy? A) var(mpg$hwy) B) lm(hwy ~ var, data = mpg) C) anova(mpg, data = hwy) D) favstats(~hwy, data = hwy)

A) var(mpg$hwy)

If we add more mice to the study, which of these would not be affected? A) β0 B) b0 C) Ybar D) n

A) β0

A, C

Above is a dotplot generated using the following command: gf_point(Infant.Mortality~1, data = swiss)%>% gf_hline(yintercept = mean(swiss$Infant.Mortality)) What does the horizontal line across the plot correspond to (Check all that apply)? A. Average Infant mortality B. Average education C. The intercept from a simple model predicting infant mortality D. The intercept from a complex model predicting infant mortality from education

D

Above is a histogram of the birthweight (wt) in ounces for the 100 newborns in this data set. gf_histogram(~ wt, data = Gestation100, fill = "royalblue2", bins = 15) If we wanted to explore the idea that mother's smoking might explain the variation in birth weight, what visualization might be helpful to us? response - incorrect A gf_histogram(~ wt, data = Gestation100) %>% gf_facet_grid(smoke ~ .) B gf_point(wt ~ smoke, data = Gestation100) C gf_boxplot(wt ~ smoke, data = Gestation100) D All of the above would be helpful to us as we explore our data.

B

Above is a sampling distribution of F-values generated through a shuffling routine. What do we use this distribution for? A. We can compare this distribution to one made with bootstrapping to see if they look different B. We can compare our observed F-value to the values in the distribution to see how unusual our F is C. If our sample distribution does not look like this distribution, we reject the simple model D. This distribution is not useful for comparing to our distribution because our sample size is 74, but the sample size here is 1000

c

Above is the output estimating a simple model (also called null or empty model) for fullSpeed, how can we interpret the intercept? A. The population average speed of light is 299845 B. The speed of light is always 299845 C. In this experiment, the average speed of light was 299845 D. All runs in this experiment showed a speed of light of exactly 299845

B

Above we have included the ANOVA tables for Wetsuit = NoWetsuit + other stuff. Which distance is the basis of the SS Error? response - correct A The distance between data points and the empty model's prediction. B The distance between data points and the NoWetsuit model's prediction. C The distance between the NoWetsuit model's prediction and the empty model's prediction. D The distance between the NoWetsuit model's residual and the empty model's residual.

A

Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. A classmate takes a look at these results and suggests that you judge these models using F instead of PRE. Why is that good advice? response - correct A Because F takes degrees of freedom into account and the smoke.factor model might just have a big PRE because that model used more degrees of freedom. B Because when you are comparing a regression model against a group model, you can't use PRE to compare them. C Because F is the only statistic that allows us to make comparisons of explanatory variables that are measured differently. D Because PRE is based on SS, which is a less familiar statistic to most people. F is based on variance, which is a more well known statistic.

D

Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. Why do these two models have the same value for SS total (25356)? response - incorrect A Because both SS totals are based on the residuals from the empty model. B Because both SS totals are based on the same outcome variable. C Because both SS totals are based on the values from the same data set. D All of the above reasons together explain why the SS totals are the same.

B

Above we have included the ANOVA tables for wt = smoke.factor + other stuff. Which distance is the basis of the SS error? response - incorrect A The distance between data points and the empty model's prediction. B The distance between data points and the smoke.factor model's prediction. C The distance between the smoke.factor model's prediction and the empty model's prediction. D The distance between the smoke.factor model's residual and the empty model's residual.

Recall that the variable SpeedUp is the difference between swimming with Wetsuit versus NoWetsuit. Why might you want to find the point below which 2.5% of bootstrapped sample means for SpeedUp fall, and the point above which 2.5% of simulated sample means for SpeedUp fall?

All of the above.

What does "alpha" do in a jitter plot?

Alpha can take values from 0 (more transparent) to 1 (more opaque).

What does a point represent in a scatterplot?

An observational unit's values (e.g., a student's Thumb length and Sex)

After fitting the regression model (Age.model), we construct a 95% confidence interval for the estimate of β 1 by running the following code: confint(Age.model) Here's the output: i. What's our 95% confidence interval for the estimate of β 1? ii. Notice that our confidence interval contains 0. What does this mean? Choose one: It suggests we should retain the empty model. iii. Which of the following is an incorrect interpretation of the confidence interval for β 1 in this model?

Answer 1: (-24.40652, 6.098267) Answer 2: It suggests we should retain the empty model. Answer 3: 95% of all Points will have this relationship with Age.

A sampling distribution of means below was created with this code: Points.stats <- favstats(~Points, data=NBAPlayers2011) SDoM <- do(10000) * mean(rnorm(176, Points.stats$mean, Points.stats$sd)) gf_histogram(~ mean, data = SDoM, color="grey", fill = "burlywood", alpha=0.75) i. If you were to stack up all the bars, what would the total be? ii. Which of the following is a true statement? Choose one: The standard error of this distribution is smaller than Points.stats$sd. iii. Someone tells you that their favorite NBA player scored 900 in the 2010-2011 NBA season. What is the likelihood of an NBA player having scored 900 points or lower in the population? iv. Pictured below is a sampling distribution of Points, color-coded by mean Points that are less than 875 and greater than 875. Points.stats <- favstats(~Points, data=NBAPlayers2011) SDoM <- do(10000) * mean(rnorm(176, Points.stats$mean, Points.stats$sd)) gf_histogram(~ mean, data = SDoM, color="grey", fill = ~mean<875) The chance of getting a sample mean less than 875 is:

Answer 1: 10,000 Answer 2: The standard error of this distribution is smaller than Points.stats$sd. Answer 3: We can't be sure because we can't tell the likelihood of a single NBA player having such a point total from this sampling distribution of means, we'd need to instead look at our sample distribution. Answer 4: Very unlikely

i. What would be the purpose of generating a sampling distribution of means of Points by resampling (bootstrapping)? ii. If you created a bootstrapped sampling distribution of 10,000 means from your sample of Points, what qualities would you expect it to have? iii. If you bootstrap a sampling distribution based on your sample data, what will be the mean of the bootstrapped distribution?

Answer 1: The distribution can help you quantify how much your best estimate of the population mean could vary. Answer 2: A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample. Answer 3: The mean of your sample

According to the Central Limit Theorem, is each of the following True or False? 1. The shape of the distribution of means will typically be normal in shape, provided the sample size is large enough OR if the shape of the population is normal. 2. The mean of the sampling distribution will be the same as the mean of the population from which the samples are randomly chosen. That is, the sample means will center around the true population mean. 3. The standard error will be smaller for larger sample sizes. Even more specifically, the standard error will be equal to the population standard deviation divided by the square root of the sample size.

Answer 1: True Answer 2: True Answer 3: True

Here is the supernova table for Age2Group.model: i. Interpret the PRE. Select one: 0.14 of the total variation in salary is explained by the age groups. ii. Why is the degrees of freedom for Model equal to 1? Select one: The Age2Group model discards one of the data points used to fit the empty model and thus loses a degree of freedom.

Answer 1:0.14 of the total variation in salary is explained by the age groups. Answer 2:The Age2Group model discards one of the data points used to fit the empty model and thus loses a degree of freedom. 4/8

Suppose you make the plot below to further explore the idea that Age would predict a teacher's Salary. i. If you fit an empty model to this data, how would you depict it on this plot? Select one: [ Select ] ["A vertical line that shows the mean Age.", "You would not be able to represent the empty model visually because it is a single number.", "A horizontal line that shows the mean Salary.", "A diagonal line that bisects the cloud of points."] ii. If you fit a model that predicts Salary by including Age as a quantitative explanatory variable, how many parameters would the model have?

Answer 1:A horizontal line that shows the mean Salary. Answer 2:Two: Salary and Age. 4/8

If we write the empty model in GLM notation, Y i = b 0 + e i, i. What is the value of Y i? Select one: Each individual teacher's salary. ii. What is the value of b 0? Select one: The mean salary of this sample, 52.5245. iii. What is the value of e i? Select one: The difference between each teacher's salary and the mean salary of this sample.

Answer 1:Each individual teacher's salary. Answer 2:The mean salary of this sample, 52.5245. Answer 3:The difference between each teacher's salary and the mean salary of this sample.

After running supernova(Age.model), the following gets outputted in the R Console: i. Which of the following is the correct interpretation of MS Total (1782.607)? Select one: [ Select ] ["This is, roughly, the total number of points in the data frame.", "This is, roughly, the average squared residual from the mean.", "This is, roughly, the standard deviation from the mean.", "This is, roughly, the total number of squared means based on the empty model."] ii. Which of the following is the correct interpretation of PRE (0.23) in the supernova table? Select one: [ Select ] ["23% of the SS from the empty model can be explained by adding Age to the complex model.", "The Age model's SS total will be 23% of the SS total from the empty model.", "23% of the teachers' salaries in the data frame can be predicted with their Age.", "23% of the Age model can be proportionally reduced by the empty model."]

Answer 1:This is, roughly, the average squared residual from the mean. Answer 2:23% of the SS from the empty model can be explained by adding Age to the complex model. 8/8

What are outliers for a box plot?

Any data that is greater or less than the whiskers are depicted in a boxplot as individual points

C

Assume this code has already been run: wt.stats <- favstats( ~ wt, data = Gestation100) What will the following line of code do? rnorm(100, wt.stats$mean, wt.stats$sd) response - correct A Create a single, normal curve of weights, using the mean and SD of wt. B Generate a random sample of one data point from a normal distribution with a mean of 100. C Generate a random sample of 100 data points from a normal distribution with the same center and spread as wt. D Generate a sampling distribution of 100 means from a normal distribution with the same center and spread as wt.

Below is a normal model of a population. There are more or less likely values of this variable. What part of the population would be considered "unlikely" to be randomly selected (according to the definition of "unlikely" agreed upon by the statistics community)?

B .10% to .5% under the tail of graph

StudentSurvey Pull up the supernova table for Pulse3Group.model. Interpret the PRE A) There is a .05 chance that we have made a truly explanatory model B) .05 of the total variation in exercise hours is explained by the pulse groups C) .05 of the sample has a relationship between exercise hours and pulse groups D) .05 of the compels model's sum of squares can be explained by Pulse3Group

B) .05 of the total variation in exercise hours is explained by the pulse groups

In the supernova table of Light.model, which uses Light to explain variation in WgtGain4, what does the PRE of .60 mean? A) There is a .60 chance that this explanatory variable helps us make better predictions of the outcome variable B) .60 of the sum of squares from the empty model is explained by the Light groups C) .60 of the sample has a relationship between WgtGain4 and Light groups D) .60 of the sum of squares from the Light.model is explained by the Light groups

B) .60 of the sum of squares from the empty model is explained by the Light groups

Use ntile() to create groups of lakes that are low, medium, and high in Chlorophyll. Save this in the FloridaLakes data frame as a new variable called Chlorophyll3Group. If you then use the head() and select() commands to print out the first 6 rows of Chlorophyll3Group, what do you get as a result? A) 1, 1, 2, 2, 3, 3 B) 1, 1, 3, 1, 1, 3 C) 1, 2, 3, 1, 2, 3 D) 1, 3, 3, 2, 1, 2

B) 1, 1, 3, 1, 1, 3

Using the StudentSurvey data frame, run favstats on Siblings. At what point would the sum of squared errors (sum of squares) be lowest? A) 1 B) 1.7 C) 0 D) 2

B) 1.7

Using the StudentSurvey data frame, run favstats on Siblings. At what value would the sum of squared errors (sum of squares) be lowest? A) 1 B) 1.7 C) 0 D) 2

B) 1.7

Again using lm() to explore the model that includes Gender to explain Height in the StudentSurvey data frame, what is the mean height for female? A) 51.51 B) 65.695 C) 70.846 D) 63.896

B) 65.695

Using the FatMice18 data frame, run favstats on WgtGain4. At what value would the sum of squared errors (sum of squares) be lowest? A) 3 B) 8.39 C) 0 D) You can never be sure of the value at which the sum of squares would be lowest

B) 8.39

To examine the distribution of Happiness, which would be more useful? A) A tally B) A histogram C) In this case, either would be useful. D) In this case, neither would be useful

B) A histogram

You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before? A) An error message. You're missing the argument between the parentheses B) A smooth, density plot overlaying your bars C) A smooth, density plot instead of bars D) A y-axis that now displays density

B) A smooth, density plot overlaying your bars

Which of the following from the NutritionStudy is a quantitative variable? A) Vitamin (vitamin use: 1 = regular, 2=occasional, or 3=no) B) Alcohol (number of alcoholic drinks consumed per week) C) Gender (Female or Male) D) EverSmoke (smoking status: Never, Former, or Current)

B) Alcohol (number of alcoholic drinks consumed per week)

Think back to the vector called SAT. Let's imagine that the student who earned the fourth score in this vector would like to know her score. You try sat[4], but get an error. What did you do wrong? A) Because sat isn't numeric, it needs quotes around it. B) Because R is case sensitive, SAT needs to be capitalized. C) SAT needs to be capitalized and in quotes. D) That shouldn't have produced an error.

B) Because R is case sensitive, SAT needs to be capitalized.

Let's say we calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals? A) The values of the residuals from the empty model will be the same as the values of residuals from the complex model. B) Both sets of residuals represent the difference between the data and the model's prediction. C) Both sets of residuals represent the difference between the data and the Grand Mean. D) In both cases, the residuals can be reduced to near 0 simply by being careful with measurement and data entry.

B) Both sets of residuals represent the difference between the data and the model's prediction.

What kind of variables should go in ```tally()```? A) Quantitative B) Categorical

B) Categorical

You have learned to make some pretty fancy histograms now. Let's take a moment to reflect. What kind of variables should go in ```gf_facet_grid()```? A) Quantitative B) Categorical

B) Categorical

Imagine that you wrote the following code. What would it do? gf_boxplot(Happiness ~ Stress, data = Fingers, color = "orange") %>% gf_jitter() A) Create two, separate plots (a box plot and a jitter plot) B) Create a single plot (a box plot with an overlaid jitter plot) C) Create a jitter plot (the last command written) D) Create a box plot (with the jitter code omitted because it's incomplete)

B) Create a single plot (a box plot with an overlaid jitter plot)

You suspect that in the SleepStudy , Gender can be used to explain sleep quality (PoorSleepQuality). Produce a jitter plot to explore whether your suspicion might be right. Which of the following is true? A) Gender clearly predicts sleep quality B) Gender does not appear to predict sleep quality C) The jitter plot doesn't address my suspicion

B) Gender does not appear to predict sleep quality

In the jitter plot above, which makes use of transparency, what visual feature indicates a higher frequency of data points? A) More transparency B) Less transparency

B) Less transparency

Using the NutritionStudy data frame, make a histogram of Alcohol. What is represented on the y-axis? A) Number of drinks consumed per week B) Number of patients C) The count of alcoholic drinks D) Number of variables

B) Number of patients

What will happen in R if you run: print("StatsCourse") A) R will send "StatsCourse" to your local printer B) R will display: "StatsCourse" C) R will show the full data file named "StatsCourse" D) R will return an error message. R is a programming language, not a printer interface

B) R will display: "StatsCourse"

You decide to conduct a study of energy drinks using undergraduates from your school. You select participants by randomly choosing ID numbers from among all ID numbers of current students. Once chosen, you randomly pick one of two energy drinks for students to consume weekly, throughout the school term. The first step is an example of _____ and the second is an example of _____. A) Random assignment; random selection B) Random selection; random assignment C) Random assignment, random assignment D) Random selection, random selection

B) Random selection; random assignment

If you've calculated the variance for WgtGain4, what have you found? A) Roughly the total squared residual from the empty model, in squared grams B) Roughly the average squared residual from the empty model, in squared grams C) Roughly the average residual from the empty model, in grams D) The sum of the residuals from the mean

B) Roughly the average squared residual from the empty model, in squared grams

Let's say you wanted to create a vector called "SAT" from a list of SAT scores. How would you do that? A) SAT <- (1300, 1120, 1050, 1470, 1350) B) SAT <- c(1300, 1120, 1050, 1470, 1350) C) SAT <- vector(1300, 1120, 1050, 1470, 1350) D) vector <- SAT(1300, 1120, 1050, 1470, 1350)

B) SAT <- c(1300, 1120, 1050, 1470, 1350)

In our example about Thumb length and Sex, which is the explanatory variable? A) Thumb B) Sex

B) Sex

Let's say we wanted to create three groups based on Pulse: low, medium, and high. Which of the following code would do that, and save the values in a new variable called Pulse3Group? A) Pulse3Group <- ntile (3) B) StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3) C) StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 2) D) StudentSurvey <- ntile(StudentSurvey$Pulse3Group)

B) StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3)

SAT <- c(1300,1120,1050,1470,1350) First.Score <- SAT[1] Second.Score <- SAT[2] First.Higher <- First.Score > Second.Score First.Higher A) 180 B) TRUE C) 1300 D) First.Score > Second.Score

B) TRUE

If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean? A) The car's highway miles per gallon is 60% better than the other configurations of cars in the distributions B) The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy C) The car's highway miles per gallon is now smaller because .6 is smaller than 27 D) The car's highway miles per gallon should be a whole number (like in the Empirical Rule), which clearly suggests an error in the calculation

B) The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy

FatMice18 If we run lm() to fit the model for WeightGain4, using Light as an explanatory model, how is error from the model calculated for each mouse A) The deviation of each mouse's WgtGain4 from the Grand Mean of WgtGain4 B) The deviation of each mouse's WgtGain4 from the mean WgtGain4 for their Light group C) The deviation of each Light group's mean to the Grand Mean of WgtGain4 D) The deviation between the mean WgtGain4 of the two Light groups

B) The deviation of each mouse's WgtGain4 from the mean WgtGain4 for their Light group

If you create an empty model of TopSpeed, what would it mean to have an "empty model" A) The model would be the best way of explaining how many variables contribute to TopSpeed (such as time of year and type of bike) B) The model would include only mean TopSpeed C) The model would predict a different TopSpeed depending on the situation D) None of the above

B) The model would include only mean TopSpeed

What would be true about the empty model for Fat? A) The model would be the best way of explaining how many variables contribute to Fat (such as smoking status and gender) B) The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables C) The model would predict a different value for Fat for each person, depending on their values on other variables D) The model would predict 0 grams for every person's value on Fat

B) The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables

If the z score for a mouse's weight gain is -0.7, what does that mean? A) The mouse's weight gain is 70% lower than the average mouse in the distribution B) The mouse's weight gain is 0.7 standard deviations lower than the mean of WgtGain4 C) The mouse lost 0.7 grams of weight D) The mouse's weight gain is lower than 70% of the entire sample

B) The mouse's weight gain is 0.7 standard deviations lower than the mean of WgtGain4

We can calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals? A) The values of the residuals from the empty model will be the same as the values of the residuals from the complex model B) The residuals represent the difference between the data and the model's prediction C) The residuals represent the difference between the data and the Grand Mean D) In both cases, the residuals can be reduced to near 0 simply by being careful with measurement and data entry

B) The residuals represent the difference between the data and the model's prediction

Where should you look in the histogram to notice within group variation? A) The center of the distribution B) The spread of the distribution C) The density of the distribution D) The skew of the distribution

B) The spread of the distribution

In the SleepStudy, might Stress be a predictor of Happiness? What do you see in a boxpot? A) There's less variability in happiness among participants high stress than there is among participants with normal stress B) There are more outliers among participants with normal stress than there are among participants with high stress C) Stress does not appear to explain happiness

B) There are more outliers among participants with normal stress than there are among participants with high stress

Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape? A) TopSpeed and the predicted values B) TopSpeed and residuals C) Predicted values and residuals D) No two of these distribution will have a similar shape

B) TopSpeed and residuals

If a data point is very far away from the mean, what would you expect for the residual? A) When farther away, the more positive the residual B) When farther away, the larger the absolute value of the residual C) When farther away, the more variable the residual D) When a data point is very far away from the mean, the residual should be 0, because the mean balances the residuals

B) When farther away, the larger the absolute value of the residual

Let's say you've calculated the sum of squares for hwy. What would be the average of dividing that number by n-1 (i.e., dividing it by the df)? A) It turns the sum of squares into a measure of spread B) You can use it to compare error across samples of different sizes C) You would have calculated the population variance D) None of the above. There's no advantage of dividing SS by the df

B) You can use it to compare error across samples of different sizes

Why should you look at a histogram of a variable before you do other statistical analyses? (check all that apply) A) You'll need the results from your histogram in order to write additional R code B) You might catch errors made in data collection/entry C) You can see the shape of the distribution to see if it makes sense D) R won't be able to run other functions on the data frame unless you make a histogram first

B) You might catch errors made in data collection/entry C) You can see the shape of the distribution to see if it makes sense

To examine the distribution of Happiness, which would be more useful? A) a tally B) a histogram C) In this case, either would be useful D) in this case, neither would be useful

B) a histogram

Which of the following cannot be calculated from the NutritionStudy data set? A) an estimate B) a parameter C) a statistic D) a simple model

B) a parameter

The NutritionStudy data frame includes information on the number of Calories patients consumed per day. Produce a histogram of Calories, without indicating a particular number of bins or a particular bin size. What is the peak of the histogram? A) around 1200 B) around 1600 C) around 2000 D) around 2200

B) around 1600

In GLM notation, which of the following represents the model (or prediction)? A) Yi B) b0 C) ei D) none of the above

B) b0

The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data? A) histogram B) bar graph C) both of us are equally appropriate D) neither are appropriate for a categorical variable

B) bar graph

Imagine that you wrote the following code. What would it do? gf_boxplot(Happiness ~ Stress, data = Fingers, color = "orange") %>% gf_jitter() A) create two, separate plots (a box plot and a jitter plot) B) create a single plot (a box plot with an overlaid jitter plot) C) create a jitter plot (the last command written) D) create a box plot (with the jitter code omitted because it's incomplete)

B) create a single plot (a box plot with an overlaid jitter plot)

If you wanted to get the five-number summary for Physical Activity, what R code would you run? A) sort(USStates, PhysicalActivity) B) favstats(~PhyscialActivity, data = USStates) C) gf_histogram (USStates$PhysicalActivity) D) makefun (USStates.PhyscialActivity)

B) favstats(~PhyscialActivity, data = USStates)

Using the StudentSurvey data frame, create a faceted histogram for Weight by Gender. For whom is likely the mean a better model? A) males B) females C) the mean is an equally good model for males and females D) histogram cannot be used to answer this question

B) females

If you print the residuals for StudentSurvey.modelGPA, what will you see? A) 3.158 B) for each participant in the study, the difference between his/her GPA and the mean GPA C) for each participant in the study, the model (i.e. the mean) for GPA D) the GPA of each participant in the study

B) for each participant in the study, the difference between his/her GPA and the mean GPA

Which R code would create a distribution of Smokers? A) histogrm(Smokers, USStates) B) gf_histogram(~Smokers, data = USStates) C) histogram ~ Smokers D) gf_histogram ~ Smokers

B) gf_histogram(~Smokers, data = USStates)

Use gf_point to examine ClassesMissed by Gender (coded 0 for females and 1 for males). Locate the SleepStudy participant who missed the most classes. Is it a male or a female? A) female B) male C) no single student stands otu

B) male

Which of the following from FloridaLakes are quantitative variables? (check all that apply) A) Lake B) pH C) NumSamples D) MinMercury

B) pH C) NumSamples D) MinMercury

Wanting to see "MyUniversity" in the R Console, you've just run the following command: print(MyUniversity). However, R returned an error message. What's the correct command, if you want to print "MyUniversity"? A) Print(MyUniversity) B) print("MyUniversity") C) #"MyUniversity" D) Print "MyUniversity"

B) print("MyUniversity")

You decide to conduct a study of energy drinks using undergraduates from your school. You select participants by randomly choosing ID numbers from along all ID numbers of current students. Once chosen, you randomly pick one of two energy drinks for students to consume weekly, through the school term. The first step is an example of _____ and the second is an example of ______ A) random assignment; random selection B) random selection; random assignment C) Random assignment; random assignment D) random selection; random selection

B) random selection; random assignment

If you'd like to see an overview of what's in the data frame -- a list of your variables, whether they're numeric or factors, and so forth -- what command would you use? A) tally() B) str() C) c() D) sort()

B) str()

How would you quickly find the total number of water samples (or test tubes) collected across all of the lakes in your study? A) sample(sum, FloridaLakes) B) sum(FloridaLakes$NumSamples) C) tally(~NumSamples, data = FloridaLakes) D) arrange(FloridaLakes, NumSamples)

B) sum(FloridaLakes$NumSamples)

If you ran the R code below, what would you be able to tell from the output? Empty,model <- lm(hwy ~ NULL, data = map) Empty.model A) How much error there is around the model B) the mean C) β0 D) all of the above

B) the mean

Change your histogram of Smokers in the USStates data frame into a density histogram by adding "..density.." to your code. What changed? A) the x-axis B) the y-axis C) the shape of the distribution D) nothing

B) the y-axis

If you use lm() to fit the empty model for LikeM, and then use confint() to find the confidence interval, what does the confidence interval tell you?

B. It gives you a range of possible β0 s that could have generated your sample. C. It gives you a range of possible μ s that could have generated your sample. Both B and C are correct.

C

Based on the descriptions and calculations above, would 12 months be included in a 95% confidence interval calculated from a sample which found an average of dental appointment of 10 months with a standard deviation of 4 months. I (a). Yes, 12 would be included in the interval I (b). There is a 95% chance 12 would be included in the interval I (c). No, 12 would not be included in the interval I (d). It is impossible to tell.

Why might the mean be a good simple model for the distribution of salaries, below:

Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.

Above we show the favstats() for females' ratings of their date's intelligence (IntelligentF). Why might it be helpful to examine the distribution of means from random samples of n=273 drawn from a normal distribution with the same mean and standard deviation as IntelligentF?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

Why might it be helpful to calculate means from random samples of 176 drawn from a normal distribution with the same mean and standard deviation as Points?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

The remaining questions on this exam have to do with the following data frame: Data from the 2010-2011 regular season from a sample of NBA players has been stored in a data frame called NBAPlayers2011. The data frame consists 176 observations on the following 25 variables. Below is a histogram of Points for the 176 players in this data set, created with the following code: gf_histogram(~Points, data=NBAPlayers2011, color="grey", fill="seagreen", alpha=0.5) If we were to use this data to guess the mean number of points scored in the population, what would we be trying to estimate?

Beta 0 B0

D

Broadly speaking, what do we study when we study statistics? A Data B Formulas C Variables D Variation

USStates What proportion of states (recorded as the variable Pres2008) was won by Obama? (Hint: use the "tally" command) A) .51 B) .53 C) .56 D) .59

C) .56

Use the DataCamp window above to write some code that will show you the values for Age and Alcohol for patients in the NutritionStudy data frame. The last study participant is 45 years old. How many alcoholic drinks does she consume per week? A) 1.8 B) 2.2 C) 0.2 D) 3.0

C) 0.2

The USStates data frame includes information on the percentage of residents in each state who smoke. Data is coded under the variable named Smokers. Produce a histogram of Smokers, without indicating a particular number of bins or indicating a particular bin size. Where is the peak of the histogram? A) 15 B) 17 C) 20 D) 25

C) 20

If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed? A) a value within one quarter of 33.6 B) 0 C) 33.6 D) it's impossible to say

C) 33.6

The FloridaLakes data frame includes information collected by researchers when they analyzed samples of water (collected in standardized test tubes) from a number of lakes. Using the DataCamp window above, determine how many lakes are included in the data frame. A) 5 B) 12 C) 53 D) 94

C) 53

You run the following command: RandomLakes <- sample(FloridaLakes, 10). What will be the result? A) A printout of random lakes B) A new data frame of 10 lakes drawn randomly from the population C) A new data frame of 10 lakes drawn randomly from your FloridaLakes data frame D) None of the above

C) A new data frame of 10 lakes drawn randomly from your FloridaLakes data frame

What's true of sampling variation? A) It's almost purely theoretical. We rarely encounter it. B) It loads to bias C) Because of it, no sample will perfectly represent the population D) All of the above

C) Because of it, no sample will perfectly represent the population

You've just run the following code: tally(~WgtGain4 > 10, data = FatMice18, format="proportion") You've gotten the following output: WgtGain>10 True False 0.2222 0.2222 What can you now say? A) Approximately 22% of mice gained more than 10 grams of weight B) If another mouse were randomly selected and added to this data set, the likelihood that it would gain more than 10 grams would be 22% C) Both of the above D) None of the above

C) Both of the above

In a study to find out if smoking habits explain variation in fat consumption, _____ would be the outcome variable and _____ would be the explanatory variable A) EverSmoke; Fat B) EverSmoke; the cause of fatty eating C) Fat; EverSmoke D) Fat; the rate of eating fatty foods

C) Fat; EverSmoke

In a study designed to find out what explains variation in Happiness, _____ would be the outcome variable and _____ would be the explanatory variable. A) Stress; Happiness B) Stress; the cause of happiness C) Happiness; Stress D) Happiness; the rating of happiness

C) Happiness; Stress

If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy? A) If the standard deviations large, the z score should also be very large and positive B) If the standard deviations is large, the absolute z score should also be large but we won't be able to tell if it is positive or negative C) If the standard deviation is large, the z score should be small and positive D) Standard deviation and z score are unrelated because they measure different things about the distribution

C) If the standard deviation is large, the z score should be small and positive

Construct a boxplot using data from NutritionStudy that illustrates how females and males (coded in the variable Gender) differ in daily consumption of Calories. Which of the following is true? A) Q1 is equal across the two groups B) There are many outliers on the low end of calorie consumption C) More than half of females consume less than 2000 calories per day D) There are more male outliers than female outliers

C) More than half of females consume less than 2000 calories per day

If you calculate the standard deviation for hwy, what have you found? A) Roughly the total squared deviations from the mean, in squared highway miles per gallon B) Roughly the average squared deviation from the mean, in squared highway miles per gallon C) Roughly the average deviation from the mean, in highway miles per gallon D) None of the above

C) Roughly the average deviation from the mean, in highway miles per gallon

Now imagine that the same student simply wanted to know whether her original score was a 1470. How would you get her answer? A) SAT[4]><1470 B) SAT[4]=1470 C) SAT[4]==1470 D) SAT[4]<-1470

C) SAT[4]==1470

How will this correction (changing 54 back to 27) affect the mean and the median? A) Both the mean and median will be equally affected by this correction B) The median will be affected more than the mean C) The median will be affected less than the mean D) It's impossible to say how the median will be affected by the correction

C) The median will be affected less than the mean

In a histogram, the height of the bars tells you how many people there are with that thumb length. In a jitter plot, what tells you how many people there are with a particular thumb length? A) The height of the dots B) The number of dots in a column C) The number of dots in a row D) The center of the dots

C) The number of dots in a row

The nutrition study included 315 patients. Where is this information represented in the data frame? A) One of the values in the data frame B) The number of variables in the data frame C) The number of rows in the data frame D) The number of columns in the data frame

C) The number of rows in the data frame

The study included 53 lakes in Florida. Where is the information in this data frame? A) One fo the values in the data frame B) The number of variables in the data frame C) The number of rows of data D) The number of columns of data

C) The number of rows of data

In general, in R, __________ is where you type in code and __________ is where the code runs. A) The script window (i.e., script.R); the script window (i.e., script.R) B) The R Console; the R Console C) The script window (i.e., script.R); the R Console D) The R Console; the script window (i.e., script.R)

C) The script window (i.e., script.R); the R Console

You've been commissioned to do a study of all lakes with average mercury levels above 1. You want to save the data of the lakes that meet this criterion to a new data frame called HighMercury. What's wrong with the following code? HighMercury <- filter(floridalakes,avgmercury >1) A) It's missing quotation marks around the number 1 B) It doesn't appropriately name the new data frame C) There are capitalization errors D) Nothing

C) There are capitalization errors

If you're told that there's random measurement error in how one of your variables was recorded, what do you know for sure? A) your data are biased B) A mistake was made when the data were either recorded or entered C) There will be more variation than you might expect D) All of the above

C) There will be more variation than you might expect

What's the name of the LAST lake in the FloridaLakes data frame? A) Trout B) Weir C) Yale D) Rosseau

C) Yale

Imagine that you created a histogram of PhysicalActivity. You meant to set it to have 15 bins, but you mis-typed and set 5 bins instead. How would the result be different from what you hoped? A) Your histogram would be missing data B) You would see more bars than you hoped C) You would see less detail than you would have otherwise depicted D) There would be no difference because your N is 50

C) You would see less detail than you would have otherwise depicted

You run the following command: RandomPatients <- sample(NutritionStudy, 10). What will be the result? A) a printout of random patients B) a new data frame of 10 patients drawn randomly from the population C) a new data frame of 10 patients drawn randomly from the NutritionStudy data frame D) none of the above

C) a new data frame of 10 patients drawn randomly from the NutritionStudy data frame

The same student has a tendency to come back repeatedly to ask the same question. With that in mind, you decide to save the answer above to an R object so you can readily call it up later. You decide to call the R object annoyingstudent. Which line of code would create that object? A) annoyingstudent <- SAT[4]><1470 B) annoyingstudent <- SAT[4]=1470 C) annoyingstudent <-SAT[4]==1470 D) SAT[4]=1470 <- annoying student

C) annoyingstudent <-SAT[4]==1470

If you want to quickly see the name of the lake with the lowest average mercury level, what R command might you run? A) arrange(FloridaLakes) B) tally (FloridaLakes$AvgMercury) C) arrange (FloridaLakes, AvgMercury) D) str(FloridaLakes)

C) arrange (FloridaLakes, AvgMercury)

If we express our model as Yi = b0 + b1X1i + b2X2i + ei which part represents the model's prediction for Exercise? A) Yi B) b0 C) b0 + b1X1i + b2X2i D) b1X1i

C) b0 + b1X1i + b2X2i

Let's say you want to filter your data so that you do NOT include lakes that have missing data for Chlorophyll. What line of code will do that? A) filter(FloridaLakes, Chlorophyll != "0") B) filter(FloridaLakes, Chlorophyll​ == "0") C) filter(FloridaLakes, Chlorophyll​ != "NA") D) filter(FloridaLakes, Chlorophyll​ == "NA")

C) filter(FloridaLakes, Chlorophyll​ != "NA")

In the FloridaLakes data frame, what kind of variable is AgeData? A) numeric B) factor C) integer D) none of the above

C) integer

What R code would you use to fit the empty model for TopSpeed? A) gf_histogram(NULL ~ TopSpeed, data = BikeCommute) B) NULL(TopSpeed, data = Bike Commute) C) lm(TopSpeed~NULL, data = BikeCommute) D) gf(TopSpeed ~ NULL, data = BikeCommute)

C) lm(TopSpeed~NULL, data = BikeCommute)

What notation can be used to represent the mean of the population? A) β0 B) μ C) Both of the above D) none of the above

C) none of the above

Let's say you want to create an R object, so you can call it up later. The object you want to create represents the Oxford Dictionaries Word of the Year for 2017, which happens to have been "Youthquake." How would you create that object? A) oxfordword2017 <- youthquake B) youthquake" <- oxfordword2017 C) oxfordword2017 <- "youthquake" D) "youthquake" <- oxfordword2017 E) You can't. R objects must be numeric.

C) oxfordword2017 <- "youthquake"

Try creating a NULL or empty model for TV viewing using the StudentSurvey data frame, and then look at the SS by using anova(). What unit is associated with the number 11224? A) minutes B) hours C) square hours D) impossible to tell

C) square hours

Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election and the number of states in which Obama won? A) winner(~Pres2008, data = USStates) B) gf_boxplot(~Pres2008, data = USStates) C) tally(~Pres2008, data = USStates) D) str(~Pres2008, data = USStates)

C) tally(~Pres2008, data = USStates)

The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles. the distance for one of the comments has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean? A) The mean will be unaffected by the correction B) the mean will be higher because of the correlation C) the mean will be lower because of the correction D) It's impossible to say how the mean will be affected by the correction

C) the mean will be lower because of the correction

Using the SleepStudy data frame, create a boxplot to explore whether Stress (coded as normal or high) might be used to explain GPA. Which of the following statements are true? A) Participants with normal stress levels have slightly higher GPA B) Participants with high stress levels vary in GPA more than do individuals with normal stress levels C) the sole outlier is a participant with a normal level of stress

C) the sole outlier is a participant with a normal level of stress

What are the sources of variation? eg) thumb = sex + other stuff

Can be explained : portion of the total variation we were able to attribute to "sex". Unexplained: everything included in "other stuff". Can be real (probably can figure it out) or induced (measurement error, sampling error, mistakes).

What kind of variables should go in ```tally()```?

Categorical

Use the DataCamp window above to construct a faceted histogram of how much fun males perceive females to be (FunM) by males' race (RaceM) in the SpeedDating data frame. Which of the race groups looks most like the panel below?

Caucasian

How do you distinct the colors when you relate 2 variables?

Chaining a new function to the histogram eg) gf_histogram(~ Thumb, data = Fingers, fill = ~Sex) %>% gf_refine(scale_fill_manual(values = c("purple", "orange"))) OR WITH gf_refine(scale_color_manual(values = c("purple", "orange")))

Random assignment

Choosing without bias experimental groups

One researcher wondered if some of the variation in the difference in velocity came from the type of swimmer they were. Triathletes swim in wetsuits more often than competitive swimmers do, and she worried that their experience would influence the results of this study. Above is a faceted histogram of difference in velocity by the Type of athlete. The two vertical lines depict the mean of the swimmer group and the mean of the triathlete group. What would be the PRE of this model: Type.model <- lm(SpeedUp ~ Type, data = Wetsuits)

Close to 0

What are box plots useful for?

Comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.

A

Consider a case where the population variance (σ 2 ) is known. In this case, we do not need to estimate variance in the sample. In order to generate a confidence interval for the mean of the distribution, which mathematical function could we use to represent the sampling distribution: A. Normal distribution with sample mean and known variance B. Normal distribution with sample mean and sample variance C. t-distribution with degrees of freedom = n D. t-distribution with degrees of freedom = n-1

A

Consider a model where we predict infant mortality from education. What would the proper word equation for such a model be? A. Infant.Mortality = Education + Error B. Education = Infant.Mortality + Error C. Infant= Mortality + Education D. Mortality = Education + Infant + Error

A

Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The population mean of the sampling distribution of the mean for Study 1 will have the same population mean of the sampling distribution of the mean for Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW

C

Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sample mean from Study 1 will be smaller than the Sample mean from Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW

B

Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sampling distribution of the mean from Study 1 look exactly the same as the population distribution. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW

A

Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The sampling distribution of the mean from Study 1 will be more normally distributed than the sampling distribution of the mean from Study 2 A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW

B

Consider two studies (Study 1 and Study 2) with different sample sizes (n1 > n2) but of the same variables and from the same populations (e.g., two studies of average well-being drawn from the population of UCLA students, one with a sample size of n1 = 100 and another with a sample size of n2 = 30). Use what you know about the Central Limit Theorem to answer evaluate the following statements: The standard error of the mean from Study 1 will be greater than the standard error of the mean from Study 2. A. TRUE B. FALSE C. IMPOSSIBLE TO KNOW

What will the following code do? resample(Wetsuits, 12)

Create a new sample from the observations in Wetsuits

what does gf_jitter() do?

Creates a jitter plot. eg) gf_jitter(Thumb ~ Sex, data = Fingers)

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data? A) -10 B) 33.6 C) 57.2 D) 23.6

D) 23.6

In USStates, what's the median percentage of residents with college degrees? (the variable is aptly named College) A) 20.2 B) 31.2 C) 43.4 D) 30.6

D) 30.6

Use the DataCamp window above to find the product of 12345 and 34567. What's the result? (Hint: Use "prod" to find the product in the same way that you'd use "sum" to find the sum.) A) 746355795 B) 475839694 C) 375844135 D) 426729615

D) 426729615

In the NutritionStudy data frame, the number 6.3 appears in the column labeled Fiber. This 6.3 is an example of A) A variable B) A condition C) A unit sampled D) A value

D) A value

If you ran the R code below, what would you be able to tell from the output? Empty,model <- lm(hwy ~ NULL, data = map) anova(Empty.model) A) How much error there is around the empty model B) The sum of the squared residuals C) The sum of squares D) All of the above

D) All of the above

The sum of squares gets larger as A) The variation increases B) The sample size increases C) The spread of the distribution increases D) All of the above

D) All of the above

What's true about data? A) They require that you've selected a sample B) They are the result of measurement C) They represent something about the world D) All of the above

D) All of the above

Which of these options could be used to depict the relationship between Exercise and Pulse3Group? A) gf_boxplot(Exercise ~ Pulse3Group, data = StudentSurvey) B) gf_histogram(~Exercise, data = StudentSurvey) %>% gf_facet_grid(Pulse3Group~.) C) gf_point(Exercise ~ Pulse3Group, data = StudentSurvey) D) All of the above

D) All of the above

What does a point represent in a scatterplot? A) A data frame (e.g., Fingers) B) An explanatory variable (e.g., Sex) C) An outcome variable (e.g., Thumb) D) An observational unit's values (e.g., a student's Thumb length and Sex)

D) An observational unit's values (e.g., a student's Thumb length and Sex)

Imagine that you have both the empty model for Exercise and the complex model for Exercise (i.e., the model that includes Pulse3Group). What would you do if you wanted to compare how well they predict exercise? A) Compare the SS from each model B) Look at the reduction in error in the Pulse3Group model C) Examine the PRE D) Any of the above

D) Any of the above

The data frame is currently organized alphabetically by lake. What if you'd like to see it ordered by average mercury level, with the most polluted lake appearing first on the list? Save the result back into FloridaLakes. A) FloridaLakes <- sample(FloridaLakes, desc(AvgMercury)) B) FloridaLakes <- sum(AvgMercury) C) FloridaLakes <- tally(~ NumSamples, AvgMercury) D) FloridaLakes <- arrange(FloridaLakes, desc(AvgMercury))

D) FloridaLakes <- arrange(FloridaLakes, desc(AvgMercury))

What would the following R code do, beyond creating a histogram? gf_histogram(~College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage") A) Specify the data frame and specify the variable to be used on the x axis B) Give the histogram a title and color the histogram red C) Create a new data frame with distribution of residents with college degrees (in percent) D) Give the histogram a title and specify the label for the x axis

D) Give the histogram a title and specify the label for the x axis

If you want to write a note to yourself about your R code but you want R to ignore it, how would you do so? A) You can't. R is very literal and will run everything entered. B) Simply type"[ignore]" at the start of the line. C) Include a tab at the start of the line. D) Include a hashtag (#) at the start of the line.

D) Include a hashtag (#) at the start of the line.

Run the supernova table for Pulse3Group.model. Does this show that cardiovascular health (that is, being in a lower pulse group) causes students to exercise more? A) Yes, because the F statistic is quite large (around 10), and the PRE is reasonable B) Yes, because we have discovered the best fitting parameter estimates C) No, because this analysis actually shows that exercising more causes lower pulse rates D) No, because the design of this study was correlational, so we cannot determine causation

D) No, because the design of this study was correlational, so we cannot determine causation

Arrange the FloridaLakes data frame by the variable called Calcium. What is the name of the lake with the lowest amount of Calcium? A) Annie B) lamonia C) Ocean Pond D) Ocheese Pond

D) Ocheese Pond

Still curious about Stress as an explanatory variable in the SleepStudy, you construct a boxplot to see if it's related to DepressionScore. Which of the following is true? A) There's more variability in depression among participants with normal stress levels that there is among participants with high stress levels. B) There are more outliers among participants with high stress levels than there are among participants with normal stress levels C) Stress does not appear to explain depression D) Q3 for participants with normal stress levels is roughly the same as Q1 for participants with high stress participants

D) Q3 for participants with normal stress levels is roughly the same as Q1 for participants with high stress participants

What's true of the distribution of any variable, if your model is the mean of the variable A) The distribution of the variable is more narrow than the distribution of the residual B) The distribution of the variable is always centered on a number lower than is the distribution of its residual C) The distribution of the variable is always centered on 0, whereas the center of the distribution of its residual is unpredictable D) The distribution of the variable is the same shape as the distribution of its residual

D) The distribution of the variable is the same shape as the distribution of its residual

NutritionStudy If the distribution of Fat were roughly symmetrical and bell-shaped, what would that mean? A) Most people consume more grams of fat than the average for the sample B) It is unlikely that there would be a confounding variable C) the distribution would be very narrow D) The most frequent number in the distribution would be very close to the average scores

D) The most frequent number in the distribution would be very close to the average scores

Construct a scatterplot to explore the relationship between GPA and Happiness among participants in the SleepStudy. What seems to be true? A) GPA does not appear to predict happiness B) The participants with the lowest GPA are NOT the least happy C) The participants with the highest GPA are NOT universally happy D) All of the above

D) all of the above

Maybe considering yourself a morning person (a "lark") or an evening person (an "owl") is related to variation in GPA. Which of the following plots would help us to see whether variation in GPA is related to variation in LarkOwl A) gf_histogram(~GPA, data = SleepStudy) %>% gf_facet_grid (LarkOwl~.) B) gf_boxplot(GPA ~ LarkOwl, data = SleepStudy) C) gf_point(GPA ~ LarkOwl, data = SleepStudy) D) all of the above

D) all of the above

Take the StudentSurvey data frame and use lm() to fit the empty model for GPA. Save the results in an R object StudentSurvey.modelGPA. What do you get when you StudentSurvey.modelGPA? A) the "intercept" B) 3.518 C) the mean for GPA D) all of the above

D) all of the above

Take the StudentSurvey data frame and use lm() to fit the empty model for SAT. What is 1204? A) an unbiased estimate of SAT B) the estimate of SAT that has the least error C) the mean SAT score D) all of the above

D) all of the above

The mean of Alcohol is 3.279 per day. A particular patient consumes 2 drinks per day. Which of the following represents the residual for the patient under the empty model? A) Yi - b0 B) 2 - 3.279 C) ei D) all of the above

D) all of the above

Use lm() to fit the empty model for TV in the Student Survey data frame. What can you say based on the output? A) The mean of the distribution of hours spent viewing TV by students in this data frame is 6.054 B) The best fitting number for the empty model is 6.054 C) 6.054 is an unbiased estimate D) all of the above

D) all of the above

What R code will output the standard deviation for hwy? A) sd(mpg$hwy) B) sqrt(var(mpg$hwy)) C) favstats(~hwy, data = mpg) D) all of the above

D) all of the above

What notation cannot be used to represent the mean of the sample? A) Yi B) β0 C) μ D) all of the above

D) all of the above

Which of the following word equations represents the hypothesis that smoking explains variation in fat consumption A) smoking = fat consumption B) fat consumption = smoking C) smoking = fat consumption + other stuff D) fat consumption = smoking + other stuff

D) fat consumption = smoking + other stuff

What R code would produce a relative frequency histogram of PhysicalActivity? A) gf_histogram(~PhysicalActivity, data = USStates) B) gf_relativehist(~PhysicalActivity, data = USStates) C) gf_densityhist(~PhysicalActivity, data = USStates) D) gf_histogram(..density..~PhysicalActivity, data = USStates)

D) gf_histogram(..density..~PhysicalActivity, data = USStates)

How would you create a plot to look at the distribution of Distance? A) gf_plot(~Distance$BikeCommute) B) gf_histogram(~Distance) C) gf_histogram(BikeCommute, Distance) D) gf_histogram(~Distance, data = BikeCommute)

D) gf_histogram(~Distance, data = BikeCommute)

Create a box plot for College in the USStates data frame. How many outliers do you see? A) 1 on the high end B) 2 on the low end C) 1 on the high end and 2 on the low end D) none

D) none

If you wanted to see the distribution for College (percent of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~College, data = USStates, bins = 10) A) "bins = 10" will return an error B) "gf_" is unnecessary C) The "~" is unnecessary D) nothing

D) nothing

Broadly speaking, what do we study when we study statistics? A) data B) formulas C) variables D) variables

D) variation

D

Does this show that cardiovascular health (that is, being in a lower pulse group) causes students to also exercise more? A Yes, because the F statistic is quite large (around 10), and the PRE is reasonable. B Yes, because we have discovered the best fitting parameter estimates. C No, because this analysis actually shows that exercising more causes lower pulse rates. D No, because the design of this study was correlational, so we cannot determine causation.

Revise your basic histogram of Smokers in the USStates data frame so that it includes just 5 bins. Locate the bin that represents the states with the highest number of smokers. Around what number is that bin centered? A) 10 B) 15 C) 20 D) 25 E) 30

E) 30

FunM.model <-lm(LikeM ~ FunM, data = SpeedDating) Using the R code above, we fit this model: Yi=b0+b1Xi+ei What does the Xi refer to?

Each male's value on FunM

c

Economic theories suggest that if there is less supply, there will be more demand. Perhaps if there are fewer of a particular major, then those majors might be paid higher wages. We used totalgrads (how many people graduated with that major) to predict median_income for that major in a model called totalgrads.model. When we calculated the best fitting estimates, we found a negative slope (b1=−.01989. Choose the correct interpretation for this value. response - correct A This is how much error has been explained per degree of freedom. B This is the proportion of b1 that could have been randomly generated if the empty model were true. C This is the decrement in predicted median earnings (in thousands of dollars) for each additional thousand graduates of that major. D This is the thousands of dollars predicted for the median income of a major that had 0 graduates.

c

Economic theories suggest that if there is less supply, there will be more demand. Perhaps if there are fewer of a particular major, then those majors might be paid higher wages. We've now used totalgrads (how many people graduated with that major) to predict the median_income of that major in a model called totalgrads.model (R code follows). totalgrads.model <- lm(median_income ~ total, data = collegegrads) Which of the following statements about SS is true? response - correct A SS Error for totalgrads.model should be greater than SS Total because the dots on this plot are not close to the regression model B The model (totalgrads.model) is not a very good explanatory model, so SS Model should be smaller than 1. C SS Model will be smaller than SS Total because SS Model is always smaller than SS Total, whether a model explains a lot of variation or not. D SS Error will be greater than SS Model because SS Error is always bigger than SS Model, whether a model explains a lot of variation or not.

B

Estimate the linear model predicting Infant Mortality from Education. What is the interpretation of the slope? A. For provinces with -.03% education beyond primary school, infant mortality is expected to increase by 1% B. For each 1% increase in education beyond primary school, infant mortality is expected to decrease by .03% C. For each 20.27% increase in education beyond primary school, infant mortality is expected to decrease by 0.03% D. None of the above

TRUE or FALSE: Correlation implies causation.

False

If you want to label the different levels after you split the variable into two groups you would use this....

Fingers$Height2Group <- factor(Fingers$Height2Group, levels = short, labels =tall )

You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index to Ring finger (that is, Index divided by Ring) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /, to create this summary variable.

Fingers$IndexRingRatio <- Fingers$Index/Fingers$Ring

How could you see the sum of sex if you just made it categorical?

Fingers$Sex <- as.numeric(Fingers$Sex) sum(Fingers$Sex)

We use the following code to turn Sex into a factor, and then replace the old version of the variable, which was numeric, with the new version, a factor:

Fingers$Sex <- factor(Fingers$Sex)

Assume we run these two lines of code:AttractiveF.stats <- favstats(~ AttractiveF, data = SpeedDating) rnorm(100, AttractiveF.stats$mean, AttractiveF.stats$sd) What will the second line of code (in red) do?

Generate a random sample of 100 data points from a normal distribution with the same center and spread as AttractiveF.

C

Given this distribution of data, could the population of newborn weights be shaped like a normal distribution? response - correct A No, the sample distribution doesn't look curved enough to have come from a normal distribution. B We cannot tell unless we did a few calculations to see if this fits the empirical rule. C Yes, it is possible because samples often do not look exactly like the populations that they were drawn from. D This sample distribution could only have come from a normal distribution because it is roughly symmetric and largely clustered in the center of the distribution.

In a study designed to find out what explains variation in Happiness, _____ would be the outcome variable and _____ would be the explanatory variable.

Happiness; Stress Happiness Measure of degree of happiness (higher values are happier) Stress Coded stress score: normal or high

B

Having a low resting heart rate (recorded in the variable Pulse) is supposed to be an indicator of good cardiovascular health. Let's say we wanted to create three groups based on Pulse: low, medium, and high. Which of the following code would do that, and save the values in a new variable called Pulse3Group? A Pulse3Group <- ntile(3) B StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3) C StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 2) D StudentSurvey <- ntile(StudentSurvey$Pulse3Group)

C

How will this correction (changing 54 back to 27) affect the mean and the median? A Both the mean and median will be equally affected by this correction. B The median will be affected more than will the mean. C The median will be affected less than will the mean. D It's impossible to say how the median will be affected by the correction.

D

How would you create a plot to look at the distribution of Distance? A gf_plot(~Distance$BikeCommute) B gf_histogram(~ Distance) C gf_histogram(BikeCommute, Distance) D gf_histogram(~ Distance, data = BikeCommute)

B

How would you quickly find the total number of water samples (or test tubes) collected across all of the lakes in your study? A sample(sum, FloridaLakes) B sum(FloridaLakes$NumSamples) C tally(~NumSamples, data = FloridaLakes) D arrange(FloridaLakes, NumSamples)

A

How would you use R to calculate variance in hwy? A var(mpg$hwy) B lm(hwy ~ var, data = mpg) C anova(mpg, data = hwy) D favstats(~ hwy, data = hwy)

D

If I want to draw a set of 30 random cases from my dataset with 1500 observations which function would I use: a) select(dataset, c(1:1500, 30)) b) dataset [ , sample(1:1500,30)] c) select (dataset, 30) d) dataset [sample (1:1500, 30) , ]

B

If a data point is very far away from the mean, what would you expect for the residual? A When farther away, the more positive the residual. B When farther away, the larger the absolute value of the residual. C When farther away, the more variable the residual. D When a data point is very far away from the mean, the residual should be 0, because the mean balances the residuals.

A

If the distribution of NoWetsuit was more variable (that is, has a greater standard deviation) than the distribution of Wetsuit, what would be true about the confidence interval of NoWetsuit compared to Wetsuit? response - correct A The NoWetsuit confidence interval would be wider. B The less "confident" you can be in the NoWetsuit confidence interval. C The NoWetsuit confidence interval should be based on the t rather than z distribution. D The NoWetsuit confidence interval should be calculated with the Central Limit Theorem and not with bootstrapping or simulation.

C

If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed? A A value within one quarter of 33.6 B 0 C 33.6 D It's impossible to say.

D

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data? A -10 B 33.6 C 57.2 D 23.6

A

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual? A -10 B 33.6 C 57.2 D 23.6

A

If the population average time between dental appointments is 12 months with a standard deviation of 4 months, what is the probability that a sample of 4 individuals has an average time between dental appointments of 8 months or lower? I (a). 0.02 I (b). 0.98 I (c). 0.16 I (d). 0.84

C

If the researchers are interested in whether wearing wetsuits affects swimming velocity, what is the outcome variable of interest? response - correct A Wetsuit B NoWetsuit C The difference in velocity between swimming in NoWetsuit compared to Wetsuit. D Type of athlete

D

If the sample size were larger, would the confidence interval be wider or narrower? A. Wider, because more people leads to more variability (i.e., higher standard deviation) B. Wider, because the sample size is positively related to margin of error C. Narrower, because the standard deviation will get smaller with more people D. Narrower, because the sample size is negatively related to margin of error

B

If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean? A The car's highway miles per gallon is 60% better than the other configurations of cars in the distribution. B The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy. C The car's highway miles per gallon is now smaller because .6 is smaller than 27. D The car's highway miles per gallon should be a whole number (like in the Empirical Rule), which clearly suggests an error in the calculation.

C

If we express our model as Yi = b0 + b1X1i + b2X2i + ei which part represents the model's prediction for Exercise? A Yi B b0 C b0 + b1X1i + b2X2i D b1X1i

A

If we fit a normal curve on the distribution of hwy (see visualization below), what is it that we're modeling with it? A Error around the model for hwy B The median C The empty model for hwy D Sample statistics

C

If we used this code to fit the empty model: Empty.model <- lm(Exercise ~ NULL, data = StudentSurvey) And then used the predict() function to make a prediction for each student's number of hours exercised per week, what value would it predict for each student? A It would depend on how much they actually exercised. B The value would be the mean number of hours exercised by that student that year and would vary for each student. C The value would be the mean number of hours exercised by this sample and would be the same for each student. D You would not be able to determine the value because it is represented by b0.

C

If we wanted to create a sampling distribution of the shuffled slopes (b1), how would you modify this code? shuffledSDob1 <- do(10000) * b1(median ~ total, data = collegegrads) response - correct A Shuffle the do() function like this: shuffle(do(10000)) B Shuffle the data like this: shuffle(collegegrads) C Shuffle either median or total like this: shuffle(median) or shuffle(total) D Shuffle the b1() function like this: shuffle(b1(median ~ total, data = collegegrads))

B

If we wanted to explore the idea that mother's age might explain the variation in birth weight, what visualization might be helpful to us? response - correct A gf_histogram(~ wt, data = Gestation100) %>% gf_facet_grid(age ~ .) B gf_point(wt ~ age, data = Gestation100) C gf_boxplot(wt ~ age, data = Gestation100) D All of the above would be helpful to us as we explore our data.

A

If we were to create a 99% confidence interval for the same data instead, would that confidence interval be wider (bigger) or narrower (smaller) than the 95% confidence interval. A. Wider/Bigger B. Narrower/Smaller C. It depends on sample size D. Impossible to tell

B

If we were to use this data to guess the mean weight of newborns in the population, what would we be trying to estimate? response - correct A Standard error B β0 C Xi D SStotal

B

If you create an empty model of TopSpeed, what would it mean to have an "empty model"? A The model would be the best way of explaining how many variables contribute to TopSpeed (such as time of year and type of bike). B The model would include only mean TopSpeed. C The model would predict a different TopSpeed depending on the situation. D None of the above.

B

If you created a bootstrapped sampling distribution of 10,000 means from your sample of SpeedUp, what qualities would you expect it to have? response - correct A A roughly normal shape, and a standard deviation similar to the standard deviation of the sample B A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample C A mean similar to the sample mean, and a standard deviation similar to the standard deviation of the sample D A shape similar to that of the sample, and a standard deviation smaller than the standard deviation of the sample

A

If you decide you want to increase your level of confidence in your estimate of Wetsuit (from 95% to 99%), what will happen to your confidence interval? response - correct A It will become wider. B It will become narrower. C It will become both wider and less reliable. D It will become both narrower and less reliable.

B

If you fit a model that predicts Mins by including FTMade as an explanatory variable, how many parameters would the model have? A 2: Mins and FTMade B 2: the y-intercept and the slope of the regression line C 2: the mean of Mins and the increment added for each free throw made that exceeds the mean number of free throws made D 4: Yi, b0, b1, Xi

B

If you fit a model that predicts Mins by including FTMade as an explanatory variable, how many parameters would the model have? response - incorrect A 2: Mins and FTMade B 2: the y-intercept and the slope of the regression line C 2: the mean of Mins and the increment added for each free throw made that exceeds the mean number of free throws made Yi, b0, b1, Xi D 4: Yi, b0, b1, Xi

B

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) Empty.model A How much error there is around the empty model B The mean C β0 D All of the above

D

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) anova(Empty.model) A How much error there is around the empty model B The sum of the squared residuals C The sum of squares D All of the above

D

If you want to know if a regression model is better than a simple model in terms of making a prediction, what parameter should you make a sampling distribution of? response - correct A The mean B The standard deviation C The confidence interval D The slope

C

If you want to quickly see the name of the lake with the lowest average mercury level, what R command might you run? A arrange(FloridaLakes) B tally(FloridaLakes$AvgMercury) C arrange(FloridaLakes, AvgMercury) D str(FloridaLakes)

A, B, D

If you want to use R to get the sum of 10 and 20, what code would you write? (Check all that apply.) A sum(10+20) B sum(10,20) C Sum(10+20) D 10+20

D

If you want to write a note to yourself about your R code, but you want R to ignore it, how would you do so? A You can't. R is very literal and will run everything entered. B Simply type"[ignore]" at the start of the line. C Include a tab at the start of the line. D Include a hashtag (#) at the start of the line.

C

If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy? A If the standard deviation is large, the z score should also be very large and positive. B If the standard deviation is large, the absolute z score should also be large but we won't be able to tell if it is positive or negative. C If the standard deviation is large, the z score should be small and positive. D Standard deviation and z are unrelated because they measure different things about the distribution.

A, D

If you wanted to generalize to all lakes in Florida, but only included lakes within a 50 km radium of the research center in your study; what should concern you? (Check all that apply.) A The sample is not random. B The sample is not convenient. C The sample will not have variation. D The sample may not represent the population you want to know about.

B

If you wanted to get the five-number summary for PhysicalActivity, what R code would you run? A sort(USStates, PhysicalActivity) B favstats(~ PhysicalActivity, data = USStates)) C gf_histogram(USStates$PhysicalActivity) D makefun(USStates.PhysicalActivity)

A

If you were interested in proportions rather than counts, whch argument would you add to your code above? A format = "proportion" B "proportion" C format = "relative frequency" D "percentage"

C

If you were to calculate the sum of the residuals from the empty model of Mins, what would it be? A Less than the sum of the residuals from the Points.model of Mins B More than the sum of the residuals from the Points.model of Mins C 0 D It's impossible to tell

C

If you were to calculate the sum of the residuals from the empty model of Mins, what would it be? response - correct A Less than the sum of the residuals from the Points.model of Mins B More than the sum of the residuals from the Points.model of Mins C 0 D It's impossible to tell

B

If you'd like to see an overview of what's in the data frame—a list of your variables, whether they're numeric or factors, and so forth—what command would you use? A tally() B str() C c() D sort()

C

If you're told that there's random measurement error in how one of your variables was recorded, what do you know for sure? A Your data are biased. B A mistake was made when the data were either recorded or entered. C There will be more variation than you might expect. D All of the above

C

If you've calculate the standard deviation for hwy, what have you found? A Roughly the total squared deviations from the mean, in squared highway miles per gallon B Roughly the average squared deviation from the mean, in squared highway miles per gallon C Roughly the average deviation from the mean, in highway miles per gallon D None of the above

C

If you've identified your confidence interval for SpeedUp, what exactly are you confident about? response - incorrect A You're confident that the means of samples of SpeedUp falls within a normal range. B You're confident that at least 95% of athletes would swim faster when wearing a wetsuit. C You're confident that the true effect of wearing a wetsuit on swimming velocity lies within it. D You're confident that the SlowDown will be normally distributed in the population.

A

Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. What does that mean? A The population in most states is sedentary. B The population in most states is very active. C The US population, overall, is sedentary. D The US population, overall, is very active.

C

Imagine that you created a histogram of PhysicalActivity. You meant to set it to have 15 bins, but you accidentally set 5 bins instead. How would the result be different from what you hoped? A Your histogram would be missing data. B You would see more bars than you hoped. C You would see less detail that you would have otherwise depicted. D There would be no difference because your N is 50.

D

Imagine that you have both the empty model for Exercise and the complex model for Exercise (i.e., the model that includes Pulse3Group). What would you do if you wanted to compare how well they predict Exercise? A Compare the SS from each model B Look at the reduction in error in the Pulse3Group model C Examine the PRE D Any of the above

B

Imagine that you wrote the following code. What would it do? gf_boxplot(Happiness ~ Stress, data = Fingers, color = "orange") %>% gf_jitter() A Create two, separate plots (a boxplot and a jitter plot) B Create a single plot (a boxplot with an overlaid jitter plot) C Create a jitter plot (the last command written) D Create a boxplot (with the jitter code omitted because it's incomplete)

A

Imagine that you've calculated SS for both the empty model and the complex model for Exercise. What will be true about these SS? A SS leftover from the empty model will be greater than the SS leftover from the complex model. B SS leftover from the empty model will be smaller than the SS leftover from the complex model. C SS leftover from the empty model will be equal to the SS leftover from the complex model. D In both cases the SS will be 0 because the residuals are balanced by the mean.

B

Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape? A TopSpeed and the predicted values B TopSpeed and residuals C Predicted values and residuals D No two of these distributions will have a similar shape.

If the sampling distribution of means is normal, the underlying population distribution is:

Impossible to tell

B

In GLM notation, which of he following represents the model (or prediction)? A Yi B b0 C ei D None of the above

A

In Yi = 10.38 - .85X1i - 3.14X2i + ei what does X1i stand for? A Whether someone is in the medium pulse group or not B The number of members in the medium pulse group C The intercept for Pulse3Groupmed D Whether someone is in the low or medium or high group

C

In a study designed to find out what explains variation in Happiness, _____ would be the outcome variable and _____ would be the explanatory variable. A Stress; Happiness B Stress; the cause of happiness C Happiness; Stress D Happiness; the rating of happiness

D

In addition to the NBA player data for the 2011 season, we also have a similar data frame called NBAPlayers2015 (with many of the same variables). Which of these values will be the same if we create two models with these lines of R code: Points11.model <- lm(Min ~ Points, data = NBAPlayers2011) Points15.model <- lm(Min ~ Points, data = NBAPlayers2015) response - incorrect A The SS total for both these models will be the same, because they have the same outcome variable. B The SS model for both these models will be the same because they have the same explanatory variable. C The best-fitting estimate of the empty model will be the same because it will be the mean number of minutes played. D None of these values (SS total, SS model, mean) will be the same.

C

In general, in R, __________ is where you type in code and __________ is where the code runs. A The script window (i.e., script.R); the script window (i.e., script.R) B The R Console; the R Console C The script window (i.e., script.R); the R Console D The R Console; the script window (i.e., script.R)

A, D

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this? (Check all that apply.) A It helps us understand the population. B It helps us understand each individual in the sample. C Solely to help us understand this particular sample. D It helps us understand the processes that produced the variation we see.

D

In which case would you use the t-distribution over the normal distribution to make a confidence interval? I (a). When you do not want to make assumptions about the distribution of the errors I (b). When generating a confidence interval for a slope instead of a mean I (c). When sample size is very small (n < 30) I (d). When the population standard deviation is unknown

A

In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: Three group mode (a) model comparison (b) confidence intervals (c) both

C

In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: regression model (a) model comparison (b) confidence intervals (c) both

C

In which of the following situations can you use model comparison and/or confidence intervals to choose between the complex and simple model: two group model (a) model comparison (b) confidence intervals (c) both

A

In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves as more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern? A The difference you found was the result of sampling variation. B Your study didn't have enough participants. C Your random assignment wasn't really random. D Your random selection wasn't really random.

Sample distributions are made up of _________; sampling distributions are made up of _________.

Individual scores; sample statistics

In the FloridaLakes data frame, what kind of variable is AgeData?

Integer

B

Interpret a PRE of 0.05 A There is a .05 chance that we have made a truly explanatory model. B .05 of the total variation in exercise hours is explained by the pulse groups. C .05 of the sample has a relationship between exercise hours and pulse groups. D .05 of the complex model's sum of squares can be explained by Pulse3Group.

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this? (Check all that apply.)

It helps us understand the population. It helps us understand the processes that produced the variation we see.

SpeedUp contains the swimming velocity with Wetsuit minus NoWetSuit. Could these differences in swimming velocity be normally distributed in the population?

It is possible.

What happens to the sampling distribution as we increase sample size?

It looks more normal

When you add an explanatory variable to your model, what should be the effect on the Total Sum of Squares (from the empty model)?

It should remain unchanged.

You fit a regression model, then construct a 95% confidence interval for the estimate of β1. If the confidence interval includes 0, what does this mean?

It suggests we should retain the empty model.

If you decide you want to increase your level of confidence for your estimate of LikeM (from 95% to 99%), what will happen to your confidence interval?

It will become wider.

If you decide you want to increase your level of confidence in your estimate of Wetsuit (from 95% to 99%), what will happen to your confidence interval?

It will become wider.

What would happen if we went from a 95% confidence interval to a 90% confidence interval?

It will get narrower because we have less confidence

What's the value in using the t distribution? response - correct

It works well as a model of the sampling distribution if the sample size is small, or standard deviation of the population is unknown.

How would the variable Fingers$IndexRingRatio be represented in a tidy data frame?

It would be represented as a new column

What would the sampling distribution of means look like for samples of n = 1 ?

It would have the same shape and standard deviation as the population distribution.

What would the sampling distribution of means look like for samples of n=1?

It would have the same shape and standard deviation as the population distribution.

If you increase your sample size in a study, how does it affect the 95% confidence interval around a parameter estimate?

It would make the confidence interval narrower.

FunM.model <- lm(LikeM ~ FunM, data = SpeedDating) confint(FunM.model) Using the code above, we created a model to predict LikeM using FunM as the explanatory variable. We then constructed 95% confidence intervals around the parameter estimates. The output is pictured below. If we repeated this study and found a larger standard error, what would be different about the confidence interval for β1?

It's likely that the 95% confidence interval for β1β1would be wider.

D

James computed a 95% confidence interval for the mean of a variable using a sample of size 100 as CI = 4 to 12. Based on this James concluded the likelihood of finding a individual with a score of 3.5 is 5% or less. Why is this conclusion FALSE? A. This conclusion is not FALSE, it is TRUE B. We can only be exactly 95% confident when sample size is infinite, so based on a sample of 100 we're less than 95% sure C. Based on the score we can calculate an exact probability of the individual's score, so we don't need to conclude it's 5% or less. D. The confidence interval is a range of possible means, not range of possible individuals.

If, in the population, females' mean intelligence ratings of males is 7.7, with a standard deviation of 1.2, how likely is it that we would randomly draw a sample of n=276 with a mean of less than 7.5?

Less than 5%

In the jitter plot above, which makes use of transparency, what visual feature indicates a higher frequency of data points?

Less transparency

A

Let's compare two models, one that treats Points as a quantitative variable to predict Mins, and the other that uses Points to create 24 groups (Points24Group) before using it to predict Mins. The supernova tables below show that the PRE for Points24Group reduces the total variation in Mins by 77%, but the Points model reduces it by 65%. Why isn't the Points24Group model better than the Points model of Mins? F ratio of points = 322, F ratio of Points24Group = 22.25 A The F ratio shows that the Points model explains more variation per degree of freedom than the Points24Group. B The SS error is bigger for the Points model, which demonstrates its advantage over the Points24Group model. C The Points model is far more elegant because the name is shorter and less clunky. D Trick question! The Points24Model is better than the Points model because the PRE is bigger, the SS model is bigger, and the SS error is smaller. There are no measures that suggest that the Points model is better.

A

Let's imagine a world where the population average speed of light is 299792 with a standard deviation of 105. Fill in the blank for the following code to create a sampling distribution of means. SDOM <- do(1000)*mean(rnorm(20, mean = ________, sd = ________)) A. 299792; 105 B. 105; 299792 C. 299792; 105/sqrt(20) D. 105/sqrt(20); 299795

A

Let's say a researcher hopes to explore the hypothesis that knowing about someone's stress level can help to predict their happiness. What word equation best captures this idea? A Happiness = Stress + other stuff B Stress = Happiness + other stuff C Other stuff = Stress + Happiness D Happiness = Stress

A

Let's say we wanted to write a word equation to explain the variation in the time it takes to bike to work. We think that the Distance of a person's commute is an important explanatory variable. What would the word equation look like? A Time = Distance + other stuff B Distance = Time + other stuff C Other stuff = Distance + Time D Model = Time + Distance

A, B, C

Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivity and a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common? (Check all that apply.) A They display the same variable. B They have the same number of bars. C The shape of the distribution would be the same. D Their axes would have the same labels.

C

Let's say you want to create an R object, so you can call it up later. The object you want to create represents the Oxford Dictionaries Word of the Year for 2017, which happens to have been "Youthquake." How would you create that object? A oxfordword2017 <- youthquake B youthquake" <- oxfordword2017 C oxfordword2017 <- "youthquake" D "youthquake" <- oxfordword2017 E You can't. R objects must be numeric.

C

Let's say you want to filter your data so that you do NOT include lakes that have missing data for Chlorophyll. What line of code will do that? A filter(FloridaLakes, Chlorophyll != "0") B filter(FloridaLakes, Chlorophyll​ == "0") C filter(FloridaLakes, Chlorophyll​ != "NA") D filter(FloridaLakes, Chlorophyll​ == "NA")

B

Let's say you wanted to create a vector called "SAT" from a list of SAT scores. How would you do that? A SAT <- (1300, 1120, 1050, 1470, 1350) B SAT <- c(1300, 1120, 1050, 1470, 1350) C SAT <- vector(1300, 1120, 1050, 1470, 1350) D vector <- SAT(1300, 1120, 1050, 1470, 1350)

B

Let's say you've calculated the sum of squares for hwy. What would be the advantage of dividing that number by n - 1 (i.e., dividing it by the df)? A It turns the sum of squares into a measure of spread. B You can use it to compare error across samples of different sizes. C You would have calculated the population variance. D None of the above. There's no advantage of dividing SS by the df.

A

Let's split GPA into three groups—low, medium, and high—and then create a faceted histogram. What goes in the blanks in the following code? SleepStudy$GPA3Group <- ntile(_____, 3) gf_dhistogram(~ Happiness, data = _____) %>% gf_facet_grid(GPA3Group ~ .) A SleepStudy$GPA; SleepStudy B SleepStudy$GPA; SleepStudy$GPA C SleepStudy; SleepStudy D GPA3Group; SleepStudy

Random Sampling

Making a selection from a population without bias

D

Maybe considering yourself a morning person (a "lark") or an evening person (an "owl") is related to variation in GPA. Which of the following plots would help us see whether variation in GPA is related to variation in LarkOwl? A gf_histogram(~ GPA, data = SleepStudy) %>% gf_facet_grid(LarkOwl ~ .) B gf_boxplot(GPA ~ LarkOwl, data = SleepStudy) C gf_point(GPA ~ LarkOwl, data = SleepStudy) D All of the above

The output of favstats(~Points, data=NBAPlayers2011) is shown below. If we had collected a different sample of 176 NBA players, what value(s) would be different?

Most likely, all of these would be different

Suppose there is a correlation between a teacher's salary and their age. There might not be a causation relationship due to gender. Gender is called a: Regression line Causation variable Correlation coefficient Confounding variable

NOT Causation variable

After running lm(Salary~Age, data=SalaryGender), the following is outputted in the R Console: Call: lm(formula = Salary ~ Age, data = SalaryGender) Coefficients: (Intercept) Age -9.305 1.319 According to our model, someone who is 0 years old would have a salary of -9.305 (thousands of dollars). This of course makes no sense. It doesn't make sense because of: Modeling Extrapolation Causation Regression Correlation

NOT Regression

fter running the code, cor(Salary ~ Age, data = SalaryGender), the following result is outputted in the R Console: [1] 0.4770431 Now that we know r = 0.47 for a regression line, what percent of variation is explained by the explanatory variable? 50% 100% 0% We can't determine the value. 53% 0.47$ 47%

NOT We can't determine the value.

A sample mean is 24 and the margin of error is 3.5. Would a population mean of 28.4 be considered likely?

No

What's true of sampling variation

No sample will perfectly represent the population

Will all samples drawn from a population always have the same mean?

No, because of sampling variation

Do these results show that light treatment (that is, being in the LL group) causes mice to eat more?

No, because the experiment shows that light causes mice to gain more weight, but does not prove that their weight gain is caused by eating more.

Based on the data shown in the boxplot, can we conclude that smoking causes changes in fat consumption?

No, because these data are the result of a correlation study, not an experiment.

We can calculate the residuals from the complex model and the mean residuals for the three groups with this code. StudentSurvey$Residuals <- resid(Pulse3Group.model)mean(Residuals ~ Pulse3Group, data = StudentSurvey) Here is the output: Have you done something wrong in R?

No. Means always balance residuals.

We used AgeM to predict ratings of fun (FunM). The F value for this model in the table above is .02. What does this F ratio tell us?

None of the above.

The sampling distribution of means from a uniform distribution will likely be:

Normal

C

Now imagine that the same student simply wanted to know whether her original score was a 1470. How would you get her answer? A SAT[4]><1470 B SAT[4]=1470 C SAT[4]==1470 D SAT[4]<-1470

The histogram below shows the distribution of Alcohol with an outlier removed. What is the "count" on the y-axis a count of?

Number of patients

You'd like to divide the patients in the data frame into two equal groups, those who consume relatively low amounts of Cholesterol per day and those who consume relatively high amounts of Cholesterol per day. You want to save this categorization in a variable called CholesterolGroup. What R code could you use to do this? A) NutritionStudy$CholesterolGroup <- ntile(NutritionStudy$Cholesterol, 2) B) ntile(NutritionStudy$Cholesterol, 2) C) arrange(Cholesterol, 2) D) NutritionStudy$CholesterolGroup <- str(NutritionStudy$Cholesterol, 2)

NutritionStudy$CholesterolGroup <- ntile(NutritionStudy$Cholesterol, 2)

A

One researcher wondered if some of the variation in the difference in velocity came from the type of swimmer they were. Triathletes swim in wetsuits more often than competitive swimmers do, and she worried that their experience would influence the results of this study. Above is a faceted histogram of difference in velocity by the Type of athlete. The two vertical lines depict the mean of the swimmer group and the mean of the triathlete group. The lines on the 2 graphs line up - very similar mean value What would be the PRE of this model: Type.model <- lm(SpeedUp ~ Type, data = Wetsuits) response - incorrect A Close to 0 B Close to 1 C Around .5 D I would need to run the supernova() function to be able to say.

B

One student suggests that players who make a lot of free throws (FTMade) are better and they would see more game time. Another student argues that making free throws doesn't make you a better player—having a higher free throw percentage (FTPct) is the sign of a better player, and suggests that would explain the variation in minutes played (Mins). Which of these plots would depict the relationship between Mins and one of these explanatory variables? A Box plots (i.e., gf_boxplot) B Scatter plots (i.e., gf_point) C Faceted histograms (i.e., gf_histogram with gf_facet_grid) D All of the above would be able to depict these relationships equally clearly.

B

One student suggests that players who make a lot of free throws (FTMade) are better and they would see more game time. Another student argues that making free throws doesn't make you a better player—having a higher free throw percentage (FTPct) is the sign of a better player, and suggests that would explain the variation in minutes played (Mins). Which of these plots would depict the relationship between Mins and one of these explanatory variables? response - incorrect A Box plots (i.e., gf_boxplot) B Scatter plots (i.e., gf_point) C Faceted histograms (i.e., gf_histogram with gf_facet_grid) D All of the above would be able to depict these relationships equally clearly.

The histogram above shows the distribution of population in millions across states. Which of the following statements are true based on the histogram? (Check all that apply.)

Only a few states have very high populations. The shape of the distribution is skewed to the right.

D

Our Points.model of the outcome variable Mins can be represented as: Yi=b0+b1X1+ei LeBron James scored 2,111 points in the 2011 season. In this equation, what part represents the prediction the Points.model would make for minutes played by LeBron James? A b0 B b1 C b1Xi D b0+b1Xi

D

Our Points.model of the outcome variable Mins can be represented as: Yi=b0+b1X1+ei LeBron James scored 2,111 points in the 2011 season. In this equation, what part represents the prediction the Points.model would make for minutes played by LeBron James? response - incorrect A b0 B b1 C b1Xi D b0+b1Xi

B

Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) What does the .43 mean? On anova table it's Age response - correct A This is the correlation between age and wt. B This is the increment to add on to the prediction of wt for every year of mother's age. C This is the increment to add on to the prediction of mother's age for every ounce of wt. D This is the percentage of newborn weights that are predicted accurately by using mother's age in the model.

B

Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) intercept = 109.26, age = 0.43 Which equation represents this model with the best-fitting estimates? response - correct A Yi = 109.26 + .43 + ei B Yi = 109.26 + .43Xi + ei C Yi = 109.26X1i + .43X2i + ei D Yi = 109.26b0 + .43b1 + ei

Which scatterplot shows a larger value of Pearson's r? closer to a line

Plot B

Which distribution do the numbers in the confidence interval represent? For example, if we are trying to estimate the confidence interval of the mean, which distribution's mean are we 95% confident lies in this range?

Population

Which of the following would have the same exact value?

Population mean and sampling distribution mean

Which of the following describes the error pointed to by the green label? e i = y i - y hat r Grand Mean Negative residual Positive residual PRE

Positive residual

B

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). How would we depict the null model of maximum velocity in a wetsuit on this plot? response - incorrect A A line using the best-fitting estimates from lm(Wetsuit ~ NoWetsuit, data = Wetsuits) B A horizontal line at the mean of Wetsuit C A vertical line at the mean of NoWetsuit D A dot in the middle of this plot

B

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). Which is the best way to take a look at this relationship (between Wetsuit and NoWetsuit)? response - incorrect A gf_histogram(~ Wetsuit, data = Wetsuits) %>% gf_facet_grid(NoWetsuit ~ .) B gf_point(Wetsuit ~ NoWetsuit, data = Wetsuits) C gf_boxplot(Wetsuit ~ NoWetsuit, data = Wetsuits) D All of these would effectively depict this relationship.

D

Recall that the variable SpeedUp is the difference between swimming with Wetsuit versus NoWetsuit. Why might you want to find the point below which 2.5% of bootstrapped sample means for SpeedUp fall, and the point above which 2.5% of simulated sample means for SpeedUp fall? response - incorrect A The points would help you determine whether the sample would be considered likely to have come from a population with a mean between those points. B It would enable you to calculate the critical distance. C It would enable you to establish the confidence interval. D All of the above.

What will the following code do? xqt(.025, df = 999)

Return t critical for a sample size of 1000

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: WetsuitiWetsuiti = b0b0 + b1b1NoWetsuitiNoWetsuiti + eiei If the confidence interval for β1β1 is .9547 m/sec plus or minus 0.118 m/sec, how big is the standard error of the sampling distribution of b1b1?

Roughly .118 divided by 2

The best-fitting model using AttractiveM to predict LikeM can be specified like this: LikeMiLikeMi = b0b0 + b1AttractiveMib1AttractiveMi+eiei If the 95% confidence interval for β1β1 is 0.7139 plus or minus 0.0814, how big is the standard error of the sampling distribution of b1?

Roughly 0.0814 divided by 2

Assuming they have a vector called SAT.2017 that includes all test scores, how would they add 50 points to each score in the vector?

SAT.2017 + 50

A

Sampling distributions are good tools for visualizing which of the following ideas: a. sampling variability b. measurement error c. individual differences d. sums of squares

How can sampling distributions help us interpret our data?

Sampling distributions give us a way to asses whether a relationship we've observed in our data is likely to have occurred just by chance.

How can sampling distributions help us interpret our data?

Sampling distributions give us a way to assess whether a relationship we've observed in our data is likely to have occurred just by chance.

Think back to the vector called SAT. Let's imagine that the student who earned the fourth score in this vector would like to know her score. Now imagine that the same student simply wanted to know whether her original score was a 1470. How would you get her answer?

Sat[4]==1470

What is a scatterplot commonly used for?

Show relationship between an outcome variable and an explanatory variable. eg) gf_point(Thumb ~ Sex, data = Fingers)

D

SpeedUp contains the swimming velocity with Wetsuit minus NoWetSuit. Could these differences in swimming velocity be normally distributed in the population? response - correct A No, that is not possible, because this distribution shows that the data are not clustered in the center. B No, that is not possible, because the sample was too small (n = 12), so the population could not have been normally distributed. C No, that is not possible, because this variable was created by subtracting two measurements. D It is possible.

What is the difference between Standard Deviation and Standard Error?

Standard Error applies to a sampling distribution; Standard Deviation applies to sample or population distributions.

If you want to use R to get the sum of 10 and 20, what code would you write? (Check all that apply.)

Sum(10+20) Sum(10,20) 10+20

What will the following code produce? SAT <- c(1300,1120,1050,1470,1350) First.Score <- SAT[1] Second.Score <- SAT[2] First.Higher <- First.Score > Second.Score First.Higher

TRUE

A

Take a look at the model that we fit in this output. How would we represent this number with General Linear Model (GLM) notation? Intercept = 0.0775 response - correct A b0 B b1 C β0 D β1

What is a correlational study or observational study?

Taking a random sample from a population, and then measuring some variables.

A

The College Board discovered a mistake! All of their tests administered in 2017 were scored 50 points lower than they should have been. Assuming they have a vector called SAT.2017 that includes all test scores, how would they add 50 points to each score in the vector? A SAT.2017 + 50 B SAT.2017 <- add(50) C sum(SAT,50) D sum(SAT+50)

If the distribution of NoWetsuit was more variable (that is, has a greater standard deviation) than the distribution of Wetsuit, what would be true about the confidence interval of NoWetsuit compared to Wetsuit? response - correct

The NoWetsuit confidence interval would be wider.

d

The PRE from the total.model of median income is .0114. Which of the following lines of code would let us see the distribution of PREs if there were no relationship between median earnings and total graduates? response - correct A SDoPRE <- do(10000) * PRE(shuffle(median_income) ~ totalgrads, data = collegegrads) B SDoPRE <- do(10000) * PRE(median_income ~ shuffle(totalgrads), data = collegegrads) C SDoPRE <- do(10000) * PRE(resample(median_income) ~ totalgrads, data = collegegrads) D All of the above would create a sampling distribution of PREs based on the empty model.

Where does the code run?

The R Console

We were interested in whether the pulse groups also explain some of the variation we see in students' number of piercings. Here is the supernova table for that model. Why does this table have a smaller Sum of Squares Total (1699) than the supernova table for Exercise explained by Pulse3Group (11864)?

The SS Total depends on the variation in the outcome variable. Piercings is a different outcome variable so it has a different SS Total.

C

The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles, the distance for one of the commute has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean? A The mean will be unaffected by the correction. B The mean will be higher because of the correction. C The mean will be lower because of the correction. D It's impossible to say how the mean will be affected by the correction.

A

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: Wetsuiti = b0 + b1NoWetsuiti + ei If the confidence interval for β1 is .9547 m/sec plus or minus 0.118 m/sec, how big is the standard error of the sampling distribution of b1? response - incorrect A Roughly .118 divided by 2 B Roughly .118 divided by square root of 12 C Roughly .9547 divided by 2 D Roughly .9547 divided by square root of 12

C

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: Wetsuiti = b0 + b1NoWetsuiti + ei If the confidence interval for β1 is .9547 m/sec plus or minus 0.118 m/sec, which of the following is NOT a correct interpretation? response - incorrect A We are 95% confident that the true slope of the DGP will be in this range. B There is a 95% chance that if you repeated this experiment with a different set of swimmers, the slope of the regression line will fall within this confidence interval. C 95% of all Wetsuit velocities have this relationship with the NoWetsuit velocity. D The true parameter (β1) will very likely fall inside this interval.

A

The best-fitting model using NoWetsuit velocity to predict Wetsuit is this: Yi=0.1423+0.9547Xi+ei How should we interpret 0.9547? response - correct A The increment to add on to the prediction of Wetsuit for every 1 m/sec of NoWetsuit B The increment to add on to the prediction of Wetsuit to each person's NoWetsuit C The difference between average Wetsuit and NoWetsuit D The prediction for Wetsuit when NoWetsuit is 0

D

The data frame is currently organized alphabetically by lake. What if you'd like to see it ordered by average mercury level, with the most polluted lake appearing first on the list? Save the result back into FloridaLakes. A FloridaLakes <- sample(FloridaLakes, desc(AvgMercury)) B FloridaLakes <- sum(AvgMercury) C FloridaLakes <- tally(~ NumSamples, AvgMercury) D FloridaLakes <- arrange(FloridaLakes, desc(AvgMercury))

If the researchers are interested in whether wearing wetsuits affects swimming velocity, what is the outcome variable of interest?

The difference in velocity between swimming in NoWetsuit compared to Wetsuit.

In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves as more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern?

The difference you found was the result of sampling variation.

Above we have included the ANOVA tables for Wetsuit = NoWetsuit + other stuff. Which distance is the basis of the SS Error?

The distance between data points and the NoWetsuit model's prediction.

Below is a boxplot of Calories consumed per day by Gender. On the right you see the distribution for males. The two rectangles that compose the "box" portion of the plot have different heights. What does that mean?

The distribution of Calories consumed by males is skewed.

What does the law of large numbers have to do with randomness?

The distribution of a random event will eventually be perfectly regular.

What happens in a path diagram?

The explanatory variable "points to" the outcome variable.

What is the explanatory variable?

The explanatory variables are the variables we use to explain variation in the outcome variable

What would this code show us? SDob1 <- do(100000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) SDob1 <- arrange(SDob1, desc(b1)) SDob1$b1[2500] response - correct

The highest population increment that could have produced our sample and it would still be considered likely

Suppose we construct a sampling distribution of bootstrapped slopes for Points vs. Age. bootSDob1 <- do(1000) * b1(Points ~ Age, data = resample(NBAPlayers2011, 176)) bootSDob1$b1[25] What does this 25th value tell us?

The highest population increment that could have produced our sample and it would still be considered likely.

The best-fitting model using NoWetsuit velocity to predict Wetsuit is this: Yi=0.1423+0.9547Xi+eiYi=0.1423+0.9547Xi+ei How should we interpret 0.9547?

The increment to add on to the prediction of Wetsuit for every 1 m/sec of NoWetsuit

B

The mean maximum swim velocity when wearing a wetsuit (i.e., Wetsuit) is 1.51 m/sec. If the critical distance is 0.08 m/sec, what's the range of possible values within which you're 95% confident that actual population mean would fall? response - correct A 1.47 m/sec to 1.55 m/sec B 1.43 m/sec to 1.59 m/sec C It depends on the standard deviation of the sample of Wetsuit. D It depends on the standard deviation of the population of Wetsuit.

A

The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6? Yi=Y¯¯¯¯+ei A Yi B Y¯¯¯¯ C ei D None of the above

c

The mean of the distribution of median earnings in this data is $40,151.45. Is this a good estimate of the median earnings of college graduates in the United States? response - correct A Yes because the mean is an estimate that balances the residuals. B No because the mean is a terrible estimate of a median. C No because the observations in this data set are majors, not individual graduates. D No because some people (e.g., famous college dropouts like Mark Zuckerberg and Bill Gates) earn a lot more than college graduates.

If you bootstrap a sampling distribution based on your sample of data, what will be the mean of the bootstrapped distribution?

The mean of your sample

In a jitter plot, what tells you how many people there are with a particular thumb length?

The number of dots in a row

For which scatterplot does the slope of the regression line equal the correlation coefficient? gf_point(Salary ~ Age, data = SalaryGender, size = 4, color = "black")%>% gf_lm() gf_point(zSalary ~ zAge, data = SalaryGender, size = 4, color = "firebrick")%>% gf_lm() Both of them. Neither of them. The one on the left. The one on the right.

The one on the right.

What is an outcome variable?

The outcome variable is the variable whose variation we are trying to explain.

D

The output of favstats(~ wt, data = Gestation100) is shown above. If they had collected a different sample of 100 newborns, what value would be different? response - correct A Max B Mean C Median D Most likely, all of the above

What is random assignment?

The process of choosing, without bias, whom from your sample is studied will be in different experimental conditions.

When we calculate the residuals from both the empty model and the complex model, what is similar about these two sets of residuals?

The residuals represent the difference between the data and the model's prediction.

C

The same student has a tendency to come back repeatedly to ask the same question. With that in mind, you decide to save the answer above to an R object so you can readily call it up later. You decide to call the R object annoyingstudent. Which line of code would create that object? A annoyingstudent <- SAT[4]><1470 B annoyingstudent <- SAT[4]=1470 C annoyingstudent <-SAT[4]==1470 D SAT[4]=1470 <- annoying student

B

The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") If you were to stack up all the bars, what would the total be? response - incorrect A 100 B 10,000 C The number of objects in the population D The sum of all the values in this distribution

A

The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") Which of the following is a true statement? response - correct A The standard error of this distribution is smaller than wt.stats$sd. B The standard error of this distribution is larger than wt.stats$sd. C The standard error of this distribution is approximately equal to wt.stats$sd. D It's impossible to know from this information alone.

D

The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd)) gf_histogram(~ mean, data = SDoM, fill = "burlywood") Someone tells you that their baby was 116 ounces at birth. What is the likelihood of a baby having a birth weight of 116 or lower in the population? response - incorrect A Probably close to 0 because such a low birth weight is in the lowest tail of this sampling distribution. B Probably close to .05 because this is a normally distributed sampling distribution. C I could find out by running this R code: tally(~mean <= 116, data = SDoM, format = "proportion"). D I am not sure because I cannot tell the likelihood of a single baby having such a birth weight from this sampling distribution of means.

If we wanted to know the range within which 95% of individual scores should fall, which distribution would we need to create a model of?

The sampling distribution of sample means

Where do you type in code?

The script window

If you want to know if a regression model is better than a simple model in terms of making a prediction, what parameter should you make a sampling distribution of?

The slope

What is standard error?

The standard deviation of the sampling distribution

What is standard error?

The standard deviation of the sampling distribution of an estimate

The sampling distribution of means above was created with these three lines of code: SincereF.stats <- favstats(~ SincereF, data = SpeedDating) SDoM <- do(10000) * mean(rnorm(272, SincereF.stats$mean, SincereF.stats$sd)) gf_histogram(~ mean, data = SDoM, color = "darkorchid4", fill = "darkorchid1") What is true about the standard error of this distribution?

The standard error of this distribution is smaller than SincereF.stats$sd.

C

The study included 53 lakes in Florida. Where is this information in the data frame? A One of the values in the data frame B The number of variables in the data frame C The number of rows of data D The number of columns of data

D

The sum of squares gets larger as A The variation increases B The sample size increases C The spread of the distribution increases D All of the above

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n = 30, the other an n = 60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n=30, the other an n=60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

If we used this code to fit the empty model: Empty.model <- lm(Salary ~ NULL, data = SalaryGender) And then used the predict() function to make a prediction for each teacher's salary, what value would it predict for each teacher?

The value would be the mean salary of this sample and would be the same for each teacher.

B

The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data? A Histogram B Bar graph C Both are equally appropriate. D Neither are appropriate for a categorical variable.

In the jitter plot below, we've put a green box around a dense row of data and a red box around a less dense row of data. What does the density of dots represent?

There are a lot of individuals who have the same value on the y-axis.

A

There are actually five experiments in this dataset (all the same experiment run a few different times). First, we're going to look at just the data from Experiment 3. Which of the following commands would create a new data set called Exp3, which contains only the data from the third experiment (HINT: you can try out different commands until you find one that creates what you want). A. Exp3 <- filter(morley, Expt == 3) B. Exp3 <- select(morley, Expt == 3) C. Exp3 <- sample(morley, sum(Expt == 3)) D. Exp3 <- resample(morley, sum(Expt == 3))

The plot above shows males' liking of female (LikeM) as a function of whether or not they want to date them again (DecisionM, Yes or No). What does the plot show?

There were females that males liked a lot, but with whom they did not want to go out on another date.

B

Think back to the vector called SAT. Let's imagine that the student who earned the fourth score in this vector would like to know her score. You try sat[4], but get an error. What did you do wrong? A Because sat isn't numeric, it needs quotes around it. B Because R is case sensitive, SAT needs to be capitalized. C SAT needs to be capitalized and in quotes. D That shouldn't have produced an error.

What's the purpose of generating a sampling distribution of means of SpeedUp by resampling (also called bootstrapping)? response - correct

This distribution can help you quantify how much your best estimate of the population mean could vary.

SpeedUp contains the swimming velocity with Wetsuit minus NoWetsuit. The histogram above was created with this code: gf_histogram(~ SpeedUp, data = Wetsuits, bins = 6, fill = "black", color = "royalblue1", alpha = .8) How would you modify this code to look at the distribution of swimmers that were faster with a wetsuit and those that were faster with no wetsuit?

This histogram shows that all swimmers were faster with a wetsuit.

d

This is the distribution of median earnings for different college majors. What makes the mean a good model of this distribution? response - correct A It's your only option, because you can't take the median of a group of medians. B The mean is always the best model for a skewed distribution. C The mean is the best statistic because we can simulate sampling distributions of the mean. D The mean is a model that spends only one degree of freedom and minimizes squared error.

Continuing with Age.model, which uses Age to explain variation in Points, Age.model <- lm(formula = Points ~ Age, data=NBAPlayers2011) Age.model What does the -9.154 mean?

This is the increment to add to the prediction of Points for every year of an NBA player's Age. -9.154 is the right coefficient

Why do some players get more playing time and some players see less game time? Let's take a look at a histogram of number of minutes played to start exploring this variation. What does the orange curve drawn on this histogram represent?

This represents a normal distribution that was fit to the mean and standard deviation of this data.

B

To examine the distribution of Happiness, which would be more useful? A A tally B A histogram C In this case, either would be useful. D In this case, neither would be useful.

What is type 1 error?

Type I error is when we conclude that some variable we manipulated—the smiley face in this case—had an effect when in fact the observed difference was simply due to random sampling variation.

If we made a histogram of 211 random numbers generated from a computer or 20-sided die, what would the resulting distribution look like? What shape would you expect?

Uniform

How do you visualize the relationship between two variables?

Using "fill= " eg) gf_histogram(~ Thumb, data = Fingers, fill = ~Sex)

a

Using the code below, we created a model to predict median earnings using STEM as the explanatory variable, and then to construct 95% confidence intervals around the parameter estimates. The output is pictured above. STEM.model <- lm(median_income ~ STEM, data = collegegrads) confint(STEM.model) If we repeated this study but our sample size was larger and thus our standard error was smaller, what would be different about the confidence interval? response - incorrect A It's likely that the 95% CI of b1 would be smaller. B It's likely that the 95% CI of b1 would be larger. C There is no way to tell because standard error is not related to confidence intervals. D The confidence interval would stay the same as long as the confidence level is the same.

What is within-group variation?

Variation among members of the same group

What happens in a word equation?

Variation in the outcome is explained by the explanatory.

Where on this boxplot would you look to see evidence of within-group variation in fat consumed?

Vertically, within each boxplot

B

Wanting to see "MyUniversity" in the R Console, you've just run the following command: print(MyUniversity). However, R returned an error message. What's the correct command, if you want to print "MyUniversity"? A Print(MyUniversity) B print("MyUniversity") C #"MyUniversity" D Print "MyUniversity"

A

We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group). Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, what does Yi represent? A Each person's value for Exercise B The average Exercise for all participants C The deviation between each person's Exercise and the average Exercise for all participants D It might be any of the above, depending on which interpretation you're using.

C

We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group). Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, which equation represents this Pulse3Group model? A Yi = b0 + ei B Yi = b0 + b1Xi + ei C Yi = b0 + b1X1i + b2X2i + ei D Pulse3Groupi = b0 + b1Exercisei + ei

B

We can calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals? A The values of the residuals from the empty model will be the same as the values of the residuals from the complex model. B The residuals represent the difference between the data and the model's prediction. C The residuals represent the difference between the data and the Grand Mean. D In both cases, the residuals can reduced to near 0 simply by being careful with measurement and data entry.

C

We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). If you know a player had 0 free throws, how many minutes would you predict he played? intercept = 1552.309 and FTMade = 2.834 A 0 B 2.83 C 1662.31 D 1662.31 + 2.83

D

We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). We also fit a model of Mins predicted by Points (points scored) and called it Points.model (output below). From these best-fitting parameters, can we tell which model explains more variation: FTMade.model or Points.model? A Yes, FTMade.model is a better model, because the increment of time added on per free throw made is larger than the increment of time added on per point scored. B Yes, FTMade.model is a better model because the intercept is larger than the intercept for Points.model. C No, we should never compare models that have different explanatory variables because they are in different units. D No, we cannot tell from the best-fitting estimates how much variability has been explained by a model.

C

We found the best-fitting estimates and put them into the FTMade.model of the outcome variable Mins: Yi=1662.31+2.83Xi+ei LeBron James played 3,063 minutes, scored 2,111 points, and made 758 free throws in the 2011 season. What is the FTMade.model's prediction for minutes played by LeBron James? A 3063 B 1662.31 C 1662.31 + 2.83*758 D 1662.31 + 2.83*758 + 2111

C

We found the best-fitting estimates and put them into the FTMade.model of the outcome variable Mins: Yi=1662.31+2.83Xi+ei LeBron James played 3,063 minutes, scored 2,111 points, and made 758 free throws in the 2011 season. What is the FTMade.model's prediction for minutes played by LeBron James? response - correct A 3063 B 1662.31 C 1662.31 + 2.83*758 D 1662.31 + 2.83*758 + 2111

D

We have quantified error from the FTMade.model of Mins and the Points.model of Mins by using the supernova() function. Which of the following are reasons to think that the Points.model is better than the FTMade.model? A The FTMade.model's PRE is less than the Points.model's PRE. B The FTMade.model's SS model is less than the Points.model's SS model. C The Points.model's SS error is less than the FTMade.model's SS error. D All of the above

C

We looked on at the majors that were considered STEM majors. boxplot(median_income ~ major_category, data = STEMonly) major_category.model <- lm(median_income ~ major_category, data = STEMonly) 4 major categories!!! How would you specify the major_category.model using GLM notation? response - correct A Yi=b1X1i+b2X2i+b3X3i+b4X4i+ei B Yi=b1X1i+b2X2i+b3X3i+ei C Yi=b0+b1X1i+b2X2i+b3X3i+ei D Yi=b0+b4X4+ei

A

We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. All the R code for this is shown below. Gestation100$smoke.factor <- factor(Gestation100$smoke, levels = c(0:3), labels = c("never", "smokes now", "until current pregnancy", "once did, not now")) smoke.factor.model <- lm(wt ~ smoke.factor, data = Gestation100) Which GLM equation specifies the smoke.factor.model? response - correct A Yi = b0 + b1X1i + b2X2i + b3X3i + ei B Yi = b0 + b1X1i + b2X2i + b3X3i + b4X4i + ei C Yi = b0X0i + b1X1i + b2X2i + b3X3i + ei D Yi = b0 + b3X3i + ei

B

We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. The ANOVA table for this model is shown above. Why is the degrees of freedom for the Model 3? response - correct A Because there are three parts of the model: Model, Error, and Total. B Because there were three additional parameters estimated in the smoke.factor model compared to the empty model. C Because there were three variables involved in this study: smokers, nonsmokers, and weight. D Because there were three data points that we excluded to fit this model.

A

We made this scatterplot to explore the idea that free throw percentage (FTPct) would predict how many minutes a player gets to play. If we fit an empty model to this data, how would we depict it on this plot? response - correct A A horizontal line that shows the mean for minutes played. B A vertical line that shows the mean free throw percentage. C A diagonal line that bisects the cloud of points. D You would not be able to represent the empty model visually because it is a single number.

b

We ran this code to calculate the residuals from the STEM model. STEM.model <- lm(median_income ~ STEM, data = collegegrads) collegegrads$resid <- resid(STEM.model) Interpret the residual (-7.86) for the Molecular Biology major. response - correct A This major's median earning is about $8,000 less than the Grand Mean. B This major's median earning is about $8,000 less than the STEM model's predicted median earning for this major. C The STEM model would predict that graduates of this major make $8,000 less than the average major. D The STEM model's prediction is $8,000 less for this major than the empty model's prediction.

D

We use this code to calculate the correlation coefficient (Pearson's r) for Mins and Points: cor(Mins ~ Points, data = NBAPlayer2011) What have we found? A A measure of how tight the data points are around the regression line B The slope of the regression line between the standardized Mins and Points C The strength of a bivariate relationship D All of the above

D

We use this code to calculate the correlation coefficient (Pearson's r) for Mins and Points: cor(Mins ~ Points, data = NBAPlayer2011) What have we found? response - incorrect A A measure of how tight the data points are around the regression line B The slope of the regression line between the standardized Mins and Points C The strength of a bivariate relationship D All of the above

Which of the following R code would create a new variable called SpeedUp that contains the difference between swimming with Wetsuit versus NoWetsuit? response - correct

Wetsuits$SpeedUp <- Wetsuits$Wetsuit - Wetsuits$NoWetsuit

A

What R code created the line that indicates the mean? A gf_vline(xintercept = 33.6, color = "blue") B gf_mean(mean = 33.6, color = "blue") C gf_mean(33.6, color = "blue") D None of the above

D

What R code will output the standard deviation for hwy? A sd(mpg$hwy) B sqrt(var(mpg$hwy)) C favstats(~ hwy, data = mpg) D All of the above

B

What R code would create a distribution of Smokers? A histogram(Smokers, USStates) B gf_histogram(~ Smokers, data = USStates) C histogram ~ Smokers D gf_histogram ~ Smokers

D

What R code would produce a relative frequency histogram of PhysicalActivity? A gf_histogram(~ PhysicalActivity, data = USStates) B gf_relativehist(~ PhysicalActivity, data = USStates) C gf_densityhist(~ PhysicalActivity, data = USStates) D gf_histogram(..density..~ PhysicalActivity, data = USStates)

C

What R code would you use to fit the empty model for TopSpeed? A gf_histogram(NULL ~ TopSpeed, data = BikeCommute) B NULL(TopSpeed, data = BikeCommute) C lm(TopSpeed ~ NULL, data = BikeCommute) D gf(TopSpeed ~ NULL, data = BikeCommute)

B

What does the orange (normal) curve drawn on this histogram represent? A This represents the population from which these data were drawn randomly. B This represents a normal distribution that was fit to the mean and standard deviation of this data. C This represents a normal curve that shows the 95% of data points that lie within two standard deviations of the mean. D This is another way of representing the sample distribution.

A

What gives you the output of the 5 first values of all the variables? A head() B str() C c() D sort()

A

What is the Bonferonni adjustment used for? I (a). Adjusting Type I Error rate when we do multiple comparisons I (b). Correcting for bias in a confidence interval I (c). Accounting for variance explained while adjusting for degrees of freedom I (d). Adjusting variance estimates to take into account measurement error

A

What is the purpose of using the shuffle method as compared to bootstrapping? A. Shuffling provides an estimated sampling distribution assuming the simple model is true B. Shuffling provides an estimated sampling distribution assuming the complex model is true C. Shuffling maintains the original range of the data, whereas bootstrapping does not D. There is no advantage of using shuffling over bootstrapping

A

What kind of distribution would this code create? do(10000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) response - incorrect A A sampling distribution of bootstrapped slopes B A sampling distribution of means C A sampling distribution of the mean difference between Wetsuit and NoWetsuit D The population distribution that our sample could have come from

C

What notation can be used to represent the mean of the population? A β0 B μ C Both of the above D None of the above

D

What notation cannot be used to represent the mean of the sample? A Yi B β0 C μ D All of the above

B

What will happen in R if you run: print("StatsCourse")? A R will send "StatsCourse" to your local printer. B R will display: "StatsCourse". C R will show the full data file names "StatsCourse". D R will return an error message. R is a programming language, not a printer interface.

B

What will the following code do? resample(Wetsuits, 12) response - correct A Take a new sample from the population of Wetsuits B Create a new sample from the observations in Wetsuits C Create a new sample from Wetsuits that is the same as the original sample D Select a random observation from the 12 observations in Wetsuits

A

What will the following code do? xqt(.025, df = 999) response - correct A Return t critical for a sample size of 1000 B Return a square root C Return the confidence interval D Return the percentage of data points that fall below .025, given df = 999

B

What will the following code produce? SAT <- c(1300,1120,1050,1470,1350) First.Score <- SAT[1] Second.Score <- SAT[2] First.Higher <- First.Score > Second.Score First.Higher A 180 B TRUE C 1300 D First.Score > Second.Score

D

What would the following R code do, beyond creating a histogram? gf_histogram(~ College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage") A Specify the data frame and specify the variable to be used on the x-axis. B Give the histogram a title and color the histogram red. C Create a new data frame with distribution of residents with college degrees (in percent). D Give the histogram a title and specify the label for the x-axis.

B

What would this code show us? SDob1 <- do(100000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) SDob1 <- arrange(SDob1, desc(b1)) SDob1$b1[2500] response - correct A The wetsuit velocity of the top 2500 swimmers in the population B The highest population increment that could have produced our sample and it would still be considered likely C The lowest population mean that could have produced our sample and it would still be considered likely D The highest population mean that could have produced our sample and it would still be considered likely

C

What's the purpose of generating a sampling distribution of means of SpeedUp by resampling (also called bootstrapping)? response - correct A These values can supplement your existing data if your sample is too small. B This distribution can confirm the accuracy of your calculated sample mean of SpeedUp. C This distribution can help you quantify how much your best estimate of the population mean could vary. D Bootstrapping eliminates the element of chance from the sampling process.

B

What's the value in using the t distribution? response - incorrect A It's less variable than the normal distribution. B It works well as a model of the sampling distribution if the sample size is small, or standard deviation of the population is unknown. C It helps us determine the degrees of freedom from our data. D It works well as a model of the population if the sample size is small, or standard deviation of the population is unknown.

D

What's true about data? A They require that you've selected a sample. B They are the result of measurement. C They represent something about the world. D All of the above

C

What's true of sampling variation? A It's almost purely theoretical. We rarely encounter it. B It leads to bias. C Because of it, no sample will perfectly reflect the population. D All of the above

D

What's true of the distribution of any variable, if your model is the mean of that variable? A The distribution of the variable is more narrow than the distribution of its residual. B The distribution of the variable is always centered on a number lower than is the distribution of its residual. C The distribution of the variable is always centered on 0, whereas the center of the distribution of its residual is unpredictable. D The distribution of the variable is the same shape as the distribution of its residual.

If you simulate 10,000 samples of 176 NBA players and the points they scored in the 2010-2011 NBA season, then calculate the mean of each sample, and then plot the resulting distribution of means in a histogram, what will be the mean of this sampling distribution?

Whatever mean you set when you ran the simulation

You're interested in females' ratings of males' intelligence. You simulate 500 samples of 276 ratings, calculate the mean of each sample, and plot the resulting distribution of means in a histogram. What will be the mean of this sampling distribution?

Whatever mean you set when you ran the simulation

B

When Pulse3Group is included in our model to explain variation in Exercise, how is error from this more complex model calculated? A The deviation of each person's Exercise from the Grand Mean for Exercise B The deviation of each person's Exercise from the mean Exercise of their Pulse3Group C The deviation of each Pulse3Group's mean to the Grand Mean for Exercise D None of the above

B

When calculating PRE which two sums of squares are used for this calculation? A. SSModel & SSError B. SSModel & SSTotal C. SSError & SSTotal

A

When calculating the F-ratio which two sums of squares are used for this calculation? A. SSModel & SSError B. SSModel & SSTotal C. SSError & SSTotal

a

When we calculated the best fitting estimates, we found a negative slope (b1 = −.01989). How can we find the confidence interval around this slope? response - incorrect A Bootstrap b1s by resampling from this sample and finding the cut off for the highest and lowest 2.5% of b1s B Randomize b1s by shuffling the totalgrads values that go with median_incomes values and find the highest and lowest 2.5% of b1s C confint(empty.model) D All of the above

A

When you add an explanatory variable to your model, what should be the effect on the Sum of Squares from the empty model? A It should remain unchanged. B It should go up. C It should go down. D It depends on how much variation is accounted for by the explanatory variable.

B

Which of the following R code would create a new variable called SpeedUp that contains the difference between swimming with Wetsuit versus NoWetsuit? response - correct A SpeedUp = Wetsuit - NoWetsuit B Wetsuits$SpeedUp <- Wetsuits$Wetsuit - Wetsuits$NoWetsuit C SpeedUp(Wetsuits$Wetsuit - Wetsuits$NoWetsuit) D SpeedUp$Wetsuit - SpeedUp$Wetsuit

C

Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election, and the number of states in which Obama won? A winner(~ Pres2008, data = USStates) B gf_boxplot(~ Pres2008, data = USStates) C tally(~ Pres2008, data = USStates) D str(~ Pres2008, data = USStates)

B, C

Which of the following are quantitative variables? (Check all that apply.) A LarkOwl B CognitionZscore C Happiness D Stress

A

Which of the following chunks of code creates a sampling distribution of F-ratios using the shuffle method? A. SDoF <- do(1000)*fVal(Infant.Mortality~shuffle(Education), data = swiss) B. SDoF<- do(1000)*fVal(Infant.Mortality~Education, data = resample(swiss)) C. SDoF <- do(1000)*fVal(shuffle(Infant.Mortality~Education, data = swiss)) D. SDoF <- shuffle(do(1000)*fVal(Infant.Mortality~Education, data = swiss))

C

Which of the following does not influence the width of a confidence interval for the mean? I (a). Sample Size I (b). Confidence level I (c). Size of effect I (d). Standard deviation

B, C, D

Which of the following from FloridaLakes are quantitative variables? (Check all that apply.) A Lake B pH C NumSamples D MinMercury

c

Which of the following is an accurate interpretation of the 95% confidence interval for the b1 coefficient in the regression model? A. 95% of the time you'll have to wait between 10.11 minutes and 11.34 minutes to see an eruption. B. There is a 95% chance that β_1 is between 10.11 and 11.34 C. If the population slope β_1 is between 10.11 and 11.34 then our observed slope is considered likely (no less than 5%). D. If I ran another study of Old Faithful, with the same sample size, there is a 95% chance I'll get the exact same slope estimate

C

Which of the following is the correct interpretation of MS Total (297,230) in the supernova table above? A This is, roughly, the total number of points in the data frame. B This is, roughly, the total number of squared means based on the empty model. C This is, roughly, the average squared residual from the mean. D This is, roughly, the standard deviation from the mean.

C

Which of the following is the correct interpretation of MS Total (297,230) in the supernova table above? response - incorrect A This is, roughly, the total number of points in the data frame. B This is, roughly, the total number of squared means based on the empty model. C This is, roughly, the average squared residual from the mean. D This is, roughly, the standard deviation from the mean.

B

Which of the following is the correct interpretation of PRE (0.65) in the supernova table above? A 65% of the players' minutes in the data frame can be predicted with their Points. B 65% of the SS from the empty model can be explained by adding Points to the complex model. C 65% of the Points model can be proportionally reduced by the empty model. D The Points model's SS total will be 65% of the SS total from the empty model.

B

Which of the following is the correct interpretation of PRE (0.65) in the supernova table above? response - correct A 65% of the players' minutes in the data frame can be predicted with their Points. B 65% of the SS from the empty model can be explained by adding Points to the complex model. C 65% of the Points model can be proportionally reduced by the empty model. D The Points model's SS total will be 65% of the SS total from the empty model.

B

Which of the following is the correct interpretation of PRE (0.98) in the supernova table above? response - correct A 98% of the Wetsuit velocities in the data frame can be predicted with the corresponding NoWetsuit velocity. B 98% of the SS from the empty model can be explained by adding NoWetsuit to the complex model. C 98% of the NoWetsuit model can be proportionally reduced by the empty model. D The NoWetsuit model's SS Total will be 98% of the SS Total from the empty model.

A, B

Which of the following methods would be appropriate for creating a confidence interval for the slope (b1) CHECK ALL THAT APPLY A. t-distribution B. bootstrapping C. shuffling D. normal distribution

B

Which of the following would be an accurate interpretation of a confidence interval for the average time to dental appointments? I (a). 95% of individuals will have time to dental appointments within this range I (b). For 95/100 studies of time to dental appointments, their confidence interval will capture the true mean. I (c). There is a 5% change that the true mean is not inside the calculated confidence interval. I (d). There is a 95% chance that the true mean is 10 months.

A

Which of the following would be the R command for estimating a simple model predicting Infant.Mortality? A. lm(Infant.Mortality~NULL, data = swiss) B. lm(Education~Infant.Mortality, data = swiss) C. predict(Infant.Mortality) D. lm(NULL~Infant.Mortality, data = swiss)

A, B, D

Which of the variables below would be appropriate for a histogram? (Check all that apply.) A College B IQ C Pres2008 D Population

D

Which of these options could be used to depict the relationship between Exercise and Pulse3Group? A gf_boxplot(Exercise ~ Pulse3Group, data = StudentSurvey) B gf_histogram(~ Exercise, data = StudentSurvey) %>% gf_facet_grid(Pulse3Group ~ .) C gf_point(Exercise ~ Pulse3Group, data = StudentSurvey) D All of the above

B

Why is the SS Total the same value for the FTMade.model and the Points.model? A Both are based on the residuals from the mean of the same explanatory variable. B Both are based on residuals from the mean of the same outcome variable. C All models that use the same data frame will have the same SS total. D Both models are based on the same number of values (n = 176).

B

Why is the SS Total the same value for the FTMade.model and the Points.model? response - correct A Both are based on the residuals from the mean of the same explanatory variable. B Both are based on residuals from the mean of the same outcome variable. C All models that use the same data frame will have the same SS total. D Both models are based on the same number of values (n = 176).

D

Why is the mean a good model for hwy? A Because the mean is the only true statistical model that can represent a population parameter. B Because the mean is the best model whenever you make a visualization of data. C Because the mean is the best model for all categorical variables. D Because the mean is a model that balances the residuals and minimizes the sum of squared residuals.

A

Why is the mean a good model for this distribution? A Because the mean balances the deviations above and below the mean. B Because the mean balances the number of values above and below the mean. C Because the mean is the midpoint of the range. D All of the above are reasons why the mean is a good model.

A

Why might it be helpful to calculate means from random samples of 100 drawn from a normal distribution with the same mean and standard deviation as wt? response - correct A Because the resulting sampling distribution would give us an idea of how much random sample means could vary. B Because if the resulting sampling distribution is roughly normal, that proves that our sample came from a normally distributed population. C Because this helps us fit the best fitting normal curve over our sample distribution. D Because the mean of the sampling distribution will tell us what the true mean of the population is.

D

Why might the mean be a good simple model for this distribution? A Because the mean is the only statistically acceptable value of b0. B Because the mean is the most frequent value in this distribution. C Because in all skewed distributions, the mean is the best model because it is different from the median. D Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.

B

Why might we prefer to create a confidence interval using a t-distribution rather than a normal distribution? A. There is no reason to prefer a t-distribution over a normal distribution B. The t-distribution takes into account the fact that population variance is unknown C. Normal distribution makes an assumption about the distribution of the errors, but the t-distribution does not D. The t-distribution will give a narrower interval, which is more precise, so it is preferred

B, C

Why should you look at a histogram of a variable before you do other statistical analyses? (check all that apply) A You'll need the results from your histogram in order to write additional R code. B You might catch errors made in data collection/entry. C You can see the shape of the distribution to see if it makes sense. D R won't be able to run other functions on the data frame unless you make a histogram first.

B

Why would you choose to use bootstrapping to create a confidence interval rather than simulation? A. It creates a symmetric confidence interval B. It does not require an assumption about the shape of the errors (e.g., normally distributed) C. Bootstrapping will always give the same answer, whereas simulation gives a different answer every time you run the code. D. There are no reasons to choose bootstrapping over simulation.

Explanatory variable goes on the __________ and is this ________________

X Axis Independent Variable

Outcome variable

Y Axis Dependent Variable

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: Which of the following represents the fitted model?

Y i = − 9.035 + 1.319 X i + e iY i = − 9.035 + 1.319 X i + e i

The arrange() function will print out the data set sorted by Age. But if you printed out MindsetMatters, it won't be sorted by Age. Why?

YOU DIDN'T SAVE IT

Given the distribution of data, could the population of Points be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

Given this distribution of data, could the population of females' ratings of males' intelligence be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

Suppose me make a model using Age to explain variation in Points, and called it Age.model: Age.model <- lm(formula = Points ~ Age, data=NBAPlayers2011) Age.model Which of these equations represents this model with the best-fitting estimates?

Yi = 1238.768 - 9.15 Xi + ei Yi = Intercept - (minus because Age is negative) Age(Right) Xi + ei

We wonder if females liked males of the same race as them (LikeF) more than males of a different race. To investigate this question we created a new variable called called RaceMatch with this code: SpeedDating$RaceMatch <- SpeedDating$RaceM == SpeedDating$RaceF We then fit a model of LikeF using RaceMatch as the explanatory variable. How would you represent this RaceMatch model in GLM notation?

Yi=b0+b1Xi+ei

A, B, C, D

You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true? (Check all that apply.) A It's unimodal. B Most scores are clumped at the center. C It's roughly symmetrical. D It's bell-shaped.

B

You decide to conduct a study of energy drinks using undergraduates from your school. You select participants by randomly choosing ID numbers from among all ID numbers of current students. Once chosen, you randomly pick one of two energy drinks for students to consume weekly, throughout the school term. The first step is an example of _____ and the second is an example of _____. A Random assignment; random selection B Random selection; random assignment C Random assignment, random assignment D Random selection, random selection

B

You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before? A An error message. You're missing the argument between the parentheses. B A smooth density plot overlaying your bars. C A smooth density plot instead of bars. D A y-axis that now displays density.

B

You run the following R code: Points.model <- lm(Mins ~ Points, data = NBAPlayers2011) Points.model When you do so, you get the following output: intercept = 1156.0680 points = 1.062 Which of the following equations represents the fitted model? response - correct Yi=1156.68 + 1.06 + ei A Yi=1156.68 + 1.06 + ei Yi= 1156.68+1.06Xi+ei B Yi= 1156.68+1.06Xi+ei Yi=1.06 + 1156.68 + ei C Yi=1.06 + 1156.68 + ei D Yi=1.06+1156.68Xi+ei

C

You run the following command: RandomLakes <- sample(FloridaLakes, 10). What will be the result? A A printout of random lakes B A new data frame of 10 lakes drawn randomly from the population C A new data frame of 10 lakes drawn randomly from your FloridaLakes data frame D None of the above

what is gf_facet_grid()?

You use it to split up histograms. eg) gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(. ~ Sex)

A

You'd like to divide the original data frame into three groups with low, medium, and high levels of average mercury. What R function would you use to do this and save the result as a new variable called MercGroup? A FloridaLakes$MercGroup <- ntile(FloridaLakes$AvgMercury, 3) B MercGroup <- sort(AvgMercury, 3) C arrange(FloridaLakes, 3) D ntile(FloridaLakes$AvgMercury, 3)

A

You'd like to see the first 10 rows of FloridaLakes, so you run head(FloridaLakes). It doesn't give you what you wanted. Why not? A You didn't indicate that you wanted to see 10 rows. B head() displays variable names. C head() can only be applied to vectors. D This is an odd request, so there's no R command for it.

If you've identified your confidence interval for SpeedUp, what exactly are you confident about?

You're confident that the true effect of wearing a wetsuit on swimming velocity lies within it.

B

You're interested in computing the confidence interval around the estimated mean of SpeedUp (how much a swimmer's velocity increases by having a wetsuit). What should you add to the following code in order to do so? Empty.model <- lm(SpeedUp ~ NULL, data = Wetsuits) response - correct A xqt(.025, df = 999) B confint(Empty.model) C SDoM <- do(10000) * mean(resample(Empty.model)) D Empty.model

C

You've been commissioned to do a study of all lakes with average mercury levels above 1. You want to save the data of the lakes that meet this criterion to a new data frame called HighMercury. What's wrong with the following code? HighMercury <- filter(floridalakes, avgmercury > 1) A It's missing quotation marks around the number 1. B It doesn't appropriately name the new data frame. C There are capitalization errors. D Nothing

Use the sum() function to

add up numbers

The command _________ can create new data frames with different summary values based on different groupings.

aggregate() **The command aggregate() is quite useful because you can ask for different summary functions (FUN) other than mean. You can ask for max (for maximum value in the group), min (for minimum value), sum (the total), or median (the middle number).

The same student has a tendency to come back repeatedly to ask the same question. With that in mind, you decide to save the answer above to an R object so you can readily call it up later. You decide to call the R object annoyingstudent

annoyingstudent <- Sat[4]==1470

If you want to sort a whole data frame, we will use this function

arrange() **except now you have to specify both the name of the data frame, and the name of the variable on which you want to sort.

If you want to quickly see the name of the lake with the lowest average mercury level, what R command might you run?

arrange(FloridaLakes, AvgMercury)

We can also turn a factor back into a numeric variable by using the _______ function

as.numeric()

Take a look at the model that we fit in this output. How would we represent this number with General Linear Model (GLM) notation? coefficient (intercept):

b0

What kind of variables should go in ```gf_facet_grid()```?

categorical

You're interested in computing the confidence interval around the estimated mean of SpeedUp (how much a swimmer's velocity increases by having a wetsuit). What should you add to the following code in order to do so? Empty.model <- lm(SpeedUp ~ NULL, data = Wetsuits) response - correct

confint(Empty.model)

What is random sampling?

everyone in the population has an equal chance of being studied

D

f you wanted to see the distribution for College (percentage of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~ College, data = USStates, bins = 10) A "bins = 10" will return an error B "gf_" is unnecessary C The "~" is unnecessary D Nothing

R has a way to let you specify whether a variable is categorical, using the _______command.

factor()

A factor variable, in R, is always categorical. In the Fingers data frame, Sex is coded as 1 or 2. In order for R to know that it is categorical, we can tell it by using the command _____________

factor(Fingers$Sex)

Let's say you want to filter your data so that you do NOT include lakes that have missing data for Chlorophyll. What line of code will do that?

filter(FloridaLakes, Chlorophyll​ != "NA")

If you were interested in proportions rather than counts, which argument would you add to your code above?

format="proportion"

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). Which is the best way to take a look at this relationship (between Wetsuit and NoWetsuit)?

gf_point(Wetsuit ~ NoWetsuit, data = Wetsuits)

What shows you just the first few rows of a data frame.

head()

The command head() shows you the first six rows of a data frame, but if you wanted to look at a different number of rows, you can just add in a number at the end like ______

head(Fingers, 3)

Maybe considering yourself a morning person (a "lark") or an evening person (an "owl") is related to variation in GPA. Which of the following plots would help us see whether variation in GPA is related to variation in LarkOwl?

histogram, scatterplot, and boxplot all show variation

Here is a depiction of the relationship between Salary and Age2Group: When Age2Group is included as an explanatory variable in our model of Salary, we write: SalaryGender$Age2Group <- ntile(SalaryGender$Age, 2) SalaryGender$Age2Group <- factor(SalaryGender$Age2Group, levels = c(1,2), labels = c("young", "old")) lm(formula = Salary ~ Age2Group, data = SalaryGender) As a result, the following is printed in the R Console:

i. How would you interpret 31.87? This represents the difference in average salaries for teachers in the high group relative to the low group. ii. In Y i = 36.59 + 31.87 X i + e i, what does X i stand for? Whether a teacher is in the young group or not. iii. When Age2Group is included in our model to explain variation in Salary, how is error from this more complex model calculated? The deviation of each teacher's Salary from the mean Salary of their Age2Group (Young or Old). 8/12

B

if my data is a good representation of the population, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution

D

if the population mean is the mean in my data and the variance is known, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution

E

if the population mean is the mean in my data and the variance is unknown, use: (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution

a, c

if the simple model is true in the population, use... (a) shuffling (b) bootstrapping (c) mathematical F distribution (d) mathematical normal distribution (e) mathematical t distribution

What will happen in R if you run: print("StatsCourse")?

it will print out StatsCourse

The rows in this data frame represent _____ and columns represent _____.

lakes; qualities of the lake

Which of these functions might help us recode Height as a categorical variable?

ntile()

if you want to split up the students in the Fingers data set into two groups by their Height (a short group and a tall group), you could use the function

ntile()

Each row is a __________

observation

Consider the following model: Tips = smiley face + other stuff Where would sampling variation go in this model of the DGP?

other stuff

If we are mostly going to put the outcome variable on the y-axis (and the explanatory variable on the x-axis), what order would you expect in our R code?

outcome ~ explanatory

What's the correct command, if you want to print "MyUniversity"?

print("MyUniversity")

There are some instances where you want to add a label onto a number. There are other cases where you want to change the numbers themselves. How would you do this?

recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)

We can use the __________function to look at just a few specific variables.

select()

write some code that will show you the values for Age and Alcohol for patients in the NutritionStudy data frame.

select()

What code would you use to put a vector in a certain order?

sort(myvector) **Only for vectors

The _____________ command tells you the type of each variable in a data frame.

str()

Using the DataCamp window above, determine how many lakes are included in the data frame.

str()

What shows us the overall structure of the data frame, including number of observations, number of variables, names of variables and so on.

str()

How would you quickly find the total number of water samples (or test tubes) collected across all of the lakes in your study?

sum(FloridaLakes$NumSamples)

We want to construct a 95% confidence interval for a population mean. However, we don't know the population standard deviation and we have a small sample size. Which is the most appropriate distribution to use for a sampling distribution?

t distribution

The FloridaLakes data frame includes a variable called Calcium. How many lakes have a Calcium level that exceeds 5.0? (Hint: Try using the tally command.)

tally()

We could represent the same pattern as sort in a frequency table using the command ______

tally() we can also specify the variable and data frame separately like this: tally(~ Age, data = MindsetMatters)

Statistics is

the study of variation

Each column is a __________

variable

B

which of the following is an example of measurement error? a) we can use year in college to predict amount of school spirit, but even when we've accounted for year there's still some variability (or error) around the prediction b) 2 nurses measure the blood pressure of a patient at the same time, one on the left arm and one on the right. one nurse gets 115/67 and the other gets 117/65 c) when asked about salary one participant from Canada reports his salary in Canadian dollars instead of American dollars whereas all the other participants report their salary in American dollars d) Alex makes a mistake when entering the number of sexual partners a participant has, and enters 115 instead of 15

Let's say you want to create an R object, so you can call it up later. The object you want to create represents the Oxford Dictionaries Word of the Year for 2017, which happens to have been "youthquake." How would you create that object?

worldoftheyear2017 <- "youthquake"

If you have less bins

you see less detail

If we add more participants to the SpeedDating study, which of these could not be affected?

β0


Related study sets

Chapter 11 Anthropology Homework

View Set

respiracao , gastro e renal - fisiologia

View Set

Intro To Business Essay Questions

View Set

Name Listening 2: 我的英文名字是吴小美

View Set