Math 227- Final

Ace your homework & exams now with Quizwiz!

The variable Pres2008 is categorical. It indicated whether it was McCain or Obama who won the state in the 2008 election. Which is the more appropriate visual representation for this data?

Bar Graph

Think back to the vector called SAT. Let's imagine that the student who earned the fourth score in this vector would like to know her score. You try sat[4], but get an error. What did you do wrong?

Because R is case sensitive, SAT needs to be capitalized.

Why is the mean a good model for this distribution?

Because the mean balances the deviations above and below the mean.

Why might it be helpful to calculate means from random samples of 100 drawn from a normal distribution with the same mean and standard deviation as wt?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

If you've identified your confidence interval for SpeedUp, what exactly are you confident about?

You're confident that the true effect of wearing a wetsuit on swimming velocity lies within it.

Which of the following is the correct interpretation of PRE (0.65) in the supernova table above?

65% of the SS from the empty model can be explained by adding Points to the complex model.

Again using lm() to explore the model that includes Gender to explain Height in the StudentSurvey data frame, what is the mean height for female?

65.695

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. Given this information, interpret the 70.3 in the table below.

70.3 percent of the surveyed residents said that they competed in a physical activity that month.

Use lm() to fit the empty model for Fat in the NutritionStudy data frame. What is the coefficient?

77.03

Using the FatMice18 data frame, run favstats on WgtGain4. At what value would the sum of squared errors (sum of squares) be lowest?

8.39

You run the following command: RandomLakes <- sample(FloridaLakes, 10). What will be the result?

A new data frame of 10 lakes drawn randomly from your FloridaLakes data frame

If you created a bootstrapped sampling distribution of 10,000 means from your sample of SpeedUp, what qualities would you expect it to have?

A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample

Which distribution would you use to create a confidence interval around a parameter estimate?

A sampling distribution

What kind of distribution would this code create? do(10000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12))

A sampling distribution of bootstrapped slopes

We fit two models in which LikeF was the outcome variable. The first used AttractiveF as the explanatory variable, the second, IntelligentF. Above we show the analysis of variance tables produced by supernova() for the two models. Which of the following would make you think that the AttractiveF model explain more variation in LikeF than the IntelligentF model?

All of the above

Which of these options could be used to depict the relationship between Exercise and Pulse3Group?

All of the above

Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. Why do these two models have the same value for SS total (25356)?

All of the above reasons together explain why the SS totals are the same.

If we wanted to explore the idea that mother's smoking might explain the variation in birth weight, what visualization might be helpful to us?

All of the above would be helpful to us as we explore our data.

Construct a scatterplot to explore the relationship in the between GPA and Happiness among participants in the SleepStudy. What seems to be true?

All of the above.

Look at the image below. What are examples of a research unit (or case), a variable, and a value, respectively?

Annie, AvgMercury, 1.33

The graph below depicts the standard normal distribution with mean 0 and standard deviation 1. Find the area of the shaded region (to the right of z = − 1.82) by using the xpnorm() function.

0.9656

Use ntile() to create groups of lakes that are low, medium, and high in Chlorophyll. Save this in the FloridaLakes data frame as a new variable called Chlorophyll3Group. If you then use the head() and select() commands to print out the first six rows of Chlorophyll3Group, what do you get as a result?

1, 1, 3, 1, 1, 3

Above we have included the ANOVA tables for two models: wt = age + other stuff and wt = smoke.factor + other stuff. A classmate takes a look at these results and suggests that you judge these models using F instead of PRE. Why is that good advice?

Because F takes degrees of freedom into account and the smoke.factor model might just have a big PRE because that model used more degrees of freedom.

What's true of sampling variation?

Because of it, no sample will perfectly reflect the population.

Why might the mean be a good simple model for the distribution of salaries, below:

Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.

Why might the mean be a good simple model for this distribution?

Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.

Here we have depicted the mean as a vertical blue line. Why is the mean a good model for hwy?

Because the mean is a model that balances the residuals and minimizes the sum of squared residuals

After fitting the regression model (Age.model), we construct a 95% confidence interval for the estimate of β 1 by running the following code: confint(Age.model) Here's the output: i. What's our 95% confidence interval for the estimate of β 1? Choose one: ii. Notice that our confidence interval contains 0. What does this mean? Choose one: iii. Which of the following is an incorrect interpretation of the confidence interval for β 1 in this model? Choose one:

1- (-24.40652, 6.098267) 2- It suggests we should retain the empty model. 3- 95% of all Points will have this relationship with Age.

The sampling distribution of means above was created with this code:SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd))gf_histogram(~ mean, data = SDoM, fill = "burlywood")8. If you were to stack up all the bars, what would the total be?

10,000

Above we show the favstats() for females' ratings of their date's intelligence (IntelligentF). Why might it be helpful to examine the distribution of means from random samples of n=273 drawn from a normal distribution with the same mean and standard deviation as IntelligentF?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

Determine whether each numerical value in the following statement is a statistic or a parameter. Bon Air Elementary School has 1000 students. The principal of this school thinks that the average IQ of students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly selected students. Among the sampled students, the average IQ is 108. Fill in the blanks with "statistic" or "parameter."

1000: Parameter 110: Parameter 20: Statistic 108: Parameter

We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. The best-fitting estimates for the model from the lm() function are shown above. 13. What would be the model's prediction for the weight of a baby born to a mother who smoked until the current pregnancy?

122.16 + 3.84

If you use lm() to fit the empty model for LikeM, and then use confint() to find the confidence interval, what does the confidence interval tell you? response - correct

Both B and C are correct.

We fit two models in which LikeF was the outcome variable. The first used AttractiveF as the explanatory variable, the second, IntelligentF. Above we show the analysis of variance tables produced by supernova() for the two models. Why is the SS total the same value for the two models?

Both are based on residuals from the mean of the same outcome variable.

Why is the SS Total the same value for the FTMade.model and the Points.model?

Both are based on residuals from the mean of the same outcome variable.

Below we have depicted a histogram and the favstats for hwy. Using the Empirical Rule, estimate the probability that a car would have a hwy above 29.4 (depicted in red). Note, we have depicted the original data distribution in the histogram but we don't want you to think about the data. We want you to estimate the probability of a value greater than 29.4 based on a normal model of the distribution.

16%

We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). If you know a player had 0 free throws, how many minutes would you predict he played?

1662.31

LeBron James played 3,063 minutes, scored 2,111 points, and made 758 free throws in the 2011 season. What is the FTMade.model's prediction for minutes played by LeBron James? response - correct

1662.31 + 2.83*758

Fit the NULL or empty model of WgtGain4 in the FatMice18 data frame. What is the sum of squares for this model?

186.28

Using the SpeedDating data frame, fit a model in which LikeF is the outcome variable and FunF is the explanatory variable. Based on the model, what would you predict a male's LikeF rating would be if his FunF rating was 0?

2.5013

The USStates data frame includes information on the percentage of residents in each state who smoke. Data is coded under the variable named Smokers. Produce a histogram of Smokers, without indicating a particular number of bins or indicating a particular bin size. Where is the peak of the histogram?

20

None of these values (SS total, SS model, mean) will be the same.

2011

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data?

23.6

Based on information in the supernova() table above, how would you calculate the approximate value of PRE?

2638 divided by 10485

If you fit a model that predicts Mins by including FTMade as an explanatory variable, how many parameters would the model have?

2: the y-intercept and the slope of the regression line

Based on the data shown in the boxplot, can we conclude that smoking causes changes in fat consumption?

No, because these data are the result of a correlation study, not an experiment.

We fit two models in which LikeF was the outcome variable. The first used AttractiveF as the explanatory variable, the second, IntelligentF. Based on the parameter estimates for the two models (shown above), can you tell which model explains more variation in LikeF?

No, it's not possible to tell from the parameter estimates how much variation has been explained by a model.

We fit a model of Mins predicted by FTMade and called it FTMade.model (the output is below). We also fit a model of Mins predicted by Points (points scored) and called it Points.model (output below). From these best-fitting parameters, can we tell which model explains more variation: FTMade.model or Points.model?

No, we cannot tell from the best-fitting estimates how much variability has been explained by a model.

We can calculate the residuals from the complex model and the mean residuals for the three groups with this code.StudentSurvey$Residuals <- resid(Pulse3Group.model)mean(Residuals ~ Pulse3Group, data = StudentSurvey) Here is the output: 14. Have you done something wrong in R?

No. Means always balance residuals.

Create a box plot for College in the USStates data frame. How many outliers do you see?

None

We used AgeM to predict ratings of fun (FunM). The F value for this model in the table above is .02. What does this F ratio tell us?

None of the above.

What notation cannot be used to represent the mean of the sample?

None of these choices can be used to represent the mean of the sample.

In addition to the NBA player data for the 2011 season, we also have a similar data frame called NBAPlayers2015 (with many of the same variables). Which of these values will be the same if we create two models with these lines of R code: Points11.model <- lm(Min ~ Points, data = NBAPlayers2011) Points15.model <- lm(Min ~ Points, data = NBAPlayers2015)

None of these values (SS total, SS model, mean) will be the same.

The sampling distribution of means from a uniform distribution will likely be:

Normal

You decide to conduct a study of energy drinks using undergraduates from your school. You select participants by randomly choosing ID numbers from among all ID numbers of current students. Once chosen, you randomly pick one of two energy drinks for students to consume weekly, throughout the school term. The first step is an example of _____ and the second is an example of _____.

Random selection; random assignment

What will the following code do?xqt(.025, df = 999)

Return t critical for a sample size of 1000

If the confidence interval for β1 is .9547 m/sec plus or minus 0.118 m/sec, how big is the standard error of the sampling distribution of b1?

Roughly .118 divided by 2

If the 95% confidence interval for β1 is 0.7139 plus or minus 0.0814, how big is the standard error of the sampling distribution of b1?

Roughly 0.0814 divided by 2

Now imagine that the same student simply wanted to know whether her original score was a 1470. How would you get her answer?

SAT[4]==1470

Imagine that you've calculated SS for both the empty model and the complex model for Exercise. What will be true about these SS?

SS leftover from the empty model will be greater than the SS leftover from the complex model

Above is the supernova table for Light.model, which uses Light to explain variation in WgtGain4. Why does the table show a smaller sum of squares error (73.78) than sum of squares total (186.28)?

SS total is based on residuals from the Grand Mean. SS error is based on residuals left over after some of the total variation is explained by the difference in group means.

What will the following code produce? SAT <- c(1300,1120,1050,1470,1350) First.Score <- SAT[1] Second.Score <- SAT[2] First.Higher <- First.Score > Second.Score First.Higher

TRUE

Use the DataCamp window above to construct a faceted histogram of how much fun males perceive females to be (FunM) by males' race (RaceM) in the SpeedDating data frame. Which of the race groups looks most like the panel below?

Caucasian

One researcher wondered if some of the variation in the difference in velocity came from the type of swimmer they were. Triathletes swim in wetsuits more often than competitive swimmers do, and she worried that their experience would influence the results of this study. Above is a faceted histogram of difference in velocity by the Type of athlete. The two vertical lines depict the mean of the swimmer group and the mean of the triathlete group. 5. What would be the PRE of this model: Type.model <- lm(SpeedUp ~ Type, data = Wetsuits)

Close to 0

Which of the following word equations represents the hypothesis that smoking explains variation in fat consumption.

Fat consumption = Smoking + other stuff

In a study designed to find out if smoking habits explain variation in fat consumption, _______ would be the outcome variable and ______ would be the explanatory variable.

Fat; EverSmoke

Using the SleepStudy data frame, produce a jitter plot to examine ClassesMissed by Gender (coded 0 for female, 1 for male). Among students who missed no class were there more females or more males?

Females

Using the StudentSurvey data frame, create a faceted histogram for Weight by Gender. The mean is likely to be a better model for which gender?

Females

The data frame is currently organized alphabetically by lake. What if you'd like to see it ordered by average mercury level, with the most polluted lake appearing first on the list? Save the result back into FloridaLakes. response

FloridaLakes <- arrange(FloridaLakes, desc(AvgMercury))

You'd like to divide the original data frame into three groups with low, medium, and high levels of average mercury. What R function would you use to do this and save the result as a new variable called MercGroup?

FloridaLakes$MercGroup <- ntile(FloridaLakes$AvgMercury, 3)

If you print the residuals for StudentSurvey.modelGPA, what will you see?

For each participant in the study, the difference between his/her GPA and the mean GPA

You suspect that in the SleepStudy, Gender can be used to explain sleep quality (PoorSleepQuality). Produce a jitter plot to explore whether your suspicion might be right. Which of the following is true?

Gender does not appear to predict sleep quality.

Assume we run these two lines of code:AttractiveF.stats <- favstats(~ AttractiveF, data = SpeedDating) rnorm(100, AttractiveF.stats$mean, AttractiveF.stats$sd) 3. What will the second line of code (in red) do?

Generate a random sample of 100 data points from a normal distribution with the same center and spread as AttractiveF.

Let's say a researcher hopes to explore the hypothesis that knowing about someone's stress level can help to predict their happiness. What word equation best captures this idea?

Happiness = Stress + other stuff

In a study designed to find out what explains variation in Happiness, _____ would be the outcome variable and _____ would be the explanatory variable.

Happiness; Stress

If the distribution of NoWetsuit was more variable (that is, has a greater standard deviation) than the distribution of Wetsuit, what would be true about the confidence interval of NoWetsuit compared to Wetsuit?

The NoWetsuit confidence interval would be wider

Why does this table have a smaller Sum of Squares Total (1699) than the supernova table for Exercise explained by Pulse3Group (11864)?

The SS Total depends on the variation in the outcome variable. Piercings is a different outcome variable so it has a different SS Total.

Consider the mouse who gained the least amount of weight in this study (circled in red in the jitter plot above). What is true about the residual of this mouse from the empty model?

The absolute value of the residual is relatively large.

If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean?

The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy.

If we run lm() to fit the model for WeightGain4, using Light as an explanatory model, how is error from the model calculated for each mouse?

The deviation of each mouse's WgtGain4 from the mean WgtGain4 for their Light group

When Pulse3Group is included in our model to explain variation in Exercise, how is error from this more complex model calculated?

The deviation of each person's Exercise from the mean Exercise of their Pulse3Group

In your study, you tested two types of energy drinks (SuperBuzz and StayFocused). You found that students who consumed SuperBuzz rated themselves as more alert on average than did those who drank StayFocused. Your roommate suspects that you are being fooled by chance (also called Type 1 error). What's her concern?

The difference you found was the result of sampling variation.

Above we have included the ANOVA tables for Wetsuit = NoWetsuit + other stuff. Which distance is the basis of the SS Error?

The distance between data points and the NoWetsuit model's prediction.

Above we have included the ANOVA tables for wt = smoke.factor + other stuff. Which distance is the basis of the SS error?

The distance between data points and the smoke.factor model's prediction.

In the plot below, what does the point circled in red represent?

The happiness of a student with high stress

What would this code show us? SDob1 <- do(100000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) SDob1 <- arrange(SDob1, desc(b1)) SDob1$b1[2500]

The highest population increment that could have produced our sample and it would still be considered likely

Suppose we construct a sampling distribution of bootstrapped slopes for Points vs. Age. bootSDob1 <- do(1000) * b1(Points ~ Age, data = resample(NBAPlayers2011, 176)) bootSDob1$b1[25] What does this 25th value tell us?

The highest population increment that could have produced our sample and it would still be considered likely.

The best-fitting model using NoWetsuit velocity to predict Wetsuit is this: Yi=0.1423+0.9547Xi+ei 12. How should we interpret 0.9547?

The increment to add on to the prediction of Wetsuit for every 1 m/sec of NoWetsuit

Above are the favstats for Alcohol. The average number of drinks per week is 3.28 but the median is 0.3. The maximum value in this distribution is 203—that is a lot of alcoholic drinks per week (almost 30 per day)! That seems to be a mistake. Which of the following would change more if we were to exclude the maximum value from the analysis?

The mean

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) Empty.model

The mean

Even in this highly skewed distribution, the mean can be a good model. What makes the mean a good model?

The mean balances the deviation above and below the mean.

Someone has a hypothesis that the Gender can be used to explain the number of Piercings that students in the StudentSurvey have. Fit the model and save it. Then, create a function that takes the model as an input. Finally, use your function to make a prediction for males. (Note that males are coded as "M") What does the output tell you?

The mean number of piercings for male is 0.171.

Let's say we want to compare the Light model for weight gain (WgtGain4 = Light + error) to the empty model (WgtGain4 = mean + error). What does the "mean" in the empty model word equation refer to?

The mean of WgtGain4 for all the mice

If you bootstrap a sampling distribution based on your sample of data, what will be the mean of the bootstrapped distribution?

The mean of your sample

The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles, the distance for one of the commute has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean?

The mean will be lower because of the correction.

How will this correction (changing 54 back to 27) affect the mean and the median?

The median will be affected less than will the mean.

If you create an empty model of TopSpeed, what would it mean to have an "empty model"?

The model would include only mean TopSpeed.

What would be true about the empty model for Fat?

The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables.

If the distribution of Fat were roughly symmetrical and bell-shaped, what would that mean?

The most frequent number in the distribution would be very close to the average scores.

If the z score for a mouse's weight gain is -0.7, what does that mean?

The mouse's weight gain is 0.7 standard deviations lower than the mean of WgtGain4.

The nutrition study included 315 patients. Where is this information represented in the data frame?

The number of rows in the data frame

The study included 53 lakes in Florida. Where is this information in the data frame?

The number of rows of data

For which scatterplot does the slope of the regression line equal the correlation coefficient?

The one on the right

Imagine that the PhysicalActivity histogram is skewed to the right. That is, the skinny, longer tail is on the right. What does that mean?

The population in most states is sedentary

We can calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals?

The residuals represent the difference between the data and the model's prediction.

When we calculate the residuals from both the empty model and the complex model, what is similar about these two sets of residuals?

The residuals represent the difference between the data and the model's prediction.

If we wanted to know the range within which 95% of individual scores should fall, which distribution would we need to create a model of?

The sampling distribution of sample means

In general, in R, __________ is where you type in code and __________ is where the code runs. response

The script window (i.e., script.R); the R Console

If you want to know if a regression model is better than a simple model in terms of making a prediction, what parameter should you make a sampling distribution of?

The slope

Using the SleepStudy data frame, create a boxplot to explore whether Stress (coded as normal or high) might be used to explain GPA. Which of the following statements are true?

The sole outlier is a participant with a normal level of stress.

What is standard error?

The standard deviation of the sampling distribution

What is standard error?

The standard deviation of the sampling distribution of an estimate

The sampling distribution of means above was created with these three lines of code: SincereF.stats <- favstats(~ SincereF, data = SpeedDating) SDoM <- do(10000) * mean(rnorm(272, SincereF.stats$mean, SincereF.stats$sd)) gf_histogram(~ mean, data = SDoM, color = "darkorchid4", fill = "darkorchid1") 4. What is true about the standard error of this distribution?

The standard error of this distribution is smaller than SincereF.stats$sd.

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n = 30, the other an n = 60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

In the SleepStudy, might Stress be a predictor of Happiness? What do you see in boxplot?

There are more outliers among participants with normal stress than there are among participants with high stress.

The plot above shows males' liking of female (LikeM) as a function of whether or not they want to date them again (DecisionM, Yes or No). What does the plot show?

There were females that males liked a lot, but with whom they did not want to go out on another date.

Let's say we wanted to write a word equation to explain the variation in the time it takes to bike to work. We think that the Distance of a person's commute is an important explanatory variable. What would the word equation look like?

Time = Distance + other stuff

Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape?

TopSpeed and residuals

According to the Central Limit Theorem, is each of the following True or False? (Select True or False from the dropdown menus.) i. [ Select ] ["True", "False"] : The shape of the distribution of means will typically be normal in shape, provided the sample size is large enough OR if the shape of the population is normal. ii. True : The mean of the sampling distribution will be the same as the mean of the population from which the samples are randomly chosen. That is, the sample means will center around the true population mean. iii. [ Select ] ["True", "False"] : The standard error will be smaller for larger sample sizes. Even more specifically, the standard error will be equal to the population standard deviation divided by the square root of the sample size.

True;True;True

Use "%>%" notation to add gf_denstity() (a density plot) to your density histogram of Smokers in USStates. What does the curve look like? response - correct

Unimodal

Broadly speaking, what do we study when we study statistics?

Variation

Where on this boxplot would you look to see evidence of within-group variation in fat consumed?

Vertically, within each boxplot

SpeedUp contains the swimming velocity with Wetsuit minus NoWetSuit. Could these differences in swimming velocity be normally distributed in the population?

It is possible.

Which of the following would be a correct interpretation of the number 3.0806 in the supernova() table above?

It is, roughly, the average squared residual from the Grand Mean.

What happens to the sampling distribution as we increase sample size?

It looks more normal

When you add an explanatory variable to your model, what should be the effect on the Sum of Squares from the empty model?

It should remain unchanged.

When you add an explanatory variable to your model, what should be the effect on the Total Sum of Squares (from the empty model)?

It should remain unchanged.

What's the value in using the t distribution?

It works well as a model of the sampling distribution if the sample size is small, or standard deviation of the population is unknown.

What would the sampling distribution of means look like for samples of n = 1 ?

It would have the same shape and standard deviation as the population distribution.

What would the sampling distribution of means look like for samples of n=1?

It would have the same shape and standard deviation as the population distribution.

If you increase your sample size in a study, how does it affect the 95% confidence interval around a parameter estimate?

It would make the confidence interval narrower.

If we repeated this study and found a larger standard error, what would be different about the confidence interval for β1?

It's likely that the 95% confidence interval for β1 would be wider.

If, in the population, females' mean intelligence ratings of males is 7.7, with a standard deviation of 1.2, how likely is it that we would randomly draw a sample of n=276 with a mean of less than 7.5? (Run a simulation in the R window above to answer this question.)

Less than 5%

Use gf_point() to examine ClassesMissed by Gender (coded 0 for females, 1 for males). Locate the SleepStudy participant who missed the most classes. Is it a female or a male?

Male

The output of favstats(~Points, data=NBAPlayers2011) is shown below. If we had collected a different sample of 176 NBA players, what value(s) would be different?

Most likely, all of these would be different

Someone has a hypothesis that younger people drink more alcohol than older people. Based on the scatterplot of number of drinks per week by age, which of the following observations is true?

Most people do not drink more than five drinks per week.

A sample mean is 24 and the margin of error is 3.5. Would a population mean of 28.4 be considered likely?

No

Will all samples drawn from a population always have the same mean?

No, because of sampling variation

Do these results show that light treatment (that is, being in the LL group) causes mice to eat more?

No, because the experiment shows that light causes mice to gain more weight, but does not prove that their weight gain is caused by eating more.

Let's say you want to create an R object, so you can call it up later. The object you want to create represents the Oxford Dictionaries Word of the Year for 2017, which happens to have been "Youthquake." How would you create that object?

oxfordword2017 <- "youthquake"

The rows in the NutritionStudy data frame represent _____ and the columns represent _____.

patients; variables

Wanting to see "MyUniversity" in the R Console, you've just run the following command: print(MyUniversity). However, R returned an error message. What's the correct command, if you want to print "MyUniversity"?

print("MyUniversity")

If you'd like to see an overview of what's in the data frame—a list of your variables, whether they're numeric or factors, and so forth—what command would you use?

str()

Which of the following R code would quickly help you find the number of states in which McCain won the 2008 election, and the number of states in which Obama won?

tally(~ Pres2008, data = USStates)

How would you use R to calculate variance in hwy?

var(mpg$hwy)

The remaining questions on this exam have to do with the following data frame: Data from the 2010-2011 regular season from a sample of NBA players has been stored in a data frame called NBAPlayers2011. The data frame consists 176 observations on the following 25 variables. Below is a histogram of Points for the 176 players in this data set, created with the following code: gf_histogram(~Points, data=NBAPlayers2011, color="grey", fill="seagreen", alpha=0.5) If we were to use this data to guess the mean number of points scored in the population, what would we be trying to estimate?

β 0

If we add more mice to the study, which of these would not be affected?

β0

If we add more participants to the SpeedDating study, which of these could not be affected?

β0

If we were to use this data to guess the mean weight of newborns in the population, what would we be trying to estimate?

β0

If a data point is very far away from the mean, what would you expect for the residual?

When farther away, the larger the absolute value of the residual

If a data point is very far away from the mean, what would you expect for the residual?

When farther away, the larger the absolute value of the residual.

In Yi = 10.38 - .85 X1i- 3.14 X2i + ei what does X1i stand for?

Whether someone is in the medium pulse group or not

Suppose me make a model using Age to explain variation in Points, and called it Age.model: Age.model <- lm(formula = Points ~ Age, data=NBAPlayers2011) Age.model Which of these equations represents this model with the best-fitting estimates?

Y i = 1238.768 − 9.154 X i + e i

In Question 15, we created an emtpy model. Which of the following is the correct way of writing this best fitting empty model in General Linear Model (GLM) notation?

Y i = 27.79 + e i

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: Which of the following represents the fitted model?

Y i = − 9.035 + 1.319 X i + e i

What's the name of the LAST lake in the FloridaLakes data frame?

Yale

Given this distribution of data, could the population of females' ratings of males' intelligence be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

If we write the model in GLM notation, which equation represents this Pulse3Group model?

Yi = b0 + b1X1i + b2X2i + ei

If you want to quickly see the name of the lake with the lowest average mercury level, what R command might you run?

arrange(FloridaLakes, AvgMercury)

You're interested in computing the confidence interval around the estimated mean of SpeedUp (how much a swimmer's velocity increases by having a wetsuit). What should you add to the following code in order to do so? Empty.model <- lm(SpeedUp ~ NULL, data = Wetsuits)

confint(Empty.model)

This bike rider has an intuition that the kind of bike he uses affects the top speed he can reach on the bike. To get a sense of the distribution of top speed, he looks at the output. Which function created this output?

favstats()

If you wanted to get the five-number summary for PhysicalActivity, what R code would you run?

favstats(~ PhysicalActivity, data = USStates))

What R code would you use to fit the empty model for TopSpeed?

lm(TopSpeed ~ NULL, data = BikeCommute)

Continuing with the murders data frame and the red lines of code from Questions 14 and 15... If the mean of murders_per_millions is 27.79125 and New York's murders_per_millions is 26.679599, what is New York's residual?

-1.111651

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual?

-10

You create a histogram of IQ and find that it looks relatively normal. Which of the following statements are likely true? (Check all that apply.)

-It's unimodal. -Most scores are clumped at the center. -It's roughly symmetrical. -It's bell-shaped.

If you wanted to generalize to all lakes in Florida, but only included lakes within a 50 km radium of the research center in your study; what should concern you? (Check all that apply.)

-The sample is not random. -The sample may not represent the population you want to know about.

Why should you look at a histogram of a variable before you do other statistical analyses? (check all that apply)

-You'll need the results from your histogram in order to write additional R code. -You might catch errors made in data collection/entry. -You can see the shape of the distribution to see if it makes sense. -R won't be able to run other functions on the data frame unless you make a histogram first. (ALL ^)

Maybe considering yourself a morning person (a "lark") or an evening person (an "owl") is related to variation in GPA. Which of the following plots would help us see whether variation in GPA is related to variation in LarkOwl?

-gf_histogram(~ GPA, data = SleepStudy) %>% gf_facet_grid(LarkOwl ~ .) -gf_boxplot(GPA ~ LarkOwl, data = SleepStudy) -gf_point(GPA ~ LarkOwl, data = SleepStudy) -ALL OF THE ABOVE (<- answer)

Interpret the PRE.

.05 of the total variation in exercise hours is explained by the pulse groups.

What proportion of states (recorded as the variable Pres2008) was won by Obama? (Hint: use the "tally" command.)

.56

Above is the supernova() table for Light.model, which uses Light to explain variation in WgtGain4. What does the PRE of .60 mean?

.60 of the sum of squares from the empty model is explained by the Light groups.

If you use shuffle() to create a randomized sampling distribution of b1 (a group difference) based on a sample of data, what will be the mean of the resulting sampling distribution?

0

If you were to calculate the sum of the residuals from the empty model of Mins, what would it be?

0

The annual precipitation amounts in a certain mountain range are normally distributed with a mean of 109 inches, and a standard deviation of 10 inches. Use the xpnorm() function to find the probability that the annual precipitation will be less than 98 inches next year.

0.1357

i. What would be the purpose of generating a sampling distribution of means of Points by resampling (bootstrapping)? Choose one: ii. If you created a bootstrapped sampling distribution of 10,000 means from your sample of Points, what qualities would you expect it to have? Choose one: iii. If you bootstrap a sampling distribution based on your sample data, what will be the mean of the bootstrapped distribution? Choose one:

1- The distribution can help you quantify how much your best estimate of the population mean could vary. 2- A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample. 3- The mean of your sample

Here is the supernova table for Age2Group.model: i. Interpret the PRE. Select one: ii. Why is the degrees of freedom for Model equal to 1? Select one:

1-0.14 of the total variation in salary is explained by the age groups. 2-The Age2Group model uses up a degree of freedom by estimating one more parameter, b1, than the empty model.

The mean maximum swim velocity when wearing a wetsuit (i.e., Wetsuit) is 1.51 m/sec. If the margin of error is 0.08 m/sec, what's the range of possible values within which you're 95% confident that actual population mean would fall?

1.43 m/sec to 1.59 m/sec

We want to construct a 95% confidence interval. The margin of error is 3.0. What is the approximate value of the standard error?

1.5

Revise your basic histogram of Smokers in the USStates data frame so that it includes just 5 bins. Locate the bin that represents the states with the highest percentage of residents who smoke. Around what number is that bin centered?

30

In USStates, what's the median percentage of residents with college degrees? (The variable is aptly named College.)

30.6

If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed?

33.6

The FloridaLakes data frame includes a variable called Calcium. How many lakes have a Calcium level that exceeds 5.0? (Hint: Try using the tally command.)

35

Use the DataCamp window above to find the product of 12,345 and 34,567. What's the result? (Hint: Use "prod" to find the product in the same way that you'd use "sum" to find the sum.)

426729615

After running the code, cor(Salary ~ Age, data = SalaryGender), the following result is outputted in the R Console: [1] 0.4770431 Now that we know r = 0.47 for a regression line, what percent of variation is explained by the explanatory variable?

47%

Using the DataCamp window above, determine how many variables there are in the FloridaLakes data frame.

5

Use lm() to explore the model that includes Gender to explain Height in the StudentSurvey data frame. How many inches must you add to the mean height for females to get the mean height for males?

5.151

Using the FatMice18 data frame, run lm() to fit the model for WgtGain4, using Light as an explanatory variable. If Yi=b0+b1Xi+ei represents the fitted model, what is the value of b0?

5.889

The best-fitting model using AttractiveM to predict LikeM can be specified like this: LikeMi = b0 + b1AttractiveMi+ei 8. Which of the following is an INCORRECT interpretation of the confidence interval for β1β1 in this model?

95% of all LikeM ratings have this relationship with AttractiveM ratings.

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this:WetsuitiWetsuiti = b0b0 + b1 NoWetSUiti + ei 16. If the confidence interval for β1β1 is .9547 m/sec plus or minus 0.118 m/sec, which of the following is NOT a correct interpretation?

95% of all Wetsuit velocities have this relationship with the NoWetsuit velocity.

Which of the following is the correct interpretation of PRE (0.98) in the supernova table above?

98% of the SS from the empty model can be explained by adding NoWetsuit to the complex model.

Between what two values do we see the middle 50% of all IQ scores in the USStates data frame?

98.5 and 102.7

If the blue circles in the diagram above represent data points in a group model, which distance would be used to calculate the sum of squares error?

A

To examine the distribution of Happiness, which would be more useful?

A histogram

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). How would we depict the null model of maximum velocity in a wetsuit on this plot?

A horizontal line at the mean of Wetsuit

We made this plot to explore whether variation in being liked (LikeF) might be explained by being perceived as fun (FunF). If we fit the empty model to this data, how would we depict it on this plot?

A horizontal line drawn at the mean of LikeF

Where on the density histogram would you look to see evidence of between-group variation in Happiness?

Across the two histograms

Which of the following from the NutritionStudy is a quantitative variable?

Alcohol (number of alcoholic drinks consumed per week)

Based on the two supernova tables above we would argue that Light (top table) explains more variation in WgtGain4 than does CageLoc (the cage location, bottom table). What in the table would support this argument?

All of the above

How should we interpret this boxplot?

All of the above

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) anova(Empty.model)

All of the above

Take the StudentSurvey data frame and use lm() to fit the empty model for GPA. Save the results in an R object StudentSurvey.modelGPA. What do you get when you run StudentSurvey.modelGPA in R?

All of the above

Take the StudentSurvey data frame and use lm() to fit the empty model for SAT. What is 1204?

All of the above

The sum of squares gets larger as:

All of the above

Suppose there is a correlation between a teacher's salary and their age. There might not be a causation relationship due to gender. Gender is called a:

Confounding variable

What will the following code do? resample(Wetsuits, 12)

Create a new sample from the observations in Wetsuits

Imagine that you wrote the following code. What would it do? gf_boxplot(Happiness ~ Stress, data = SleepStudy, color = "orange") %>% gf_jitter()

Create a single plot (a boxplot with an overlaid jitter plot)

Consider the following two normal curves, a and b: Which has the larger mean and which has the larger standard deviation (curve a or curve b)?

Curve b has the larger mean; Curve a has the larger standard deviation

Below is a normal model of a population. There are more or less likely values of this variable. What part of the population would be considered "unlikely" to be randomly selected (according to the definition of "unlikely" agreed upon by the statistics community)?

D

In the figure below, which part represents the probability that a car would have a hwy above 29.4 (depicted in red)? Which part represents the z score for 29.4?

D; C

FunM.model <-lm(LikeM ~ FunM, data = SpeedDating) Using the R code above, we fit this model: Yi=b0+b1Xi+ei 2. What does the Xi refer to?

Each male's value on FunM

If we write the model in GLM notation, what does Yi represent?

Each person's value for Exercise

If we fit a normal curve on the distribution of hwy (see visualization below), what is it that we're modeling with it?

Error around the model for hwy

After running lm(Salary~Age, data=SalaryGender), the following is outputted in the R Console: Call: lm(formula = Salary ~ Age, data = SalaryGender) Coefficients: (Intercept) Age -9.305 1.319 According to our model, someone who is 0 years old would have a salary of -9.305 (thousands of dollars). This of course makes no sense. It doesn't make sense because of:

Extrapolation

TRUE or FALSE: Correlation implies causation.

False

The sampling distribution of means above was created with this code: SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd))gf_histogram(~ mean, data = SDoM, fill = "burlywood") 10. Someone tells you that their baby was 116 ounces at birth. What is the likelihood of a baby having a birth weight of 116 or lower in the population?

I am not sure because I cannot tell the likelihood of a single baby having such a birth weight from this sampling distribution of means.

Which of the following statements are true? I. The mean of a population is denoted by ("Y-bar"). II. Sample size is never bigger than population size. III. The population mean is a statistic.

II. only

The mean of hwy is 23.44. If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy?

If the standard deviation is large, the z score should be small and positive.

If the sampling distribution of means is normal, the underlying population distribution is:

Impossible to tell

If you want to write a note to yourself about your R code, but you want R to ignore it, how would you do so?

Include a hashtag (#) at the start of the line

Sample distributions are made up of _________; sampling distributions are made up of _________.

Individual scores; sample statistics

In the FloridaLakes data frame, what kind of variable is AgeData?

Integer

If you wanted to see the distribution for College (percentage of residents with college degrees), and run the following R code, what would be wrong? gf_histogram(~ College, data = USStates, bins = 10)

Nothing

What does the distance between the two points (shown in the red rectangle) mean?

Nothing

The histogram below shows the distribution of Alcohol with an outlier removed. What is the "count" on the y-axis a count of?

Number of patients

Using the NutritionStudy data frame, make a histogram of Alcohol. What is represented on the y-axis?

Number of patients

Which of the following lines of R code would save only the patients with less than 200 drinks per week into a new data frame called NutriStudy?

NutriStudy <- filter(NutritionStudy, Alcohol <200)

You'd like to divide the patients in the data frame into two equal groups, those who consume relatively low amounts of Cholesterol per day, and those who consume relatively high amounts of Cholesterol per day. You want to save this categorization in a variable called CholesterolGroup. What R code could you use to do this?

NutritionStudy$CholesterolGroup <- ntile(NutritionStudy$Cholesterol, 2)

Arrange the FloridaLakes data frame by the variable called Calcium. What is the name of the lake with the lowest amount of Calcium?

Ocheese Pond

Which scatterplot shows a larger value of Pearson's r?

Plot B

How can sampling distributions help us interpret our data?

Sampling distributions give us a way to asses whether a relationship we've observed in our data is likely to have occurred just by chance.

How can sampling distributions help us interpret our data?

Sampling distributions give us a way to assess whether a relationship we've observed in our data is likely to have occurred just by chance.

One student suggests that players who make a lot of free throws (FTMade) are better and they would see more game time. Another student argues that making free throws doesn't make you a better player—having a higher free throw percentage (FTPct) is the sign of a better player, and suggests that would explain the variation in minutes played (Mins). Which of these plots would depict the relationship between Mins and one of these explanatory variables?

Scatter plots (i.e., gf_point)

You are interested in females' rating of how much they like their male dates (LikeF). In particular, you wonder if variation in LikeF is better explained by how attractive they think the male is (AttractiveF), or by how fun they think the males is (FunF). Which of these plots would best help you explore this question?

Scatter plots(i.e., gt_point)

Let's split GPA into three groups—low, medium, and high—and then create a faceted histogram. What goes in the blanks in the following code? SleepStudy$GPA3Group <- ntile(_____, 3) gf_dhistogram(~ Happiness, data = _____) %>% gf_facet_grid(GPA3Group ~ .)

SleepStudy$GPA; SleepStudy

What is represented by the numbers on the y-axis of this histogram?

Specific cars

Which of the following cannot be calculated from murders_per_million?

A parameter

Which of the following cannot be calculated from the NutritionStudy data set?

A parameter

If all the values of a data set are the same, all of the following must equal zero except for which one?

Mean

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: A teacher is 46 years old. What is the Age.model's prediction for their Salary (in thousands of dollars)?

-9.305 + 1.319*46

Which of the following are quantitative variables? (Check all that apply.)

-CognitionZscore -Happiness

Which of the variables below would be appropriate for a histogram? (Check all that apply.)

-College -IQ -Population

In our USStates data frame, the variable PhysicalActivity was obtained by surveying a random sample of residents in each state and asking them if they had competed in a physical activity in the last month. What's our goal in analyzing data like this? (Check all that apply.)

-It helps us understand the population. -It helps us understand each individual in the sample. -Solely to help us understand this particular sample. -It helps us understand the processes that produced the variation we see. (ALL ^)

Which of the following from FloridaLakes are quantitative variables? (Check all that apply.)

-Lake -pH -NumSamples -MinMercury (ALL ^)

The histogram above shows the distribution of population in millions across states. Which of the following statements are true based on the histogram? (Check all that apply.)

-Only a few states have very high populations. -Most states have a population of about 0. -Only small states show up on the far right tail. -The shape of the distribution is skewed to the right. (ALL ^)

Let's say you make several histograms in the process of exploring the data. Among them is a frequency histogram of PhysicalActivity and a relative frequency histogram of PhysicalActivity. If you used default settings for each of them, what do the two have in common? (Check all that apply.)

-They display the same variable. -They have the same number of bars. -The shape of the distribution would be the same. -Their axes would have the same labels. (ALL ^)

What's true about data?

-They require that you've selected a sample. -They are the result of measurement. -They represent something about the world. -ALL OF THE ABOVE (<- answer)

If you're told that there's measurement error in how one of your variables was recorded, which of the following could be true?

-Your measurements are biased. -A mistake was made when the data were either recorded or entered. -BOTH OF THE ABOVE (<- answer)

If you want to use R to get the sum of 10 and 20, what code would you write? (Check all that apply.)

-sum(10+20) -sum(10,20) -10+20

Using the SpeedDating data frame, fit a model of LikeM using SharedInterestsM as the explanatory variable. What is the 95% confidence interval for β1?

.35 to .53

Use the DataCamp window above to write some code that will show you the values for Age and Alcohol for patients in the NutritionStudy data frame. The last study participant is 45 years old. How many alcoholic drinks does she consume per week?

0.2

A sampling distribution of means below was created with this code: Points.stats <- favstats(~Points, data=NBAPlayers2011) SDoM <- do(10000) * mean(rnorm(176, Points.stats$mean, Points.stats$sd)) gf_histogram(~ mean, data = SDoM, color="grey", fill = "burlywood", alpha=0.75) i. If you were to stack up all the bars, what would the total be? Choose one: [ Select ] ["The sum of all the values in this distribution", "The number of objects in the population", "10,000", "176"] ii. Which of the following is a true statement? Choose one: [ Select ] ["The standard error of this distribution is smaller than Points.stats$sd.", "The standard error of this distribution is approximately equal to Points.stats$sd.", "It's impossible to know from this information alone.", "The standard error of this distribution is larger than Points.stats$sd."] iii. Someone tells you that their favorite NBA player scored 900 in the 2010-2011 NBA season. What is the likelihood of an NBA player having scored 900 points or lower in the population? Choose one: [ Select ] ["Probably close to 0 because such a low point total is in the lowest tail of this sampling distribution.", "I could find out by running the R code: tally(~mean <= 900, data = SDoM, format = "proportion").", "Probably close to 0.5 because this is a normally distributed sampling distribution.", "We can't be sure because we can't tell the likelihood of a single NBA player having such a point total from this sampling distribution of means, we'd need to instead look at our sample distribution."] iv. Pictured below is a sampling distribution of Points, color-coded by mean Points that are less than 875 and greater than 875. Points.stats <- favstats(~Points, data=NBAPlayers2011) SDoM <- do(10000) * mean(rnorm(176, Points.stats$mean, Points.stats$sd)) gf_histogram(~ mean, data = SDoM, color="grey", fill = ~mean<875) The chance of getting a sample mean less than 875 is: Choose one:

1- 10,000 2-The standard error of this distribution is smaller than Points.stats$sd. 3- We can't be sure because we can't tell the likelihood of a single NBA player having such a point total from this sampling distribution of means, we'd need to instead look at our sample distribution. 4- Very unlikely

Suppose you make the plot below to further explore the idea that Age would predict a teacher's Salary. i. If you fit an empty model to this data, how would you depict it on this plot? Select one: ii. If you fit a model that predicts Salary by including Age as a quantitative explanatory variable, how many parameters would the model have? Select one:

1- A horizontal line that shows the mean Salary. 2- Two: The y-intercept and the slope of the regression line.

If we fit an empty model to this data, how would we depict it on this plot?

A horizontal line that shows the mean for minutes played.

If we write the empty model in GLM notation, Y i = b 0 + e i, i. What is the value of Y i? Select one: [ Select ] ["Each individual teacher's salary.", "The median salary of this sample, 39.3.", "The difference between each teacher's salary and the mean salary of this sample.", "The mean salary of this sample, 52.5245.", "The population mean salary, which we're trying to predict."] ii. What is the value of b 0? Select one: [ Select ] ["The difference between each teacher's salary and the mean salary of this sample.", "The median salary of this sample, 39.3.", "The mean salary of this sample, 52.5245.", "The population mean salary, which we're trying to predict.", "Each individual teacher's salary."] iii. What is the value of e i? Select one: [ Select ] ["Each individual teacher's salary.", "The difference between each teacher's salary and the mean salary of this sample.", "The mean salary of this sample, 52.5245.", "The median salary of this sample, 39.3.", "The population mean salary, which we're trying to predict."]

1- Each individual teacher's salary 2- The mean salary of this sample, 52.5245. 3- The difference between each teacher's salary and the mean salary of this sample.

For questions 13-20, refer to the following data frame, and continue to add each red line of code into an R sandbox: The data frame murders contains data collected by the FBI. They tracked total gun murders in each state. (This data frame actually includes the District of Columbia, DC, as a "state.") Here are some details about the data frame: state: US state abb: Abbreviation of US state region: Geographical US region population: State population, number of people in that state (2010) total: Number of gun murders in that state (2010) You can run str(murders) in your R sandbox to see the structure of the murders data frame. States varied in how many gun murders they saw in 2010. Run the following code to see the distribution of total: gf_histogram(~ total, data = murders, color="blue", fill="grey") (i.) What is the shape of the distribution? pick one: [ Select ] ["Bimodal", "Symmetric", "Negatively Skewed", "Positively Skewed", "Uniform"] (ii.) Without any other information, what statistic would be the best measure of center? pick one: [ Select ] ["residual", "mean", "median", "variance", "standard deviation"] (iii.) On the graph below, the mean and median have been added to the histogram. Which line, red or green, is the mean and which line is the median? Fill in the blanks. The green line is the mean . The red line is the [ Select ] ["median", "mean"] . (iv.) What would you get if you were to total up the height (the "count") of all the bars in the histogram above? (pick one.) If you were to total up the height (the "count") of all the bars, then you would get the [ Select ] ["sum of the residuals", "total number of states", "total number of murders", "sum of squares"] . (v.) Even though this distribution is shaped the way that it is, the mean can still be a good model. What makes the mean a good model? (choose one) pick one: [ Select ] ["The mean is the most common number in the distribution.", "The mean is the midpoint of the range.", "The mean balances the deviations above and below the mean.", "The mean balances the number of values above and below the mean."] (vi.) Why else is the mean a good model? pick one: [ Select ] ["Because the mean is the only true statistical model that can represent a population parameter.", "Because the mean is a model that balances the residuals and minimizes the sum of squared residuals.", "Because when you quantify the total error for the mean, you get 0.", "Because the mean is the best model whenever you make a visualization of data.", "Because the mean is the best model for all variables."]

1- Negatively skewed 2- Standard Deviation 3- Median 4- Mean 5- Total # of states 6- The mean balances the deviations above and below the mean 7- Because the mean is a model that balances the residuals and minimizes the sum of squared residuals.

After running supernova(Age.model), the following gets outputted in the R Console: i. Which of the following is the correct interpretation of MS Total (1782.607)? Select one: ii. Which of the following is the correct interpretation of PRE (0.23) in the supernova table? Select one:

1- This is, roughly, the average squared residual from the mean. 2- 23% of the SS from the empty model can be explained by adding Age to the complex model.

Here is a depiction of the relationship between Salary and Age2Group: When Age2Group is included as an explanatory variable in our model of Salary, we write: SalaryGender$Age2Group <- ntile(SalaryGender$Age, 2) SalaryGender$Age2Group <- factor(SalaryGender$Age2Group, levels = c(1,2), labels = c("young", "old")) lm(formula = Salary ~ Age2Group, data = SalaryGender) As a result, the following is printed in the R Console: i. How would you interpret 31.87? Select one: [ Select ] ["This is the number of teachers, on average, who are old.", "This represents the average salaries of teachers in the old group.", "This represents the difference in average salaries for teachers in the high group relative to the grand mean.", "This represents the difference in average salaries for teachers in the high group relative to the low group."] ii. In Y i = 36.59 + 31.87 X i + e i, what does X i stand for? Select one: [ Select ] ["Whether a teacher is in the old group or not.", "The intercept for Age2Group.", "The average salary of teachers in the old group.", "The average salary of teachers in the young group.", "Whether a teacher is in the young group or not.", "The number of teachers in the old group."] iii. When Age2Group is included in our model to explain variation in Salary, how is error from this more complex model calculated? Select one: The deviation of each Age2Group's mean to the Grand Mean for Salary.

1- This represents the difference in average salaries for teachers in the high group relative to the low group 2- Whether a teacher is in the old group or not. 3- The deviation of each teacher's Salary from the mean Salary of their Age2Group (Young or Old).

Continuing with the murders data frame and the red lines of code from Questions 14 and 15... We can fit a normal curve on the distribution of murders_per_million by adding the code, below, to our R Sandbox. outcome.stats <- favstats(~murders_per_million, data=murders) gf_dhistogram(~murders_per_million, data=murders, color="blue", fill="grey")%>% gf_labs(title="Total Gun Murders in Each State", x="Muders Per Million", y="Density")%>% gf_vline(xintercept=~mean, color="red", data=outcome.stats)%>% gf_dist("norm", color="red", params=list(outcome.stats$mean, outcome.stats$sd)) Here's what gets outputted in the R Console, Plots window: (i.) What is it that we're modeling with the normal curve? pick one: [ Select ] ["The empty model for murders_per_million.", "The median", "Sample statistics", "The error around the model for murders_per_million."] (ii.) In the figure below, which part represents the probability that a state would have a murders_per_million above 49 (depicted in green)? Which part represents the z-score for 49? pick one: [ Select ] ["C; D", "B; D", "D; A", "A; B", "C; B"] (iii.) Use the xpnorm() function to find the probability that a state would have a murders_per_million above 49. The probability is: [ Select ] ["0.1939", "Can't be determined.", "0.8061", "0.86"] (iv.) Suppose we run the following code: tally(~murders_per_million > 49, data = murders, format="proportion") We get the following output: murders_per_million > 49 TRUE FALSE 0.07843137 0.92156863 What can we now say? pick one: Both of these options are correct.

1-The error around the model for murders_per_million. 2- C;D 3- 0.1939 4- Both of these options are correct

You run the following command: RandomPatients <- sample(NutritionStudy, 10). What will be the result?

A new data frame of 10 patients drawn randomly from the NutritionStudy data frame

Continuing with the murders data frame... Since population is typically such a large number (many millions), R will usually put that number in scientific notation. The following code creates a new variable called pop_in_millions and saves it in the murders data frame. For instance, California's population is 37,253,956. Dividing by 1,000,000 will make California's pop_in_millions25. Enter the following code into your R Sandbox: murders$pop_in_millions <- murders$population / 1000000 We'll also create a new variable called murders_per_million: for every million people in a state, this will show how many gun murders there are. We'll take total and divide by pop_in_millions. Enter the following code into your R Sandbox, in a new line: murders$murders_per_million <- murders$total / murders$pop_in_millions Finally, run the following code in your R sandbox: favstats(~murders_per_million, data=murders) (i.) As we can see, the average murders_per_million is about 27.79. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of correctly recording 1257 murders in California, the number of murders has been entered as 1527 in the murders data frame. Now, suppose you correct this mistake, and change the 1527 entry to 1257. How will the correction (changing 1527 to 1257) affect the mean? pick one: The mean will be lower because of the correction. (ii.) What does the standard deviation, 24.56118, tell us? pick one: Roughly the average squared deviation from the mean, in squared murders per million. (iii.) If we had calculated the variance for murders_per_million, what would we have found? pick one: The sum of the residuals from the mean.

1-The mean will be lower because of the correction. 2-Roughly the average deviation from the mean, in murders per million. 3-Roughly the average squared residual from the mean, in squared murders per million.

Continuing with the murders data frame and the red lines of code from Question 14... Let's create an empty model for murders_per_million by running the following code: emptymodel <- lm(murders_per_million ~ NULL, data = murders) emptymodel (i.) What does it mean to have an "empty model"? pick one: [ Select ] ["The model would predict a different amount of murders_per_million depending on the situation.", "The model is linear.", "The model would be the best way of explaining how many variables contribute to murders_per_million (such as pop_in_millions).", "The model would include only the mean of murders_per_million."] (ii.) After running the emptymodel R code, what are we able to tell from the output? pick one: [ Select ] ["the population mean", "How much error there is around the empty model", "the sample mean", "the standard deviation"] (iii.) Run the emptymodel R code. What is the coefficient? pick one: [ Select ] ["27.79", "24.56", "51", "26.87"] (iv.) In using lm() to fit the empty model for murders_per_million in the murders data frame, what can we say based on the output? pick one: [ Select ] ["The mean of the distribution of murders_per_million in this data frame is about 27.79.", "The best-fitting number for the empty model is about 27.79.", "27.79 is an unbiased estimate.", "We can say ALL of these things."] (v.) If you ran the R code below, what would you be able to tell from the output? emptymodel <- lm(murders_per_million ~ NULL, data = murders) anova(emptymodel) pick one: [ Select ] ["The sum of the squared residuals", "How much error there is around the empty model", "The sum of squares", "We can say ALL of these things."]

1-The model would include only the mean of murders_per_million. 2-the sample mean 3-27.79 4-We can say ALL of these things. 5-We can say ALL of these things.

Using the StudentSurvey data frame, run favstats on Siblings. At what point would the sum of squared errors (sum of squares) be lowest? response - incorrect

1.7

Using the StudentSurvey data frame, run favstats on Siblings. At what value would the sum of squared errors (sum of squares) be lowest?

1.7

If you use the distribution of WgtGain4 in the FatMice18 data frame (shown in this histogram) as a probability model, what is the likelihood of a mouse in a future study gaining more than 15 grams of weight?

1/18

Tally up the number of lakes for which the variable AgeData is 0. How many are there?

10

The FloridaLakes data frame includes information collected by researchers when they analyzed samples of water (collected in standardized test tubes) from a number of lakes. Using the DataCamp window above, determine how many lakes are included in the data frame.

53

Using the SpeedDating data frame, run favstats on AttractiveF. At what value would the sum of squared errors (sum of squares) be lowest?

6.3

Experiment, using different numbers of bins in your histogram of Smokers. If you don't want to see gaps between the blocks in your histogram, which is the better choice?

A small number of bins

You decide to make a relative frequency histogram for PhysicalActivity and have added the following to your code: %>% gf_density() What will you now see that you didn't see before?

A smooth density plot overlaying your bars.

In the NutritionStudy data frame, the number 6.3 appears in the column labeled Fiber. This 6.3 is an example of:

A value

At a college, the scores on the chemistry final exam are approximately normally distributed, with a mean of 75 and a standard deviation of 12. The scores on the statistics final are also approximately normally distributed, with a mean of 80 and a standard deviation of 8. (a) A student scored 81 on the chemistry final. What is the student's standard score (z-score)? z = [ Select ] ["-0.50", "-1.50", "0.50", "-1.00", "1.50", "1.00"] (b) The same student scored 84 on the statistics final. What is the student's standard score (-score)? z = 1.00 (c) Relative to the students in each respective class, in which subject did this student do better? pick one: [ Select ] ["The student did better on the Chemistry final than on the Statistics final.", "The student did equally well on both exams, relative to the students in each respective class.", "The student didn't do well on either exam.", "The student did better on the Statistics final than on the Chemistry final."] (d) Using the Range Rule of Thumb, determine whether the student's score on either exam was significantly (unusually) low or significantly (unusually) high. pick one: [ Select ] ["The student's score on their statistics final was unusually high, but their score on the chemistry final was usual.", "The student's score on their chemistry final was unusually high, but their score on the statistics final was usual.", "The student scored unusually high on both exams.", "The student's score wasn't unusually high nor low on either exam."]

A) 0.50 B) 0.50 C) The student did equally well on both exams, relative to the students in each respective class. D)The student's score wasn't unusually high nor low on either exam.

The mean of Alcohol is 3.279 per day. A particular patient consumes 2 drinks per day. Which of the following represents the residual for this patient under the empty model?

All of the above

The scatter plot of Happiness by GPA is below. The mean is drawn in orange. What does the scatter plot tell you about the relationship between Happiness and GPA? (Assume that the maximum Happiness score was 36.)

All of the above

Use lm() to fit the empty model for TV in the StudentSurvey data frame. What can you say based on the output?

All of the above

We have quantified error from the FTMade.model of Mins and the Points.model of Mins by using the supernova() function. Which of the following are reasons to think that the Points.model is better than the FTMade.model?

All of the above

We use this code to calculate the correlation coefficient (Pearson's r) for Mins and Points: cor(Mins ~ Points, data = NBAPlayer2011) 16. What have we found?

All of the above

What R code will output the standard deviation for hwy?

All of the above

What notation cannot be used to represent the mean of the sample?

All of the above

Recall that the variable SpeedUp is the difference between swimming with Wetsuit versus NoWetsuit. Why might you want to find the point below which 2.5% of bootstrapped sample means for SpeedUp fall, and the point above which 2.5% of simulated sample means for SpeedUp fall?

All of the above.

The sum of squares gets larger as:

All of the answer choices are correct.

If the standard deviation for a set of data is 0, then which of the following must be true?

All of the data values are identical.

Imagine that you have both the empty model for Exercise and the complex model for Exercise (i.e., the model that includes Pulse3Group). What would you do if you wanted to compare how well they predict Exercise?

Any of the above

The NutritionStudy data frame includes information on the number of Calories patients consumed per day. Produce a histogram of Calories, without indicating a particular number of bins or a particular bin size. What is the peak of the histogram?

Around 1600

If the diagram above represents data points in a regression model, which distance would represent the reduction in error of the regression model compared to the empty model?

B

Why might it be helpful to calculate means from random samples of 176 drawn from a normal distribution with the same mean and standard deviation as Points?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

The histogram below was created with this code: gf_histogram(~ hwy, data = mpg, fill = "magenta", bins = 10) Why does this histogram look different than the one in the previous question?

Because the values of hwy (i.e., highway miles per gallon) were put into fewer bins

We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. The ANOVA table for this model is shown above. Why is the degrees of freedom for the Model 3?

Because there were three additional parameters estimated in the smoke.factor model compared to the empty model.

What is the observational unit in this data frame?

Bike commutes

What notation can be used to represent the mean of the population?

Both of the above

You've just run the following code: tally(~ WgtGain4 > 10, data = FatMice18, format="proportion") You've gotten the following output: What can you now say?

Both of the above

Let's say we calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals?

Both sets of residuals represent the difference between the data and the model's prediction.

Above is the supernova table for Light.model, which uses Light to explain variation in WgtGain4. Why is the degrees of freedom for Model equal to 1?

The Light model uses up a degree of freedom by estimating one more parameter, b1, than the empty model.

Assume this code has already been run: wt.stats <- favstats( ~ wt, data = Gestation100)What will the following line of code do?rnorm(100, wt.stats$mean, wt.stats$sd)

Generate a random sample of 100 data points from a normal distribution with the same center and spread as wt.

What would the following R code do, beyond creating a histogram? gf_histogram(~ College, data = USStates) %>% gf_labs(title = "Distribution of Residents with College Degrees", x = "Percentage")

Give the histogram a title and specify the label for the x-axis.

Here is a density histogram of self-reported Happiness faceted by Stress (high vs. normal). Where on the density histogram would you look to see evidence of within group variation in Happiness?

Horizontally, along the x-axis

You fit a regression model, then construct a 95% confidence interval for the estimate of β1β1. If the confidence interval includes 0, what does this mean?

It suggests we should retain the empty model.

If you decide you want to increase your level of confidence for your estimate of LikeM (from 95% to 99%), what will happen to your confidence interval? response - correct

It will become wider.

If you decide you want to increase your level of confidence in your estimate of Wetsuit (from 95% to 99%), what will happen to your confidence interval?

It will become wider.

What would happen if we went from a 95% confidence interval to a 90% confidence interval?

It will get narrower because we have less confidence

Using the USStates data frame, make a bar graph to illustrate the number of states that voted for McCain and Obama (recorded as the variable Pres2008). Based on what you see, which of the following statements is true?

McCain won in more than 20 states.

Michelle's doctor told her that the standardized score (z-score) for her systolic blood pressure, as compared to the blood pressure of other women her age, is 1.50 . Which of the following is the best interpretation of this standardized score?

Michelle's systolic blood pressure is 1.50 standard deviations above the mean systolic blood pressure of women her age.

Construct a boxplot using data from the NutritionStudy that illustrates how females and males (coded in the variable Gender) differ in daily consumption of Calories. Which of the following is true?

More than half of females consume less than 2,000 calories per day.

Above we show the favstats() for females' ratings of their date's intelligence (IntelligentF). If the researchers collected a new sample of 273 speed dates, what value in this output would be different?

Most likely, all of the above

The output of favstats(~ wt, data = Gestation100) is shown above. If they had collected a different sample of 100 newborns, what value would be different?

Most likely, all of the above

Does this show that cardiovascular health (that is, being in a lower pulse group) causes students to also exercise more?

No, because the design of this study was correlational, so we cannot determine causation.

Use the DataCamp window above to construct a faceted histogram of Fat by EverSmoke in the NutritionStudy data frame. Which of the three EverSmoke groups looks like the panel below?

Patients who CURRENTLY smoke

Which distribution do the numbers in the confidence interval represent? For example, if we are trying to estimate the confidence interval of the mean, which distribution's mean are we 95% confident lies in this range?

Population

Which of the following would have the same exact value?

Population mean and sampling distribution mean

Which of the following describes the error pointed to by the green label?

Positive residual

Still curious about Stress as an explanatory variable in the SleepStudy, you construct a boxplot to see if it's related to DepressionScore. Which of the following is true? response - correct

Q3 for participants with normal stress levels is roughly the same as Q1 for participants with high stress participants.

What will happen in R if you run: print("StatsCourse")?

R will display: "StatsCourse".

If you've calculated the standard deviation for hwy, what have you found?

Roughly the average deviation from the mean, in highway miles per gallon

If you've calculated the variance for WgtGain4, what have you found?

Roughly the average squared residual from the empty model, in squared grams

Let's say you wanted to create a vector called "SAT" from a list of SAT scores. How would you do that?

SAT <- c(1300, 1120, 1050, 1470, 1350)

The College Board discovered a mistake! All of their tests administered in 2017 were scored 50 points lower than they should have been. Assuming they have a vector called SAT.2017 that includes all test scores, how would they add 50 points to each score in the vector?

SAT.2017 + 50

What is the difference between Standard Deviation and Standard Error?

Standard Error applies to a sampling distribution; Standard Deviation applies to sample or population distributions.

Having a low resting heart rate (recorded in the variable Pulse) is supposed to be an indicator of good cardiovascular health. Let's say we wanted to create three groups based on Pulse: low, medium, and high. Which of the following code would do that, and save the values in a new variable called Pulse3Group?

StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3)

Let's compare two models, one that treats Points as a quantitative variable to predict Mins, and the other that uses Points to create 24 groups (Points24Group) before using it to predict Mins. The supernova tables below show that the PRE for Points24Group reduces the total variation in Mins by 77%, but the Points model reduces it by 65%. Why isn't the Points24Group model better than the Points model of Mins?

The F ratio shows that the Points model explains more variation per degree of freedom than the Points24Group.

Which of the following R code would create a new variable called SpeedUp that contains the difference between swimming with Wetsuit versus NoWetsuit?

Wetsuits$SpeedUp <- Wetsuits$Wetsuit - Wetsuits$NoWetsuit

If you simulate 10,000 samples of 176 NBA players and the points they scored in the 2010-2011 NBA season, then calculate the mean of each sample, and then plot the resulting distribution of means in a histogram, what will be the mean of this sampling distribution?

Whatever mean you set when you ran the simulation

If the researchers are interested in whether wearing wetsuits affects swimming velocity, what is the outcome variable of interest?

The difference in velocity between swimming in NoWetsuit compared to Wetsuit.

Below is a boxplot of Calories consumed per day by Gender. On the right you see the distribution for males. The two rectangles that compose the "box" portion of the plot have different heights. What does that mean?

The distribution of Calories consumed by males is skewed.

What's true of the distribution of any variable, if your model is the mean of that variable?

The distribution of the variable is the same shape as the distribution of its residual.

A researcher was interested in the fat content in patients' diets. What does the 50.1 (in the green box) represent?

The number of grams of fat consumed per day by this patient.

Above is the histogram for WgGain4. What would you get if you were to total up the height (the "count") of all the bars?

The number of mice in the FatMice18 data frame

You're interested in females' ratings of males' intelligence. You simulate 500 samples of 276 ratings, calculate the mean of each sample, and plot the resulting distribution of means in a histogram. What will be the mean of this sampling distribution?

Whatever mean you set when you ran the simulation

The sampling distribution of means above was created with this code:SDoM <- do(10000) * mean(rnorm(100, wt.stats$mean, wt.stats$sd))gf_histogram(~ mean, data = SDoM, fill = "burlywood") 9. Which of the following is a true statement?

The standard error of this distribution is smaller than wt.stats$sd.

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n=30, the other an n=60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

Below is the histogram for hwy. What would you get if you were to total up the height (the "count") of all the bars?

The total number of cars in the mpg data frame

If we used this code to fit the empty model: Empty.model <- lm(Exercise ~ NULL, data = StudentSurvey) And then used the predict() function to make a prediction for each student's number of hours exercised per week, what value would it predict for each student?

The value would be the mean number of hours exercised by this sample and would be the same for each student.

If we used this code to fit the empty model: Empty.model <- lm(Salary ~ NULL, data = SalaryGender) And then used the predict() function to make a prediction for each teacher's salary, what value would it predict for each teacher?

The value would be the mean salary of this sample and would be the same for each teacher.

Change your histogram of Smokers in the USStates data frame into a density histogram by using gf_dhistogram() instead of gf_histogram(). What changed?

The y-axis

In the jitter plot below, we've put a green box around a dense row of data and a red box around a less dense row of data. What does the density of dots represent?

There are a lot of individuals who have the same value on the y-axis.

You've been commissioned to do a study of all lakes with average mercury levels above 1. You want to save the data of the lakes that meet this criterion to a new data frame called HighMercury. What's wrong with the following code? HighMercury <- filter(floridalakes, avgmercury > 1)

There are capitalization errors.

Based on this histogram, which of the following observations are true?

There are fewer people who consider themselves larks than those who consider themselves neither a Lark nor an Owl.

What's the purpose of generating a sampling distribution of means of SpeedUp by resampling (also called bootstrapping)?

This distribution can help you quantify how much your best estimate of the population mean could vary.

SpeedUp contains the swimming velocity with Wetsuit minus NoWetsuit. The histogram above was created with this code: gf_histogram(~ SpeedUp, data = Wetsuits, bins = 6, fill = "black", color = "royalblue1", alpha = .8) 3. How would you modify this code to look at the distribution of swimmers that were faster with a wetsuit and those that were faster with no wetsuit?

This histogram shows that all swimmers were faster with a wetsuit.

Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) 16. What does the .43 mean?

This is the increment to add on to the prediction of wt for every year of mother's age.

Continuing with Age.model, which uses Age to explain variation in Points, Age.model <- lm(formula = Points ~ Age, data=NBAPlayers2011) Age.model What does the -9.154 mean?

This is the increment to add to the prediction of Points for every year of an NBA player's Age.

Which of the following is the correct interpretation of MS Total (297,230) in the supernova table above?

This is, roughly, the average squared residual from the mean.

Why do some players get more playing time and some players see less game time? Let's take a look at a histogram of number of minutes played to start exploring this variation. What does the orange curve drawn on this histogram represent?

This represents a normal distribution that was fit to the mean and standard deviation of this data.

Interpret the -3.14.

This represents the difference in average hours of exercise for people in the high pulse group relative to the low pulse group.

Given the distribution of data, could the population of Points be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

Given this distribution of data, could the population of newborn weights be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

The mean of Alcohol is 3.279 drinks per day. A particular patient consumes 2 drinks per day. Which of the following symbols would be used to represent the value 2 in the notation of the General Linear Modal?

Yi

The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6? Yi=Y+ei

Yi

We made the variable smoke into a factor and saved it as smoke.factor. Then we fit a model that used smoke.factor to explain variation in wt. All the R code for this is shown below. Gestation100$smoke.factor <- factor(Gestation100$smoke, levels = c(0:3), labels = c("never", "smokes now", "until current pregnancy", "once did, not now"))smoke.factor.model <- lm(wt ~ smoke.factor, data = Gestation100) 12. Which GLM equation specifies the smoke.factor.model?

Yi = b0 + b1X1i + b2X2i + b3X3i + ei

Perhaps the mother's age (age) could explain some of the variation in birth weight (wt). Above, we have depicted the output of this model: age.model <- lm(wt ~ age, data = Gestation100) 15. Which equation represents this model with the best-fitting estimates?

Yi= 109.26 + .43Xi + ei

Which of the following equations represents the fitted model?

Yi= 1156.68+1.06Xi+ei

We wonder if females liked males of the same race as them (LikeF) more than males of a different race. To investigate this question we created a new variable called called RaceMatch with this code: SpeedDating$RaceMatch <- SpeedDating$RaceM == SpeedDating$RaceF 5. We then fit a model of LikeF using RaceMatch as the explanatory variable. How would you represent this RaceMatch model in GLM notation?

Yi=b0+b1Xi+ei

Let's say you've calculated the sum of squares for hwy. What would the advantage be of dividing that number by n - 1 (i.e., dividing it by the df)?

You can use it to compare error across samples of different sizes.

You'd like to see the first 10 rows of FloridaLakes, so you run head(FloridaLakes). It doesn't give you what you wanted. Why not?

You didn't indicate that you wanted to see 10 rows.

Imagine that you created a histogram of PhysicalActivity. You meant to set it to have 15 bins, but you accidentally set 5 bins instead. How would the result be different from what you hoped? response - correct

You would see less detail that you would have otherwise depicted.

The same student has a tendency to come back repeatedly to ask the same question. With that in mind, you decide to save the answer above to an R object so you can readily call it up later. You decide to call the R object annoyingstudent. Which line of code would create that object? response - incorrect

annoyingstudent <-SAT[4]==1470

In GLM notation, which of he following represents the model (or prediction)?

b0

Take a look at the model that we fit in this output. How would we represent this number with General Linear Model (GLM) notation?

b0

If we express our model as Yi = b0 + b1X1i + b2X2i+ ei which part represents the model's prediction for Exercise?

b0 + b1X1i + b2X2i

LeBron James scored 2,111 points in the 2011 season. In this equation, what part represents the prediction the Points.model would make for minutes played by LeBron James?

b0+b1Xi

Let's say you want to filter your data so that you do NOT include lakes that have missing data for Chlorophyll. What line of code will do that?

filter(FloridaLakes, Chlorophyll​ != "NA")

If you were interested in proportions rather than counts, whch argument would you add to your code above?

format = "proportion"

What R code would produce a relative frequency histogram of PhysicalActivity?

gf_dhistogram(~ PhysicalActivity, data = USStates)

How would you create a plot to look at the distribution of Distance?

gf_histogram(~ Distance, data = BikeCommute)

What R code would create a distribution of Smokers?

gf_histogram(~ Smokers, data = USStates)

What R code produced the plot below?

gf_point(Happiness ~ Stress, data = SleepStudy)

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). Which is the best way to take a look at this relationship (between Wetsuit and NoWetsuit)?

gf_point(Wetsuit ~ NoWetsuit, data = Wetsuits)

If we wanted to explore the idea that mother's age might explain the variation in birth weight, what visualization might be helpful to us?

gf_point(wt ~ age, data = Gestation100)

Above is a histogram for TopSpeed. What R code created the line that indicates the mean?

gf_vline(xintercept = 33.6, color = "blue")

Here are some data from a study of mercury levels in Florida lakes. Researchers analyzed samples of water (collected in standardized test tubes) from each lake. The study included 53 lakes in Florida and put it in a data frame called FloridaLakes. What R command produced the printout below?

head()

The rows in this data frame represent _____ and columns represent _____.

lakes; qualities of the lake

How would you quickly find the total number of water samples (or test tubes) collected across all of the lakes in your study?

sum(FloridaLakes$NumSamples)

We want to construct a 95% confidence interval for a population mean. However, we don't know the population standard deviation and we have a small sample size. Which is the most appropriate distribution to use for a sampling distribution?

t distribution

If your score on this statistics test is converted to a z-score, which of these z-scores would you most prefer?

z = 2.00


Related study sets

3: Sovereignty, States & Constitutional Law

View Set

Business Ethics Now Chapter 1 Terms

View Set