MATH 2.0
StudentSurvey is a data frame with 362 observations on the following 17 variables: Year Year in school Gender Student's gender: F or M Smoke Smokers? No or Yes Award Preferred award: Academy or Nobel or Olympic HigherSAT Which SAT is higher? Math or Verbal Exercise Hours of exercise per week TV Hours of TV viewing per week Height Height (in inches) Weight Weight (in pounds) Siblings Number of Siblings BirthOrder Birth order, 1 = oldest VerbalSAT Verbal SAT score MathSAT Math SAT score SAT Combined Verbal + Math SAT GPA College grade point average Pulse Pulse rate (beats per minute) Piercings Number of body piercings Using the StudentSurvey data frame, run favstats on Siblings. At what point would the sum of squared errors (sum of squares) be lowest?
1.7
If you use the distribution of WgtGain4 in the FatMice18 data frame (shown in this histogram) as a probability model, what is the likelihood of a mouse in a future study gaining more than 15 grams of weight?
1/18
Fit the NULL or empty model of WgtGain4 in the FatMice18 data frame. What is the sum of squares for this model?
186.28
Using the FatMice18 data frame, run favstats on WgtGain4. At what value would the sum of squared errors (sum of squares) be lowest?
8.39
If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) anova(Empty.model)
All of the above (How much error there is around the empty model) (The sum of the squared residuals) (The sum of squares)
What R code will output the standard deviation for hwy?
All of the above (sd(mpg$hwy)) (sqrt(var(mpg$hwy))) (favstats(~ hwy, data = mpg))
The sum of squares gets larger as:
All of the above (The variation increases) (The sample size increases) (The spread of the distribution increases)
In the figure below, which part represents the probability that a car would have a hwy above 29.4 (depicted in red)? Which part represents the z score for 29.4?
D; C
If we fit a normal curve on the distribution of hwy (see visualization below), what is it that we're modeling with it?
Error around the model for hwy
The mean of hwy is 23.44. If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy?
If the standard deviation is large, the z score should be small and positive.
If you've calculated the standard deviation for hwy, what have you found?
Roughly the average deviation from the mean, in highway miles per gallon
If you've calculated the variance for WgtGain4, what have you found?
Roughly the average squared residual from the empty model, in squared grams
What is represented by the numbers on the y-axis of this histogram?
Specific cars
If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean?
The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy.
What's true of the distribution of any variable, if your model is the mean of that variable?
The distribution of the variable is the same shape as the distribution of its residual.
f you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) Empty.model
The mean
. Let's say we want to compare the Light model for weight gain (WgtGain4 = Light + error) to the empty model (WgtGain4 = mean + error). What does the "mean" in the empty model word equation refer to?
The mean of WgtGain4 for all the mice
If the z score for a mouse's weight gain is -0.7, what does that mean?
The mouse's weight gain is 0.7 standard deviations lower than the mean of WgtGain4.
. If a data point is very far away from the mean, what would you expect for the residual?
When farther away, the larger the absolute value of the residual
Let's say you've calculated the sum of squares for hwy. What would the advantage be of dividing that number by n - 1 (i.e., dividing it by the df)?
You can use it to compare error across samples of different sizes.
How would you use R to calculate variance in hwy?
var(mpg$hwy)
If we add more mice to the study, which of these would not be affected?
β0
Below we have depicted a histogram and the favstats for hwy. Using the Empirical Rule, estimate the probability that a car would have a hwy above 29.4 (depicted in red). Note, we have depicted the original data distribution in the histogram but we don't want you to think about the data. We want you to estimate the probability of a value greater than 29.4 based on a normal model of the distribution.
16%
Below is a normal model of a population. There are more or less likely values of this variable. What part of the population would be considered "unlikely" to be randomly selected (according to the definition of "unlikely" agreed upon by the statistics community)?
B
Here we have depicted the mean as a vertical blue line. Why is the mean a good model for hwy?
Because the mean is a model that balances the residuals and minimizes the sum of squared residuals
The histogram below was created with this code: gf_histogram(~ hwy, data = mpg, fill = "magenta", bins = 10) Why does this histogram look different than the one in the previous question?
Because the values of hwy (i.e., highway miles per gallon) were put into fewer bins
You've just run the following code: tally(~ WgtGain4 > 10, data = FatMice18, format="proportion") You've gotten the following output: What can you now say?
Both of the above (Approximately 22% of mice gained more than 10 grams of weight.) (If another mouse were randomly selected and added to this data set, the likelihood that it would gain more than 10 grams would be 22%.)
Consider the mouse who gained the least amount of weight in this study (circled in red in the jitter plot above). What is true about the residual of this mouse from the empty model?
The absolute value of the residual is relatively large.
The next set of questions is based on a data frame called FatMice18, which contains data for 18 mice. Mice were randomly assigned to one of two light treatments: LD (a normal light/dark cycle) or LL (light in the day and light at night as well). The researchers tracked the weight gained by each mouse (in grams) over four weeks of this light treatment. bove is the histogram for WgGain4. What would you get if you were to total up the height (the "count") of all the bars?
The number of mice in the FatMice18 data frame
What is the difference between a residual and the standard deviation?
The residual standard deviation is simply the standard deviation of the residual values, or the difference between a set of observed and predicted values.
Below is the histogram for hwy. What would you get if you were to total up the height (the "count") of all the bars?
The total number of cars in the mpg data frame