Ch7 Review
Imagine that you have both the empty model for Exercise and the complex model for Exercise (i.e., the model that includes Pulse3Group). What would you do if you wanted to compare how well they predict Exercise?
Compare the SS from each model Look at the reduction in error in the Pulse3Group model Examine the PRE
We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group).Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, what does 𝑌𝑖Yi represent?
Each person's value for Exercise
When you add an explanatory variable to your model, what should be the effect on the Sum of Squares from the empty model?
It should remain unchanged.
Does this show that cardiovascular health (that is, being in a lower pulse group) causes students to also exercise more?
No, because the design of this study was correlational, so we cannot determine causation.
Imagine that you've calculated SS for both the empty model and the complex model for Exercise. What will be true about these SS?
SS leftover from the empty model will be greater than the SS leftover from the complex model.
Why does this table have a smaller Sum of Squares Total (1699) than the supernova table for Exercise explained by Pulse3Group (11864)?
The SS Total depends on the variation in the outcome variable. Piercings is a different outcome variable so it has a different SS Total.
We can calculate the residuals from both the empty model and the complex model. What is similar about these two sets of residuals?
The residuals represent the difference between the data and the model's prediction.
If we used this code to fit the empty model: Empty.model <- lm(Exercise ~ NULL, data = StudentSurvey) And then used the predict() function to make a prediction for each student's number of hours exercised per week, what value would it predict for each student?
The value would be the mean number of hours exercised by this sample and would be the same for each student.
In 𝑌𝑖Yi = 10.38 - .85𝑋1𝑖X1i - 3.14𝑋2𝑖X2i + 𝑒𝑖ei what does 𝑋1𝑖X1i stand for?
Whether someone is in the medium pulse group or not
Which of these options could be used to depict the relationship between Exercise and Pulse3Group?
gf_boxplot(Exercise ~ Pulse3Group, data = StudentSurvey) gf_histogram(~ Exercise, data = StudentSurvey) %>% gf_facet_grid(Pulse3Group ~ .) gf_point(Exercise ~ Pulse3Group, data = StudentSurvey)
We are going to try and explain variation in Exercise hours with cardiovascular health (Pulse3Group).Assume our model is the following: Exercise = Pulse3Group + other stuff If we write the model in GLM notation, which equation represents this Pulse3Group model?
𝑌𝑖Yi = 𝑏0b0 + 𝑏1𝑋1𝑖b1X1i + 𝑏2𝑋2𝑖b2X2i + 𝑒𝑖
If we express our model as 𝑌𝑖Yi = 𝑏0b0 + 𝑏1𝑋1𝑖b1X1i + 𝑏2𝑋2𝑖b2X2i + 𝑒𝑖ei which part represents the model's prediction for Exercise?
𝑏0b0 + 𝑏1𝑋1𝑖b1X1i + 𝑏2𝑋2𝑖
PRE= 0.05, interpret PRE
.05 of the total variation in exercise hours is explained by the pulse groups.
We can calculate the residuals from the complex model and the mean residuals for the three groups with this code.StudentSurvey$Residuals <- resid(Pulse3Group.model)mean(Residuals ~ Pulse3Group, data = StudentSurvey) Here is the output: low: -3.9e-16 med: 1.06e-15 high: -1.23e-15 14. Have you done something wrong in R
No. Means always balance residuals.
Having a low resting heart rate (recorded in the variable Pulse) is supposed to be an indicator of good cardiovascular health. Let's say we wanted to create three groups based on Pulse: low, medium, and high. Which of the following code would do that, and save the values in a new variable called Pulse3Group?
StudentSurvey$Pulse3Group <- ntile(StudentSurvey$Pulse, 3)
When Pulse3Group is included in our model to explain variation in Exercise, how is error from this more complex model calculated?
The deviation of each person's Exercise from the mean Exercise of their Pulse3Group
When Pulse3Group is included as an explanatory variable in our model of Exercise, we get this as the output of lm(Exercise ~ Pulse3Group, data = StudentSurvey).
This represents the difference in average hours of exercise for people in the high pulse group relative to the low pulse group.