Chapter 5

Ace your homework & exams now with Quizwiz!

What is the difference between a model of data and a model of DGP ?

The notation is different. We use Greek letters to refer to the model of DGP and Roman Letters to refer to the model of data . Our certainty about each model's accuracy is different. We can be certain about our sample model statistics, but never truly know the parameters in the model of the DGP.

This bike rider has an intuition that the kind of bike he uses affects the top speed he can reach on the bike. To get a sense of the distribution of top speed, he looks at the output. Which function created this output?

favstats()

How would you create a plot to look at the distribution of Distance?

gf_histogram(~ Distance, data = BikeCommute)

Above is a histogram for TopSpeed. What R code created the line that indicates the mean?

gf_vline(xintercept = 33.6, color = "blue")

What is the difference between 𝑏0 and 𝑌⎯⎯⎯⎯Y¯?

.

What is the difference between 𝑏0 and 𝛽0?

.

How will this correction (changing 54 back to 27) affect the mean and the median?

The median will be affected less than will the mean.

If you create an empty model of TopSpeed, what would it mean to have an "empty model"?

The model would include only mean TopSpeed.

What would be true about the empty model for Fat?

The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables.

Which of the following cannot be calculated from the NutritionStudy data set?

A parameter

When we ran this R code to fit the empty model to our data for Thumb length from the full Fingers data set, what was the number 60.1?

A statistic and A parameter estimate

What notation can be used to represent the mean of the population?

-𝛽0 -𝜇

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual?

-10

When we ran this R code to fit the empty model to our data for Thumb length from the full Fingers data set, what was the number 60.1?

-A statistic -A parameter estimate

Take the StudentSurvey data frame and use lm() to fit the empty model for SAT. What is 1204?

-An unbiased estimate of SAT -The estimate of SAT that has the least error -The mean SAT score

Using the StudentSurvey data frame, create a faceted histogram for Weight by Gender. The mean is likely to be a better model for which gender?

-Females

If you print the residuals for StudentSurvey.modelGPA, what will you see?

-For each participant in the study, the difference between his/her GPA and the mean GPA

Take the StudentSurvey data frame and use lm() to fit the empty model for GPA. Save the results in an R object StudentSurvey.modelGPA. What do you get when you run StudentSurvey.modelGPA in R?

-The "intercept" -3.158 -The mean for GPA

Why is the mean unbiased?

-The errors in both directions from the mean are equal. -The sum of the differences on the left of the mean is always the same as that on the right of the mean -The mean is not more influenced by bigger values than by smaller values.

How is the mean the "middle" of the distribution

-The mean is a balancing point for the error above and below the mean. -The deviations above the mean are equal to the absolute deviations below the mean.

Use lm() to fit the empty model for TV in the StudentSurvey data frame. What can you say based on the output?

-The mean of the distribution of hours spent viewing TV by students in this data frame is 6.054. -The best-fitting number for the empty model is 6.054. -6.054 is an unbiased estimate.

What is the difference between a model of data and a model of the DGP?

-The notation is different. We use Greek letters (e.g., 𝛽β, 𝜎σ) to refer to the model of the DGP and Roman letters (e.g., b, s) to refer to the model of data. -Our certainty about each model's accuracy is different. We can be certain about our sample model statistics, but never truly know the parameters in the model of the DGP.

How is the median the "middle of the distribution?

-There are an equal number of data points above and below the median. -The number of values above the median are equal to the number of values below the median.

What notation cannot be used to represent the mean of the sample?

-𝑌𝑖 -𝛽0 -𝜇

The mean of Alcohol is 3.279 per day. A particular patient consumes 2 drinks per day. Which of the following represents the residual for this patient under the empty model?

-𝑌𝑖−𝑏0 - 2-3.279 -𝑒𝑖

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data?

23.6

If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed?

33.6

Use lm() to fit the empty model for Fat in the NutritionStudy data frame. What is the coefficient?

77.03

Why is the mean a good model for this distribution?

Because the mean balances the deviations above and below the mean.

When we use the mean as a model, why do we call it a "parameter estimate"?

Because we can't calculate the mean of the DGP, we must estimate it.

Why do we call it "parameter estimate" ?

Because we cant calculate the mean of the DGP , we must estimate it

What is the observational unit in this data frame?

Bike commutes

𝑌𝑖

Data

ei

Error

Which of the following word equations represents the hypothesis that smoking explains variation in fat consumption.

Fat consumption = Smoking + other stuff

𝑏0

Model or Prediction

Using the NutritionStudy data frame, make a histogram of Alcohol. What is represented on the y-axis?

Number of patients

Above are the favstats for Alcohol. The average number of drinks per week is 3.28 but the median is 0.3. The maximum value in this distribution is 203—that is a lot of alcoholic drinks per week (almost 30 per day)! That seems to be a mistake. Which of the following would change more if we were to exclude the maximum value from the analysis?

The mean

Even in this highly skewed distribution, the mean can be a good model. What makes the mean a good model?

The mean balances the deviation above and below the mean.

The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles, the distance for one of the commute has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean?

The mean will be lower because of the correction.

∑𝑖=1𝑛(𝑌𝑖−𝑌¯)=0 Examine the equation above. Which verbal statement best describes the meaning of the equation?

The sum of the deviations of each person i's score, from 1 to n, is equal to 0.

Let's say we wanted to write a word equation to explain the variation in the time it takes to bike to work. We think that the Distance of a person's commute is an important explanatory variable. What would the word equation look like

Time = Distance + other stuff

Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape?

TopSpeed and residuals

We run a study and calculate a parameter estimate, 𝑏0 = 25. We decide to run the study again, and again find a parameter estimate of 𝑏0 = 25. Based on these two studies, what do we know about the population parameter 𝛽0?

We know that 25 is our best estimate of 𝛽0β0, but we can never be sure of the true population parameter.

We run a study and calculate a parameter estimate, 𝑏0b0 = 25. We decide to run the study again, and again find a parameter estimate of 𝑏0b0 = 25. Based on these two studies, what do we know about the population parameter 𝛽0β0?

We know that 25 is our best estimate of 𝛽0β0, but we can never be sure of the true population parameter.

What R code would you use to fit the empty model for TopSpeed?

lm(TopSpeed ~ NULL, data = BikeCommute)

Once we have outcome.stats saved, which of these lines of codes would return the median?

outcome.stats$median

Describes a population

parameter

Describes a sample

statistic

Mean of a sample

𝑌⎯⎯⎯⎯

The mean of Alcohol is 3.279 drinks per day. A particular patient consumes 2 drinks per day. Which of the following symbols would be used to represent the value 2 in the notation of the General Linear Modal?

𝑌𝑖

The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6? 𝑌𝑖=𝑌⎯⎯⎯⎯+𝑒𝑖

𝑌𝑖

In GLM notation, which of he following represents the model (or prediction)?

𝑏0

Should 60.1 be represented in the GLM notation as 𝑏0 or 𝛽0?

𝑏0

Should 60.1 be represented in the GLM notation as 𝑏0b0 or 𝛽0β0?

𝑏0

Error around a sample model

𝑒𝑖

Mean of a population

𝜇

Error around a population model

𝜖𝑖


Related study sets

BACC 512 CHAPTER 12 Current Liabilities

View Set

Unit Test: Alcohol, Tobacco, and Other Drugs

View Set

Adam Smith and Free Market Economy

View Set

Structure, Function, and Regulation of the Heart

View Set

BUS 404 Final - Chapter 16 Stern

View Set

Motor Speech Disorders - Midterm Exam

View Set