CHAPTER 5

Ace your homework & exams now with Quizwiz!

What is residuals?

"leftovers" from our data once we take out the model. subtract the mean from each data point

What number to choose for a categorical variable?

The mode

What are statistics?

A statistic is anything you can compute to summarize something about your data; the mean is our first example of a statistic.

What number to choose for a numerical variable?

Mean or median

How would you define "mean" in terms of error?

The mean is the number that balances the amount of deviation above and below it, yielding the same amount of error above it as below it.

What is the model of the DGP?

Yi=β0+ϵi

What is a difference between a statistic and a parameter?

statistics are computed, parameters are estimated.

Find the average squared deviation for Siblings in the StudentSurvey data frame. A 1.39 B 1.73 C 1.18 D 362

A 1.39

Fit the NULL or empty model of Exercise in the StudentSurvey data frame. What is the sum of squares for this model? A 11864 B 360 C 32.956 D 9.054

A 11864

Use lm() to fit the empty model for Fat in the NutritionStudy data frame. What is the coefficient? A 77.03 B 57.0 C 65.12 D None of the above

A 77.03

Even in This highly skewed distribution, the mean can be a good model. What makes the mean a good model? A The mean balances the deviation above and below the mean. B The mean balances the number of values above and below the mean. C The mean is the most common number in this distribution. D All of the above are reasons why the mean is a good model.

A The mean balances the deviation above and below the mean.

The mean of Alcohol is 3.279 drinks per day. A particular patient consumes 2 drinks per day. Which of the following symbols would be used to represent the value 2 in the notation of the General Linear Modal? A Yi B b0 C ei D None of the above

A Yi

What is a parameter?

A parameter is a number that summarizes something about a population.

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the residual? A -10 B 33.6 C 57.2 D 23.6

A) -10

Why is the mean a good model for this distribution? (graph is skewed, but the x-axis is by .5) A Because the mean balances the deviations above and below the mean. B Because the mean balances the number of values above and below the mean. C Because the mean is the midpoint of the range. D All of the above are reasons why the mean is a good model.

A) Because the mean balances the deviations above and below the mean

Why is the mean unbiased? (check all that apply) A) The errors in both directions from the mean are equal. B) The sum of the difference on the left of the mean is always the same as that on the right of the mean. C) The mean is a statistic and all statistics are unbiased. D) The mean is a model and all models are unbiased. E) The mean is not more influenced by bigger values than by smaller values.

A) The errors in both directions from the mean are equal. B) The sum of the difference on the left of the mean is always the same as that on the right of the mean. E) The mean is not more influenced by bigger values than by smaller values.

How is the median the "middle of the distribution? (check all that apply) A) There are an equal number of data points above and below the median. B) There are an equal number of variables above and below the median. C) The median is a balancing point for the value above and below the median. D) Values above the median are equal to values below the median. E) The number of values above the median are equal to the number of values below the median.

A) There are an equal number of data points above and below the median. E) The number of values above the median are equal to the number of values below the median.

Let's say we wanted to write a word equation to explain the variation in the time it takes to bike to work. We think that the Distance of a person's commute is an important explanatory variable. What would the word equation look like? A) Time = Distance + other stuff B) Distance = time + other stuff C) Other stuff = Distance + time D ) Model = time + Distance

A) Time = Distance + other stuff

The mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6. What part of this GLM notation represents 23.6? Yi=Y¯+ei A Yi B Y¯ C ei D None of the above

A) Yi

This bike rider has an intuition that the kind of bike he uses affects the top speed he can reach on the bike. To get a sense of the distribution of top speed, he looks at the output. Which function created this output? (has the 5 numbe summary) A favstats() B tally() C arrange() D head()

A) favstats()

Above is a histogram for TopSpeed. What R code created the line that indicates the mean? ( Image is a vertical line at some x point) A gf_vline(xintercept = 33.6, color = "blue") B gf_mean(mean = 33.6, color = "blue") C gf_mean(33.6, color = "blue") D None of the above

A) gf_vline(xintercept = 33.6, color = "blue")

Which of the following cannot be calculated from the NutritionStudy data set? A An estimate B A parameter C A statistic D A simple model

B A parameter

If you print the residuals for StudentSurvey.modelGPA, what will you see? A 3.158 B For each participant in the study, the difference between his/her GPA and the mean GPA C For each participant in the study, the model (i.e. the mean) for GPA D The GPA of each participant in the study

B For each participant in the study, the difference between his/her GPA and the mean GPA

Using the NutritionStudy data frame, make a histogram of Alcohol. What is represented on y axis? A Number of drinks consumed per week B Number of patients C The count of alcoholic drinks D Number of variables

B Number of patients

What would be true about the empty model for Fat? A The model would be the best way of explaining how many variables contribute to Fat (such as smoking status and gender). B The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables. C The model would predict a different value for Fat for each person, depending on their values on other variables. D The model would predict 0 grams for every person's value on Fat.

B The model would make the same prediction (the mean of Fat) for every person regardless of their values on other variables.

Using the StudentSurvey data frame, create a faceted histogram for Weight by Gender. For whom is likely the mean a better model? A males B females C The mean is an equally good model for males and females. D Histogram cannot be used to answer this question.

B females

If you create an empty model of TopSpeed, what would it mean to have an "empty model"? A The model would be the best way of explaining how many variables contribute to TopSpeed (such as time of year and type of bike). B The model would include only mean TopSpeed. C The model would predict a different TopSpeed depending on the situation. D None of the above.

B) The model would include only mean TopSpeed.

Imagine you make three histograms: one for TopSpeed, one for the predicted values based on the empty model for TopSpeed, and one for the residuals. Which two distributions will have a similar shape? A TopSpeed and the predicted values B TopSpeed and residuals C Predicted values and residuals D No two of these distributions will have a similar shape.

B) TopSpeed and residuals

In GLM notation, which of he following represents the model (or prediction)? A Yi B b0 C ei D None of the above

B) b0

Above are the favstats for Alcohol. The average number of drinks per week is 3.28 but the median is 0.3. The maximum value in this distribution is 203 -- that is a lot of alcoholic drinks per week (almost 30 per day). That seems to be a mistake. Which of the following would change more if we were to exclude the maximum value from the analysis? A The median B The minimum C The mean D All of these values (median, minimum, and mean) will change a lot.

C The mean

Try creating a NULL or empty model for TV viewing using the StudentSurvey data frame, and then look at the SS by using anova(). What unit is associated with the number 11224? A minutes B hours C square hours D impossible to tell

C square hours

If the mean for TopSpeed is 33.6, what will the empty model predict for each observation's TopSpeed? A A value within one quarter of 33.6 B 0 C 33.6 D It's impossible to say.

C) 33.6

What notation can be used to represent the mean of the population? A β0 B μ C Both of the above D None of the above

C) Both of the above

What R code would you use to fit the empty model for TopSpeed? A gf_histogram(NULL ~ TopSpeed, data = BikeCommute) B NULL(TopSpeed, data = BikeCommute) C lm(TopSpeed ~ NULL, data = BikeCommute) D gf(TopSpeed ~ NULL, data = BikeCommute)

C) lm(TopSpeed ~ NULL, data = BikeCommute)

The average Distance of this person's bike commute is just over 27 miles. Imagine that you've discovered that one of your observations has been recorded incorrectly. Instead of a distance of around 27 miles, the distance for one of the commute has been entered as 54 miles! You make the correction to your data frame. How will the correction affect the mean? A. The mean will be unaffected by the correction. B. The mean will be higher because of the correction. C. The mean will be lower because of the correction. D. It's impossible to say how the mean will be affected by the correction.

C. The mean will be lower because of the correction.

How will this correction (changing 54 back to 27) affect the mean and the median? A. Both the mean and median will be equally affected by this correction. B. The median will be affected more than will the mean. C. The median will be affected less than will the mean. D. It's impossible to say how the median will be affected by the correction.

C. The median will be affected less than will the mean.

Take the StudentSurvey data frame and use lm() to fit the empty model for GPA. Save the results in an R object StudentSurvey.modelGPA. What do you get when you StudentSurvey.modelGPA? A The "intercept" B 3.518 C The mean for GPA D All of the above

D All of the above

Take the StudentSurvey data frame and use lm() to fit the empty model for SAT. What is 1204? A An unbiased estimate of SAT B The estimate of SAT that has the least error C The mean SAT score D All of the above

D All of the above

The mean of Alcohol is 3.279 per day. A particular patient consumes 2 drinks per day. Which of the following represents the residual for this patient under the empty model? A Yi−b0 B 2 - 3.279 C ei D All of the above

D All of the above

Use lm() to fit the empty model for TV in the StudentSurvey data frame. What can you say based on the output? A The mean of the distribution of hours spent viewing TV by students in this data frame is 6.054. B The best fitting number for the empty model is 6.054. C 6.054 is an unbiased estimate. D All of the above.

D All of the above.

Which of the following word equations represents the hypothesis that smoking explains variation in fat consumption. A smoking = fat consumption B fat consumption = smoking C smoking = fat consumption + other stuff D fat consumption = smoking + other stuff

D fat consumption = smoking + other stuff

If the mean of TopSpeed is 33.6 and a given observation has a TopSpeed of 23.6, what is the data? A -10 B 33.6 C 57.2 D 23.6

D) 23.6

How is the mean the "middle" of the distribution? (check all that apply) A) The mean is a balancing point for the values above and below the mean. B) There are an equal number of values above and below the mean. C) There are an equal number of variables above and below the mean. D) The mean is a balancing point for the error above and below the mean. E) The deviations above the mean are equal to the absolute deviations below the mean.

D) The mean is a balancing point for the error above and below the mean. E) The deviations above the mean are equal to the absolute deviations below the mean.

How would you create a plot to look at the distribution of Distance? A gf_plot(~Distance$BikeCommute) B gf_histogram(~ Distance) C gf_histogram(BikeCommute, Distance) D gf_histogram(~ Distance, data = BikeCommute)

D) gf_histogram(~ Distance, data = BikeCommute)

What is the statistical model?

DATA = MODEL + ERROR Each data point in a distribution can be decomposed into two parts: the model (i.e., the number we are using to represent the whole distribution), and the data point's deviation from the model (the error).

What model is the best of a categorical variable?

If a distribution is for a categorical variable, the best model is generally the category that is most frequent.

If a distribution is roughly normal and symmetrical, which model would be better: mean or median?

Mean, because the more outliers, the more it will affect the mean.

If a distribution is skewed (left or right), which model would be better: mean or median?

Median because it isn't affected by the outliers.

What is the relation between residuals and the mean?

The residuals (or error) around the mean always sum to 0.

How to differentiate general linear model and a model for the DGP?

Whenever you see Greek letters you can be pretty sure we are talking about parameters of the population. Roman letters are generally used to represent estimators calculated from data.

Is there more error when the distribution has more spread?

Yes

What is the general linear model?

Yi = b0+ ei data = mean + error

What model is best suited for a bell-shaped, roughly symmetrical distribution?

the best model might be a number toward where the middle is when you ignore the long tail on one side or the other.

What is a simple/empty/null model?

using the mean to model the distribution of a quantitative variable

What do these letters represent? μ εi ȳ ei

μ = mean of population εi = error around a population model ȳ = mean of a sample e = error around a sample model


Related study sets

Chapter 2 Types of Life Policies

View Set

Ch 14 Infection and Infectious Diseases

View Set

Marketing Management Test #2 Case Study

View Set

Lecture 2: Basic vs. Applied Research

View Set

Quiz: Chapter 14, Nursing Care of the Family During Pregnancy

View Set