Chapter 6

¡Supera tus tareas y exámenes ahora con Quizwiz!

The histogram below was created with this code: gf_histogram(~ hwy, data = mpg, fill = "magenta", bins = 10) Why does this histogram look different than the one in the previous question?

Because the values of hwy (i.e., highway miles per gallon) were put into fewer bins

How would you use R to calculate variance in hwy?

var(mpg$hwy)

Below is a normal model of a population. There are more or less likely values of this variable. What part of the population would be considered "unlikely" to be randomly selected (according to the definition of "unlikely" agreed upon by the statistics community)?

B tail ends

According to the normal model, what is the probability of a student having a thumb longer than 65.1?

.28

The "unlikely" part is the .05 part that is farthest away from the mean (in both left and right directions). Label the regions below with the appropriate proportions. (You're free to use a value more than once.)

0.95 is the center of the bell shaped curve 0.25 towards the ends

The expression above suggests a series of mathematical operations. Place them in their proper order.

1) For each data point, subtract the mean 2) Take the absolute value 3) Take the sum across all rows in the data frame

Which is the visual definition of standard deviation? (Check all that apply.)

Roughly the average residual from the model Roughly the average vertical distance from the mean

𝑧𝑖=(𝑌𝑖−𝑌¯)/𝑠 And what is the denominator?

Sample standard deviation

Which of these measures of spread might be most useful in measuring how far 65.1 mm is from the mean?

Standard Deviation, average error: 8.726695

What is the correct interpretation of the value 8.726695?

The average deviation in this distribution is roughly 8.73 mm.

What is the correct interpretation of the value 76.1552?

The average squared deviation in this distribution is about 76 squared mm.

What's true of the distribution of any variable, if your model is the mean of that variable? response - correct

The distribution of the variable is the same shape as the distribution of its residual.

Let's ask a different question: In which DGP, the one that produced Sample 1 or Sample 2, above, is it more likely that the next observation will be greater than 110? Explain how you answered this question.

This time the answer is different. You would be more confident in getting a score higher than 110 if your data distribution were the one on the bottom, precisely because there is more error (and a wider spread) in Sample 2. A fairly large proportion of scores in Sample 2 are above 110, whereas a much smaller proportion are above 110 in Sample 1.

Standard Deviation

average deviation from the mean, in pounds

Variance

average squared deviation from the mean, in squared pounds

Which of the following R commands would help you figure out variance in pounds lost? (Check all that apply.)

var(MindsetMatters$PoundsLost)

Which of the following R commands would help you figure out SS for pounds lost? (Check all that apply.)

var(MindsetMatters$PoundsLost) * (75 - 1) anova(Empty.model)

If the simple model of our TinyFingers data was a number other than the mean (62), then which of the following would be true?

The SS is bigger than the SS for the mean.

SS

the sum of area of all the squares

If the z score for your friend's car's highway miles per gallon is found to be .6, what does that mean? response

The car's highway miles per gallon is .6 standard deviations larger than the mean for hwy.

As we have said before, there are often many different words that represent the same ideas in statistics. For example, the mean is sometimes referred to as the average. You will see the phrase "mean squared error" in statistics. Which of these do you think it refers to? Try to figure it out.

Variance

What if, in addition to knowing Zelda's thumb length is 65.1 mm, we know also that the mean of the distribution of thumb lengths is 60.1 mm? What does this tell us that we didn't know before we knew the mean?

We now know that this thumb is longer than average and that's about it.

Let's say you've calculated the sum of squares for hwy. What would the advantage be of dividing that number by n - 1 (i.e., dividing it by the df)?

You can use it to compare error across samples of different sizes.

Is this a sample statistic or a population parameter? ∑𝑛𝑖=1 (𝑌𝑖−𝑌¯)^2/𝑛−1

sample statistic

What R code will output the standard deviation for hwy?

sd(mpg$hwy) sqrt(var(mpg$hwy)) favstats(~ hwy, data = mpg)

If you had to write one line of code to represent the sum of squared deviations (SS), which would it be?

sum(resid(Empty.model)^2)

variance

the average area of the squares

Below we have depicted a histogram and the favstats for hwy. Using the Empirical Rule, estimate the probability that a car would have a hwy above 29.4 (depicted in red). Note, we have depicted the original data distribution in the histogram but we don't want you to think about the data. We want you to estimate the probability of a value greater than 29.4 based on a normal model of the distribution.

16%

Sum of absolute deviations

∑|𝑌𝑖−𝑌¯|

What value of Thumb is represented by the black, vertical line in the picture above?

65.1

Why is the sum of the residuals equal to 0?

Because the mean balances the residuals

Here we have depicted the mean as a vertical blue line. Why is the mean a good model for hwy?

Because the mean is a model that balances the residuals and minimizes the sum of squared residuals

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) anova(Empty.model)

How much error there is around the empty model The sum of the squared residuals The sum of squares

What does the distribution of the summary variable sum generally look like?

Normal

var1 <- resample(-3:3, 1000) What do you think the distribution of var1 will look like?

Roughly flat, like a uniform distribution

If you've calculated the standard deviation for hwy, what have you found?

Roughly the average deviation from the mean, in highway miles per gallon

What does it mean that these three measures of error (SS, variance, and standard deviation) are minimized at the mean?

These measures are smallest when the mean is the model.

Which is the correct interpretation of this z score?

This thumb is .57 standard deviations (less than 1 standard deviation) above the mean.2

Based on this picture, eyeball the proportion of thumbs that might be at least 65.1.

.30

In the distribution above, we've drawn some light gray lines to illustrate the boundaries of Zone 2 (the data within 2 standard deviations from the mean). Let's apply the statistical community's idea of "unlikely." Choose all of the "unlikely" thumb lengths.

82, 80, 90, 38

Empirical Rule

Approximately 68 percent of the scores in a normal distribution are within one standard deviation, plus or minus, of the mean. Approximately 95 percent of the scores are within two standard deviations. Approximately 99.7 percent of scores are within three standard deviations of the mean (in other words, almost all of them).

Now make a prediction: If one new, randomly-selected observation were added to each of these samples, in which sample would the new observation be more likely to be greater than the mean (around 100)?

Both are equally likely.

In the figure below, which part represents the probability that a car would have a hwy above 29.4 (depicted in red)? Which part represents the z score for 29.4?

D; C

𝑧𝑖=(𝑌𝑖−𝑌¯)/𝑠 What is the numerator in the formula above?

Deviation

create an empty model of PoundsLost from MindsetMatters

Empty.model <- lm(PoundsLost ~ NULL, data = MindsetMatters) sum(resid(Empty.model)^2)

If we fit a normal curve on the distribution of hwy (see visualization below), what is it that we're modeling with it?

Error around the model for hwy

If you wanted to calculate a z score for a hwy of 27, how would it be affected by the standard deviation for hwy?

If the standard deviation is large, the z score should be small and positive.

What is the range of z scores for the data points in Zone 1?

If we are one standard deviation in the positive direction, the z score would be 1. If we are one standard deviation in the negative direction, the z score would be -1. So Zone 1, +/- one standard deviation, would contain all the data for which z scores fall between -1 and 1.

When we model error with a normal distribution, we think about our data in a smooth way rather than a jagged way. If we are interested in the probability of a student having a thumb longer than 65.1 mm, what part of the picture above should we look at?

The green part

If you ran the R code below, what would you be able to tell from the output? Empty.model <- lm(hwy ~ NULL, data = mpg) Empty.model

The mean

Below is the histogram for hwy. What would you get if you were to total up the height (the "count") of all the bars?

The number of car configurations in the mpg data frame

You probably noticed that the border (the black line representing 65.1 mm) is not labeled 65.1. Instead, it is labeled "z = 0.57." What does this z score mean?

The number of standard deviations that fit between the mean and 65.1 mm

SS

Total squared deviations from the mean, in squared pounds

What is represented by the numbers on the y-axis (count) of this histogram?

Specific cars

Which of the following R commands would help you figure out standard deviation for pounds lost? (Check all that apply.)

favstats(~ PoundsLost, data = MindsetMatters) sd(MindsetMatters$PoundsLost) sqrt(var(MindsetMatters$PoundsLost))

Which of these lines of R code do you think would give you SS?

sum(resid(Empty.model)^2)

Deviations

𝑌𝑖−𝑌¯

Remember that the variance of Thumb from TinyFingers was 16.4. Which of the following is a good estimate for the standard deviation of Thumb from TinyFingers?

4

Now compare the two z scores (2 vs. 0.4). Which is more impressive, a player with a z score of 2 or one with a z score of 0.4? Why?

A z score of 2 is more impressive—it's two standard deviations above the mean. It should be harder to score two standard deviations above the mean than to score 0.4 (or less than one half) a standard deviation above the mean.

interpret these z scores. They are measures, but what is the unit of measurement? 2 what? 0.4 what?

A z score represents the number of standard deviations a score is above (if positive) or below (if negative) the mean. So, the units are standard deviations. A z score of 2 is two standard deviations above the mean. A z score of 0.4 is 0.4 standard deviations above the mean.

# modify this to create a boolean variable that records whether a Thumb is greater than 65.1 # modify this to find the proportion of GreaterThan65.1

Fingers$GreaterThan65.1 <- Fingers$Thumb > 65.1 tally(~GreaterThan65.1, data = Fingers, format = "proportion")

Why do we put parentheses in the expression (37,000 - 35,000)/1,000?

If we did this calculation without parentheses, the calculation would be 37,000 - (35,000 / 5,000) because order of operations, our cultural conventions for how we do arithmetic, says that division is done before subtraction.

Say you have a video game called Kargle. A friend of yours says that their high score is 37,000 points. Is that a good score? How do you know? What else would you want to know to answer this question?

Let's say you know that the mean score across all players of the game is 35,000. How would that help you? Clearly it would help. You would know that the score of 37,000 is above the average by 2,000 points. But even though it helps you interpret the meaning of the 37,000, it's not enough. What it doesn't tell you is how far above the average 37,000 points is in relation to the whole distribution.

What is the difference between a z score and a standard deviation?

Standard deviation (SD) is roughly the average deviation of all scores from the mean. It can be seen as an indicator of the spread of the distribution. A z score uses SD as a sort of ruler for measuring how far an individual score is above or below the mean.

What is the Data Generating Process (DGP) of var1 like?

The DGP is the function resample(), which randomly picks one of the numbers between -3 and 3.

What is the difference between a residual and the standard deviation?

The residual standard deviation is simply the standard deviation of the residual values, or the difference between a set of observed and predicted values. The standard deviation of the residuals calculates how much the data points spread around the regression line.

The sum of squares gets larger as:

The variation increases The sample size increases The spread of the distribution increases

If a data point is very far away from the mean, what would you expect for the residual?

When farther away, the larger the absolute value of the residual

If you had to write one line of code to represent the Sum of Absolute Deviations (SAD), which would it be?

sum(abs(resid(TinyEmpty.model)))

Absolute deviations

|𝑌𝑖−𝑌¯|


Conjuntos de estudio relacionados

Lop 4: Unit 15: When's Children's Day?

View Set

Macroeconomics: International Trade

View Set

Head, Neck, and neurological ATI

View Set

2/27 Ophthalmology History and Exam

View Set

Understanding Chap 12 Noun Clauses pgs. 241 ex 2 - Combining Sentences

View Set