Chapter 11
Residual Plot
-A *residual plot* is a plot in which the residuals are plotted against the values of the explanatory variable 𝑥. -In other words, the points on the residual plot are *(𝑥, 𝑦 − 𝑦 ̂)*
Compute the correlation coefficient
-We will denote size by x and selling price by y. -We compute the correlation coefficient using the following steps: -Step 1: Compute the sample means and standard deviations. -Step 2: Compute the quantities: *[(𝑥−𝑥 ̅)/𝑠_𝑥]* and *[(𝑦−𝑦 ̅)/𝑠_𝑦] Step 3: Compute the products: *[(𝑥−𝑥 ̅)/𝑠_𝑥] x [(𝑦−𝑦 ̅)/𝑠_𝑦]* -Step 4: Add the products computed in Step 3, and divide the sum by n − 1.
Negative Association
Two variables are *negatively associated* if large values of one variable are associated with small values of the other.
Correlation Is Not Causation
-A group of elementary school children took a vocabulary test. -It turned out that children with larger shoe sizes tended to get higher scores on the test, and those with smaller shoe sizes tended to get lower scores. -As a result, there was a large positive correlation between vocabulary and shoe size. -Does this mean that learning new words causes one's feet to grow, or that growing feet cause one's vocabulary to increase? -The fact that shoe size and vocabulary are correlated *does not mean that changing one variable will cause the other to change*. -*Correlation is not the same as causation*. -In general, when two variables are correlated, *we cannot conclude that changing the value of one variable will cause a change in the value of the other*.
Scatterplots
-A real estate agent wants to study the relationship between the size of a house and its selling price. -It is reasonable to suspect that the selling price is related to the size of the house. -A good way to visualize a relationship like this is with a scatterplot. -In a scatterplot, *each individual in the data set contributes an ordered pair of numbers*, and each ordered pair is plotted on a set of axes.
Correlation Coefficient is not Resistant
-A statistic is resistant if its value is *not affected much by extreme data values*. -The correlation coefficient is *not resistant*. -It may be misleading when *outliers are present*.
Ordered Pairs
-An ordered pair consists of values of two variables for each individual in the data set. In the preceding table, the ordered pairs are (inflation rate, unemployment rate). -To study the relationship between education and income, the ordered pair might be (number of years of education, annual income).
Least-Squares Regression Line
-For example, say we have a table of data that presents the size in square feet and selling price in thousands of dollars for a sample of houses. -We can use these date to predict the selling price of a house based on its size. -The key is to *summarize the data with a straight line*. -We want to find the line that *fits the data best* -We determine exactly how well a line fits the data by *squaring the vertical distances and adding them up*. -For each line, we draw the vertical distances from the points to the line. -The *line that fits best* is the *line for which this sum of squared vertical distances is as small as possible*. -This line is called the *least-squares regression line*
Residual
-Given a point (𝑥, 𝑦) and the least-squares regression line 𝑦 ̂ = 𝑏_0 + 𝑏_1 𝑥, the *residual* for the point (𝑥, 𝑦) is the *difference between the observed value 𝑦 and the predicted value 𝑦 ̂*: *Residual = 𝑦 − 𝑦 ̂*
Equation of the Least-Squares Regression Line
-Given ordered pairs (x, y), with sample means 𝑥 ̅ and 𝑦 ̅, sample standard deviations 𝑠_𝑥 and 𝑠_𝑦, and correlation coefficient r, the *equation of the least-squares regression line for predicting y from x* is: *𝑦 ̂=𝑏_0+𝑏_1𝑥* -where 𝑏_1 = 𝑟(𝑠_𝑦/𝑠_𝑥) is the *slope* -and 𝑏_0 = (𝑦 ̅−𝑏_1𝑥 ̅) is the *y-intercept*
Checking Assumptions in the Linear Model
-In practice, we do not see the entire population, so we must use the *sample to check that the assumptions are satisfied* -This is done with a *residual plot*
Positive Linear Association
-Observe that larger sizes tend to be associated with larger prices, and smaller sizes tend to be associated with smaller prices. -We refer to this as a *positive association* between size and selling price. -In addition, the points tend to cluster around a straight line. -We describe this by saying that the relationship between the two variables is *linear* -Therefore, we can say that the scatterplot exhibits a *positive linear association* between size and selling price.
Assumptions for the Linear Model
-The *mean of the y-values within a strip* is denoted *𝜇_(𝑦|𝑥)*. As 𝑥 varies, the values 𝜇_(𝑦|𝑥) follow a straight line: *𝜇_(𝑦|𝑥) = 𝛽_0+𝛽_1 * -The *amount of vertical spread is approximately the same in each strip*, except perhaps near the ends. -The *𝑦-values within a strip are approximately normally distributed*. This is assumption is not necessary if the sample size is large (𝑛 > 30).
Properties of the Correlation Coefficient
-The correlation coefficient is always between *−1 and 1*, inclusive. In other words, *−1 ≤ r ≤ 1*. -The value of the correlation coefficient *does not depend on the units of the variables*. -If we measure x and y in different units, the correlation will still be the same. -It does not matter which variable is x and which is y. -The correlation coefficient *measures only the strength of the linear relationship between variables*, and can be misleading when the relationship is nonlinear. -The correlation coefficient is sensitive to outliers, and can be misleading when outliers are present.
Point of Averages
-The least-squares regression line goes through the point of averages ( 𝑥 ̅, 𝑦 ̅). -The *point of averages* is the *point that represents the average of the x data values and the y data values*.
Conditions for the Residual Plots
-The residual plot must satisfy the following conditions in order for the linear model assumptions to be satisfied: 1. The residual plot must *not exhibit an obvious pattern*. 2. The *vertical spread of the points in the residual plot must be roughly the same across the plot*. 3. There must be *no outliers*.
Constructing Confidence Intervals for the Slope
-The slope 𝑏_1 of the least-squares regression line is a point estimate of the population slope 𝛽_1. -When the assumptions of the linear model are satisfied, we can construct a confidence interval for 𝛽_1. -To form a confidence interval, we need 1. a *point estimate* 2. a *standard error* 3. a *critical value*
Standard Error of 𝑏_1
-To compute the standard error of 𝑏_1, we first compute a quantity called the *residual standard deviation* -The *residual standard deviation*, denoted *𝑠_𝑒*, *measures the spread of the points on the scatterplot around the least-squares regression line*. 1. Formula for *residual standard deviation*: *𝑠_𝑒=√[(∑(𝑦−𝑦 ̂ )^2 )/(𝑛−2)]* -Formula for *standard error of 𝑏_1*: *𝑠_𝑏=𝑠_𝑒/√[∑(𝑥−𝑥 ̅ )^2]
Predicted Value
-We can use the least-squares regression line to predict the value of the outcome variable if we are given the value of the explanatory variable. -The value of 𝑦 ̂ that is computed is the *predicted value*. EXAMPLE: Use the least-squares regression line 𝑦 ̂ = 160.1939 + 0.0992x to predict the selling price of a house of size 2800 square feet. Solution: Substitute the value of x = 2800 into the equation for x. 𝑦 ̂ = 160.1939 + 0.0992(2800) 𝑦 ̂ = 438.0 thousand dollars
Linear Model Equation
-When the assumptions of the linear model hold, the point (𝑥,𝑦) satisfy the following linear model equation: *𝑦=𝛽_0+ 𝛽_1 𝑥+𝜀* -where *𝛽_0 is the y-intercept* -*𝛽_1 is the slope of the line* -*𝜀 is a random error* -The y-intercept 𝑏_0 and the slope 𝑏_1 of the least-squares line are *estimates of 𝛽_0 and 𝛽_1*
Linear Model
-When the points on a scatterplot are a random sample from a population, we can imagine plotting every point in the population on a scatterplot. -Then, *if certain assumptions are met*, we say that *the population follows a linear model*. -The intercept 𝑏_0 and the slope 𝑏_1 of the least-squares regression line are then estimates of a population intercept 𝛽_0 and a population slope 𝛽_1. -We cannot determine the exact values of 𝛽_0 and 𝛽_1, because we cannot observe the entire population. -However, we can use the sample points to construct *confidence intervals and test hypotheses about 𝛽_0 and 𝛽_1*
Correlation Coefficient
-When two variables have a linear relationship, we want to measure *how strong the relationship is* -A numerical measure of the *strength of the linear relationship between two variables* is called the *correlation coefficient* -Given ordered pairs (x, y), with sample means 𝑥 ̅ and 𝑦 ̅ , sample standard deviations 𝑠_𝑥 and 𝑠_𝑦, and sample size n, the correlation coefficient r is given by: *𝑟=1/(𝑛−1)∑[(𝑥−𝑥 ̅)/𝑠_𝑥 ][(𝑦−𝑦 ̅)/𝑠_𝑦 ]* -We often refer to *r* as the *correlation between x and y*
Explanatory Variable
The variable *we are given* is called the *explanatory variable*, or *predictor variable* -In the equation of the least-squares regression line, *x represents the explanatory variable* and y represents the outcome variable
Outcome Variable
The variable *we want to predict* (in this case, selling price) is called the *outcome variable*, or *response variable* -In the equation of the least-squares regression line, x represents the explanatory variable and *y represents the outcome variable*
Linear Relationship
Two variables have a *linear relationship* if the data tend to cluster around a straight line when plotted on a scatterplot.
Interpreting the Correlation Coefficient
When two variables have a linear relationship, the correlation coefficient can be interpreted as follows: -If *r is positive* = the two variables have a *positive linear association*. -If *r is negative* = the two variables have a *negative linear association*. -If *r is close to 0* = the *linear association is weak*. -The *closer r is to 1* = the more *strongly positive the linear association is*. -The *closer r is to −1* = the *more strongly negative the linear association is*. -If *r = 1* = the *points lie exactly on a straight line with positive slope*; in other words, the variables have a *perfect positive linear association*. -If *r = −1* = the *points lie exactly on a straight line with negative slope*; in other words, the variables have a *perfect negative linear association* -When two variables are *not linearly related*, the correlation coefficient *does not provide a reliable description of the relationship between the variables*.