Soc 106 - CH 9.1-9.4 - Linear Regression & Correlation
LINEAR REGRESSION FUNCTION
A probabilistic model uses α + βx to represent the mean of y-values, rather than yitself, as a function of x. For a given value of x, α + βx represents the mean of theconditional distribution of y for subjects having that value of x. - refer to pic model the relationship between x and the mean of the conditional distribution ofy.Fory = annual income, in dollars, and x = number of years of education, supposeE(y) =−5000 +3000x. For instance, those having a high school education (x = 12)have a mean income of E(y) =−5000 + 3000(12) = 31,000 dollars. The modelstates that the mean income is 31,000, but allows different subjects having x = 12 tohave different incomes
r-SQUARED: PROPORTIONAL REDUCTION IN PREDICTION ERROR
A related measure of association summarizes how well x can predict y. If we canpredict y much better by substituting x-values into the prediction equationˆy = a+bxthan without knowing the x-values, the variables are judged to be strongly associated.This measure of association has four elements: - A rule for predicting y without using x. We refer to this as Rule 1. - A rule for predicting y using information on x. We refer to this as Rule 2. - A summary measure of prediction error for each rule, E1 for errors by rule 1 and E2 for errors by rule 2 - The difference in the amount of error with the two rules is E1−E2. Convertingthis reduction in error to a proportion provides the definition Rule 1 (Predicting y without using x): The best predictor is¯y, the sample mean. Rule 2 (Predicting y using x): When the relationship between x and y is linear,the prediction equationˆy = a + bx provides the best predictor of y. Prediction Errors: The prediction error for each subject is the difference be-tween the observed and predicted values of y. The prediction error using rule 1is y −¯y, and the prediction error using rule 2 is y −ˆy, the residual. For each pre-dictor, some prediction errors are positive, some are negative, and the sum ofthe errors equals 0. We summarize the prediction errors by their sum of squared values, - p. 264
We present three different, but related, aspects of regression analysis:
1. We investigate whether an association exists between the variables by testing the hypothesisof statistical independence. 2. We study the strength of their association using the correlation measure of association. 3. We estimate a regression equation that predicts the value of the response variable from thevalue of the explanatory variable.
Regression function
An equation of the form E(y) = α + βx that relates values of x to the mean ofthe conditional distribution of y is called a regression function A regression function is a mathematical function that describes how themean of the response variable changes according to the value of anexplanatory variable The function E(y) = α + βx is called a linear regression function, because ituses a straight line to relate the mean of y to the values of x. In practice, the regres-sion coefficients α and β are unknown. Least squares provides the sample predictionequationˆy = a + bx. At any particular value of x,ˆy = a + bx estimates the mean ofy for all subjects in the population having that value of x
CORRELATION IMPLIES REGRESSION TOWARD THE MEAN
Another interpretation of the correlation relates to its standardized slope property.We can rewrite the equality Now, the slope b is the change inˆy for a one-unit increase in x. An increase in x of sxunits has a predicted change of sxb units. (For instance, if sx= 10, an increase of 10units in x corresponds to a change inˆy of 10b.) See Figure 9.11. Since sxb = rsy, an increase of sxin x corresponds to a predicted change of r standard deviations in the y-values For example, let's start at the point (¯x,¯y) through which the prediction equationpasses and consider the impact of x moving above¯x by a standard deviation. Supposethat r = 0.5. An increase of sxin x,from¯x to (¯x + sx), corresponds to a predictedincrease of 0.5syin y,from¯y to (¯y + 0.5sy). We predict that y is closer to the mean,in standard deviation units. This is called regression toward the mean. The larger theabsolute value of r, the stronger the association, in the sense that a standard deviationchange in x corresponds to a greater proportion of a standard deviation change in y.
LINEAR FUNCTIONS: INTERPRETING THE y-INTERCEPT AND SLOPE
Any particular formula might provide a good description or a poor one of how yrelates to x. This chapter introduces the simplest type of formula—a straight line.For it, y is said to be a linear function of x Each real number x, when substituted into the formula y = α + βx, yields adistinct value for y. In a graph, the horizontal axis, the x-axis, lists the possible valuesof x. The vertical axis, the y-axis, lists the possible values of y. The axes intersect atthe point where x = 0 and y = 0, called the origin At x = 0, the equation y = α + βx simplifies to y = α + βx = α + β(0) = α.Thus, the constant α in this equation is the value of y when x = 0. Now, points onthe y-axis have x = 0, so the line has height α at the point of its intersection with they-axis. Because of this, α is called the y-intercept The slope β equals the change in y for a one-unit increase in x. That is, for twox-values that differ by 1.0 (such as x = 0 and x = 1), the y-values differ by β.Twox-values that are 10 units apart differ by 10β in their y-values. In the context of a regression analysis, α and β are called regression coefficients
MODELS ARE SIMPLE APPROXIMATIONS FOR REALITY
As Section 7.5 (page 193) explained, a model is a simple approximation for the rela-tionship between variables in the population. The linear function provides a simplemodel for the relationship between two quantitative variables. For a given value ofx, the model y = α + βx predicts a value for y. The better these predictions tend tobe, the better the model. As we shall discuss in some detail in Chapter 10, association does not imply cau-sation. For example, consider the interpretation of the slope in Example 9.2 aboveof "When the percentage with income below the poverty level increases by 1, theviolent crime rate increases by about 25 crimes a year per 100,000 population." Thisdoes not mean that if we had the ability to go to a state and increase the percentageof people living below the poverty level from 10% to 11%, we could expect the num-ber of crimes to increase in the next year by 25 crimes per 100,000 people. It merelymeans that based on current data, if one state had a 10% poverty rate and one hadan 11% poverty rate, we'd predict that the state with the higher poverty rate wouldhave 25 more crimes per year per 100,000 people But, as we'll see in Section 9.3, asensible model is actually a bit more complex than the one we've presented so far, byallowing variability in y-values at each value for x. That model, not merely a straightline, is what we mean by a regression model. Before introducing the complete model,in Section 9.3, we'll next see how to find the best approximating straight line
example 9.5 p. 257
Describing How Income Varies, for Given Education Again, suppose E(y) =−5000 + 3000x describes the relationship between mean annual income and num-ber of years of education. The slope β = 3000 implies that mean income increases$3000 for each year increase in education. Suppose also that the conditional distribu-tion of income is normal, with σ = 13,000. According to this model, for individualswith x years of education, their incomes have a normal distribution with a mean ofE(y) =−5000 + 3000x and a standard deviation of 13,000.Those having a high school education (x = 12) have a mean income of E(y) =−5000 + 3000(12) = 31,000 dollars and a standard deviation of 13,000 dollars. So,about 95% of the incomes fall within two standard deviations of the mean, that is,between 31,000 − 2(13,000) = 5000 and 31,000 + 2(13,000) = 57,000 dollars.Those with a college education (x = 16) have a mean annual income of E(y) =−5000 +3000(16) = 43,000 dollars, with about 95% of the incomes falling between$17,000 and $69,000. Figure 9.7 pictures this regression model. In Figure 9.7, each conditional distribution is normal, and each has the samestandard deviation, σ = 13. In practice, the distributions would not be exactly nor-mal, and the standard deviation need not be the same for each. Any model neverholds exactly in practice. It is merely a simple approximation for reality. For sam-ple data, we'll learn about ways to check whether a particular model is realistic. Themost important assumption is that the regression equation is linear. The scatterplothelps us check whether this assumption is badly violated, as we'll discuss later inthe chapter.
EFFECT OF OUTLIERS ON THE PREDICTION EQUATION
Figure 9.5 plots the prediction equation from Example 9.4 over the scatterplot. The diagram shows that the observation for D.C. (the sole point in the top-right corner) is a regression outlier—it falls quite far from the trend that the rest of the data follow.This observation seems to have a substantial effect. The line seems to be pulled uptoward it and away from the center of the general trend of points Let's now refit the line using the observations for the 50 states but not the onefor D.C. Table 9.3 shows that the prediction equation isˆy =−0.86 + 0.58x. Figure9.5 also shows this line, which passes more directly through the 50 points. The slope is 0.58, compared to 1.32 when the observation for D.C. is included. The one outlyingobservation has the impact of more than doubling the slope! An observation is called influential if removing it results in a large change in theprediction equation. Unless the sample size is large, an observation can have a stronginfluence on the slope if its x-value is low or high compared to the rest of the dataand if it is a regression outlier In summary, the line for the data set including D.C. seems to distort the rela-tionship for the 50 states. It seems wiser to use the equation based on the 50 statesalone rather than to use a single equation for both the 50 states and D.C. This line forthe 50 states better represents the overall trend. In reporting these results, we wouldnote that the murder rate for D.C. falls outside this trend, being much larger thanthis equation predicts
residual
For an observation, the difference between an observed value and thepredicted value of the response variable, y −ˆy, is called the residual
the linear regression model
For the linear model y = α + βx, each value of x corresponds to a single value of y.Such a model is said to be deterministic. It is unrealistic in social science research,because we do not expect all subjects who have the same x-value to have the samey-value. Instead, the y-values vary For example, let x = number of years of education and y = annual income. Thesubjects having x = 12 years of education do not all have the same income, becauseincome is not completely dependent upon education. Instead, a probability distri-bution describes annual income for individuals with x = 12. It is the conditionaldistribution of the y-values at x = 12. A separate conditional distribution appliesfor those with x = 13 years of education. Each level of education has its own condi-tional distribution of income. For example, the mean of the conditional distributionof income would likely be higher at higher levels of education. A probabilistic model for the relationship allows for variability in y at each value of x. We now show how a linear function is the basis for a probabilistic model.
PREDICTION ERRORS ARE CALLED RESIDUALS
For the prediction equationˆy =−0.86 +0.58x for the 50 states, a comparison of theactual murder rates to the predicted values checks the goodness of the equation. Forexample, Massachusetts had poverty rate x = 10.7 and y = 3.9. The predicted murderrate (ˆy)atx = 10.7 isˆy =−0.86 + 0.58x =−0.86 + 0.58(10.7) = 5.4. The predictionerror is the difference between the actual y-value of 3.9 and the predicted value of5.4, or y −ˆy = 3.9 − 5.4 =−1.5. The prediction equation overestimates the murderrate by 1.5. Similarly, for Louisiana, x = 26.4 andˆy =−0.86 +0.58(26.4) = 14.6. Theactual murder rate is y = 20.3, so the prediction is too low. The prediction error isy −ˆy = 20.3 − 14.6 = 5.7. The prediction errors are called residuals Table 9.3 shows the murder rates, the predicted values, and the residuals for thefirst four states in the data file. A positive residual results when the observed value yis larger than the predicted valueˆy,soy −ˆy > 0. A negative residual results when theobserved value is smaller than the predicted value. The smaller the absolute valueof the residual, the better is the prediction, since the predicted value is closer to theobserved value In a scatterplot, the residual for an observation is the vertical distance betweenits point and the prediction line. Figure 9.6 illustrates this. For example, the obser-vation for Louisiana is the point with (x, y) coordinates (26.4, 20.3). The predictionis represented by the point (26.4, 14.6) on the prediction line obtained by substitut-ing x = 26.4 into the prediction equationˆy =−0.86 + 0.58x. The residual is thedifference between the observed and predicted points, which is the vertical distancey −ˆy = 20.3 − 14.6 = 5.7.
CONDITIONAL VARIATION TENDS TO BE LESS THAN MARGINAL VARIATION
From pages 43 and 105, a point estimate of the population standard deviation of avariable y is - refer to pic This is the standard deviation of the marginal distribution of y, because it uses onlythe y-values. It ignores values of x. To emphasize that this standard deviation dependson values of y alone, the remainder of the text denotes it by syin a sample and σyin a population. It differs from the standard deviation of the conditional distributionof y, for a fixed value of x. To reflect its conditional form, that standard deviation issometimes denoted by sy|xfor the sample estimate and σy|xfor the population. Forsimplicity, we use s and σ . The sum of squares(y −¯y)2in the numerator of syis called the total sum ofsquares. In Table 9.4 for the 50 student GPAs, it is 15.29. Thus, the marginal standarddeviation of GPA is sy=15.29/(50 − 1) - p. 259 Typically, less spread in y-values occurs at a fixed value of x than totaled overall such values. We'll see that the stronger the association between x and y,thelessthe conditional variability tends to be relative to the marginal variability. For ex-ample, suppose the marginal distribution of college GPAs (y) at your school fallsbetween 1.0 and 4.0, with sy= 0.60. Suppose we could predict college GPA perfectlyusing x = high school GPA, with the prediction equationˆy = 0.40 + 0.90x. Then,SSE = 0, and the conditional standard deviation would be s = 0. In practice, per-fect prediction would not happen. However, the stronger the association in terms ofless prediction error, the smaller the conditional variability would be. See Figure 9.8,which portrays a marginal distribution that is much more spread out than eachconditional distribution.
Correlation is a standardized slope
Here is the simple connection between the slope estimate and the sample correlation: If the sample spreads are equal (sx= sy), then r = b. The correlation is the valuethe slope would take for units such that the variables have equal standard devia-tions. For example, when the variables are standardized by converting their valuesto z-scores, both standardized variables have standard deviations of 1.0. Because ofthe relationship between r and b, the correlation is also called the standardized re-gression coefficient for the model E(y) = α + βx. In practice, it is not necessary tostandardize the variables, but we can interpret the correlation as the value the slopewould equal if the variables were equally spread out.The point estimate r of the correlation was proposed by the British statisticalscientist Karl Pearson in 1896, just four years before he developed the chi-squaredtest of independence for contingency tables. In fact, this estimate is sometimes calledthe Pearson correlation.
example 9.9 - p. 265
In practice, it is unnecessary to perform this computation, since software reports r or r2 or both.
The Correlation
On page 53, we introduced the correlation between quantitative variables. Its value,unlike that of the slope b, does not depend on the units of measurement. The formulas for the correlation and for the slope (page 252) have the samenumerator, relating to the covariation of x and y. The correlation is a standardizedversion of the slope. The standardization adjusts the slope b for the fact that thestandard deviations of x and y depend on their units of measurement. Let sx and sy denote the marginal sample standard deviations of x and y
Example 9.4
Predicting Murder Rate from Poverty Rate For the 51 observations on y = murderrate and x = poverty rate in Table 9.1, SPSS software provides the results shown inTable 9.2. Murder rate has¯y = 8.7 and s = 10.7, indicating that it is probably highlyskewed to the right. The box plot for murder rate in Figure 9.3 shows that the extremeoutlying observation for D.C. contributes to this The estimates of α and β are listed under the heading2B. The estimated y-intercept is a =−10.14, listed opposite (Constant). The estimate of the slope isb = 1.32, listed opposite the variable name of which it is the coefficient in the pre-diction equation, POVERTY. Therefore, the prediction equation isˆy = a + bx =−10.14 + 1.32x The slope b = 1.32 is positive. So, the larger the poverty rate, the larger is thepredicted murder rate. The value 1.32 indicates that an increase of 1 in the percent-age living below the poverty rate corresponds to an increase of 1.32 in the predicted murder rate Similarly, an increase of 10 in the poverty rate corresponds to a 10(1.32) =13.2-unit increase in predicted murder rate. If one state has a 12% poverty rate andanother has a 22% poverty rate, for example, the predicted annual number of mur-ders per 100,000 population is 13.2 higher in the second state than the first state.This differential of 13 murders per 100,000 population translates to 130 per millionor 1300 per 10 million population. If the two states each had populations of 10 mil-lion, the one with the higher poverty rate would be predicted to have 1300 moremurders per year
figure 9.6
PredictionEquation and Residuals. Aresidual is a verticaldistance between a datapoint and the prediction line
example 9.2
Straight Lines for Predicting Violent Crime Rate For the 50 states, consider y =violent crime rate and x = poverty rate. We'll see that a straight line cannot perfectlyrepresent the relation between them, but the line y = 210 + 25x provides a type ofapproximation. The y-intercept of 210 represents the violent crime rate at povertyrate x = 0 (unfortunately, there are no such states). The slope equals 25. When thepercentage with income below the poverty level increases by 1, the violent crime rateincreases by about 25 crimes a year per 100,000 population.By contrast, if instead x =percentage of the population living in urban areas, thestraight line approximating the relationship is y = 26 + 8x. The slope of 8 is smallerthan the slope of 25 when poverty rate is the explanatory variable. An increase of1 in the percentage below the poverty level corresponds to a greater change in theviolent crime rate than an increase of 1 in the percentage urban. Figure 9.2 shows the lines relating the violent crime rate to poverty rate and tourban residence. Generally, the larger the absolute value of β, the steeper the line.When β is positive, y increases as x increases—the straight line goes upward, likethese two lines. Then, large values of y occur with large values of x, and small values of y occur with small values of x. When a relationship between two variables followsa straight line with β>0, the relationship is said to be positive.When β is negative, y decreases as x increases. The straight line then goesdownward, and the relationship is said to be negative. For instance, the equationy = 1756 − 16x, which has slope −16, approximates the relationship between y =violent crime rate and x = percentage of residents who are high school graduates.For each increase of 1.0 in the percentage who are high school graduates, the violentcrime rate decreases by about 16. Figure 9.2 also shows this line When β = 0, the graph is a horizontal line. The value of y is constant and does notvary as x varies. If two variables are independent, with the value of y not dependingon the value of x, a straight line with β = 0 represents their relationship. The liney = 800 shown in Figure 9.2 is an example of a line with β = 0
example 9.6 - p. 258
TV Watching and Grade Point Averages A survey3of 50 college students in an in-troductory psychology class observed self-reports of y = high school GPA and x =weekly number of hours viewing television. The study reportedˆy = 3.44 − 0.03x.Software reports sums of squares shown in Table 9.4. This type of table is called anANOVA table. Here, ANOVA is an acronym for analysis of variability. The residualsum of squares in using x to predict y was SSE = 11.66. The estimated conditional standard deviation is At any fixed value x of TV viewing, the model predicts that GPAs vary around amean of 3.44 − 0.03x with a standard deviation of 0.49. At x = 20 hours a week, forinstance, the conditional distribution of GPA is estimated to have a mean of 3.44 −0.03(20) = 2.84 and standard deviation of 0.49. The term (n − 2) in the denominator of s is the degrees of freedom (df)fortheestimate. When a regression equation has p unknown parameters, then df = n − p.The equation E(y) = α +βx has two parameters (α and β), so df = n −2. The tablein the above example lists SSE = 11.66 and its df = n−2 = 50−2 = 48. The ratio ofthese, s2= 0.24, is listed on the printout in the Mean Square column. Most softwarecalls this the residual mean square. Its square root is the estimate of the conditionalstandard deviation of y, s =√0.24 = 0.49. Among the names software calls this areRoot MSE (Stata and SAS) for the square root of the mean square error, Residualstandard error (R), and Standard error of the estimate (SPSS).
Properties of the correlation
The correlation is valid only when a straight-line model is sensible for the re-lationship between x and y. Since r is proportional to the slope of a linear pre-diction equation, it measures the strength of the linear association. −1 ≤ r ≤ 1. The correlation, unlike the slope b, must fall between −1 and +1.We shall see why later in the section. r has the same sign as the slope b. This holds because their formulas have thesame numerator, relating to covariation of x and y, and positive denominators.Thus, r > 0 when the variables are positively related, and r < 0 when thevariables are negatively related. r = 0 for those lines having b = 0. When r = 0, there is not a linear increasingor linear decreasing trend in the relationship. r =±1 when all the sample points fall exactly on the prediction line. Thesecorrespond to perfect positive and negative linear associations. There is thenno prediction error when we useˆy = a + bx to predict y. The larger the absolute value of r, the stronger the linear association. Variableswith a correlation of −0.80 are more strongly linearly associated than variableswith a correlation of 0.40. Figure 9.9 shows scatterplots having various valuesfor r. The correlation, unlike the slope b, treats x and y symmetrically. The predictionequation using y to predict x has the same correlation as the one using x topredict y. The value of r does not depend on the variables' units For example, if y is the number of murders per 1,000,000 population instead ofper 100,000 population, we obtain the same value of r = 0.63. Also, when murderrate predicts poverty rate, the correlation is the same as when poverty rate predictsmurder rate, r = 0.63 in both cases. The correlation is useful for comparing associations for variables having differ-ent units. Another potential predictor for murder rate is the mean number of yearsof education completed by adult residents in the state. Poverty rate and educationhave different units, so a one-unit change in poverty rate is not comparable to a one-unit change in education. Their slopes from the separate prediction equations arenot comparable. The correlations are comparable. Suppose the correlation of mur-der rate with education is −0.30. Since the correlation of murder rate with povertyrate is 0.63, and since 0.63 > |−0.30|, murder rate is more strongly associated withpoverty rate than with education. We emphasize that the correlation describes linear relationships. For curvilinearrelationships, the best-fitting prediction line may be completely or nearly horizontal,and r = 0 when b = 0. See Figure 9.10. A low absolute value for r does not then implythat the variables are unassociated, but merely that the association is not linear - p. 262
A SCATTERPLOT PORTRAYS THE DATA
The first step of model fitting is to plot the data, to reveal whether a model with astraight-line trend makes sense. The data values (x, y) for any one subject form apoint relative to the x and y axes. A plot of the n observations as n points is called a scatterplot Figure 9.3 indicates that the trend of points seems to be approximated fairly wellby a straight line. One point, however, is far removed from the rest. This is the pointfor the District of Columbia (D.C.). It had murder rate much higher than for anystate. This point lies far from the overall trend. Figure 9.3 also shows box plots forthese variables. They reveal that D.C. is an extreme outlier on murder rate. In fact,it falls 6.5 standard deviations above the mean. We shall see that outliers can have a serious impact on a regression analysis. The scatterplot provides a visual check of whether a relationship is approxi-mately linear. When the relationship seems highly nonlinear, it is not sensible touse a straight-line model. Figure 9.4 illustrates such a case. This figure shows a nega-tive relationship over part of the range of x-values, and a positive relationship overthe rest. These cancel each other out using a straight-line model. For such data, a nonlinear model presented in Section 14.5 is more appropriate - p. 251
linear function
The formula y = α + βx expresses observations on y as a linear function ofobservations on x. The formula has a straight-line graph with slope β (beta)and y-intercept α (alpha).
Least Squares Estimates
The least squares estimates a and b are the values that provide theprediction equationˆy = a + bx for which the residual sum of squares,SSE =(y −ˆy)2, is a minimum. - p. 255 The prediction lineˆy = a + bx is called the least squares line, because it is theone with the smallest sum of squared residuals This value is smaller than SSE for any other straight line predictor, such asˆy =−0.88 + 0.60x. In this sense, the data fall closer to this line than to any other line.Most software (e.g., R, SPSS, Stata) calls SSE the residual sum of squares. It de-scribes the variation of the data around the prediction line. Table 9.3 reports it in theSum of Squares column, in the row labeled Residual.Besides making the errors as small as possible in this summary sense, the leastsquares line - Has some positive residuals and some negative residuals, but the sum (andmean) of the residuals equals 0. - Passes through the point (¯x,¯y). The first property tells us that the too-low predictions are balanced by the too-highpredictions. Just as deviations of observations from their mean¯y satisfy(y−¯y) = 0,
DESCRIBING VARIATION ABOUT THE REGRESSION LINE
The linear regression model has an additional parameter σ describing the standarddeviation of each conditional distribution. That is, σ measures the variability of they-values for all subjects having the same x-value. We refer to σ as the conditionalstandard deviation. A model also assumes a particular probability distribution for the conditionaldistribution of y. This is needed to make inference about the parameters. For quan-titative variables, the most common assumption is that the conditional distributionof y is normal at each fixed value of x, with unknown standard deviation σ
measuring linear association: the correlation
The linear regression model uses a straight line to describe the relationship. For thismodel, this section introduces two measures of the strength of association betweentwo quantitative variables
RESIDUAL MEAN SQUARE: ESTIMATING CONDITIONAL VARIATION
The ordinary linear regression model assumes that the standard deviation σ of theconditional distribution of y is identical at the various values of x. The estimate ofσ uses SSE =(y −ˆy)2, which measures sample variability about the least squaresline. The estimate is If the constant variation assumption is not valid, then s summarizes the average vari-ability about the line
PROPERTIES OF r-SQUARED
The properties of r2 follow directly from those of the correlation r or from its defini-tion in terms of the sums of squares. - Since −1 ≤ r ≤ 1, r2falls between 0 and 1. - The minimum possible value for SSE is 0, in which case r2= TSS/TSS = 1.For SSE = 0, all sample points must fall exactly on the prediction line. In thatcase, there is no error using x to predict y with the prediction equation. Thiscondition corresponds to r =±1.
THE SLOPE AND STRENGTH OF ASSOCIATION
The slope b of the prediction equation tells us the direction of the association. Itssign indicates whether the prediction line slopes upward or downward as x increases,that is, whether the association is positive or negative. The slope does not, however,directly tell us the strength of the association. The reason for this is that its numericalvalue is intrinsically linked to the units of measurement. For example, consider the prediction equationˆy =−0.86 + 0.58x for y = mur-der rate and x = percentage living below the poverty level. A one-unit increase in xcorresponds to a b = 0.58 increase in the predicted number of murders per 100,000people. This is equivalent to a 5.8 increase in the predicted number of murders per1,000,000 people. So, if murder rate is the number of murders per 1,000,000 popula-tion instead of per 100,000 population, the slope is 5.8 instead of 0.58. The strengthof the association is the same in each case, since the variables and data are the same.Only the units of measurement for y differed. The slope b doesn't directly indicatewhether the association is strong or weak, because we can make b as large or as small as we like by an appropriate choice of units The slope is useful for comparing effects of two predictors having the same units.For instance, the prediction equation relating murder rate to percentage living inurban areas is 3.28 + 0.06x. A one-unit increase in the percentage living in urbanareas corresponds to a 0.06 predicted increase in the murder rate, whereas a one-unitincrease in the percentage below the poverty level corresponds to a 0.58 predictedincrease in the murder rate. An increase of 1 in percentage below the poverty levelhas a much greater effect on the murder rate than an increase of 1 in percentageurban. The measures of association we now study do not depend on the units of mea-surement. Like the measures of association that Chapter 8 presented for categoricaldata, their magnitudes indicate the strength of association.
this chapter
This chapter presents methods for analyzing association between quantitative response and explanatory variables. The analyses are collectively called a regression analysis
SUMS OF SQUARES DESCRIBE CONDITIONAL AND MARGINAL VARIABILITY
To summarize, the correlation r falls between −1 and +1. It indicates the directionof the association, positive or negative, through its sign. It is a standardized slope,equaling the slope when x and y are equally spread out. A one standard deviationchange in x corresponds to a predicted change of r standard deviations in y.Thesquare of the correlation has a proportional reduction in error interpretation relatedto predicting y usingˆy = a + bx rather than¯y. The total sum of squares, TSS =(y −¯y)2, summarizes the variability of theobservations on y, since this quantity divided by n−1 is the sample variance s2yof they-values. Similarly, SSE =(y−ˆy)2summarizes the variability around the predictionequation, which refers to variability for the conditional distributions. For example,when r2= 0.39, the variability in y using x to make the predictions is 39% less thanthe overall variability of the y-values. Thus, the r2result is often expressed as "thepoverty rate explains 39% of the variability in murder rate" or "39% of the vari-ance in murder rate is explained by its linear relationship with the poverty rate."Roughly speaking, the variance of the conditional distribution of murder rate for agiven poverty rate is 39% smaller than the variance of the marginal distribution ofmurder rate This interpretation has the weakness, however, that variability is summarized bythe variance. Many statisticians find r2to be less useful than r because, being basedon sums of squares, it uses the square of the original scale of measurement. It's easierto interpret the original scale than a squared scale. This is also the advantage of thestandard deviation over the variance
Least Squares Prediction Equation
Using sample data, we can estimate the equation for the simple straight-line model.The process treats α and β in the equation y = α + βx as parameters and estimatesthem.
linear relationships
We let y denote the response variable and let x denote the explanatory variable. Weanalyze how values of y tend to change from one subset of the population to another,as defined by values of x For quantitative variables, a mathematical formula describes how the conditional distribution of y (such as y = crime rate) varies according to the value of x (such as x = percentage below the poverty level). Does the crime rate tend to be higher forstates that have higher poverty rates?
PREDICTION EQUATION HAS LEAST SQUARES PROPERTY
We summarize the size of the residuals by the sum of their squared values. This quan-tity, denoted by SSE, is - refer to pic In other words, the residual is computed for every observation in the sample, eachresidual is squared, and then SSE is the sum of these squares. The symbol SSE is anabbreviation for sum of squared errors. This terminology refers to the residual beinga measure of prediction error from usingˆy to predict y. The better the prediction equation, the smaller the residuals tend to be and,hence, the smaller SSE tends to be. Any particular equation has correspondingresiduals and a value of SSE. The prediction equation specified by the formulas onpage 252 for the estimates a and b of α and β has the smallest value of SSE out of allpossible linear prediction equations
PREDICTION EQUATION
When the scatterplot suggests that the model y = α +βx may be appropriate, we usethe data to estimate this line. The notation - refer to pic represents a sample equation that estimates the linear model. In the sample equa-tion, the y-intercept (a) estimates the y-intercept α of the model and the slope (b)estimates the slope β. The sample equationˆy = a +bx is called the prediction equa-tion, because it provides a predictionˆy for the response variable at any value of x.The prediction equation is the best straight line, falling closest to the points inthe scatterplot, in a sense explained later in this section. The formulas for a and b in the prediction equationˆy = a + bx are If an observation has both x- and y-values above their means, or both x- and y-valuesbelow their means, then (x −¯x)(y −¯y) is positive. The slope estimate b tends to bepositive when most observations are like this, that is, when points with large x-valuesalso tend to have large y-values and points with small x-values tend to have small y-values. We shall not dwell on these formulas or even illustrate how to use them, as any-one who does any serious regression modeling uses statistical software. The appendixat the end of the text provides details. Internet applets are also available