CH4
Extrapolation
ESTIMATING a VALUE OUTSIDE the RANGE OF MEASURED DATA - The use of a regression line for prediction well outside the range of values of the explanatory variable X that you used to obtain the line - Such predictions are often NOT ACCURATE
When Examining Scatterplots:
Identify deviations from the overall pattern by looking at the scatter of the data points about the regression line - Look for points with large Residuals—Outliers of the relationships—& any other unusual observations
Exercise 4.39: A study shows that there is a positive correlation between the size of a hospital (measured by its number of beds X) & the median number of days Y that patients remain in the hospital - Does this mean that you can shorten a hospital stay by choosing a small hospital? - Why?
No - It is more probable that those in critical condition are sent to large hospitals
When you report a regression, give r² as a measure of how successful the regression was in explaining the response
Perfect correlation (r = -1 or r = 1) means that the points lie exactly on a line - r² & all the variation in one variable is accounted for by the linear relationship with the other variable If r = -0.7 or r = 0.7 - Then r² = 0.49 - About half the variation is accounted for by the linear relationship
Y-Hat (Ŷ)
The PREDICTED/EXPECTED/AVERAGE RESPONSE FOR ANY value of X - Because of the scatter of points about the line, the predicted response will usually not be exactly the same as the actually observed response (Y) - Residuals = Y - Ŷ
Intercept (a)
The VALUE OF Y WHEN X EQUALS ZERO - a = My - b(Mx) Although we need its value to complete the Regression Equation, it is MEANINGFUL in context ONLY WHEN the EXPLANATORY VARIABLE CAN actually TAKE VALUES CLOSE TO ZERO - Ex: the relationship between time to finish a 1-mile race (X) pulse rate at the finish line (Y) • Because it is physiologically impossible to run 1 mile in 0 time, this value would have no useful interpretation in this particular context
Least-Squares Regression Line
The line that makes the SUM OF the SQUARED VERTICAL DISTANCES of the data points from the line AS SMALL AS POSSIBLE
Suppose we pull Subject 16's point in the scatterplot straight down
This altered point is now an outlier of the relationship in addition to being an outlier in the X direction What happens to the regression line? - Figure 4.4(b) shows the result - The dashed line is the regression line with Subject 16's altered data - Because there are no other points with similar X-values, the dashed line chases the outlier - An outlier in X pulls the least-squares line toward itself - If the outlier in the X direction does not lie close to the line calculated from the other observations, it will be influential When studying one variable by itself, an outlier that lies far above the other values pulls up the Mean (Mx) toward it &, therefore, is always influential - In the regression setting, however, not all outliers are influential
Equation of the Least-Squares Regression Line
We have data on an Explanatory Variable (X) & a Response Variable (Y) for n individuals From the data, we can calculate: - The Means (Mx & My) & Standard Deviations (SDx & SDy) of the 2 Variables - Their Correlation (r) Least-Squares Regression Line: - Ŷ = a + bx - Slope = b = r x (SDy/SDx) - Intercept = a = My - b(Mx)
Apply Your Knowledge 4.11: In 2013, the U.S. Food & Drug Administration made an announcement regarding widely used sleep drugs containing the active ingredient zolpidem It contained the following statement: - "Since women eliminate zolpidem from their bodies more slowly than men, the FDA has notified the manufacturers that the recommended dose should be lowered for women" a. Drug safety initially had been established by studying zolpidem blood levels as a function of time since ingestion of the drug - State the explanatory & response variables here b. A lurking variable was later revealed, leading the FDA to issue a notification - What is this lurking variable? - What is the reason cited explaining the impact of this lurking variable?
a. Time since ingestion (explanatory) & zolpidem blood level (response) b. Sex (women eliminate the drug more slowly)
Cautions about Correlation & Regression
1. Beware extrapolation - Suppose that you have data on a child's growth between 3 and 8 years of age - You find a strong linear relationship between age X and height Y - If you fit a regression line to these data and use it to predict height at age 25 years, you will predict that the child will be more than 8 feet tall - Growth slows down and then stops at maturity, so extending the straight line to adult ages is foolish - Few relationships are linear for all values of X - Don't make predictions for values of X outside the range that actually appears in your data 2. Beware the lurking variable - The relationship between two variables can often be understood only by taking other variables into account - Lurking variables can make a correlation or regression misleading - You should always think about possible lurking variables before you draw conclusions based on correlation or regression - You should also consider that an association can sometimes be markedly different for different groups - The failure to recognize such issues can lead to an incorrect interpretation of the association between two variables - Teasing out the relative influence of multiple factors in complex environments can be very challenging - Identifying potential confounding variables requires some expertise on the topic studied or at the very least a healthy dose of common sense - Planning a multivariate study and analysis that includes potential lurking variables in addition to your main variables is an important part of statistics necessary to reach sound conclusions
Facts about Least-Squares Regression
1. The distinction between explanatory & response variables is essential in regression - We want to express Y as a function of X - Least-squares regression makes the distances of the data points from the line small only in the Y direction - If we reverse the roles of the two variables, we get a different least-squares regression line 2. There is a close connection between correlation & the slope of the least-squares line - The slope is b = r x (SDy/SDx) - You see that the slope & the correlation always have the same sign - Ex: if a scatterplot shows a positive association, then both b & r are positive - They should not be confused, however - The slope measures the average change in Y for a one-unit change in X - The correlation measures the strength & direction of the linear relationship 3. The least-squares regression line always passes through the point (Mx, My) on the graph of Y against X - Remember that the Intercept is a = My - b(Mx) - So, when X = Mx, Y = [My - b(Mx)] + b(Mx) = My 4. The correlation r describes the strength of a straight-line relationship - In the regression setting, the square of the correlation, r², is the fraction of the variation in the values of Y that is explained by the least-squares regression of Y on X
Check Your Skills 4.19: For a class project, you measure the weight in grams (g) & the tail length in millimeters (mm) of a group of mice The equation of the least-squares line for predicting tail length from weight is: - predicted tail length = 20 + (3 × weight) How much (on average) does tail length increase for each additional gram of weight? a. 3 mm b. 20 mm c. 23 mm
A
Check Your Skills 4.20: According to the regression line in Exercise 4.19, what is the predicted tail length for a mouse weighing 18 g? a. 74 mm b. 54 mm c. 34 mm
A
Check Your Skills 4.21: If you had measured the tail length in Exercise 4.19 in centimeters instead of millimeters (1 centimeter = 10 millimeters), what would the slope of the regression line be? a. 3/10 = 0.3 b. 3 c. (3)(10) = 30
A
Lurking/Confounding Variable
A variable that is NOT ONE OF THE EXPLANATORY/RESPONSE VARIABLES in a study, yet MAY INFLUENCE the INTERPRETATION OF RELATIONSHIPS AMONG THOSE VARIABLES
Association ≠ Causation
An association between an Explanatory Variable X & a Response Variable Y, even if it is very strong, is not by itself good evidence that changes in X actually cause changes in Y
Ex 5.1: Figure 4.1 repeats Figure 3.2 (page 73), with the addition of a Regression Line for predicting fat gain from NEA change What does this line mean? Suppose that an individual's NEA increases by 400 Cal when she overeats - From 400 Cal on the X axis, go up to the Regression Line & then over to the Y axis - The graph shows that the fat gain predicted by the linear model is a bit more than 2 kg
Any straight line describing the NEA data has the Form: - fat gain = a + (b x NEA change) - Where "fat gain" is Y & "NEA change" is X The line in Figure 4.1 is the Regression Line with the equation: - fat gain = 3.505 - (0.00344 x NEA change) This equation makes it easy to predict fat gain - If a person's NEA increases by 400 Cal when she overeats, substitute X = 400 in the equation - The fat gain predicted by this model is: • fat gain = 3.505 - (0.00344 x 400) = 2.13 kg The slope, b = -0.00344 - Tells us that fat gained decreases by 0.00344 kg, on average, for each added Calorie of NEA - The Slope of a Regression Line is the expected rate of change in the Response Variable as the Explanatory Variable changes The intercept, a = 3.505 kg - The estimated fat gain if NEA does not change (X = 0) when a person overeats
Check Your Skills 4.22: Because elderly people may have difficulty standing straight to have their heights measured, a study looked at predicting overall height from height to the knee - Here are the data (in centimeters, cm) for 5 elderly men: Knee height X (cm) 57.7 47.4 43.5 44.8 55.2 Overall height Y (cm) 192.1 153.3 146.4 162.7 169.1 Use your calculator or software: - What is the equation of the least-squares regression line for predicting height from knee height? a. Ŷ = 2.4 + 44.1X b. Ŷ = 44.1 + 2.4X c. Ŷ = -2.5 + 0.32X
B
Residuals
Because we use the line to predict Y from X, the prediction errors we make are errors in Y - The vertical direction in the Scatterplot The VERTICAL DISTANCES OF THE POINTS FROM THE LINE - Y - Ŷ A good Regression Line makes these AS SMALL AS POSSIBLE - The most common way to make the COLLECTION OF VERTICAL PREDICTION ERRORS "as small as possible" is the LEAST-SQUARES method
Ex 4.10: Obese parents tend to have obese children - But is obesity determined genetically? The results of a study of Mexican American girls aged 9 to 12 years are typical - The investigators measured body mass index (BMI), a measure of weight relative to height, for both the girls & their mothers - People with high BMI are labeled obese - The correlation between the BMI of daughters & the BMI of their mothers was r = 0.506
Body type is in part determined by heredity - Daughters inherit half their genes from their mothers As a consequence, there is the potential for a direct causal link between the BMI of mothers & daughters - However, mothers who are obese may also set an example of little exercise & poor eating habits - It is also possible that both mothers & daughters are affected by the same environmental conditions, such as living in an area with limited access to healthy food (a "food desert") & no parks or public structures to facilitate exercise Heredity, behavior, & environmental factors all likely contribute to some extent to the mother-daughter correlation The lesson here is more subtle than just "association does not imply causation" - Even when direct causation is present, it may not be the whole explanation for a correlation You must still worry about lurking variables - Careful statistical studies try to anticipate lurking variables and measure them - The mother-daughter study did measure TV viewing, exercise, & diet - Elaborate statistical analysis can account for the effects of these variables to come closer to the direct effect of mother's BMI on daughter's BMI - Even so, this remains a second-best approach to determining causation When possible, the best way to get good evidence that X causes Y is to do an experiment in which we change X & keep lurking variables under control When experiments cannot be done, explaining an observed association can be difficult & controversial - Many of the sharpest disputes in which statistics plays a role involve questions of causation that cannot be settled by experiment - Does using cell phones cause brain tumors? - Are our extreme levels of carbon dioxide emissions causing global warming? - Is human activity triggering a new, planet-wide mass extinction? - All these questions have become public issues - All concern associations among variables - & all have this in common: They try to pinpoint cause & effect in a setting involving complex relations among many interacting variables
Check Your Skills 4.23: Marco's mom measured his height in centimeters (cm) every few months when he was between ages 4.5 & 7.5 The recorded values lie close to the line whose equation is: - Ŷ = 80.46 + 6.25X Using this equation to find out whether Marco will be taller than his mom by age 14 (she is 168 cm tall) is an example of... a. An influential point b. Prediction within range c. Extrapolation
C
Check Your Skills 4.24: Across American cities, the number of pharmacists & the number of deaths in a given year are positively associated What can be concluded from this association? a. Pharmacists sell dangerous drugs, therefore more pharmacists lead to more deaths b. Patients near death are prescribed lots of medication, therefore more deaths means more pharmacists are needed c. Larger cities tend to have both more deaths & more pharmacists, therefore the association may not be causal
C
The Limitations of Correlation & Regression
Correlation and regression lines describe only linear relationships - You can do the calculations for any relationship between two quantitative variables, but the results are useful only if the scatterplot shows a linear pattern Correlation and least-squares regression lines are not resistant - Always plot your data and look for potentially influential observations
Ex 4.11: Despite the difficulties, it is sometimes possible to build a strong case for causation in the absence of experiments - The evidence that smoking causes lung cancer is about as strong as nonexperimental evidence can be
Doctors had long observed that most patients with lung cancer were smokers Comparison of smokers & "similar" nonsmokers showed a very strong association between smoking & death from lung cancer. Could the association be explained by lurking variables? - Ex: Might there be a genetic factor that predisposes people both to nicotine addiction & to lung cancer? Smoking & lung cancer would then be positively associated even if smoking had no direct effect on the lungs - How were these objections overcome? Let's answer this question in general terms: What are the criteria for establishing causation when we cannot do an experiment? - The association is strong • The association between smoking and lung cancer is very strong - The association is consistent • Many studies of different kinds of people in many countries link smoking to lung cancer • That reduces the chance that a lurking variable specific to one group or one study explains the association - Higher doses are associated with stronger responses • People who smoke more cigarettes per day or who smoke over a longer period get lung cancer more often • People who stop smoking reduce their risk - The alleged cause precedes the effect in time • Lung cancer develops after years of smoking • The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years • Lung cancer kills more men than any other form of cancer • Lung cancer was rare among women until women began to smoke • Lung cancer in women rose along with smoking rates, again with a lag of about 30 years, & has now passed breast cancer as the leading cause of cancer death among women - The alleged cause is plausible • Experiments with animals show that tars from cigarette smoke do cause cancer In 2014, the U.S. Surgeon General released a comprehensive report on the health consequences of smoking on the 50-year anniversary of the first Surgeon General's warning about the health hazards of smoking - This latest report lists 12 cancers (including lung cancer) & numerous chronic diseases causally linked to smoking - The evidence for causation is overwhelming because of the sheer volume of scientific studies supporting the criteria described previously When attempting to establish causation, however, keep in mind that 1 observational study cannot provide evidence as strong as the evidence provided by 1 well-designed experiment
Working with Logarithm Transformations
Logarithm transformations are quite frequent in biology to study problems ranging from population growth to animal physiology A regression line can be computed using the transformed data, & prediction within the range of the data can be performed as with simple linear relationships - The difference is that we will need to transform the predicted Y back into its original (non-logarithmic) unit
Ex 4.8: High blood pressure & arterial stiffness are risk factors for cardiovascular disease, the leading cause of death worldwide - Understanding the physiology of cardiovascular health is important for developing preventive & curative approaches to the disease The sympathetic nervous system is known to affect vasoconstriction - A research team investigated the relationship between sympathetic nerve activity (measured in number of bursts per 100 heartbeats) & an indicator of arterial stiffness (augmented aortic pressure, in millimeters of mercury, mm Hg) in 44 healthy young adults - The results, displayed in Figure 4.7(a), show no clear relationship between sympathetic nerve activity & the arterial stiffness indicator (r = -0.17)
Men & women have a somewhat different cardiovascular physiology, so the researchers decided to examine their findings separately for men & for women - The results are displayed in Figure 4.7(b) & (c), respectively To the researchers' surprise, a moderate linear relationship was noticeable in both cases, but the relationship was positive for men (r = 0.53) & negative for women (r = -0.58) In this case, combining data from men & women created the appearance of no relationship between sympathetic nerve activity & the arterial stiffness indicator - Gender is a Lurking Variable that drastically affects the relationship between the 2 Quantitative Variables studied
Ex 4.3: For the NEA data, r = -0.7786 & r² = 0.6062 By comparison, Figure 3.1 (page 71) shows a stronger linear relationship in which the points are more tightly concentrated along a line - Here, r = 0.9448 & r² = 0.8927
Nearly 61% of the variation in fat gain is accounted for by the linear relationship with change in NEA - The other 39% is individual variation among subjects that is not explained by the linear relationship Approximately 89% of the variation in manatee deaths from collisions with powerboats in a given year is explained by the number of powerboats registered that year - Only about 11% is variation among years with the same number of powerboat registrations
Influential Observations/Points
REMOVING IT would markedly CHANGE the RESULT OF the C CALCULATION The result of a statistical calculation may be of little practical use if it depends strongly on one or more of these Points that are outliers in either the X or Y direction of a scatterplot are often influential for the correlation - Points that are Outliers in the X direction are often influential for the least-squares regression line
Ex 4.7: A number of studies have found that individuals who consume moderate amounts of alcohol are more likely to live longer - News reports of the potential health benefit of moderate alcohol consumption often follow the publication of such studies - But can we legitimately conclude that alcohol consumption itself is a cause of greater longevity?
Researchers reviewed 87 different scientific studies examining the association between alcohol consumption & longevity - They found that moderate drinkers tended to be healthier not necessarily because of their alcohol consumption, but because they tended to be better educated & more affluent - A person's socioeconomic situation produces a number of lurking variables that explain, at least in part, why longevity is associated with alcohol consumption In addition, people who do not consume alcohol may have radically different reasons to do so - Some people may choose to abstain from alcohol as a lifelong lifestyle decision, whereas others may have been compelled to stop drinking because of poor health - These individuals are likely to lower the average longevity of all study participants who do not drink alcohol, so that the group with moderate drinking appears to have the better outcome
Ex 4.2: Example 4.1 gives the equation for the Least-Squares Regression Line between NEA change & fat gain - We can obtain this equation fairly quickly From the raw data given in Example 3.5 (page 72), we find that the sample Means & Standard Deviations for the 2 Variables are: NEA change (X) Fat gain (Y) M 324.75 2.388 SD 257.66 1.139 The Correlation between them is r = -0.779 This is all the information we need to find the Slope & the Intercept
Slope = r x (SDy/SDx) - -0.779 x (1.139/257.66) - -0.779 x 0.00442 - (-0.00344) Intercept = My - b(Mx) - 2.388 - (-0.00344)(324.75) - 2.388 - (-1.11714) - 3.505 Ŷ = 3.505 - 0.00344X
Review of Straight Lines
Suppose that Y is a Response Variable (plotted on the vertical axis) & X is an Explanatory Variable (plotted on the horizontal axis) A straight line relating Y to X has an equation of the form: - y = a + bx - b = Slope - a = Intercept
Slope (b)
The AMOUNT BY WHICH Y CHANGES WHEN X INCREASES BY ONE UNIT - b = r x (SDy/SDx) The SIZE of it DEPENDS ON the UNITS in which WE MEASURE THE TWO VARIABLES - Ex: it is the change in fat gain in kilograms when NEA increases by 1 Cal • There are 1000 grams in a kilogram • If we measured fat gain in grams, this would be 1000 times larger, 3.44 • You CAN'T DETERMINE HOW IMPORTANT A RELATIONSHIP IS BY LOOKING AT THE SIZE of this value in the Regression Line
Regression Line
The LINE OF BEST FIT A straight line that summarizes the linear relationship between 2 variables, but only in a specific setting: - When ONE OF THE VARIABLES IS THOUGHT TO HELP EXPLAIN/PREDICT THE OTHER - That is, REGRESSION DESCRIBES a RELATIONSHIP BETWEEN AN EXPLANATORY VARIABLE (X) & a RESPONSE VARIABLE (Y) It is a MATHEMATICAL MODEL that: - Represents the UNDERLYING PATTERN in the data - Provides an equation showing how the RESPONSE VARIABLE Y CHANGES, ON AVERAGE, AS the EXPLANATORY VARIABLE X CHANGES • This equation may be used to PREDICT the EXPECTED VALUE OF Y FOR A GIVEN VALUE OF X
Ex 4.9: Even reputable publications like the New England Journal of Medicine can display a certain sense of humor - Right around Nobel Prize season, the journal published a satirical editorial describing an analysis of the correlation between chocolate consumption X (in kilograms per year per capita) & the number of Nobel laureates Y (per 10 million population) among nations that have ever received a Nobel Prize A scatterplot in the editorial shows a positive linear trend with r = 0.791
The basic meaning of causation is that by changing X we can bring about a change in Y Could a country consume more chocolate to boost its number of Nobel laureates? - No As the editorial points out, chocolate consumption is computed at the population level & there is no indication of how much chocolate Nobel laureates actually consume - This makes it difficult to justify a plausible cause-&-effect mechanism between per-capita chocolate consumption & per-capita number of Nobel laureates Correlations such as this are sometimes called "nonsense correlations" or "spurious correlations" - In such a case, the correlation is real - What is nonsense is the conclusion that changing one of the variables causes changes in the other - A lurking variable (e.g., possibly a country's wealth in this case) that influences both X & Y can create a high correlation even though there is no direct connection between X & Y
Ex 4.4: "Empathy" means being able to understand what others feel To see how the brain expresses empathy, researchers recruited 16 couples in their mid-twenties who were married or had been dating for at least two years - They zapped the man's hand with an electrode while the woman watched and while they measured the activity in several parts of the woman's brain that would respond to her own pain - Brain activity was recorded as a fraction of the activity observed when the woman herself was zapped with the electrode - The women also completed a psychological test that measures empathy - The research objective is to see if women who score higher in empathy respond more strongly when their partner has a painful experience - Here are data for one brain region: Subject 1 2 3 4 5 6 7 8 Empathy score 38 53 41 55 56 61 62 48 Brain activity −0.120 0.392 0.005 0.369 0.016 0.415 0.107 0.506 Subject 9 10 11 12 13 14 15 16 Empathy score 43 47 56 65 19 61 32 105 Brain activity 0.153 0.745 0.255 0.574 0.210 0.722 0.358 0.779 Figure 4.3 is a scatterplot of these data, with empathy score as the explanatory variable X and brain activity as the response variable Y
The plot shows a positive association - That is, women who score higher in empathy do indeed react more strongly to their partners' pain - The overall pattern is moderately linear, with correlation r = 0.515 The solid line on the plot is the least-squares regression line of brain activity on empathy score - Its equation is Ŷ = 0.00761X - 0.0578 Figure 4.3 shows one unusual observation - Subject 16 is an Outlier in the X direction, with an empathy score 40 points higher than the score for any other subject - However, this point fits the overall pattern &, therefore, is not an Outlier of the relationship - Does this unusual data point have any influence on the correlation & regression equation? Because of this point's extreme position on the empathy scale X, Subject 16 has a strong influence on the correlation - Dropping this point reduces the correlation from r = 0.515 to r = 0.331 - We say that Subject 16 is influential for calculating the correlation Is this observation also influential for the least-squares line? - Figure 4.4(a) shows that it is not - The regression line calculated without Subject 16 (dashed) differs little from the line that uses all the observations (solid) - The reason that this Outlier in the X direction has little influence on the regression line is that it is not an Outlier of the relationship & lies close to the dashed regression line calculated from the other observations
Ex 4.6: Humans have large brains compared with other mammals, but they are also a relatively large mammal species - How does an animal's brain relate to its body size? Figure 4.5(a) shows the scatterplot of brain weight in grams against body weight in kilograms for 96 species of mammals - Figure 4.5(b) shows the scatterplot for the same data after a logarithm transformation - The line is the least-squares regression line for predicting Y from X The regression line for the untransformed data in Figure 4.5(a) is clearly unsatisfactory because most mammals are so small compared to elephants - The correlation between brain weight & body weight is r = 0.86, but this value is misleading - If we remove the elephant data point, the correlation for the other 95 species drops to r = 0.50 - Humans, dolphins, & hippos are also clear outliers of the relationship
The scatterplot using transformed data in Figure 4.5(b) has no obvious outliers and is clearly linear, with correlation r = 0.96 The least-squares regression for predicting the logarithm of brain weight from the logarithm of body weight has the equation: - log(brain weight) = 1.01 + [0.72 × log(body weight)] For a mammal species weighing, on average, 100 kg, the logarithm is log(100) = 2, & the predicted logarithm of brain weight is - log(brain weight) = 1.01 + (0.72 × 2) = 2.45 To undo the logarithm transformation, recall that for common logarithms with base 10, - Y = 10^[log(Y)] Thus, the predicted brain weight, in grams, for the 100-kg species is - brain weight = 10^(2.45) = 282 g Examine Figure 4.5 & verify that this prediction makes sense on both scatterplots, although the scatterplot of transformed data is much easier to use
Interpreting r²
The squared value of the correlation coefficient tells us what fraction of the variation in the Y variable is explained by the least-squares regression model When there is a linear relationship, some of the variation in Y is accounted for by the fact that as X changes, the predicted response Ŷ is pulled along with it - This is the variation in Y that is explained by the regression model = (variation in Ŷ as X pulls it along the line)/(total variation in the observed values of Y) = (total variation in observed values of Y - variation of the Residuals)/(total variation in observed values of Y) Unless all the data points lie exactly on a line, there is additional variation in the actual response Y that appears as the scatter of points above & below the line - Recall that the least-squares method gives us, by definition, the line with the smallest possible sum of squared vertical distances from the points to the line - These squared vertical distances are the still-unexplained variation in Y
Association DOES NOT IMPLY Causation
When we study the relationship between two variables, we are often interested in whether changes in the explanatory variable cause changes in the response variable A strong association between two variables is not enough to draw conclusions about cause and effect Sometimes an observed association really does reflect cause & effect - A household that heats with natural gas uses more gas in colder months because cold weather requires burning more gas to stay warm In other cases, an association is explained by lurking variables, & the conclusion that X causes Y is either wrong or not proved The important caution that association does not necessarily imply causation applies whether the variables considered are quantitative or categorical
Exercise 4.37: Exercise 3.32 (page 88) examined the relationship between mean seed weight (in milligrams, mg) & mean number of seeds produced in a year by several common tree species After converting the data from Table 3.5 (page 89) into a logarithm scale, a scatterplot of the converted data shows a linear relationship that can be described by the least-square regression equation: - Ŷ = 4.238 - 0.567X - Ŷ = the log of seed count (base 10) - X = the log of seed weight (base 10) a. Using this equation, predict the log of seed count Ŷ for a tree species with a mean seed weight of 1000 mg (X = 3) b. Following Example 4.2, undo the logarithm transformation to find the predicted mean annual seed count for a tree species with a mean seed weight of 1000 mg
a) Ŷ = 2.537 b. Predicted mean seed count - 10^(2.537) - 344.3
Exercise 4.27: Exercises 3.26 & 3.27 (page 82) describe an automated computerized grading system for oysters based on either a 2D or a 3D image-processing approach The data appear in Table 3.3 (page 86). Actual (cm3); 2D (thousand pixels); 3D (million voxels) 13.04 47.907 5.136699 11.71 41.458 4.795151 17.42 60.891 6.453115 7.23 29.949 2.895239 10.03 41.616 3.672746 15.59 48.070 5.728880 9.94 34.717 3.987582 7.53 27.230 2.678423 12.73 52.712 5.481545 12.66 41.500 5.016762 10.53 31.216 3.942783 10.84 41.852 4.052638 13.12 44.608 5.334558 8.48 35.343 3.527926 14.24 47.481 5.679636 11.11 40.976 4.013992 15.35 65.361 5.565995 15.44 50.910 6.303198 5.67 22.895 1.928109 8.26 34.804 3.450164 10.95 37.156 4.707532 7.97 29.070 3.019077 7.34 24.590 2.768160 13.21 48.082 4.945743 7.83 32.118 3.138463 11.38 45.112 4.410797 11.22 37.020 4.558251 9.25 39.333 3.449867 13.75 51.351 5.609681 14.37 53.281 5.292105 a. If you haven't done so already, make a scatterplot of the relationship between actual volume (response) & 2D volume reconstruction (explanatory), & another scatterplot showing the relationship between actual volume & 3D volume reconstruction - Compare the two image-processing approaches b. Obtain the equation for the least-squares regression describing actual volume as a function of 2D reconstruction - Use the regression equation to predict the actual volume of an oyster with a 2D reconstruction of 35 thousand pixels. c. Obtain the equation for the least-squares regression line describing actual volume as a function of 3D reconstruction - Use the regression equation to predict the actual volume of an oyster with a 3D reconstruction of 4.5 million volume pixels d. What percent of the variation in actual oyster volumes is explained by the 2D program? - By the 3D program? - Which is the more accurate predictive model? - Explain your answer e. Explain in the context of this exercise why it is important to include both the equation for the least-squares regression line & the value of r² when reporting on a linear association
a. Both relationships are linear but the relationship is stronger with the 3D approach b) Ŷ = 0.3367 + 0.2649X - X = 35 - Ŷ = 9.608 cm^3 c. Ŷ = 0.419572 + 2.475224X - X = 4.5 - Ŷ = 11.558 cm^3 d. r² = 0.846 for 2D; - r² = 0.954 for 3D - The 3D approach is a more accurate predictive model e. The regression equation alone is not sufficient to judge a model
Exercise 4.35: Drilling down beneath a lake in Alaska yields chemical evidence of past changes in climate - Biological silicon, left by the skeletons of single-celled creatures called diatoms, is a measure of the abundance of life in the lake - A rather complex variable based on the ratio of certain isotopes relative to ocean water gives an indirect measure of moisture, mostly from snow - As we drill down, we look further into the past - Here are data from 2300 to 12,000 years ago: Isotope (%) Silicon (mg/g) −19.90 97 −19.84 106 −19.46 118 −20.20 141 -20.71 154 −20.80 265 −20.86 267 −21.28 296 −21.63 224 −21.63 237 −21.19 188 −19.37 337 a. Make a scatterplot of silicon (response) against isotope (explanatory). Ignoring the outlier, describe the direction, form, & strength of the relationship - The researchers say that this relationship & relationships among other variables they measured are evidence for cyclic changes in climate that are linked to changes in the sun's activity b. The researchers single out one point: "The open circle in the plot is an outlier that was excluded in the correlation analysis" - Circle this outlier on your graph - What is the correlation with and without this point? - The point strongly influences the correlation - Explain why the outlier moves r in the direction revealed by your calculations c. Is the outlier also strongly influential for the regression line? - Calculate & draw on your graph the regression lines with & without the outlier, then discuss what you see - Explain why adding the outlier moves the regression line in the direction shown on your graph
a. Ignoring the outlier, the plot shows a moderately strong, negative linear relationship b. For all 12 points: r = −0.3387 - Without the outlier: r = −0.7866 - The outlier weakens the linear pattern c. For all 12 points: Ŷ = −492.6 − 33.79X - Without the outlier: Ŷ = −1371.6 − 75.52X - The line is pulled toward the outlier, reducing the very large vertical deviation of this point from the line
Apply Your Knowledge 4.9: Exercise 3.33 (page 88) examined the relationship between the total number of stem cell divisions in the lifetime of a given tissue & the lifetime risk of cancer for that tissue, based on 31 types of cancers for which this information is known in the U.S. population - After the data from Table 3.6 (page 89) are converted to a logarithm scale (base 10), a scatterplot of the results shows a linear pattern - Figure 4.6 gives part of the regression output for the log-transformed data using statistical software Estimate Std. Error t Ratio Prob > I t I Intercept -7.61081 0.722916 -10.53 < 0.0001* LogDivisions 0.5326381 0.07317 7.28 < 0.0001* a. Rounding the coefficients to four decimal places, give the equation of the least-leastsquares regression line modeling LogRisk(Y) as a function of LogDivisions (X) b. Using this equation, predict the log of lifetime cancer risk Ŷ for a tissue with a lifetime total of 1 billion (10^9) stem cell divisions - Be sure to first find the value of X that corresponds to 1 billion divisions c. Following the steps described in Example 4.6, undo the logarithm transformation to find the predicted lifetime cancer risk for a tissue with a lifetime total of 1 billion stem cell divisions
a. LogRisk = -7.6108 + 0.5326(LogDivisions) b. Predicted LogRisk = -2.8174 c. Predicted Risk = 0.00152
Apply Your Knowledge 4.3: Exercise 3.2 discusses a study in which scientists examined data on Mean summer sea-surface temperatures (in degrees Celsius) & Mean coral growth (in millimeters per year) over a several-year period at locations in the Red Sea - Here are the data: Temperature (°C) 29.68 29.87 30.01 30.25 30.47 30.65 30.90 Growth (mm/y) 2.63 2.57 2.67 2.60 2.47 2.39 2.25 a. Use your calculator to find: - The Mean & Standard Deviation of both sea-surface temperature X & growth Y - The Correlation r between X & Y - Use these basic measures to find the equation of the least-squares line for predicting Y from X b. Enter the data into your software or calculator & use the regression function to find the least-squares line - The result should agree with your work in part a up to roundoff error
a. Mx = 30.261; SDx = 0.4391 - My = 2.511; SDy = 0.1502 - r = −0.8986 - b = r x (SDy/SDx) = (-0.8986)(0.1502/0.4391) = −0.3074 - a = My - b(Mx) = 2.511 - (-0.3074)(30.261) = 11.813 b. Ŷ = 11.813 - 0.3074X
Exercise 4.41: Researchers assessed coffee drinking every 4 years (using validated food questionnaires) for more than 200,000 adults followed for as long as 30 years - They found that higher coffee consumption was associated with lower risk of total mortality - The published report states, "Compared with nondrinkers, coffee consumption of 1 to 5 cups per day was associated with lower risk of mortality, whereas coffee consumption of more than 5 cups per day was not associated with risk of mortality" a. Should doctors encourage adult patients to drink more coffee based on this study? - Explain your reasoning b. Can you think of any reason why adults who do not drink coffee may have a lower longevity than adults who do?
a. No - Association does not imply causation - There are likely confounding variables b. Answers will vary - They might have a lower socioeconomic status that limits their access to healthy food & health care, for instance
Apply Your Skills 4.13: A large study of the detailed lifestyle of California Seventh-Day Adventists found that a woman's level of education is one of the better predictors of her risk of getting breast cancer later in life, with higher levels of education associated with higher risks of breast cancer a. Do you think that we should ask women to drop out of school in order to reduce the incidence of breast cancer? - Is education a plausible candidate to be a direct cause of breast cancer? - Explain b. Another variable associated with the risk of breast cancer is the age at which a woman has her first child, with later ages associated with higher cancer rates - Is the age at first child a plausible cause of breast cancer? - Explain - How does knowing about the association between age at first child & breast cancer help us understand the association between education & breast cancer?
a. No - There is no biological reason why education would cause breast cancer b. Women with higher education tend to have babies later in life
Exercise 4.29: Measuring tree height is not an easy task - How well does trunk diameter predict tree height? A survey of 958 live trees in an old-growth forest in Canada provides us with the following information: - The mean tree height is 15.6 meters (m), with standard deviation 13.4 m - The mean diameter, measured at "breast height" (1.3 m above ground), is 23.4 centimeters (cm), & the standard deviation is 23.5 cm The correlation between the height & the diameter is very high: - r = 0.96 a. What are the slope & intercept of the regression line to predict tree height from trunk diameter? - Draw a graph of this regression line b. The tree diameters ranged from 1 to 101 cm - Predict the height of a tree in this area that is 50 cm in diameter at breast height - Use r² to assess the reliability of this prediction
a. Slope = 0.547 - Intercept = 2.791 b. Prediction = 30.2 - A high r² (here, 0.9216) implies a reliable prediction
Exercise 4.25: The study described in Exercise 4.4 also examined the impact of tara gum on the perceived creaminess of fruit yogurt drinks - Participants were asked to rate creaminess on a scale of 0 to 100 (with 0 meaning "not at all") for fruit yogurt drinks containing various concentrations of tara gum ranging from 0 to 0.47 g per 100 g of drink The observed relationship between perceived creaminess & tara gum concentration was clearly linear & very strong, with r² = 0.95 Here is the equation of the least-squares regression line: - creaminess = 29.38 + 101.98(concentration) a. What is the slope of the regression line? - Is the relationship between creaminess & tara gum concentration positive or negative? b. What is the predicted perceived creaminess for a tara gum concentration of 0.2 g/100 g? - Use r² to assess the reliability of this prediction c. Using this model, what would be the predicted perceived creaminess for a tara gum concentration of 1 g/100 g? - Explain why this is not a legitimate prediction
a. Slope = 101.98 - Positive relationship b. Prediction = 49.8 - Very reliable with r² = 0.95 c. Nonsense prediction = 131.4 - Extrapolation (prediction outside range)
Apply Your Knowledge 4.5: Example 3.6 (page 73) described a study of Magellanic penguins' breeding - Figure 3.3 (page 73) shows the relationship between starvation & fledging in the Punta Tombo colony for the 28 years between 1983 & 2010 - There are two outliers on this scatterplot, which the researchers excluded from their analysis because of those years' highly unusual climatic conditions Here is a quote from the publication: - "Starvation strongly affected reproductive success" - "When a higher percentage of chicks starved, a lower percentage fledged (r² = 0.70) excluding 1991 & 1999" a. Interpret the quoted value 0.70 in the context of this study b. What is the value of the correlation between percent of chicks starved & percent of chicks fledged?
a. Starvation rate explained 70% of the variation in fledging rate b. r = -0.84
Apply Your Knowledge 4.1: The Department of Motor Vehicles warns of the effect of drinking alcohol on blood alcohol content (BAC, in percent of volume) - One "drink" is defined as 12 ounces (oz) of beer, or 4 oz of table wine, or 1.25 oz of 80-proof liquor - The impact of a single drink on a person's BAC depends on sex & weight a. For a 160-pound man, the blood alcohol content as a function of the number of drinks consumed in 1 hour can be expressed by the Regression Line: - BAC = 0.023 × number of drinks - Say in words what the slope of this line tells you - Why do you think the Intercept is zero? b. For a woman weighing 120 pounds, BAC increases by 0.038, on average, for each drink taken in one hour - What is the equation of the Regression Line for predicting BAC from number of drinks for a 120-pound woman? c. Find the predicted BAC for a 160-pound male who consumed 4 drinks over the last hour & for a 120-pound woman having 2 drinks in 1 hour - Explain why these values may not be the individuals' actual blood alcohol contents
a. The Slope indicates that for each drink consumed, BAC increases by 0.023, on average - The intercept is zero because someone who doesn't have any drinks should have a BAC of 0 b. BAC = 0.038 × number of drinks c. BAC = 0.023 × 4 = 0.092, on average, for a man consuming 4 drinks - BAC = 0.038 × 2 = 0.076, on average, for a woman consuming 2 drinks
Exercise 4.33: Table 4.1 presents four sets of data prepared by the statistician Frank Anscombe to illustrate the dangers of calculating without first plotting the data Data Set A X 10 8 13 9 11 14 6 4 12 7 5 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Data Set B X 10 8 13 9 11 14 6 4 12 7 5 Y 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 Data Set C X 10 8 13 9 11 14 6 4 12 7 5 Y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 Data Set D X 8 8 8 8 8 8 8 8 8 8 19 Y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91 6.89 12.50 a. Without making scatterplots, find the correlation & the least-squares regression line for all four data sets - What do you notice? - Use the regression line to predict Ŷ for X = 10 b. Make a scatterplot for each of the data sets & add the regression line to each plot c. In which of the 4 cases would you be willing to use the regression line to describe the dependence of Y on X? - Explain your answer in each case
a. They all give similar results - Ŷ = 3.0 + 0.5X - r = 0.82 - Very close slopes, intercepts, & correlations - Ŷ = 8 for X = 10 c. Only the first relationship is linear & appropriate for regression - The second relationship is strongly curved - The third has a strong outlier - The last has only two values for X
Apply Your Knowledge 4.7: Here are the data for the study of Magellanic penguins in Example 3.6 & Exercise 4.5 - The values with * next to them correspond to the 2 years with highly unusual climatic conditions that the researchers excluded from their analysis Percent starved Percent fledged 39.5 40.8 20.9 43.7 25.9 45.5 44.0 45.8 66.0 1.7 86.4 6.8 69.8 8.6 54.0 20.3 66.0 24.2 28.1 54.2 55.3 26.0 16.8 55.7 46.9 27.2 40.8 26.9 27.4 27.1 38.3 27.8 32.4 57.3 39.0 44.4 28.6 58.2 11.8* 60.5* 28.6* 31.4* 47.2 34.7 29.0 37.1 18.6 66.4 39.2 42.6 46.0 32.8 20.9* 11.2* 34.0* 10.5* a. Are the two outliers (* values) influential for the correlation? - Obtain r with & without these two data points b. Are the two outliers influential for regression? - Obtain the equation of the least-squares regression line with & without these two data points - Add both regression lines to the same scatterplot, as in Figure 4.4(a) c. What are the two excluded data points? - Are they outliers in the X direction? - In the Y direction? - Are they outliers of the relationship? - Explain how this is relevant to your conclusions in parts a & b
a. Using all points, r = −0.684. - Without the outliers, r = −0.836 b. Using all points, Ŷ = 61.39 − 0.6806 - Without the outliers, Ŷ = 68.18 − 0.7883 c. They are outliers of the relationship & are influential, especially for the correlation
Exercise 4.31: Exercise 3.36 (page 90) gives data from a study showing that social exclusion causes "real pain" - That is, activity in an area of the brain that responds to physical pain increases as distress from social exclusion increases A scatterplot shows a moderately strong linear relationship - Figure 4.10 shows regression output from software for these data Estimate Std. Error t Value P > ItI Intercept -0.126085 0.024646 -5.116 0.000336 Distress 0.060782 0.009979 6.091 7.84e-05 Multiple r-squared: 0.7713 Adjusted r-squared: 0.7505 a. What is the equation of the least-squares regression line for predicting brain activity from social distress score? - Use the equation to predict brain activity for a social distress score of 2.0 b. What percent of the variation in brain activity among these subjects is explained by the straight-line relationship with social distress score?
a. Ŷ = 0.06078X - 0.1261 - Prediction = −0.00454 b. 77.1%