Introduction to Statistics: Chapter 4 Homework (Regression Analysis: Exploring Associations Between Variables)
The figure shows a scatterplot of the height of the left seat of a seesaw and the height of the right seat of the same seesaw. Estimate the numerical value of the correlation, and explain the reason for your estimate.
The correlation is r=−1 because there is a perfect negative linear association. NOTE: View the scatterplot graph!
If the correlation between height and weight of a large group of people is 0.74, find the coefficient of determination (as a percent) and explain what it means. Assume that height is the predictor and weight is the response, and assume that the association between height and weight is linear.
The coefficient of determination is 54.76%. Therefore, 54.76% of the variation in weight can be explained by the regression line.
The following scatterplot shows the age and weight for some women. Some of them exercised regularly, and some did not. Explain what it means that the blue line (for those who did not exercise) is a bit steeper than the red line (for those who did exercise).
Among those who exercise, the effect of age on weight is less. An additional year of age does not lead to as great an increase in the average weight for exercisers as it does for non-exercisers. NOTE: View the scatterplot!
The accompanying scatterplot shows data on age, denoted A, of a sample of students and the number of college credits, C, attained. Comment on the strength, direction, and shape of the trend.
The trend is linear, positive, and strong until around age 24, when the trend becomes negative and weak. NOTE: View the scatterplot of age and credits.
The accompanying graph shows the average car insurance premium for a sample of ages. Complete parts (a) and (b) below.
a. Explain what the graph tells about insurance rates for drivers at different ages. Explain why insurance rates might follow this trend. Choose the correct choice below. Ans: As drivers' age increases, premiums tend to decrease, before increasing again at 65 years of age. Younger drivers and older drivers are more likely to get into accidents, so they are charged more for premiums. b. Would it be appropriate to do a linear regression analysis on these data? Why or why not? Choose the correct choice below. Ans: It would not be appropriate, because the data do not follow a linear trend. NOTE: View the graph of average car insurance premiums.
Does a correlation of -0.4 or +0.5 give a larger coefficient of determination? We say that the linear relationship that has the larger coefficient of determination is more strongly correlated. Which of the values shows a stronger correlation?
A correlation of +0.5 gives a larger coefficient of determination and shows a stronger correlation.
The following figure shows a scatterplot of the educational level of twins. Describe the scatterplot. Explain the trend and mention any unusual points.
The point that shows one twin with 1 year of education and the other twin with 12 years is an outlier.; The point that shows one twin with 15 years of education and the other with 8 years is unusual.; The trend is positive.; In general, if one twin has a higher-than-average level of education, so does the other twin.
A graph shows the relationship between SAT score and college GPA. SAT score is the predictor variable and college GPA is the response variable. If the variables are reversed so that college GPA was the predictor and SAT score was the response variable, what effect would this have on the numerical value of the correlation coefficient?
The correlation coefficient would not change, because the correlation coefficient does not depend on the order of the variables.
The accompanying graph shows the winning percentages in singles matches and doubles matches for a sample of male professional tennis players. Complete parts (a) through (c) below.
a. Based on this scatterplot, does there appear to be a strong linear association between these two variables? Choose the correct choice below. Ans: No, the scatterplot does not show any association between these variables. b. Would the numerical value of the correlation between these two variables be close to negative one, positive one, or zero? Give a reason for your answer. Ans: The numerical value of the correlation would be close to 0, because the scatterplot shows no trend. c. Based on this graph, can one accurately predict a professional tennis player's doubles winning percentage based on his singles winning percentage? Ans: Since the scatterplot shows no association between these variables, singles winning percentage cannot be used to predict doubles winning percentage. NOTE: View the scatterplot!
Suppose that students who scored much lower than the mean on their first statistics test were given special tutoring in the subject. Suppose that they tended to show some improvement on the next test. Explain what might cause the rise in grades other than the tutoring program itself.
Regression toward the mean might contribute to raising the scores of the students who scored low on the first test.
The correlation between height and arm span in a sample of adult women was found to be r=0.949. The correlation between arm span and height in a sample of adult men was found to be r=0.854. Which association—the association between height and arm span for women, or the association between height and arm span for men—is stronger? Explain.
The association between height and arm span for women is stronger because the value of r is farther from 0.
A doctor is studying cholesterol readings in his patients. After reviewing the cholesterol readings, he calls the patients with the highest cholesterol readings (the top 5% of readings in his office) and asks them to come back to discuss cholesterol-lowering methods. When he tests these patients a second time, the average cholesterol readings tended to have gone down somewhat. Explain what statistical phenomenon might have been partly responsible for this lowering of the readings.
The cholesterol going down might be partly caused by regression toward the mean, since the second measurement is closer to the mean.
The accompanying table shows the calories in a five-ounce serving and the percent alcohol content for a sample of wines. Complete parts (a) through (e) below.
a. Make a scatterplot using percent alcohol as the independent variable and calories as the dependent variable. Include the regression line on the scatterplot. Based on the scatterplot, does there appear to be a strong linear relationship between these variables? Choose the correct graph below. Ans: Graph (Positive Trend beginning with x-value 9) Based on the scatterplot, does there appear to be a strong linear relationship between these variables? Ans: While there appears to be a strong positive linear relationship, one of the points is an influential point. b. Find the numerical value for the correlation between percent alcohol and calories. Explain what the sign of the correlation means in the context of this problem. Ans: r=0.935 Explain what the sign of the correlation means in the context of this problem. Ans: The large and positive correlation suggests that wines with more alcohol tend to have higher calorie content than wines with less alcohol. c. Report the equation of the regression line and interpret the slope of the regression line in the context of this problem. Ans: Predicted Calories=-76.8+(19.6) Percent Alcohol Interpret the slope of the regression line in the context of this problem. Select the correct choice below and fill in the answer box within your choice. Ans: The slope is 19.6. For each increase of 1% alcohol content, the calories go up by an average of 19.6 calorie(s). d. Find and interpret the value of the coefficient of determination. Ans: The coefficient of determination is 87.5% Interpret the coefficient of determination. Select the correct choice below and fill in the answer box to complete your choice. Ans: 87.5% of the variation in calories is explained by the variation in percent alcohol. e. Add a new point to the data, a wine that is 20% alcohol that contains 0 calories. Find r and the regression equation after including this new point. What was the effect of this one data point on the value of r and the slope of the regression equation? Ans: r=-0.373 Determine the new regression equation. Ans: Predicted Calories=198.4+(-6.4) Percent Alcohol What was the effect of this one data point on the value of r and the slope of the regression equation? Ans: The new data point is an influential point, which has made the correlation weaker and negative.
Assume that in a political science class, the teacher gives a midterm exam and a final exam. Assume that the association between midterm and final scores is linear. The summary statistics shown below have been simplified for clarity. Also, r=0.7 and n=28.
Step 1: Find the equation of the line to predict the final exam score from the midterm score. Ans: b=0.7 b. Then find the y-intercept, a, from the equation, a=y=bx. Ans: a=21.6 c. Write out the following equation: Predicted y=a+bx. However, use "Predicted Final" instead of "Predicted y" and "Midterm" in place of x. Ans: Predicted Final = 21.6 + .7 Midterm Step 2: Use the equation to predict the final exam score for a student who gets 88 on the midterm. Ans: The predicted final exam grade is 83. Step 3: Your predicted final exam grade should be less than 88. Why? Ans: Regression toward the mean, because the student's predicted final score is closer to the mean than was their midterm score.
In the accompanying scatterplots, the first graph shows the years a person was employed before working at the company and the salary at the company. The second graph shows the years employed at the company and the salary. Which graph shows a stronger relationship and could do a better job predicting salary at the company?
The years employed at the company shows a stronger relationship and is a better predictor of salary, because the vertical spread of salary is narrower. NOTE: View the scatterplots.
The following figure shows the number of units that students were enrolled in and the number of hours (per week) that they reported studying. Do you think there is a positive trend, a negative trend, or no noticeable trend? Explain what this means about the students.
There appears to be a positive trend. It appears that the number of hours of homework tends to increase slightly with enrollment in more units. NOTE: View the scatterplot!
4.4 Answer the questions using complete sentences. a. What is an influential point? How should influential points be treated when doing a regression analysis? b. What is the coefficient of determination and what does it measure? c. What is extrapolation? Should extrapolation ever be used?
a. Choose the correct answer below. Ans: An influential point is an outlier whose presence or absence has a large effect on the regression analysis. If the data have one or more influential points, perform the regression analysis with and without these points and comment on the differences. b. Choose the correct answer below. Ans: The coefficient of determination is the square of the correlation coefficient, r. The coefficient of determination measures how much of the variation in the response variable is explained by the explanatory variable. c. Choose the correct answer below. Ans: Extrapolation is using the regression line to make predictions beyond the range of x-values in the data. Extrapolation should not be used.
Answer the questions using complete sentences. a. An economist noted the correlation between consumer confidence and monthly personal savings was negative. As consumer confidence increases, would monthly personal savings be expected to increase, decrease, or remain constant? b. A study found a correlation between higher education and lower death rates. Does this mean that one can live longer by going to college? Why or why not?
a. Choose the correct answer below. Ans: Monthly personal savings would be expected to decrease, because a negative correlation means that as one variable increases, the other variable tends to decrease. b. Choose the correct answer below. Ans: No, because an association between two variables is not sufficient evidence by itself to conclude that a cause-and-effect relationship exists between the variables.
Assume that in a sociology class, the teacher gives a midterm exam and a final exam. Assume that the association between midterm and final scores is linear. The summary statistics are shown below. Also, r = 0.75 and n = 24.
a. Find and report the equation of the regression line to predict the final exam score from the midterm score. Ans: Predicted Final Grade=18.25+0.75 Midterm Grade b. For a student who gets 52 on the midterm, predict the final exam score. Ans: The predicted final exam grade is 57. c. Your answer to part (b) should be higher than 52. Why? Ans: The student's final score should be higher than his or her midterm score because of regression toward the mean-predictor variables far from the mean tend to produce response variables closer to the mean. d. Consider a student who gets a 100 on the midterm. Without doing any calculations, state whether the predicted score on the final exam would be higher, lower, or the same as 100. Ans: The predicted score on the final exam would be lower than 100 because of regression toward the mean.
The figure shows a scatterplot of birthrate (live births per 1000 women) and age of the mother in the United States. Would it make sense to find the correlation for this data set? Explain. According to this graph, at approximately what age does the highest fertility rate occur? (Source: Helmut T. Wendel and Christopher S. Wendel (Editors): Vital Statistics of the United States: Births, Life Expectancy, Deaths, and Selected Health Data. Second Edition, 2006, Bernan Press.)
Ans: No, it would not make sense to find the correlation because the trend is not linear.
The scatterplot shows the heights of mothers and daughters. Complete parts (a) through (e) below. Daughter=40.13+0.379 Mother
a. As the data are graphed, which is the independent and which the dependent variable? Ans: The independent variable is mother's height and the dependent variable is daughter's height. b. From the graph, approximate the predicted height of the daughter of a mother who is 65 inches (5 feet 5 inches) tall. Ans: The predicted height of the daughter of a mother who is 65 inches tall is about 65 inches. c. From the equation, determine the predicted height of the daughter of a mother who is 65 inches tall. The predicted height of the daughter of a mother who is 65 inches tall is about 64.77 inches. d. Interpret the slope. Choose the correct answer below. Ans: For each additional inch in the mother's height, the average daughter's height increases by about 0.379 inch. e. What other factors besides mother's height might influence the daughter's height? Select all that apply. Ans: The father's height and The daughter's nutrition during formative years NOTE: Look at the scatterplot graph!
4.2 Complete parts (a) and (b) below.
a. The scatterplot to the right shows the college tuition and percentage acceptance at some colleges. Would it make sense to find the correlation using this data set? Why or why not? Ans: No. Linear regression is not appropriate because the trend is not linear. b. The scatterplot to the right shows the composite grade on the ACT (American College Testing) exam and the English grade on the same exam. Would it make sense to find the correlation using this data set? Why or why not? Ans: Yes. There is no reason why linear regression would not be appropriate because the trend is linear. NOTE: Look at the scatterplots!
Five people were asked how many female first cousins they had and how many male first cousins. The data are shown in the table. Assume the trend is linear, find the correlation, and comment on what it means.
r=.847 People with many female cousins tend to have many male cousins.
The accompanying scatterplot shows a solid blue line for predicting weight from age of men; the dotted red line is for predicting weight from age of women. The data were collected from a large statistics class. a. Which line is higher and what does that mean? b. Which line has a steeper slope and what does that mean?
a. The men's line is higher, which means that men tend to weigh more than women at all ages shown. b. The men's line has a steeper slope, which means that older men tend to outweigh younger men more than older women outweigh younger women.
4.1 The accompanying scatterplots show SAT scores and GPA in college for a sample of students. The top graph uses the SAT critical reading score to predict GPA in college and the bottom graph shows the SAT math score to predict GPA. Which is the better predictor of GPA for these students, critical reading SAT or math SAT? Explain your answer.
The critical reading SAT is a better predictor of GPA, because the vertical spread of GPA is narrower. NOTE: View the scatterplots.
A newspaper published an article with the headline "Positive Correlation Found between Gym Usage and GPA." Explain what a positive correlation means in the context of this headline.
In general, higher gym usage is associated with higher GPA.
The accompanying scatterplot shows data on age and GPA for a sample of college students. Comment on the trend of the scatterplot. Is the trend positive, negative, or near zero?
The graph shows a trend near zero since the points show no pattern as age increases. The association between age and GPA is near zero. NOTE: View the scatterplot of age and GPA.
A newspaper published an article with the headline "Study Finds Correlation between Education, Life Expectancy." Would you expect this correlation to be positive or negative? Explain your reasoning in the context of the headline.
We would expect this correlation to be positive, since people who are able to afford college are more likely to also have access to health care.
4.3 The scatterplot shows the median starting salaries and the median mid-career salaries for graduates at a selection of colleges. Complete parts (a) through (e) below. Mid-Career=−5200+1.915 Start Med
a. As the data are graphed, which is the independent and which the dependent variable? Ans: The independent variable is median starting salary and the dependent variable is median mid-career salary. b. Why do you suppose median salary at a school is used instead of the mean? Ans: Salary distributions are usually skewed, making the median a more meaningful measure of center. c. Using the graph, estimate the median mid-career salary for a median starting salary of $70,000. Ans: The median mid-career salary for a median starting salary of $70,000 is about $125,000.00. d. Use the equation to predict the median mid-career salary for a median starting salary of $70,000. Ans: The median mid-career salary for a median starting salary of $70,000 is about $128,850. e. What other factors besides starting salary might influence mid-career salary? Select all that apply. Ans: The amount of additional education required and The number of hours worked per week NOTE: Look at the scatterplot graph!
Indicate which variable you think should be the predictor (x) and which variable should be the response (y). Explain your choices. a. You have collected data on used cars for sale. The variables are price and odometer readings of the cars. b. Research is conducted on monthly household expenses. Variables are monthly water bill and household size. c. A personal trainer gathers data on the weights and time spent in the gym for each of her clients.
a. Choose the correct choice below. Ans: The predictor is odometer reading. The response is price. Cars with higher mileage have been on the road longer, and have likely retained less of their original value. b. Choose the correct choice below. Ans: The predictor is household size. The response is monthly water bill. Larger households tend to use more water. c. Choose the correct choice below. Ans: The predictor is time spent in the gym. The response is weight. Clients who have spent more time working out are more likely to have lost weight.
Indicate which variable you think should be the predictor (x) and which variable should be the response (y). Explain your choices. a. A researcher measures subjects' stress levels and blood pressures. b. Workers who commute by car record the length of their commutes (in miles) and the amount spent monthly on gasoline purchases. c. Amusement parks record the heights and maximum speeds of roller coasters.
a. Choose the correct choice below. Ans: The predictor is stress level. The response is blood pressure. The subjects' blood pressures are likely to be explained by their stress levels; higher stress is associated with higher blood pressure. b. Choose the correct choice below. Ans: The predictor is commute length. The response is monthly gasoline purchases. Workers with longer commutes will need more gas to get to and from work; longer commute length is associated with higher gas purchases. c. Choose the correct choice below. Ans: The predictor is height. The response is maximum speed. A roller coaster has a higher speed if it starts from a greater height; taller height is associated with greater maximum speed.
The accompanying table gives the distance from Boston to each city and the cost of a train ticket from Boston to that city for a certain date. Complete parts (a) through (e) below.
a. Use technology to produce a scatterplot. Based on the scatterplot, is there a strong linear relationship between these two variables? Explain. Ans: Graph (Positive trend; not straight line) Based on the scatterplot, is there a strong linear relationship between these two variables? Explain. Ans: The association is linear and strong. The scatterplot shows that as distance increases, ticket price increases at a roughly constant rate. b. Compute r and write the equation of the regression line. Ans: r=0.903 Write the equation of the regression line. Use the words "Ticket Price" and "Distance" in the equation. Ans: Predicted Ticket Price=20.7+0.2308 Distance c. Provide an interpretation of the slope of the regression line. Select the correct choice below and fill in the answer box to complete your choice. Ans: An additional mile of distance between the two cities is associated with an increase of $.2308 in the average train ticket price. d. Provide an interpretation of the y-intercept of the regression line or explain why it would not be appropriate to do so. Select the correct choice below and, if necessary, fill in the answer box to complete your choice. Ans: It is not appropriate to interpret the y-intercept, because it does not make sense for two cities to be 0 miles apart. e. Use the regression equation to predict the cost of a train ticket from Boston to Louisville, a distance of 825 miles. Ans: The predicted ticket price is $211.11.
The accompanying table shows a list of the weights and costs of some turkeys at different supermarkets. Complete parts (a) through (f) below.
a. Make a scatterplot with weight on the x-axis and cost on the y-axis. Include the regression line on the scatterplot. Ans: Graph (Positive trend starting at y-value 8) b. Find the numerical value for the correlation between weight and cost. Explain what the sign of the correlation shows. Find the numerical value for the correlation. Ans: 0.915 The large and positive correlation suggests that heavier turkeys tend to have higher costs than lighter turkeys. c. Report the equation of the best-fit straight line, using weight as the predictor (x) and cost as the response (y). Ans: Predicted Cost=-4.66+(1.58) Weight d. Report the slope and intercept of the regression line and explain what they show. If the intercept is not appropriate to report, explain why. Ans: The slope is $1.58. For each increase of 1 pound in the weight of the turkey, the cost goes up by an average of $ 1.58. Report the intercept of the regression line and explain what it shows. Select the correct choice below and fill in the answer box(es) to complete your choice. Ans: The intercept is -4.66. The interpretation of the intercept is inappropriate, because it is not possible to have a turkey that weighs 0 pounds. e. Add a new point to the data, a 30-pound turkey that is free. Give the new value for r and the new regression equation. Explain what the negative correlation implies. What happened? Ans: r=−0.385 Determine the new regression equation. Ans: Predicted Cost=27.31+(-0.57) Weight What does the negative correlation imply? Ans: A negative correlation suggests that larger turkeys tend to have a lower cost. What happened when the new data point was added? Ans: The 30-pound free turkey was an influential point, which significantly changed the results. f. Find and interpret the coefficient of determination using the original data. Ans: The coefficient of determination is 83.7%. NOTE: Use StatCrunch!
Chapter 4 Review The scatterplot shows the shoe size and height for some men (M) and women (F). Complete parts (a) through (c) below.
a. Why did we not extend the red line (for the women) all the way to 74 inches, instead stopping at 69 inches? Choose the correct answer below. Ans: There are no women taller than 69 inches, so the line should stop at 69 inches to avoid extrapolating. b. How do we interpret the fact that the blue line is above the red line? Choose the correct answer below. Ans: Men who are the same height as women wear shoes that are on average, larger sizes. c. How do we interpret the fact that the two lines are (nearly) parallel? Choose the correct answer below. Ans: The mean increase in shoe size based on height is the same for men and women. NOTE: View the scatterplot!
The accompanying graph shows the monthly premiums for a 10-year $250,000 male life insurance policy by age of purchase. For example, a 20-year-old male could purchase such a policy for about $10 per month, while a 50-year-old male would pay about $24 per month for the same policy. Complete parts (a) and (b) below.
a. Explain what the graph tells about life insurance rates for males at different ages. Explain why life insurance rates might follow this trend. Choose the correct answer below. Ans: As the insured's age increases, premiums increase at an increasing rate. Older people are more likely to die, and senior citizens are very likely to die within 10 years, so older people are charged more for insurance. b. Would it be appropriate to do a linear regression analysis on these data? Why or why not? Choose the correct answer below. Ans: It would not be appropriate, because the data do not follow a linear trend. NOTE: View the scatterplot of age and premiums.
A professor went to a website for rating professors and looked up the quality rating and also the "easiness" of the six full-time professors in one department. The ratings are 1 (lowest quality) to 5 (highest quality) and 1 (hardest) to 5 (easiest). The numbers given are averages for each professor. Assume the trend is linear, find the correlation, and comment on what it means.
r=. 883 Comment on the meaning of the correlation. Choose the correct interpretation below. Ans: The professors that have high easiness scores tend to also have high quality scores.
Use a computer or statistical calculator to calculate the correlation coefficient in parts a through c below.
a. The table shows the approximate distance between selected cities and the approximate cost of flights between those cities. Calculate the correlation coefficient between cost and miles. Ans: r=.986 b. This table shows the same information, except that the distance was converted to kilometers by multiplying the numbers of miles by 1.609 and rounding to the nearest kilometer. What happens to the correlation coefficient when numbers are multiplied by a positive constant? Ans: The correlation is .986. The correlation coefficient remains the same when the numbers are multiplied by a positive constant. c. Suppose a tax is added to each flight. Fifty-five dollars is added to every flight, no matter how long it is. The table shows the new data. What happens to the correlation coefficient when a constant is added to each number? Ans: The correlation is .986. The correlation coefficient remains the same when a constant is added to each number. NOTE: Use StatCrunch! (Stats>Regression>Simple Linear)
The accompanying table gives the distance from a particular city to seven other cities (in thousands of miles) and gives the time for one randomly chosen, commercial airplane to make that flight. Do a complete regression analysis that includes a scatterplot with the line, interprets the slope and intercept, and predicts how much time a nonstop flight from this city would take to another city that is located 3000 miles away.
Draw a scatterplot for the round-trip flight data. Be sure that distance is the x-variable and time is the y-variable, because time is being predicted from distance. Graph the best-fit line using technology. Choose the correct scatterplot below. Ans: The graph (positive trend starting with 1) Does it seem that the trend is linear, or is there a noticeable curve? Ans: The linear model is appropriate because there is a linear trend in the data. Find the equation for predicting time (in hours) from miles (in thousands). Ans: Predicted Time=0.72+1.91 Thousand Miles Interpret the slope in the context of the problem. Select the correct choice below and fill in the answer box to complete your choice. Ans: For every additional thousand miles, on average, the time goes up by 1.91 hours. Interpret the intercept in the context of the problem. Although there are no flights with a distance of zero, try to explain what might cause the added time that the intercept represents. Ans: A trip of zero miles would take about .72 hours. However, a trip would never be exactly zero miles, so this time might account for delays in taking off and landing. Using the regression line, how long should it take to fly nonstop 3000 miles? Ans: It would take, on average, about 6.45 hours to fly 3000 miles. NOTE: View the distances and flight times.