Learning Path 12

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

influential observation.

An outlier such that removing the outlier would have a drastic effect on the analysis (such as on the slope of the least-squares regression line).

Here is the least-squares regression equation for Example 12.1 (the "Clear-Cut" example): y^ = 4.3516 +21.3217x Which of the following is a correct interpretation of the slope?

As the age of the tree increases by 1 year, the most recent five-year growth is predicted to increase by 21.322 cm.

General interpretation of the slope of the least-squares regression equation

As the explanatory variable increases by one unit, the response variable is predicted to change by the value of the slope. If the slope is positive, y is predicted to increase If the slope is negative, y is predicted to decrease The response variable is predicted to be the value of the y-intercept when the explanatory variable is equal to 0. The y-intercept is the value of y when x = 0. We increase x by 1 unit when interpreting the slope.

True or False? The best-fitting line is the line with the smallest-sum of the squared horizontal distances from the point to the line. We call this line the least-squares regression line.

False The best-fitting line is the line with the smallest-sum of the squared vertical distances. We call this line the least-squares regression line.

Important Notes!!

If you use x and yˆy^ in your equation, ALWAYS DEFINE WHAT x AND yˆy^ REPRESENT IN THE CONTEXT OF THE PROBLEM. Recall, x is the explanatory variable and y is the response variable. yˆy^= 4.3516 + 21.3217x where x = age of tree (years) yˆy^ = predicted most recent five-year growth (cm) Include units if a variable has units Always include the term "predicted" when defining what yˆy^ represents. This is because not all observed values fall exactly on the least-squares regression line. This prediction represents what we would expect to observe for a given value of x. Instead of using "x" and "yˆy^", you can use one-word names that describe the response and explanatory variable: growthˆgrowth^ = 4.3516 + 21.3217(age) Regardless if you use "x" and "yˆy^" OR one-word names, always mean sure you have a hat over the response variable to indicate this represents a predicted value.

Why does the least squares regression line have its name? It is named as it is because the line is linear so there are no squared terms in the equation of the regression line. If there are any squared terms we should make sure that there are as few as possible. It is named as it is because we want the line that gives us the least amount of calculations. It is named as it is because we want the line that minimizes the squared vertical distance between the data and the line.

It is named as it is because we want the line that minimizes the squared vertical distance between the data and the line.

Additional notes:

Observed values below the regression line will have negative residuals Observed values above the regression line will have positive residuals Observed values on the regression line will have a residual = 0. The sum of the residuals is 0. This is always true!

certain situations when using a least-squares regression equation for prediction may not be appropriate.

Only use the least-squares regression equation for prediction if the value of x for which you want to predict is within the range of the x-values used to obtain the least-squares regression equation. Predicting for x-values outside the range of x-values used to obtain the least-squares regression equation is called extrapolation and is not recommended. Only use the least-squares regression equation for prediction if the relationship between the response and explanatory variables is linear! Outliers can have an influence on the least-squares regression line! Be cautious with using the least-squares regression equation when there are outliers!

Which of the following formulas is used to calculate the residuals? Residual = response - predicted Residual = explanatory variable - predicted Residual = observed - predicted

Residual = observed - predicted

Strategy for dealing with outliers

Step 1: Determine if there was a data entry error. If so, correct the error and continue the analysis. If not, move on to Step 2. In the long-jump example, all but one average takeoff error are between 0 and 0.25. The outlier had an average takeoff error of 0.5. Perhaps this jumper's average takeoff error was actually 0.05 and the person entering the data put the decimal place in the wrong place. Often, it may be difficult to know if a data entry error was made, but if you see an unusual value, you might bring it to the attention to the researcher so that the researcher can investigate further. If we are able to determine that there was a data entry error, we'd correct it and move on. If a data entry error wasn't made and/or we can't determine if a data entry error was made, move to Step 2. Step 2: Determine if the outlier is from a different population. If so, the outlier can be removed from the analysis. As discussed previously, only remove the outlier if it is from a different population. Otherwise, do two analyses, one with the outlier and one without. In the "Long Jump" example, suppose that all jumpers but the outlier were experienced jumpers. The jumper that had the outlying takeoff error never did the long jump before but had to step in at this meet. That makes this jumper from a different population - he/she was not an experienced long-jumper while all others were. This would be a reason to remove the outlier from the analysis. In summary, only remove an outlier for a non-statistical reason. Do not remove an outlier because it's influencing the analysis - that is a "statistical" reason. If you can't determine if the outlier is from a different population, do an analysis with and without the outlier and report both.

Researchers are interested in predicting the height of evergreen trees (in feet) in a forest using the amount of rainfall (in inches) for October. They obtain information from 100 forests on the amount of rainfall in October as well as the height of the tallest evergreen tree in each forest (calculated during the following month of May). They perform simple linear regression on this data and obtain the following output from R Studio: (Intercept) 19.28 Oct rainfall 1.853 What is the interpretation of the y-intercept in the context of this problem?

The height of an evergreen tree is predicted to be 19.28 feet if there are 0 inches of rainfall in the month of October. The equation of the least squares regression line will be y^ = 19.28 + 1.853x . If x = 0 then we have y^ = 19.28. When x = 0 the amount of rainfall for the month of October is 0. Therefore, the height of an evergreen tree is predicted to be 19.28 feet if there are 0 inches of rainfall in the month of October.

The following output was obtained after performing a simple linear regression on a dataset that contained only two variables: water (the amount of water in fluid ounces) and height (in inches). The variable water is the amount of water applied to a plant weekly and the variable height is the height of the plant after a 6-week period of watering. (Intercept) = 20.810 water = 0.8239 Which of the following is the correct interpretation for the predicted value when x = 2. (Round your answer to two decimal places) The predicted height of a plant that received 2 fluid ounces of water each week for six weeks is 42.44 inches. The predicted height of a plant that received 2 fluid ounces of water for six weeks is 25.75 inches. The predicted height of a plant that received 2 fluid ounces of water each week for six weeks is 22.46 inches.

The predicted height of a plant that received 2 fluid ounces of water each week for six weeks is 22.46 inches. To make this prediction use y^ = 20.810 + 0.8239x = 20.810 + 0.8239(2) = 20.810 + 1.6478 = 22.4578, which is 22.46 inches after rounding.

least-squares regression line.

The straight line that "fits best" between points on a scatterplot. It is determined by finding the line that produces the smallest sum of squared vertical distances from points to the line.

y-intercept:

The value of the y-coordinate when the x-coordinate is 0. That is, it's the value of the response variable when the explanatory variable is 0. Visually, it's the y-coordinate at the point the least-squares regression line crosses the y-axis.

Which of the following statements most closely matches the definition of extrapolation? Using the equation from the least-squares regression line to predict beyond the range of the x-values in the sample data. Creating a data set that contains the variables that we want to use for making a prediction. Using extra information about the data that is not included in the graph to help those who did not perform the analysis better understand the interpretation of a prediction . Using multiple explanatory variables in a model.

Using the equation from the least-squares regression line to predict beyond the range of the x-values in the sample data.

Any points that have negative residuals would be considered an outlier.

false

Return to the example from the previous two question where the least-squares regression equation was generated for predicting the heart weight (in grams) of cats using their body weight (in kilograms) from a data set of 144 domestic cats. Below is the R output from the linear model. Using the output, what is the direction of the association between heart weight and body weight?

positive If the sign of the slope is positive, this means that for a single unit increase in the explanatory variable, the predicted value of the response variable will also increase. This indicates a positive association. In short - if the slope is positive then the direction of the association is positive.

extrapolation

predicting for values that are outside the range of the x-values used to obtain the least-squares regression line.

Interpreting the slope of the least-squares regression equation:

remember that x represents the explanatory variable and y represents the response variable. So, increasing x by one unit means the value of the explanatory variable increases by one unit. How much "y" changes means how much the response variable will change. Putting it altogether, a general way to interpret the slope in the context of the problem is as follows: as the explanatory variable increases by one unit, y will change by the value of the slope.

residual.

residual = observed value - predicted (or "expected") value

We can use the least-squares regression equation to predict a value of the _____________ variable for different values of the ___________ variable.

response, explanatory

slope:

the "steepness" of a line. It is how much y (the response variable) changes for a one unit increase in x (the explanatory variable)

which line will R choose as the "best-fitting" line?

the line with the smallest sum of the squared vertical distance

The only time we can feel comfortable using the least-squares regression equation for prediction is when

the relationship between the explanatory and response is linear, we predict for x-values that are within the range of the x-values in the sample, and outliers are not influential on an analysis

A point on the least-squares regression line has a residual of 0

true

Tim's height is much more than what it is predicted to be for his age. Therefore, Tim has a positive residual.

true

If we extrapolate, we are making an unreliable guess that the same linear relationship will continue to exist. True or False

true If we establish a linear relationship between two variables on some range of x-values, does this guarantee that a linear relationship will exist outside of this range? The answer is no. Extrapolating is using this linear equation for values of x where a linear relationship hasn't been established.

Return to the example from the previous question where the least-squares regression equation was generated for predicting the heart weight (in grams) of cats using their body weight (in kilograms) from a data set of 144 domestic cats. In the correct least-squares regression equation from the last question what does y^ represent and what does x represent

y^ = predicted heart weight of cats in grams x = body weight of cats in kilograms

Which of the following is the correct formula to use for writing the equation of the least-squares regression line? y^= ax + b, where a is the y-intercept and b is the slope, and y^ is the the predicted value of the explanatory variable. y^= a + bx, where a is the y-intercept, b is the slope, and y^ is the the predicted value of the response variable.

y^= a + bx, where a is the y-intercept, b is the slope, and y^ is the the predicted value of the response variable.

equation of the least-squares regression line (i.e. "least-squares regression equation")

y^= a +bx y^ = the predicted value of the response variable x = a value of the explanatory variable a = the y-intercept of the least-squares regression line b = the slope of the least-squares regression line


Ensembles d'études connexes

Series 7 -Chapter 12-Variable Annuities

View Set

Chapter 21 study guide multiple choice/true and false, Ch 22 history test Japan part, Chapter 22

View Set

220-901 A+ Review Questions (In depth)

View Set

Ricci Ch 46 PrepU (Peds HemeOnc)

View Set

Kinesiology 2 - Knee Joint (Dr. Holtzman)

View Set

TLE - ENTREPRENEURSHIP & FINANCIAL LITERACY

View Set