chapter 4: stats 10
Always use a phrase like ____ ____ when describing an association because the trend you are describing has variability - the association you are describing may not be true for all individuals.
"tends to"
• Correlation says
"there seems to be a linear association between these two variables" but it doesn't tell us what that association is
• The response variable
(dependent variable) is the variable of interest. • It is the outcome variable. • It goes on the y-axis.
The explanatory variable
(independent variable) is the variable that it is the predictor. • It goes on the x-axis.
Correlation is always between _____ and _____
-1 and +1.
We have r no correlation when r is equal to
0 (r = 0).
Trend The general tendency of the scatterplot as you read from left to right, typical trends:
1. Increasing (uphill), called a positive association 2. Decreasing (downhill), called a negative association 3. No trend, if there is neither an uphill nor downhill tendency
Examining Scatterplots Note three features:
1. Trend (like center) 2. Strength (like spread) 3. Shape
The correlation between weight and gas mileage of cars is -0.96. Which of the following is the correct value and interpretation of r-squared of the linear model for predicting gas mileage from weight?
92% of the variation in the gas mileage is explained by the weight of the cars.
Understanding the Correlation Coefficient
Changing the order of the variables does not • change the correlation coefficient (r). Adding a constant or multiplying by a positive • constant does not affect r. • • The correlation coefficient is unitless. Must have a linear trend.
Extrapolation
Do not extrapolate beyond the data, i.e., do not make a prediction for a x value outside the range of the data - the linear model may no longer hold outside that range.
Interpretation of the slope:
For each unit increase in x, we expect y to increase/decrease on average by the value of the slope. • A positive slope implies that as x increases y increases. • A negative slope implies that as x increases y decreases. • When correlation (r) is positive slope (b) will be positive and when correlation (r) is negative slope (b) will be negative
Assessing Model Fit
In order to assess if the linear model is a good fit, we look to see how much of the variation in the response variable is accounted for by the model, i.e. explained by the explanatory variable.
Use Scatterplots to Investigate Associations Between ________Variables
Numerical
Finding the Regression Line
On average, the sum of the squares of the vertical distances between the points or observed y-values (y ) and the value predicted by the line ( y) is the smallest for the regression line. • Also called the least squares line. ˆ(y − yˆ)2
Intercept
Once we have the slope, we can calculate the intercept. B=y+bx • We need to find the means of variables x and y. • Interpretation of the intercept: When x equals 0, we expect y to equal the intercept.
Coefficient of Determination
The coefficient of determination or the correlation coefficient squared, (or r-squared), gives the percentage of the variance of y is explained by x. • For the Burger King model: • When interpreting a regression model we need to explain in context of what means: • Burger King model: 69% of the variation in the fat content (response or y-variable) is explained by the protein content (explanatory or x-variable) of the burger.
correlation coefficient (r)
The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between two numerical variables. -the Pearson correlation coefficient -Correlation only makes sense if the trend is linear. -Correlation has no units
Linear Model
The linear model is just an equation of a straight line through the data. • The points in the scatterplot don't all line up, but a straight line can summarize the general pattern. • The linear model can help us understand how the explanatory (independent) and response (dependent) variables are associated.
Slope
The slope can be calculated using the correlation coefficient, r, and the standard deviations of the explanatory variable (x) and the response variable (y) slope= r (Sy/Sx)
Negative correlation
We have negative correlation when r is less than • 0 (r < 0). The closer the correlation is to -1, the stronger • the association between the two variables.
Positive Correlation
We have positive correlation when r is greater • than 0 (r > 0). The closer the correlation is to 1, the stronger the • association between the two variables.
Correlation Does Not Mean ________
causation • Do not conclude that a cause-and-effect relationship between two variables exists just because there is a strong correlation.
The sign of a correlation coefficient gives the _____ ___
direction of the association + increasing - decreasing
regression line
is a tool for making predictions about future observations. • It is a useful method for summarizing a linear relationship.
residual
is the difference between the observed value and its associated predicted value. Residual = observed value - predicted value = y − y^
Each point in the scatterplot represents ____ _________
one observation
Regression Line ____ matters
order
Correlation is sensitive to ____
outliers
Finding the Correlation Coefficient
r = ∑(zx*zy)/ n−1 zx is the z-score for the x-variable, zy is the z- score for the y-variable, n is the sample size.
Coefficient of Determination
r^2=0 means that none of the variability in y is explained by x. r^2=1 means that all of the variability in y is explained by x. • While the correlation coefficient is between -1 and 1, is between 0 and 1. • We would like to be as close to 100% as possible. 0 is less than or equal to r^2 is less than or equal to 1
Prediction
simply plug in the value for x in the equation for the regression line and calculate predicted y. Predicted Fat = 6.8 + 0.97 × Protein Predicted Fat = 6.8+ 0.97×30 = 35.9g yˆ= 35.9g
Scatterplots with small amounts of scatter or little vertical variation indicate a ____association.
strong
Scatterplots with large amounts of scatter or vertical variation indicate a _____ association.
weak
predicted value
y ^ represents the predicted value or the estimate made from a model.
equation for a straight line.
y= mx + b, or y=a+bx where y is the y-variable, x is the x-variable, b is the intercept and m is the slope.
A correlation near ____ corresponds to a • weak linear association.
zero
Pitfalls to Avoid
• Don't fit linear models to nonlinear associations. • Correlation is not causation. • Beware of outliers! Try the regression and correlation with and without influential points to see the differences • Be careful of regressions of aggregate data (data of means rather than individuals). • Don't extrapolate.