L8: Correlation and Simple Linear Regression
True or false: A correlation coefficient of -0.75 indicates a weak relationship while a correlation coefficient of +0.75 suggests a strong relationship.
False: Correlation coefficients range from -1 to +1. The value indicates the strength, and the sign tells the direction of the relationship. Correlation coefficients of -0.75 and +0.75 describe equally strong linear relationships; -0.75 is a strong negative association (Y decreases as X increases) and +0.75 is a strong positive association (Y increases as X increases).
True or false: The R2 for a simple linear regression indicates how much of the variability in the independent variable (X) is explained by differences in the dependent variable (Y).
False: The R2 for a simple linear regression indicates how much of the variability in the dependent variable (Y) is explained by differences in the independent variable (X).
True or false: The null hypothesis for a correlation analysis or a simple linear regression analysis is that the slope=1.
False: The null hypothesis for a correlation analysis or a simple linear regression analysis is that the slope=0.
Scatter diagram and regression line
For a simple linear regression, Y = a + bX avoids the subscripts. My guess: When X = 15, Y = 5; when X = 30, Y = 17, so change in Y is 17-5 = 12, change in X is 30-15 = 15, so the slope is (change in Y over change in X) = 12/15 (.80). So a 1 unit increase in X implies a 0.80 unit increase in Y. Intercept looks around -7 to -10 perhaps. Note. There are no data (points) below X = 10, so the intercept (value of Y when X = 0) is not especially relevant here but still needed to specify the actual line.
Concept of best fit
For any line, with intercept b0 and slope b1, the error is ei = Yi - b0 - b1Xi - Linear regression, also called ordinary least squares (OLS) regression, chooses b0 and b1 to minimize the sum of squared errors for all the observations.
Interpreting slope
The slope from the regression line is -0.164, suggesting that for each minute the infant is on the bypass pump, there is an associated decrease of 0.164 units in the psychomotor developmental score one year later. The regression line begins at 101.25 at 0 minutes and declines to a score of about 90 after 70 minutes on the bypass pump.
True or false: A scatter plot can be used to look for evidence of a reasonably linear relationship between measurement variables.
True: A scatter plot is typically generated as a first step when evaluating associations between two continuous variables. If the relationship is reasonably linear, then correlation analysis or linear regression is appropriate. If the relationship is non-linear, other statistical methods should be employed.
True or false: A correlation coefficient that is weak can be statistically significant.
True: A weak relationship can be statistically significant in a large sample size.
True or false: Correlation analysis can be used to evaluate relationships between two continuous variables.
True: Correlation analysis can be used to examine relationships between two continuous (measurement) variables. Simple linear regression is another type of analysis that can evaluate relationships between two continuous variables.
Evaluating fitted regression model
Two sets of topics: 1) The estimated coefficients come from the data, so we can estimate 95% CIs and test hypotheses for the slope (b1), just like we did for the correlation coefficient (also a t-test) 2) 1.How well does the estimated linear regression model explain the variation in the variable Y? oA statistic, called R2 (coefficient of determination) shows how much of the variation in Y is "explained" by X (as a proportion/percentage). oMight be useful if prediction is of interest
Use data to estimate the coefficients
Y= Bo + b1X - b0 and b1 are estimated regression coefficients, our best guess of population parameters corresponds to little r
Simple linear regression equation line
Y= dependent, outcome variable X= independent, predictor variable y= B0+B1x - B0, B1 are the regression coefficients/ parameters - B0 is the Y-intercept, B1 is the slope
two continuous variables
Y= dependent, outcome variable (also referred to as left-hand side variable (LHS) X= independent, predictor variable (referred to as right-hand side variable)
Sample correlation
r -1< r < +1 - summarizes the closeness of points in scatter plot to straight line - magnitude (closer to -1 or +1) indicates strength - sign indicates the direction of relationship (positive vs negative/inverse)
Ordinary least squares (OLS)
- OLS regression is used to estimate a best line for the data (simple linear regression: one continuous X and one continuous Y variable) b0= intercept b1=∆Y/∆X - b0 and b1 are estimated regression coefficients, pour best estimates of population parameters - for any X, use the estimated coefficients b0 and b1 to predict Y - Y hat is a predicted value X is an actual value
Error in any line
- difference from observation (i) and line - Remember, for any observation in data and any line (intercept b0 and slope b1): Yi = b0 + b1Xi + ei
Simple linear regression
- only one X and one Y Y= b0 + b1X - the slope describes the association between X and Y - b1 is the expected (mean) change in Y for a one unit change in X - slope of 0 indicates no linear association between X and Y - example interpretation: 1 year increase in age (X) associated with 2 lb. weight gain (Y) on average; slope is 2.
Correlation coefficient r if reasonably linear
- there is a strong non-linear relationship - it is important to examine the plot, since r will be misleading if there is a non-linear relationship
Calculate covariance of X and Y
-The new component - covariance - looks at how far each observed (X,Y) pair is from the mean of X and the mean of Y, simultaneously. - We copy the deviations from the mean BMI (column 1) and the deviations from the mean SBP (column 2) from the computations of the respective sample variances. - We compute the cross-products - being sure to retain the signs. The first person has BMI ~9 units below the mean BMI and SBP ~ 20 units below the mean SBP. Note that person 5 has BMI just below the mean BMI and SBP just above the mean SBP, thus their cross product is negative (discordant).
correlation coefficient ranges from −1 to 1
. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation (association) between the variables
Tip for CIs
0 is the null value. Sometimes, the null value will not be zero. If the CI includes zero, it is statistically significant. point estimate is always going to be included in the confidence interval. It does not matter what type of test the point estimate will always be in CI but it does not tell anything about the statistical significance. - even if there is no statistical significance the point estimate will always be in the confidence interval
Main assumptions for this linear regression model
1) A linear relationship is reasonable (look at scatter plot) 2) The errors across observations are not correlated (independent observations) 3) The errors are normally distributed 4) The errors and not correlated with the X variable (called homoskedasticity).
Simple linear regression
regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line - Correlation tells us the nature (direction) and strength of the association between X and Y - sometimes we want to estimate the equation of the line that best describes the association. Once we have a regression equation we can use it for prediction. Recall y=mx+b. Here we use b0 and b1 for the y-intercept (the value of Y when X=0) and the slope (change in Y relative to a 1 unit change in X). Simple regression refers to one X and one Y; multivariable regression refers to multiple (more than one) X and one Y.
r close to -1
strong, negative linear association
r close to 1
strong, positive linear association
Heteroskedacity
Plot with random data showing heteroscedasticity: The variance of the y-values of the dots increase with increasing values of x.
Homoskedacity
Plot with random data showing homoscedasticity: at each value of x, the y-value of the dots has about the same variance.
What does R-squared tell us?
R squared tells us how much of the variation, Y is explained by X which is the exposure variable of interest
How well does gestational age predict birthweight?
R-squared=0.6683 - gestational age explains 66.8% of the variability in birthweight
A dataset containing measurements of body mass index and systolic blood pressure was analyzed using both correlation analysis and simple linear regression. The correlation analysis resulted in r=0.51. Which of the following is the best estimate of the multiple R2 obtained from the simple linear regression analysis?
R2=0.26 - Exactly! In simple linear regression, the R2 value is the square of the correlation coefficient. 0.51^2=0.26
Scatter diagram and regression line: Correlation and regression
Correlation: How closely do points cluster around the regression line? Regression: What is the prediction equation (key: slope) that describes the linear association between x and y?
A prospective cohort study recorded the bodyweight of adult participants annually. A simple linear regression analysis was performed in which the dependent variable was bodyweight in pounds and the independent variable was age in years. The Y-intercept was 37, and the slope was 4.2. The R2 was 0.26, and the p-value was 0.01. Which of the following is the best interpretation of these results?
For each additional year of age there was a 4.2 pound increase in weight on average, and changes in age accounted for about 26% of the variability in weight. Subjects generally gained 37 pounds over the 4.2 year period of observation, and the results were statistically significant. For each additional year of age there was a 0.26 pound increase in weight on average, and the results were statistically significant. The results are not interpretable because the R2=0.26 is <1.
Statistical Significance of a correlation coefficient
H0: No linear association between X and Y - The guidelines - e.g., r between 0.4-0.6 indicates a moderate correlation - are just that, guidelines. We can also run a formal test for statistical significance. The null hypothesis is always H0: rho=0 and the alternative is almost always 2 sided - testing if there is any significant correlation. The test statistic is t (can use Z for large samples) but t is always fine (and R also produces a t statistic). - We can determine statistical significance by comparing the test statistic to the appropriate two-sided critical value with df=n-2 or use R to determine a p-value to judge statistical significance. Another approach would be to develop a CI around r and see if the null value (0) falls within the CI. Denominator is the standard error - if the null hypothesis is true r=0 If null hypothesis is true we want t -stat to be zero - if we get a large t-stat data is not compatible with null - if we get large t-stat we get a small p-value and reject the null and say that here is a statistically significant relationship
Correlation coefficient r
In correlation analysis, we summarize the association (the scatter diagram) in a single number - the correlation coefficient. - The sign indicates the direction (+/-) and the magnitude (farther from 0, closer to 1) indicated strength. - Zero correlation indicates no linear association between X and Y (focus on "linear" this is why it is always important to look at the scatter diagram as there may be an association (e.g., curvilinear. J or U-shaped) that is not captured with the correlation coefficient.
Compute r and find p-value
Interpretation There is a very strong, positive, statistically significant linear association between BMI and SBP (r=0.86, p=0.0014).
Least squares estimates of regression parameters
Note the relationship between the correlation (r) and the slope. If the correlation is positive, so is the slope (same for negative). The correlation is unitless. The slope has units attached to it - specifically reflects the change in Y (and whatever units Y is measured in) per 1 unit change in X (and whatever units X is measured in). sx and sy are the sample standard deviations; X bar and Y bar are the means of X and Y, respectively The Y-intercept is computed second, as it is based on the slope. Sy= standard dev for y Sx= standard dev for x standard deviation will always be positive
Which of the following is the best definition of slope?
The change in Y for each increment in X - The slope is the steepness of a linear regression equation. It captures the change in the Y variable for each unit change in the X variable. Slope can be simply described as the change in Y over the change in X (i.e., rise over run). - The value of Y when X equals 0 is the Y-intercept, the place where the line crosses the Y-axis. The variability in Y that's explained by the X variable is called R-squared.
- Questions that can be answered by simple linear regression?
o How closely the accumulation of savings correlates with time or the average savings per week? o We can also ask about the probability that the apparent relationship is just the result of random error?
Using regression results: Prediction: What os the expected birthweight for an infant with gestational age of 35 weeks? birth_wt = -4020.05 + 180.46(gest_age)
birth_wt = -4020.05 + 180.46(35)=2296.05 grams
Regression usage
can't make an inference about casuality
Sample correlation coefficient
covariance: how X and Y vary together sample variance: of X and sample variance of Y
correlation
direction and strength of the linear association between variables; neither variable plays a special role - does not matter which variable is X, and which is Y
regression
equation that best describes the linear relationship between variables - regression alone does not address causation or the usefulness of any prediction
Regression line: best fit but not perfect
error is the difference from observation and line
True or false: A correlation matrix is a table that summarizes how multiple continuous variables correlate with each other.
true
r close to 0
weak or no linear association
Example interpretation using regression results
•Mean birthweight is expected to increase by 180.5 grams for each additional week of gestation •95% CI for b1 = 110.5, 250.4 ßà excludes 0 •t-statistic large (5.49) ßà p-value small <0.0001 •Reject null hypothesis that slope b1 = 0 •Is the intercept interpretable? Here - no. Doesn't make sense to predict birth weight for a fetus with zero weeks of age! •The slope is of interest - Birth weight is expected to be 180.46 grams higher for every 1 week increase in gestational age.
Limitations of Linear regression
•OLS regression just estimates the best line → association between Y and X •Regression alone does not address causation, even if the prediction is logical •Without more details (e.g., theory, study design), causal interpretation is not possible.
