Introduction to multiple linear regression
What are the two ways in which multi-collinearility may be defined?
- A strong correlation among predictors - A weak correlation of one predictor with multiple predictors
Assumptions of multiple linear regression RE the dependent and independent variables
- Dependent = continuous and qualitative - Independent = qualitative or categorical
In multiple linear regression, residuals must me... (?) (3)
- Independent - Random - Normally distributed
If there was no colinearility between predictors, what would this mean, and what would the results of a multiple regression be equal to?
- That each predictor has unique variability only - Running multiple simple linear regressions for each of the predictors would output the same information as a multiple linear regression
Rather than a linear linear regression produces what?
A 3D plane
The equation for multiple regression is what?
An extension of the equation of a straight line (i.e. of the equation of the simple linear regression) by adding another term
Why is multiple linear regression used in the first place if we get colinearility?
As often factors are related
In multiple linear regression, why shouldn't there be a high degree of colinearility?
As this makes you unable to assess the contribution of each predictor
The size of the sample considered in a multiple linear regression must...(?)
Be at least x10 the number of predictors that have been tested in total (even if the final model considers less of these predictors)
In general, multicolinerality makes it difficult to...(?)
Determine the effects of a predictor
Why does R-squared become limited if more predictors keep getting added to the model?
Even if the predictor doesn't count for much unique variability, it adds another degree of freedom to the model, therefore although the F-ratio stays relatively the same, the significance of this ratio will decrease
Outline what multiple linear regression does
Fits data to a model that define y as a function of 2 or more variables
Even though the VIF for a factor may be less than 10, why may there still be colinearility?
If different predictors of the model have an average VIF of around 10 - this would still indicate overlap by different factors accounting for variability
What value/s of TF are problematic i.e. indicate a potential degree of collinearility?
Less than 0.1
What value/s of VIF are problematic i.e. indicate a potential degree of collinearility?
Less than 1 OR greater than 10
By taking into account more than one predictor variable in a multiple regression compared to a simple linear regression, what does this allow you to do?
More reliably judge what the effect will be
Performance of a multiple linear regression is a good alternative to what?
Performance of separate correlations
What becomes limited each time a predictor is added to a model?
R-squared
Comparing two models, in which the first only considers height as a variable, and the second that considers not only height but also weight and sex, what does an improvement of the first model by the second suggest?
That height is an unstable coefficient
What do we mean if we say that a VIF or TF value is problematic?
That it indicates a potential degree of colinearility
In multiple linear regression, the relationship should be linear with non-zero variance. What is meant by the latter point?
That there is some variability within the model, otherwise there would be no need for the model
In a venn diagram that represents the variability in the dataset, what do the smaller circles that lie within the largest circle represent?
The amount of variability accounted for by a particular predictor
Homoscedasicity is an assumption of multiple linear regression. Define it
The assumption that variance around the regression line is the same for all values of the predictor variable
What does it mean when we say that the standard errors of the B coefficient increase as colinearility increases?
The b coefficient becomes unstable, thus causes us to become less confident about those effects
Define what is meant by 'multicolllinearity'
The phenomenon in which one predictor variable in a multiple linear regression model can be linearly predicted from others with a substantial degree of accuracy
What increases as colinearility increases?
The standard errors of the B coefficient increases, i.e. the coefficient becomes unstable, causing us to become less confident about those effects
In a venn diagram that represents the variability in the dataset, what does the largest circle represent (it may enclose smaller circles)?
The total amount of variability in the dataset
In addition to an increase in the standard errors of the b and limiting the R-squared value with addition of predictors to a model, what else becomes a problem?
There is a lack of clarity in the contribution of each individual predictor
If there is likely to be a degree of colinearility within a model, what can you do?
There is very little that statistics can do to help this, however, if possible, it can be sensible to combine factors that are similar to a single factor e.g. height and weight combined into BMI
What is TV?
Tolerance factor - the reciprocal of the VIF (variance inflation factor)
In a venn diagram that represents the variability in the dataset, what does overlap of the smaller circles present?
Variability explained by these factors combined
What does VIF stand for?
Variance inflation factor
Comparing two models, in which the first only considers height as a variable, and the second that considers not only height but also weight and sex. There is an improvement of the first model by the second, suggesting that height is an unstable coefficient. We could remove height as a predictor (ensuring that we note this in the final reporting) and then re-run the model. However, what argument might be posed against this?
We cannot be sure that height does have an effect at all
Comparing two models, in which the first only considers height as a variable, and the second that considers not only height but also weight and sex. There is an improvement of the first model by the second, suggesting that height is an unstable coefficient. Should we exclude height from the data set?
We could do, and if we chose to do this, we must report that height is not a significant predictor and re-run the model