Linear Regression Assumptions and Statistical Inference
What should the difference between the actual and predicted values be in case of linear regression?
Normal - The difference between the actual and predicted values are called residuals, and they should be normally distributed in the case of linear regression.
When computing the variance inflation factor of the kth predictor using the remaining k-1 predictors, which of the following is used?
R-squared - Variance inflation factor, or VIF, is computed using the R-squared of a linear model considering the kth predictor as the dependent and the remaining k-1 predictors as the independent variables.
The similarity between a statistician and a machine learner is that they both want to learn from data.
True - Both a statistician and a machine learner want to learn from data by building predictive models. What they do after building the models is where they differ.
If the assumption of linearity is not satisfied, we can transform one or more variables and check if it introduces linearity.
True - If the assumption of linearity is not satisfied, we can transform one or more variables and check if the transformed variable(s) have a linear relationship with the target or not. We can also transform the target variables if needed.
The best straight line in simple linear regression is the one for which the sum of squared differences between actual and predicted values is minimum.
True - In simple linear regression, the best-fit line is the one for which the sum of squared differences between actual and predicted values is minimum.
Machine learners should not stop after checking only a linear model as they did not assume a DGM.
True - Machine learners do not assume a DGM, so they should not stop after checking only a linear model and should explore non-linear models too.
If the predictor variables have multicollinearity, we cannot trust the p-values of the model coefficients.
True - The p-values of the model coefficients cannot be trusted if the predictor variables have multicollinearity.
Dimensionality reduction techniques can help get rid of multicollinearity.
True - One of the advantages of dimensionality reduction techniques is that they can help get rid of multicollinearity.
One kind of statistical inference from linear regression is to determine whether a variable should be kept or dropped by running a hypothesis test on its coefficient.
True - Running a hypothesis test on the coefficient of a predictor variable used in linear regression can help us determine the statistical significance of the predictor on the target, thereby allowing us to decide whether it should be kept or dropped.
A statistician assumes a data-generating model before beginning the analysis.
True - Statisticians assume a data-generating model before beginning the analysis, allowing them to make powerful statistical inferences.
Two distributions are said to be close to each other if their respective percentiles when drawn on a Q-Q plot, lie on a ___.
diagonal 45° straight line - If the percentile points of two distributions lie on a diagonal 45° straight line, they are said to be close to each other.
In case the residuals of linear regression form a funnel-shaped pattern, they are said to be
heteroscedastic - In the case of heteroscedasticity, the residuals of linear regression form a pattern (funnel-shaped or others).
Statisticians are mostly interested in determining the performance of a model on the ___ data, while machine learners have to check the model performance on the ___ data too.
in-sample, out-of-sample - Statisticians are mostly interested in determining the performance of a model on the in-sample data, while machine learners have to check the model performance on the out-of-sample data too to compare multiple models and check for overfitting.
When doing multiple linear regression, the predictor variables should be ___ each other.
independent of - When doing multiple linear regression, the predictor variables should be independent of each other to ensure that there is no multicollinearity.
When two predictor variables are correlated, then it poses problems for ___ but not for ___.
interpretation, prediction - When two predictor variables are correlated, then their combined effect is considered and no problems arise for predictions. However, interpretation becomes a challenge as it might not make sense for one or more variables.
For which of the following p-values given by the Goldfeldquant test (at 5% significance) will we conclude that the residuals of linear regression are homoscedastic?
0.1 - If the p-value given by the Goldfeldquant test, at 5% significance, is greater than 0.05, we can conclude that the residuals of linear regression are homoscedastic.
Considering a significance level of 5%, for which of the following p-values will we consider the predictor variable insignificant and drop it?
0.1 - Since we have a 5% significance level, if the predictor variable has a p-value > 0.05, we will consider it insignificant and drop it.
Suppose the 95% confidence interval for the coefficient of a predictor variable is [30.58, 31.44]. What will be the best estimate of the coefficient?
31.01 - The best estimate for the coefficient is the midpoint of the 95% confidence interval, i.e., (30.58 + 31.44)/2 = 31.01
There are ___ assumptions of simple linear regression.
4 - Linearity, Independence, Normality, Homoscedasticity
Suppose the VIF of a predictor variable is 1.8. Then the variance of the coefficient corresponding to that predictor variable is ___% greater than what it would be if there was no multicollinearity.
80 - A VIF of 1.8 indicates that the variance of the coefficient corresponding to that predictor variable is inflated by 80% due to multicollinearity.
The OLS summary, by default, shows us the ___ confidence interval for the model coefficients.
95% - By default, the statsmodels OLS summary shows us the 95% confidence interval for the model coefficients.
When two or more variables have high VIF, we should
Check the effect of dropping each variable individually on model performance and choose which one to drop - When two or more variables have high VIF, a good approach is to check the effect of dropping each variable individually on model performance and then choose which one to drop. Generally, the one having the least impact on model performance is dropped.
The assumption of linearity is said to be satisfied if the plot of residuals against the predicted values shows a parabolic pattern.
False - If the independent variables are linearly related to the target variable, then the plot of residuals against the predicted values will show no pattern at all.
The assumption of homoscedasticity says that the residuals of linear regression should not have equal variance.
False - The assumption of homoscedasticity says that the residuals of linear regression should have equal variance.
The linear regression coefficients corresponding to the best fit straight line will give us the exact data-generating model.
False - The linear regression coefficients corresponding to the best fit straight line will give us a very close approximation of the actual data-generating model.
A machine learner can make statistical inferences about the data.
False - As machine learners do not assume a DGM, they cannot make statistical inferences about the data.
