ISYE6414 Midterm 2
A multiple linear regression model has high explanatory power if the coefficient of determination is close to 1.
TRUE: If R^2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.
Interpret the coefficient for concCu.
A 1-unit increase in the concentration of copper decreases the log odds of botrytis blight surviving by 0.27483 when sulfur stays fixed.
Cook's distance (Di) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed.
TRUE: "This is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model."
Residual analysis in Poisson regression can be used:
To evaluate goodness of fit of the model.
Poisson regression can be used:
To model count data. To model rate response data. To model response data with a Poisson distribution.
If the VIF for each predicting variable is smaller than a certain threshold, then we can say that there is not a problematic amount of multicollinearity in the multiple linear regression model.
TRUE
If we were to apply a standard normal regression to response data with a Poisson distribution, the constant variance assumption would not hold.
TRUE
In Multiple Linear Regression we standardize the residuals because unlike true errors, the estimated residuals don't have constant variance
TRUE
In Poisson regression we model the log of the expected response variable not the expected log response variable.
TRUE
In applying the deviance test for goodness of fit in logistic regression, we seek large p-values, that is, not reject the null hypothesis.
TRUE
In logistic regression the sampling distribution of the residual is approximately normal distribution if the model is a good fit.
TRUE
In logistic regression, if your sample size is small you will underestimate the number of type 1 errors
TRUE
In logistic regression, the estimation of the regression coefficients is based on maximum likelihood estimation
TRUE
In logistic regression, the hypothesis test for subsets of coefficients is approximate: it relies on a large sample size
TRUE
In multiple linear regression, we can assess the assumption of constant-variance by plotting the standardized residuals against fitted values.
TRUE
In multiple linear regression, when using very large samples, relying on the p-values associated to the traditional hypothesis test with 𝐻𝑎:βj≠0 can lead to misleading conclusions on the statistical significance of the regression coefficients.
TRUE
In order to perform classification in logistic regression, we need to first define a classifier for the classification error rate.
TRUE
Linea regression uses partial F test, Logistic/Poisson uses chi-square test
TRUE
Linear regression uses t-test, logistic/poisson uses wald test
TRUE
Logistic/Poisson parameter estimates are approximate
TRUE
Logistics regression models the probability of a success given a set of predicting variables
TRUE
One common approach to evaluate the classification error is cross-validation.
TRUE
Prediction translates into classification of a future binary response in logistic regression.
TRUE
Residual analysis is used for goodness of fit assessment.
TRUE
Statistical inference is not reliable for small n for logistic/poisson
TRUE
The estimated regression coefficients and their standard deviations are approximate not exact in Poisson regression.
TRUE
The interpretation of the estimated Poisson regression coefficients is in terms of the ratio of the response rates.
TRUE
The link function for the Poisson regression is the log function.
TRUE
The prediction intervals are centered at the predicted value.
TRUE
The sampling distribution of the prediction of a new response is a t-distribution.
TRUE
The standard normal regression, the logistic regression and the Poisson regression are all falling under the generalized linear model framework.
TRUE
We use a chi-square testing procedure to test whether a subset of regression coefficients are zero in Poisson regression.
TRUE
We use the glm() R command to fit a Poisson linear regression.
TRUE
In multiple linear regression, we could diagnose the normality assumption by using the normal probability plot.
TRUE: "For checking normality, we can use the quantile plot, or normal probability plot, on which data are plotted against a theoretical normal distribution in such a way that the points should form a straight line."
In multiple linear regression, the coefficient of determination is used to evaluate goodness-of-fit.
TRUE: "Goodness of fit refers to goodness of fit of the data with the model structure and assumptions."
Multicollinearity can lead to misleading conclusions on the statistical significance of the regression coefficients of a multiple linear regression model.
TRUE: "The higher the VIF, ... The less likely it is that a coefficient will be evaluated as statistically significant"
In Poisson regression
We DO NOT make inference using t-intervals for the regression coefficients. Statistical inference DOES NOT relies on exact sampling distribution of the regression coefficients. Statistical inference is NOT reliable for small sample data.
Using the R statistical software to fit a logistic regression,
We can obtain both the estimates and the standard deviations of the estimates for the regression coefficients.
Suppose you wanted to test if the coefficient for concCu is equal to -0.3. What z-value would you use for this test?
z-value = (estimated coefficient - null value)/standard error of estimated coefficient = (-0.27483+0.3)/0.01784 = 1.411
DF: Simple Linear Regression
->hypothesis testing for the βs: n-2 DF (n = sample size) We lose 2 degrees of freedom from replacing β0 and β1 with β^0 and β^1 to obtain the residuals which replace the error terms.
DF: Multiple Linear Regression
->hypothesis testing for the βs: n-p-1 DF (n = sample size) (p = number of predicting variables) This is consistent with simple linear regression (p=1). If you add a categorical variable with c categories, it turns into c-1 dummy variables and reduces your DF by c-1. ->testing for overall regression: p DF_regression, n-p-1 DF_residual, and n-1 DF_total (p = number of predicting variables) (n = sample size) ->testing for subsets of coefficients: q DF_regression, n-p-q-1 DF_residual, and n-p-1 DF_total (p = number of predicting variables) (n = sample size) (q = number of additional variables added to the model) DF_total = DF_regression + DF_residual
DF: ANOVA
->hypothesis testing for the μs: N-k DF_errors, k-1 DF_treatment, and N-1 DF_total (N = total sample size) (k = number of groups) DF_errors: We have N-k DF because N is the sample size and the k population means are replaced with their sample means. DF_treatment: We have k-1 DF because k is the sample size (we have k sample means) and the overall population mean is replaced with the overall sample mean. DF_total: We have N-1 DF because N is the sample size and the overall population mean is replaced with the overall sample mean. DF_total = DF_errors + DF_treatment
DF: Generalized Linear Models
->testing for overall regression: p DF (p = number of predicting variables) ->testing for subsets of coefficients (Wald): q DF (q = number of regression coefficients discarded from the full model to get the reduced model) ->testing goodness of fit: n-p-1 DF (n = sample size) (p = number of predicting variables)
The p-value for a goodness-of-fit test using the deviance residuals for the regression can be obtained from which of the following?
1-pchisq(299.43,17). The goodness of fit test uses the residual deviance (299.43) and corresponding degrees of freedom (17) as the test statistic for the chi-squared test.
The p-value for testing the overall regression can be obtained from which of the following?
1-pchisq(419.33,2). The chi-square test statistic is the difference between the null deviance (718.76) and the residual deviance (299.43), which is 419.33. The degrees of freedom is the difference between the null deviance degrees of freedom (19) and the residual deviance degrees of freedom (17), which is 2 (the number of predicting variables in the model).
Construct an approximate 95% confidence interval for the coefficient of concS.
95% confidence interval = (estimated coefficient - z critical point * standard error of estimated coefficient, estimated coefficient + z critical point * standard error of estimated coefficient) = (-4.32735 - 1.96*0.26518, -4.32735 + 1.96*0.26518) = (-4.847, -3.808)
If the p-value of the deviance test for goodness of fit is large, there is no need to improve the model fit
FALSE
Random subsampling is more simple that k-fold cross-validation, thus it is more commonly used
FALSE
Parameter tuning is not recommended as part of the sub-sampling approach for addressing the p-value problem with large samples.
FALSE: Parameter tuning is recommended as part of the sub-sampling approach for addressing the p-value problem with large samples. "Other tuning parameters that need to be varied are the number of sub-samples, B in the description of the approach, and the percentage of data to sub-sample. I recommend a more thorough analysis evaluating the sensitivity to these parameters in detecting statistical significance in such studies."
The mean square prediction error (MSPE) is a robust prediction accuracy measurement for an ordinary least square (OLS) model regardless of the characteristics of the dataset.
FALSE: "MSPE is appropriate for evaluating prediction accuracy for a linear model estimated using least squares, but it depends on the scale of the response data, and thus is sensitive to outliers."
The logit link function is the best link function to model binary response data because it always fits the data better than other link functions.
FALSE: "The logit function is not the only function that yields the s-shaped kind of curve. There are other s-shaped functions that are used in modeling binary responses."
In multiple linear regression, the proportion of variability in the response variable that is explained by the predicting variables is called adjusted R^2
FALSE: "We interpret R^2 as the proportion of total variability in Y that can be explained by the linear regression model."
If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.
FALSE: 'Goodness of fit doesn't guarantee good prediction." And conversely, good prediction doesn't guarantee that the model is a good fit. To evaluate whether the model is a good fit or equivalently whether the assumptions hold, we can use the Pearson or deviance residuals to evaluate whether they are normally distributed. We can evaluate that using the histogram and the normality plots. If they're normally distributed, then we conclude that the model is a good fit. Another approach to evaluating goodness of fit is through hypothesis testing. In the goodness of fit test, the null hypothesis is that the model fits well, and the alternative is that the model does not fit well. The test statistic for the goodness of fit test is the sum of squared deviances. Under the null hypothesis of good fit, the test statistic has an approximate Chi-Square distribution with n-p-1 degrees of freedom. Very important to remember that if the p-value is small, we reject the null hypothesis of good fit, and thus we conclude that the model is not a good fit.
In multiple linear regression, a VIF value of 6 for a predictor means that 92% of the variation in that predictor can be modeled by the other predictors.
FALSE: A VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model. 𝑉𝐼𝐹𝑗=(1/1-𝑅^2𝐽)⟹6=(1/1−𝑅^2𝑗)⟹𝑅^2𝑗=1−(1/6)=83.3% rearrange terms 6=(1(1−𝑥)) 6=(1(−𝑥+1)) common factor: 6=1−𝑥+1 6=1−(𝑥−1) multiply all terms by same value to eliminate fraction denominators: 6=1−(𝑥−1) −(𝑥−1)⋅6=−(𝑥−1)⋅1−(𝑥−1) cancel multiplied terms that are in the denominator: −(𝑥−1)⋅6=−(𝑥−1)*1/−(𝑥−1) −(𝑥−1)⋅6=1 re-order terms so constants are on the left: (−𝑥+1)⋅6=1 6(−𝑥+1)=1 distribute: 6(−𝑥+1)=1 −6𝑥+6=1 subtract 6 from both sides of the equation: −6𝑥+6=1 −6𝑥+6−6=1−6 Simplify Subtract the numbers −6𝑥=−5 Divide both sides of the equation by the same term −6𝑥=−5 −6𝑥/−6=−5/−6 Simplify Cancel terms that are in both the numerator and denominator Divide the numbers 𝑥=5/6
In multiple linear regression, a VIFj of 10 means that there is no correlation among the jth predictor and the remaining predictor variables, and hence the variance of the estimated regression coefficient 𝛽hat𝑗 is not inflated.
FALSE: A VIFj of 1 means that there is no collinearity with respect to the jth predictor and the remaining predictor variables, and hence the variance of the estimated regression coefficient 𝛽hat𝑗 is not inflated at all. A VIFj of 10 means that the variance of the estimated regression coefficient corresponding to the jth predicting variable is 10 times more than what it should be if there's no collinearity.
The number of parameters that need to be estimated in a logistic regression model with 6 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.
FALSE: As there is no error term in a logistic regression model, there is no additional parameter for the variance of the error terms. As a result, the number of parameters that need to be estimated in a logistic regression model with 6 predicting variables and an intercept is 7. The number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables is 8.
For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for at least one of the predicting variables.
FALSE: For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for all of the predicting variables. "Linearity assumption, meaning the relationship between Y and Xj is linear for all predicting variables."
In logistic regression, if the p-value of the deviance test for goodness-of-fit is smaller than the significance level 𝛼, then it is plausible that the model is a good fit.
FALSE: For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it is an indication that the model is a good fit.
In Poisson regression, we assume a nonlinear relationship between the log rate and the predicting variables.
FALSE: In Poisson regression, we assume a linear relationship between the log rate and the predicting variables. Linearity Assumption: 𝑙𝑜𝑔(𝐸(𝑌|𝑥1,...,𝑥𝑝))=𝛽0+𝛽1𝑥1+...+𝛽𝑝𝑥𝑝
In multiple linear regression, the adjusted R^2 can be used to compare models, and its value will always be greater than or equal to that of R2.
FALSE: In multiple linear regression, the adjusted R^2 can be used to compare models, and its value will always be less than or equal to that of R^2.
Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.
FALSE: Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
The presence of multicollinearity in a multiple linear regression model will not impact the standard errors of the estimated regression coefficients.
FALSE: Multicollinearity in the predicting variables can impact the standard errors of the estimated regression coefficients. "However, the bigger problem is that the standard errors will be artificially large"
Under logistic regression, the sampling distribution used for a coefficient estimator is a Chi-squared distribution when the sample size is large.
FALSE: The coefficient estimator follows an approximate normal distribution
The log-likelihood function is a linear function with a closed-form solution.
FALSE: The log-likelihood function is a non-linear function. A numerical algorithm is needed in order to maximize it.
In logistic regression, the error terms are assumed to follow a normal distribution.
FALSE: There are no error terms in logistic regression
A t-test is used for testing the statistical significance of a coefficient given all predicting variables in a Poisson regression model.
FALSE: We can use a Z-test to test for the statistical significance of a coefficient given all predicting variables in a Poisson regression model.
We cannot estimate the regression coefficients of a multiple linear regression model if the predicting variables are linearly independent.
FALSE: We cannot estimate the regression coefficients of a multiple linear regression model if the predicting variables are linearly dependent.
In logistic regression, the estimated value for a regression coefficient 𝛽𝑖 represents the estimated expected change in the response variable associated with one unit increase in the corresponding predicting variable, 𝑥𝑖 , holding all else in the model fixed.
FALSE: We interpret logistic regression coefficients with respect to the odds of success.
When testing a subset of coefficients, deviance follows a chi-square distribution with 𝑞q degrees of freedom, where 𝑞q is the number of regression coefficients in the reduced model.
FALSE: When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients discarded from the full model to get the reduced model.
When do we use transformation?
If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. If the constant variance assumption does not hold, we transform the response variable.
If the residuals of a multiple linear regression model are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation.
If the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.
Logistic regression is different from standard linear regression in that:
It does not have an error term. The response variable is not normally distributed. It models probability of a response and not the expectation of the response.
An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.
TRUE
Comparing cross-validation methods: In K-fold cross-validation, the larger K is, the higher the variability in the estimation of the classification error is.
TRUE
Logistic regression deals with the case where the dependent variable is binary, and the conditional distribution 𝑌𝑖|𝑿𝑖,1,⋯,𝑿𝑖,𝑝 is Binomial.
TRUE: Logistic regression is the generalization of the standard regression model that is used when the response variable y is binary or binomial.
The presence of certain types of outliers can impact the statistical significance of some of the regression coefficients of a multiple linear regression model.
TRUE: Outliers that are influential can impact the statistical significance of the beta parameters.
For both logistic and Poisson regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.
TRUE: The deviance residuals are approximately N(0,1) if the model is a good fit
In logistic regression, the relationship between the probability of success and the predicting variables is nonlinear.
TRUE: The equation that links the predictors to the probability is: 𝑝(𝑥1,...,𝑥𝑝)= 𝑒𝑥𝑝(𝛽0+𝛽1𝑥1+...+𝛽𝑝𝑥𝑝) / 1+𝑒𝑥𝑝(𝛽0+𝛽1𝑥1+...+𝛽𝑝𝑥𝑝) This relationship is not linear.
The estimated regression coefficients in Poisson regression are approximate.
TRUE: The estimated parameters and their standard errors are approximate estimates.
For a classification model, the training error rate tends to underestimate the true classification error rate of the model.
TRUE: The training error rate is a downward-biased (optimistic) estimate of the true classification error rate. Hence, the training error rate tends to underestimate the true classification error rate of the model.
Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.
TRUE: We can perform residual analysis on the Pearson residuals or the Deviance residuals
The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.
TRUE: 𝑔(𝑝)=ln(p/1−𝑝) The logit link function is also known as the log odds function.
In evaluating a multiple linear regression model
The F test is used to evaluate the overall regression. The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. Residual analysis is used for goodness of fit assessment.
None of these are correct:
The residuals have constant variance for the multiple linear regression model. The residuals vs fitted can be used to assess the assumption of independence. The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution.
Logistic regression is different from standard linear regression in that:
The sampling distribution of the regression coefficient is approximate. A large sample data is required for making accurate statistical inferences. A normal sampling distribution is used instead of a t-distribution for statistical inference.
When we do not have a good fit in generalized linear models, it may be that:
We need to transform some of the predicting variables or to include other variables. The variability of the expected rate is higher than estimated. There may be leverage points that need to explored further.
Log odds Percent Change vs. Factor Change by
e^beta is the factor it changes by. (exp(1.084) = 2.956 in this case) And e^beta - 1 when beta is positive OR 1 - e^beta when beta is negative is the percentage of change. (exp(1.084)-1) = 1.956 in this case.
What is the probability of survival for a botrytis blight sample exposed to a copper concentration of 0.6 and a sulfur concentration of 0.6?
p = e^(beta_0+beta_1 * x1+beta_2 * x2)/1+e^(beta_0+beta_1 * x1+beta_2 * x2) = e^(3.58770-4.32735 * 0.6-0.27483 * 0.6)/1+e^(3.58770-4.32735 * 0.6-0.27483 * 0.6) = 0.696
