Regression Analysis - Modules 1-5
Which is correct? a) If we reject the test of equal means, we conclude that all treatment means are not equal. b) If we do not reject the test of equal means, we conclude that means are definitely all equal c) If we reject the test of equal means, we conclude that some treatment means are not equal. d) None of the above.
) If we reject the test of equal means, we conclude that some treatment means are not equal.
Assumptions of Linear Regression
- Linearity/Mean Zero Assumption - Constant Variance Assumption - Independence Assumption - Normality Assumption
Regression Analysis is used for:
- Prediction of the response variable - Modelling the relationship between the response variable and the explanatory variables - Testing hypotheses of association relationships (the relationships between variables)
R Code for the Overall Regression P-Value
1-pchisq(null.deviance - residual.deviance, null.df - residual.df)
Assumptions of a Logistic Regression Model
1. Linearity Assumption: There is a linear relationship between the link function and the predictors 2. Independence Assumption: The response variables are independent random variables 3. The link function is the logit function.
Modeling a simple ______ relationship between two factors (X and Y) through a ______ statistical model (simple linear regression)
Deterministic, Non-Deterministic
(True/False) A logistic regression model has the same four model assumptions as a multiple linear regression model.
False
(True/False) It is good practice to perform variable selection based on the statistical significance of the regression coefficients.
False
(True/False) Mallow's Cp statistic penalizes for complexity of the model more than both leave-one-out CV and BIC.
False, BIC penalizes complexity more than other approaches.
(True/False) If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.
False, Goodness of Fit doesn't guarantee good prediction, and vice versa.
(True/False) The normality assumption states that the response variable is normally distributed.
False, Normality assumption: the error terms are normally distributed. The response may or may not be normally distributed, but the error terms are assumed to be normally distributed.
(True/False) In a multiple linear regression model, when more predictors are added, R^2 can decrease if the added predictors are unrelated to the response variable.
False, R^2 never decreases as more predictors are added to a multiple linear regression model.
(True/False) Ridge regression is a regularized regression approach that can be used for variable selection.
False, Ridge regression does not perform variable selection.
(True/False) In multiple linear regression, a VIF value of 6 for a predictor means that 90% of the variation in that predictor can be modeled by the other predictors.
False, a VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model.
(True/False) In a multiple linear regression model with n observations, all observations with Cook's distance greater than 4/n should always be discarded from the model.
False, an observation should not be discarded just because it is found to be an outlier. We must investigate the nature of the outlier before deciding to discard it.
(True/False) In a simple linear regression model, any outlier has a significant influence on the estimated slope parameter.
False, an outlier does not necessarily have a large influence on model parameters. When it does, we call if an influential point.
(True/False) Backward stepwise regression is computationally preferable over forward stepwise regression.
False, backward stepwise regression is more computationally expensive than forward stepwise regression.
(True/False) You obtained a statistically significant F-statistic when testing for equal means across four groups. The number of unique pairwise comparisons that could be performed is seven.
False, for k=4 treatments, there are ( k*(k-1) )/2 = 6 unique pairs of treatments, so the number of unique pairwise comparisons that could be performed is six.
(True/False) Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score.
False, forward stepwise regression does not check all possible combinations.
(True/False) Consider a multiple linear regression model with intercept. If two predicting variables are categorical and each variable has three categories, then we need to include five dummy variables in the model.
False, in a multiple linear regression model with intercept, if two predicting variables are categorical and both have 3 categories, then we need to include 4 dummy variables in the model.
(True/False) In a simple linear regression model, given a significance level α, the ( 1 − α ) % confidence interval for the mean response should be wider than the ( 1 − α ) % prediction interval for a new response at the predictor's value x ∗.
False, in a simple linear regression model, given a significance level α, the (1−α)100% confidence interval for the mean response should be narrower than the (1−α)100% prediction interval for a new response at the predictor's value x*.
(True/False) In multiple linear regression, the prediction of the response variable and the estimation of the mean response have the same interpretation.
False, in multiple linear regression, the prediction of the response variable and the estimation of the mean response do not have the same interpretation.
(True/False) For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for only one of the predicting variables.
False, in multiple linear regression, we need the linearity assumption to hold for all of the predicting variables, for the model to be a good fit. For example, if the linearity does not hold with one or more predicting variables, then we could transform the predicting variables to improve the linearity assumption.
(True/False) It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables.
False, it is good practice to create a multiple linear regression model using a linearly independent set of predicting variables. X^T X is not invertible if the columns of X are linearly dependent, i.e. one predicting variable, corresponding to one column, is a linear combination of the others.
(True/False) With k-fold cross validation larger k values increase bias and reduce variance.
False, larger values of k decrease bias and increase variance.
(True/False) In ANOVA, the linearity assumption is assessed using a plot of the response against the predicting variable.
False, linearity is not an assumption of ANOVA
(True/False) Complex models with many predictors are often extremely biased, but have low variance.
False, models with many predictors tend to have low bias and high variance.
(True/False) Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.
False, multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
(True/False) Multicollinearity among the predicting variables will not impact the standard errors of the estimated regression coefficients.
False, multicollinearity in the predicting variables can impact the standard errors of the estimated coefficients. However, the bigger problem is that the standard errors will be artificially large.
(True/False) The p-value is a measure of the probability of rejecting the null hypothesis.
False, p-value is a measure of how rejectable the null hypothesis is... It's not the probability of rejecting the null hypothesis, nor is it the probability that the null hypothesis is true.
(True/False) When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients in the reduced model.
False, q is the difference between the number of coefficients in the full model and the reduced model.
(True/False) In a multiple linear regression model, the adjusted R^2 measures the goodness of fit of the model.
False, the adjusted R^2 is not a measure of Goodness of Fit. R^2 and adjusted R^2 measure the ability of the model and the predictor variables to explain the variation in the response. Goodness of Fit refers to having all model assumptions satisfied.
(True/False) In ANOVA, when testing for equal means across groups, the alternative hypothesis is that the means are not equal between two groups for all pairs of means/groups.
False, the alternative is that at least one pair of groups have unequal means.
(True/False) Under Poisson regression, the sampling distribution used for a coefficient estimator is a chi-squared distribution when the sample size is large.
False, the coefficient estimator follows an approximate normal distribution.
(True/False) The estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing the sum by n - p, where n is the sample size and p is the number of predictors.
False, the estimated variance of the error terms of a multiple linear regression model with intercept should be obtained by summing up the squared residuals and dividing that by n - p - 1, where n is the sample size and p is the number of predictors as we lose p + 1 degrees of freedom when we estimate the p coefficients and 1 intercept.
(True/False) In simple linear regression models, we lose three degrees of freedom when estimating the variance because of the estimation of the three model parameters β 0 , β 1 , σ 2.
False, the estimator for σ^2 is σ^2 hat, and is the sum of the squared residuals, divided by n - 2. We lose two degrees of freedom because the variance estimator, σ^2 hat, uses the estimates for β_0 and β_1 in its calculation.
(True/False) In regularized regression, the penalization is generally applied to all regression coefficients (β0, ... ,βp), where p = number of predictors.
False, the intercept (β0) is not used in the penalization.
(True/False) The log-likelihood function is a linear function with a closed-form solution.
False, the log-likelihood function is non-linear; therefore, maximizing the log-likelihood with respect to the coefficients in closed form expression is not possible.
(True/False) A multiple linear regression model contains 6 quantitative predicting variables and an intercept. The number of parameters to estimate in this model is 7.
False, the number of parameters to estimate in a multiple linear regression model containing 6 quantitative predicting variables and an intercept is 8: 7 regression coefficients (β_0, β_1, ..., β_6) and the variance of the error terms (σ^2).
(True/False) The sampling distribution for the variance estimator in simple linear regression is χ2 (chi-squared) regardless of the assumptions of the data.
False, the sampling distribution of the estimator of the variance is chi-squared, with n - 2 degrees of freedom, under the assumption of normality of the error terms.
(True/False) The number of parameters that need to be estimated in a logistic regression model with 5 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.
False, there are no error terms in logistic regression, so we have one less parameter than linear regression.
(True/False) The causation of a predicting variable to the response variable can be captured using multiple linear regression on observational data, conditional of other predicting variables in the model.
False, this is particularly prevalent in a context of making causal statements when the setup of the regression does not allow so. Causality statements can only be made in a controlled environment such as randomized trials or experiments.
(True/False) Variable selection is a simple and completely solved statistical problem since we can implement it using the R statistical software.
False, variable selection is an "unsolved" problem and needs to be tailored to the problem at hand.
(True/False) It is good practice to perform a goodness-of-fit test on logistic regression models without replications.
False, we can only define residuals for binary data with replications and residuals are needed for a goodness of fit test.
(True/False) Conducting t-tests on each β parameter in a multiple linear regression model is the preferable to an F-test when testing the overall significance of the model.
False, we cannot and should not select the combination of predicting variables that most explains the variability in the response based on the t-tests for statistical significance because the statistical significance depends on what other variables are in the model.
(True/False) Given a a quantitative predicting variable and a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model.
False, when we have qualitative variables with k levels, we only include k-1 dummy variables if the regression model has an intercept.
(True/False) With the Box-Cox transformation, when λ = 0 we do not transform the response.
False, when λ = 0, we transform using the normal log.
(True/False) The simple linear regression coefficient, β_0 hat, is used to measure the linear relationship between the predicting and response variables.
False, β_0 hat is the intercept and does not tell us about the relationship between the predicting and response variables.
(True/False) β_1 hat is an unbiased estimator for β_0.
False, β_1 hat is an unbiased estimator for β_1.
The predicting variable is a _____ variable. It does not change with the response but is fixed before the response is measured
Fixed
Assumptions: 1. ______ 2. ______ 3. ______ 4. ______
Linearity, Constant Variance, Independence, and Normality Assumptions
Regression Analysis is used for: 1. ______ of the response variable 2. ______ the relationship between the response variable and the explanatory variables 3. ______ hypotheses of association relationships (the relationships between variables)
Prediction, Modelling, Testing
The response variable is a ______ variable. It varies with changes in the predict/s along with other random changes
Random
(True/False) A binary response variable with replications in logistic regression has a Binomial distribution.
True
(True/False) An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.
True
(True/False) Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both.
True
(True/False) For Generalized Linear Models, the null hypothesis of the Goodness of Fit hypothesis test is that the model is a good fit.
True
(True/False) If there are specific variables that are required to control the bias selection in the model, they should be forced into the model and not be part of the variable selection process.
True
(True/False) In a Poisson regression model, we use a chi-squared test to test the overall regression.
True
(True/False) In a multiple linear regression model, the R^2 measures the proportion of total variability in the response variable that is captured by the regression model.
True
(True/False) In ridge regression, when the penalty constant lambda (λ) equals zero, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates.
True
(True/False) It is required to standardize or rescale the predicting variables when performing regularized regression.
True
(True/False) Ridge regression can be used to deal with problems caused by high correlation among the predictors.
True
(True/False) Akaike Information Criterion (AIC) is an estimate for the prediction risk.
True, AIC is an estimate for the prediction risk that includes a complexity penalty term.
(True/False) The L1 penalty measures the sparsity of a vector and forces regression coefficients to be zero.
True, LASSO performs variable selection by measuring sparsity and forcing some regression coefficients to zero.
(True/False) The lasso regression requires a numerical algorithm to minimize the penalized sum of least squares.
True, LASSO regression does not have a closed form solution to find the regression coefficients.
(True/False) The mean sum of squared errors in ANOVA measures variability within groups.
True, MSE = within-group variability.
(True/False) Simpson's Paradox occurs when a coefficient reverses its sign when used in a marginal versus a conditional model.
True, Simpson's Paradox: Reversal of an association when looking at a marginal relationship versus a conditional relationship.
(True/False) With Poisson regression, the variance of the response is not constant.
True, V(Y|x_1,...x_p)=exp(beta_0 + beta_1 x_1 + ... + beta_p x_p)
(True/False) A negative value of β 1 is consistent with an inverse relationship between the predictor variable and the response variable.
True, a negative value of β_1 is consistent with an inverse relationship.
(True/False) The penalty constant lambda (λ) in penalized regression controls the trade-off between lack of fit and model complexity.
True, as lambda approaches 1, model complexity is penalized more heavily, and a simpler model (that potentially doesn't fit the data well) is chosen.
(True/False) In the case of multiple linear regression, controlling variables are used to control for sample bias.
True, controlling variables can be used to control for bias selection in a sample.
(True/False) When conducting ANOVA, the larger the between-group variability is relative to the within-group variability, the larger the value of the F-statistic will tend to be.
True, given the formula of the F-statistics, a larger increase in the numerator (the between-group variability) compared to the denominator will result in a larger F-statistic.
(True/False) A linear regression model has high explanatory power if the coefficient of determination is close to 1.
True, if R^2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.
(True/False) If the pairwise comparison interval between groups in an ANOVA model includes zero, we conclude that the two means are plausibly equal.
True, if the comparison interval includes zero, then the two means are not statistically significantly different, and are thus, plausibly equal.
(True/False) In a simple linear regression model, we need the normality assumption to hold for deriving a reliable prediction interval for a new response.
True, if the model assumptions are violated, we cannot reliably use the model for statistical inference.
(True/False) If the residuals are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation.
True, if the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.
(True/False) An example of a multiple linear regression model is Analysis of Variance (ANOVA).
True, multiple linear regression is a generalization of both ANOVA and simple linear regression models.
(True/False) The presence of certain types of outliers, such as influential points, can impact the statistical significance of some of the regression coefficients.
True, outliers that are influential can impact the statistical significance of the beta parameters.
(True/False) In ANOVA, to test the null hypothesis of equal means across groups, the variance of the response variable must be the same across all groups.
True, that is one of the assumptions of ANOVA.
(True/False) Generalized linear models, like logistic regression, use a Wald test to determine the statistical significance of the coefficients.
True, the coefficient estimates follow an approximate normal distribution and a z-test, also known as a Wald test, is used to determine their statistical significance.
(True/False) The prediction interval of one member of the population will always be larger than the confidence interval of the mean response for all members of the population when using the same predicting values.
True, the confidence intervals under estimation are narrower than the prediction intervals because the prediction intervals have additional variance from the variation of a new measurement
(True/False) For Generalized Linear Models, including Poisson regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.
True, the deviance residuals are approximately N(0,1) if the model is a good fit.
(True/False) The estimated regression coefficients in Poisson regression are approximate.
True, the estimated parameters and their standard errors are approximate estimates.
(True/False) For a given predicting variable, the corresponding estimated regression coefficient will likely be different in a conditional model versus a marginal model.
True, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship.
(True/False) In multiple linear regression, the estimated regression coefficient corresponding to a quantitative predicting variable is interpreted as the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed.
True, the estimated value for one of the regression coefficient β_i represents the estimated expected change in y associated with one unit of change in the corresponding predicting variable, X_i, holding all else in the model fixed.
(True/False) The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.
True, the logit function is f(x) = ln( p / (1 - p) )
(True/False) For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it suggests that the model is a good fit.
True, the null hypothesis is that the model fits the data, so large p-values suggest that the model is a good fit.
(True/False) The pooled variance estimator, s_(pooled)^2, in ANOVA is synonymous with the variance estimator, σ^2 hat, in simple linear regression because they both use mean squared error (MSE) for their calculations.
True, the pooled variance estimator is, in fact, the variance estimator.
(True/False) Suppose that we have a multiple linear regression model with k quantitative predictors, a qualitative predictor with l categories and an intercept. Consider the estimated variance of error terms based on n observations. The estimator should follow a chi-square distribution with n − k − l degrees of freedom.
True, the variance estimator follows the chi-squared distribution with n - p degrees of freedom, where p = k + (l - 1) + 1 = k - l.
(True/False) If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.
True, this is important since without a good fit, we cannot rely on the statistical inference. Only when the model is a good fit, i.e. all model assumptions hold, can we rely on the statistical inference.
(True/False) Cook's distance (D_i) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed.
True, this is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model.
(True/False) The training risk is not an unbiased estimator of the prediction risk.
True, training risk is a biased estimator of prediction risk.
(True/False) Under the normality assumption, the estimator for β_1 is a linear combination of normally distributed random variables.
True, under the normality assumption, β_1 is thus a linear combination of normally distributed random variables.
(True/False) If the model assumptions hold, then the estimator for the variance, σ^2, is a random variable.
True, we assume that the error terms are independent random variables. Therefore, the residuals are independent random variables. Since σ ^ 2 is a combination of the residuals, it is also a random variable.
(True/False) In a simple linear regression model, we can assess if the residuals are correlated by plotting them against fitted values.
True, we can assess whether residuals are correlated or not, but we can't directly assess whether they are independent or not.
(True/False) In a simple linear regression model, given a significance level α, if the ( 1 − α ) % confidence interval for a regression coefficient does not include zero, we conclude that the coefficient is statistically significant at the α level.
True, we can conclude that the coefficient is statistically significant if 0 is not included in the confidence interval.
(True/False) Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.
True, we can perform residual analysis on the Pearson residuals or the Deviance residuals.
(True/False) An ANOVA model with a single qualitative predicting variable containing k groups will have k + 1 parameters to estimate.
True, we have to estimate the means of the k groups and the pooled variance estimator, s_(pooled)^2.
(True/False) In logistic regression, the relationship between the probability of success and the predicting variables is nonlinear.
True, we model the probability of success given the predictors by linking the probability to the predicting variables through a nonlinear link function.
(True/False) A partial F-Test can be used to test whether the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero.
True, we use the Partial F-test to test the null hypothesis that the regression coefficients associated to a subset of the predicting variables are all equal to zero. The alternative hypothesis is that at least one of these regression coefficients is not zero.
(True/False) When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis.
True, when the objective is to explain the relationship to the response, one might consider including the predicting variables even if they are correlated.
Variable selection methods are performed by: a) Balancing the bias-variance trade-off. b) Minimizing the training risk. c) Excluding the predicting variables for which the regression coefficients are not statistically significant. d) All of the above.
a) Balancing the bias-variance trade-off.
Assuming that the data are normally distributed, the estimated variance has the following sampling distribution under the simple linear model: a) Chi-square with n-2 degrees of freedom b) T-distribution with n-2 degrees of freedom c) Chi-square with n degrees of freedom d) T-distribution with n degrees of freedom
a) Chi-square with n-2 degrees of freedom
Which one is correct? a) Forward stepwise regression adds one variable to the model at a time starting with a minimum model. b) Backward stepwise regression adds one variable to the model at a time starting with the minimum model. c) It is always feasible to apply the backward model search for all possible combinations of the predicting variables. d) None of the above.
a) Forward stepwise regression adds one variable to the model at a time starting with a minimum model.
The estimators of the linear regression model are derived by: a) Minimizing the sum of squared differences between observed and expected values of the response variable. b) Maximizing the sum of squared differences between observed and expected values of the response variable. c) Minimizing the sum of absolute differences between observed and expected values of the response variable. d) Maximizing the sum of absolute differences between observed and expected values of the response variable.
a) Minimizing the sum of squared differences between observed and expected values of the response variable.
In logistic regression: a) The estimation of the regression coefficients is based on maximum likelihood estimation. b) We can derive exact (close form expression) estimates for the regression coefficients. c) The estimations of the regression coefficients is based on minimizing the sum of least squares. d) All of the above.
a) The estimation of the regression coefficients is based on maximum likelihood estimation.
The pooled variance estimator is: a) The variance estimator assuming equal variances. b) The variance estimator assuming equal means and equal variances. c) The sample variance estimator assuming equal means. d) None of the above.
a) The variance estimator assuming equal variances.
The mean squared errors (MSE) measures: a) The within-treatment variability. b) The between-treatment variability. c) The sum of the within-treatment and between-treatment variability. d) None of the above.
a) The within-treatment variability.
The objective of the residual analysis is a) To evaluate goodness of fit b) To evaluate whether the means are equal. c) To evaluate whether only the normality assumptions holds. d) None of the above.
a) To evaluate goodness of fit
Residual analysis in Poisson regression can be used: a) To evaluate goodness of fit of the model. b) To evaluate whether the error terms are uncorrelated. c) To evaluate whether the error terms are Poisson distributed. d) All of the above.
a) To evaluate goodness of fit of the model.
In Poisson regression: a) We model the log of the expected response variable not the expected log response variable. b) We use the ordinary least squares to fit the model. c) There is an error term. d) None of the above.
a) We model the log of the expected response variable not the expected log response variable.
Which one is correct? a) We use a chi-square testing procedure to test whether a subset of regression coefficients are zero in Poisson regression. b) The test for subsets of regression coefficients is a goodness of fit test. c) The test for subsets of regression coefficients is reliable for small sample data in Poisson regression. d) None of the above.
a) We use a chi-square testing procedure to test whether a subset of regression coefficients are zero in Poisson regression.
We detect departure from the assumption of constant variance a) When the residuals increase as the fitted values increase also. b) When the residuals vs fitted are scattered randomly around the zero line. c) When the histogram does not have a symmetric shape. d) All of the above.
a) When the residuals increase as the fitted values increase also.
The sampling distribution of β^0 hat is a: a) t-distribution b) chi-squared distribution c) normal distribution d) None of the above
a) t-distribution The distribution of β_0 is normal. Since we are using a sample and not the full population, the sampling distribution of β_0 hat is the t-distribution.
The estimated versus predicted regression line for a given x*: a) Have the same variance b) Have the same expectation c) Have the same variance and expectation d) None of the above
b) Have the same expectation
Comparing cross-validation methods: a) The random sampling approach is more computational efficient that leave-one-out cross validation. b) In K-fold cross-validation, the larger K is, the higher the variability in the estimation of the classification error is. c) Leave-one-out cross validation is a particular case of the random sampling cross-validation. d) None of the above.
b) In K-fold cross-validation, the larger K is, the higher the variability in the estimation of the classification error is.
Which one is correct? a) We can evaluate the goodness of fit a model using the testing procedure of the overall regression. b) In applying the deviance test for goodness of fit in logistic regression, we seek large p-values, that is, not reject the null hypothesis. c) There is no error term in logistic regression and thus we cannot perform a goodness of fit assessment. d) None of the above.
b) In applying the deviance test for goodness of fit in logistic regression, we seek large p-values, that is, not reject the null hypothesis.
Which one is correct? a) The logit link function is the only link function that can be used for modeling binary response data. b) Logistic regression models the probability of a success given a set of predicting variables. c) The interpretation of the regression coefficients in logistic regression is the same as for standard linear regression assuming normality. d) None of the above.
b) Logistic regression models the probability of a success given a set of predicting variables.
Which of the following is not an application of regression? a) Testing hypotheses b) Proving causation c) Predicting outcomes d) Modeling data
b) Proving causation
In logistic regression: a) The hypothesis test for subsets of coefficients is a goodness of fit test. b) The hypothesis test for subsets of coefficients is approximate; it relies on a large sample size. c) We can use the partial F test for testing whether a subset of coefficients are all zero. d) None of the above.
b) The hypothesis test for subsets of coefficients is approximate; it relies on a large sample size.
The fitted values are defined as: a) The difference between observed and expected responses. b) The regression line with parameters replaced with the estimated regression coefficients. c) The regression line. d) The response values.
b) The regression line with parameters replaced with the estimated regression coefficients.
The total sum of squares divided by N-1 is a) The mean sum of squared errors b) The sample variance estimator assuming equal means and equal variances c) The sample variance estimator assuming equal variances. d) None of the above.
b) The sample variance estimator assuming equal means and equal variances
The penalty constant 𝜆 in regularized regression has the role: a) To reduce the number of predicting variables in the model. b) To control the trade-off between lack of fit and model complexity. c) To select the best subset of predicting variables. d) All of the above.
b) To control the trade-off between lack of fit and model complexity.
The objective of the pairwise comparison is a) To find which means are equal. b) To identify the statistically significantly different means. c) To find the estimated means which are greater or lower than other. d) None of the above.
b) To identify the statistically significantly different means.
To test if a coefficient is less than a critical value, C, we conduct a one-sided test on the _________ tail of a ___________ distribution. a) left, normal b) left, t c) right, normal d) right, t e) None of the above
b) left, t For β_1 greater than zero we're interested on the right tail of the distribution of the β_1 hat.
The F-test is a _________ tailed test with ______ and ______ degrees of freedom. a) one, k, N-1 b) one, k-1, N-k c) two, k-1, N-k d) two, k, N-1 e) None of the above.
b) one, k-1, N-k The F-test is a one tailed test that has two degrees of freedom, namely k − 1 and N − k.
Stepwise regression is: a) A model selection method that provides the best subset of regression coefficients given a model selection criteria. b) A model selection method that selects larger models than other approaches. c) A model selection method that can be performed in the R statistical software given a set of controlling factors that form the minimum or starting model. d) None of the above.
c) A model selection method that can be performed in the R statistical software given a set of controlling factors that form the minimum or starting model.
The assumption of normality: a) It is needed for deriving the estimators of the regression coefficients. b) It is not needed for linear regression modeling and inference. c) It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. d) It is needed for deriving the expectation and variance of the estimators of the regression coefficients.
c) It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference.
Which one is correct? a) A multiple linear regression model with p predicting variables but no intercept has p model parameters. b) The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model. c) Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. d) None of the above.
c) Multiple linear regression is a general model encompassing both ANOVA and simple linear regression.
Which one is correct? a) Independence assumption can be assessed using the residuals vs fitted values. b) Independence assumption can be assessed using the normal probability plot. c) Residual analysis can be used to assess uncorrelated errors. d) None of the above
c) Residual analysis can be used to assess uncorrelated errors.
In ANOVA, for which of the following purposes is the Tukey method used? a) Test for homogeneity of variance b) Test for normality c) Test for differences in pairwise means d) Test for independence of errors
c) Test for differences in pairwise means
In logistic regression: a) We can perform residual analysis when response binary data are with or without replications. b) Residuals are derived as the fitted values minus the observed responses. c) The sampling distribution of the residual is approximately normal distribution if the model is a good fit. d) All of the above.
c) The sampling distribution of the residual is approximately normal distribution if the model is a good fit.
The variability in the prediction comes from: a) The variability due to a new measurement. b) The variability due to estimation c) The variability due to a new measurement and due to estimation. d) None of the above.
c) The variability due to a new measurement and due to estimation.
Which one is correct? a) Ridge regression is used for variable selection. b) The penalty for lasso regression is not a sparsity penalty. c) We can combine ridge and lasso regression into what we call the elastic net regression. d) All regularized regression approaches will select the same model.
c) We can combine ridge and lasso regression into what we call the elastic net regression.
Using the R statistical software to fit a logistic regression: a) We can use the lm() command. b) The input of the response variable is exactly the same if the binary response data are with or without replications. c) We can obtain both the estimates and the standard deviations of the estimates for the regression coefficients. d) None of the above.
c) We can obtain both the estimates and the standard deviations of the estimates for the regression coefficients.
Which one is correct? a) The selected variables using variable selection approaches are the only variables that explain the response variable. b) If a predicting variables is selected to be in the model, we conclude that the predicting variable has a causal relationship with the response variable. c) When selecting variables, we need to first establish which variables are used for controlling bias selection in the sample and which are explanatory. d) Variable selection always provides the best model in terms of goodness of fit, particularly for models with a large number of predictors
c) When selecting variables, we need to first establish which variables are used for controlling bias selection in the sample and which are explanatory.
The alternative hypothesis of ANOVA can be stated as: a) the means of all pairs of groups are different b) the means of all groups are equal c) the means of at least one pair of groups is different d) None of the above
c) the means of at least one pair of groups is different Using the hypothesis testing procedure for equal means, we test: The null hypothesis, which that the means are all equal (mu_1 = mu_2...=mu_k) versus the alternative hypothesis, that some means are different. Not all means have to be different for the alternative hypothesis to be true -- at least one pair of the means needs to be different.
In evaluating a multiple linear model, a) The F test is used to evaluate the overall regression. b) The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. c) Residual analysis is used for goodness of fit assessment. d) All of the above.
d) All of the above.
In evaluating a simple linear model a) There is a direct relationship between the coefficient of determination and the correlation between the predicting and response variables. b) The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. c) Residual analysis is used for goodness of fit assessment. d) All of the above.
d) All of the above.
In variable selection: a) The advantage of having a biased model with less predicting variables is the reduction in uncertainty of predictions of future responses. b) Complexity is equivalent to a large model with many predicting variables. c) The goal is to balance the trade-off between bias (more predictors) and variance (fewer predictors). d) All of the above.
d) All of the above.
Logistic regression is different from standard linear regression in that: a) It does not have an error term b) The response variable is not normally distributed. c) It models probability of a response and not the expectation of the response. d) All of the above.
d) All of the above.
Logistic regression is different from standard linear regression in that: a) The sampling distribution of the regression coefficient is approximate. b) A large sample data is required for making accurate statistical inferences. c) A normal sampling distribution is used instead of a t-distribution for statistical inference. d) All of the above.
d) All of the above.
Poisson regression can be used: a) To model count data. b) To model rate response data. c) To model response data with a Poisson distribution. d) All of the above.
d) All of the above.
The objective of multiple linear regression is a) To predict future new responses b) To model the association of explanatory variables to a response variable accounting for controlling factors. c) To test hypothesis using statistical inference on the model. d) All of the above.
d) All of the above.
The sampling distribution of the estimated regression coefficients is a) Centered at the true regression parameters. b) The t-distribution assuming that the variance of the error term is unknown an replaced by its estimate. c) Dependent on the design matrix. d) All of the above.
d) All of the above.
Variable selection can be used: a) To deal with multicollinearity in multiple regression. b) To select among a large number of predicting variables. c) To fit a model when there are more predicting variables than observations. d) All of the above.
d) All of the above.
When do we use transformations? a) If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. b) If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. c) If the constant variance assumption does not hold, we transform the response variable. d) All of the above.
d) All of the above.
When we do not have a good fit in generalized linear models, it may be that: a) We need to transform some of the predicting variables or to include other variables. b) The variability of the expected rate is higher than estimated. c) There may be leverage points that need to explored further. d) All of the above.
d) All of the above.
Which is correct? a) Prediction translates into classification of a future binary response in logistic regression. b) In order to perform classification in logistic regression, we need to first define a classifier for the classification error rate. c) One common approach to evaluate the classification error is cross-validation. d) All of the above.
d) All of the above.
Which one is correct? a) The estimated regression coefficient in ridge regression are obtained using exact or closed form expression. b) The estimated regression coefficient in lasso regression are obtained using a numeric algorithm. c) The estimated regression coefficients from lasso regression are less efficient than those from the ordinary least squares estimation approach. d) All of the above.
d) All of the above.
Which one is correct? a) The estimated regression coefficients and their standard deviations are approximate not exact in Poisson regression. b) We use the glm() R command to fit a Poisson linear regression. c) The interpretation of the estimated regression coefficients is in terms of the ratio of the response rates. d) All of the above.
d) All of the above.
Which one is correct? a) The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. b) The prediction intervals are centered at the predicted value. c) The sampling distribution of the prediction of a new response is a t-distribution. d) All of the above.
d) All of the above.
Which one is correct? a) The standard normal regression, the logistic regression and the Poisson regression are all falling under the generalized linear model framework. b) If we were to apply a standard normal regression to response data with a Poisson distribution, the constant variance assumption would not hold. c) The link function for the Poisson regression is the log function. d) All of the above.
d) All of the above.
Which one is correct? a) If a departure from normality is detected, we transform the predicting variable to improve upon the normality assumption. b) If a departure from the independence assumption is detected, we transform the response variable to improve upon this assumption. c) The Box-Cox transformation is commonly used to improve upon the linearity assumption. d) None of the above
d) None of the above
In Poisson regression: a) We make inference using t-intervals for the regression coefficients. b) Statistical inference relies on exact sampling distribution of the regression coefficients. c) Statistical inference is reliable for small sample data. d) None of the above.
d) None of the above.
In the presence of near multicollinearity, a) The coefficient of determination decreases. b) The regression coefficients will tend to be identified as statistically significant even if they are not. c) The prediction will not be impacted. d) None of the above.
d) None of the above.
We can test for a subset of regression coefficients a) Using the F statistic test of the overall regression. b) Only if we are interested whether additional explanatory variables should be considered in addition to the controlling variables. c) To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. d) None of the above.
d) None of the above.
Which are all the model parameters in ANOVA? a) The means of the k populations. b) The sample means of the k populations. c) The sample means of the k samples. d) None of the above.
d) None of the above.
Which one correctly characterizes the sampling distribution of the estimated variance? a) The estimated variance of the error term has a χ2 distribution regardless of the distribution assumption of the error terms. b) The number of degrees of freedom for the χ2 distribution of the estimated variance is n-p-1 for a model without intercept. c) The sampling distribution of the mean squared error is different of that of the estimated variance. d) None of the above.
d) None of the above.
Which one is correct? a) The residuals have constant variance for the multiple linear regression model. b) The residuals vs fitted can be used to assess the assumption of independence. c) The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution. d) None of the above.
d) None of the above.
Which one is correct? a) There are multiple model selection criteria that can be used and all provide the same penalization of the model complexity. b) The AIC is commonly used for prediction models since it penalizes the model complexity the most. c) The AIC and BIC cannot be used in selecting variables for generalized linear models. d) None of the above.
d) None of the above.
The estimators for the regression coefficients are: a) Biased but with small variance b) Biased with large variance c) Unbiased under normality assumptions but biased otherwise. d) Unbiased regardless of the distribution of the data.
d) Unbiased regardless of the distribution of the data.
The estimators for the regression coefficients are: a) Biased but with small variance b) Unbiased under normality assumptions but biased otherwise. c) Biased regardless of the distribution of the data. d) Unbiased regardless of the distribution of the data.
d) Unbiased regardless of the distribution of the data.
R Code for the Goodness of Fit P-Value
pchisq(model.deviance, model.df, lower.tail = FALSE)