STA 363 2nd Semester Content
Residuals outside of +/- 3
outliers
What type of unusual observation can this statistic be used to identify? Standardized residuals
outliers
Which function is used to find best subsets regression?
regsubsets()
Obesity Deaths NonDeaths 1 Obese 16 2045 2 NotObese 7 1044 When calculating odds ratios to compare obese vs non-obese, the value is _____.
(16/2045)/(7/1044)
What percentage of the data should be used for the training set? a) 75% b) 80% c) 90%
All are valid
What is the family for a logistic regression model?
Binomial
These are observations whose predictor values are far from the center of the predictor space; i.e., observations that are "extreme" in the Xs. (What are they called?)
High-leverage points
What type of unusual observation can this statistic be used to identify? Hat Values
High-leverage points
p/(1-p)
odds
Please select the proper model under different scenario. A numerical response with categorical predictors
ANOVA
In R, how do we obtain plots displaying the influence of an observation?
Specify plots 4 and 6 in autoplot
What does the number of folds in cross-validation control?
How many groups we create from the data
These are relatively isolated observations that are poorly predicted by the fitted model; i.e., observations that are "extreme" in the Ys. (What are they called?)
Outliers
The typical order of checking and fixing assumptions is: (put in order) -constant variance, linearity, normality, independence, unusual observations
independence, linearity, constant variance, normality, unusual observations
If there is an interaction between two terms, you cannot interpret the main effects(True or false).
true
Match each regression assumption violation (or other problem) with the most appropriate solution to fix it. Normality assumption violation
Transformation of the response
In the caret package in R, which function is used to run the cross-validation?
train
Please select the proper model under different scenario. A numerical response with numerical predictor
(Multiple) Linear regression
How could you estimate the percentage of variability explained by a logistic regression model?
(Null deviance - Residual deviance)/Null deviance
What VIF value represents issues for multicollinearity?
>10
_____ is basically total variation. _____ is basically error variation(remaining variation after the model is fitted).
null deviance, residual deviance
exp(confint(fit)) ## 2.5 % 97.5 % ## (Intercept) 0.0005951928 0.001288398 ## gendermales 0.3745835638 1.212538866 We are 95% confident that the _____ of concussion for _____ is/are between about 0.0006 and 0.0013.
odds, females
In the caret package in R, which function allows us to combine multiple validation studies to compare across different models?
resamples
What is the variation in Y that remains unexplained after we fit the regression model?
residual sum of squares
Consider the following two models. model 1: 𝑌=𝛽0+𝛽1𝑋1+𝜖Y=β0+β1X1+ϵ model 2: 𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+𝛽3𝑋3+𝜖Y=β0+β1X1+β2X2+β3X3+ϵ A reduced F-test is conducted to compare these models. And the F-stat is 5.5 on 2 and 30 degrees of freedom (p value is 0.00922701). Which of the following statements is true? a) Model 1 is preferred since p-value is small and we don't have evidence that adding X2 and X3 will improve the model. b) Model 1 significantly predicts the response variable Y. c) X2 and X3 are significant predictors. d) Model 2 is preferred since the p-value is small and we have evidence that adding X2 and X3 will improve the model.
d
Which of the following best describes the relationship between leverage points and influential points? a) All high leverage points are influential. b) All influential points have high leverage. c) High leverage points cannot be influential. d) High leverage indicates that a point has the potential to be influential.
d
Which of the following is an advantage of a generalized linear model (GLM) compared to a linear model (LM)? a) The GLM always produces a smaller adjusted R-squared. b) The GLM is always a more accurate model. c) The GLM performs better than the LM when the errors are normally distributed. d) The GLM allows the errors to follow distributions that are non-normal.
d
Which of the following model validation methods does not depend on random selection of the training and testing sets? a) Single validation study b) None (All of these depend on random selection of the training and testing sets) c) 5-fold CV d) LOOCV
d
Look for natural gaps in the x-axis
high leverage points
A study was conducted on an experimental treatment for coronavirus. Patients were treated to either the experimental treatment (treatment=1) or a placebo (treatment=0). Then they were followed until they either recovered (y=0) or died (y=1) from the disease. Age was included in the model as a covariate. The coefficients from the logistic regression model are given below. 𝛽𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡=0.011 𝛽𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡=−0.126 𝛽𝑎𝑔𝑒=0.019 Based on this model, what are the predicted odds of death for a 40-year-old subject on the treatment?(round 3 decimals)
1.906
Obesity Deaths NonDeaths 1 Obese 16 2045 2 NotObese 7 1044 What are the odds that an obese women died from cardiovascular disease?
16/2045
Obesity Deaths NonDeaths 1 Obese 16 2045 2 NotObese 7 1044 What is the estimate of the probability that an obese women died from cardiovascular disease ?
16/2061
For a categorical value with four categories, how many dummy variables will be used in a regression model?
3
In a 5-fold cross-validation study with 60 subjects, how many subjects will be in each training set?
48
Obesity Deaths NonDeaths 1 Obese 16 2045 2 NotObese 7 1044 What are the odds that a non-obese women died from cardiovascular disease?
7/1044
Consider the following two models. model 1: 𝑌=𝛽0+𝛽1𝑋1+𝜖Y=β0+β1X1+ϵ model 2: 𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+𝛽3𝑋3+𝜖Y=β0+β1X1+β2X2+β3X3+ϵ Which of the following represents the null hypothesis of a reduced F-test used to compare these models?
B2 = B3 = 0
How is best subsets different than stepwise selections?
Checks every combination of predictors
Match each regression assumption violation (or other problem) with the most appropriate solution to fix it. Unusual observations
Fit two models, one with and one without certain observations
What type of unusual observation can this statistic be used to identify? Cook's Distance(Cook's D)
Influential points
What type of unusual observation can this statistic be used to identify? DF-Betas
Influential points
Choose whether each of the following statements best describes k-fold CV or LOOCV. More heavily influenced by outliers
LOOCV
Choose whether each of the following statements best describes k-fold CV or LOOCV. RMSE and MAE estimates have higher variance
LOOCV
What is the typical link function for a logistic regression model?
Logit
Which of the following metrics can be used to measure the prediction error for classification problem?
Misclassification rate
Please select the proper model under different scenario. A numerical response with numerical and categorical predictors
Multiple Linear regression(if no interaction) or ANCOVA(if interaction)
exp(confint(fit)) ## 2.5 % 97.5 % ## (Intercept) 0.0005951928 0.001288398 ## gendermales 0.3745835638 1.212538866 Based on the confidence interval for gender, is there a significant difference between the odds of concussion between males and females? Why?
No, confidence interval includes 1
Obesity Deaths NonDeaths 1 Obese 16 2045 2 NotObese 7 1044 Since the value of the odds ratio is 1.16, this indicates that _____, because the ratio is close to 1.
Obesity may not have an effect on whether women died from cardiovascular disease
Which metrics are typically used to compare models in a validation study?
RMSE and MAE
Match each regression assumption violation (or other problem) with the most appropriate solution to fix it. Multicollinearity
Remove one of the variables
Match the regression assumption violation (or other problem) with the most appropriate solution to fix it. Linearity assumption violation
Transformation of the predictors
Match each regression assumption violation (or other problem) with the most appropriate solution to fix it. Non-constant error variance
Transformation of the response
To test the full model in logistic regression, you should use what test?
Whole model likelihood ratio test(Whole model LRT)
What statement best describes multicollinearity?
Two or more predictors highly correlated with each other
Which of the following metrics can be used to check for multicollinearity?
Variance inflation factors(VIFs)
What is an advantage of using model validation compared to using models based metrics such as adjusted R-squared and AIC? a) Model validation removes the bias that comes from using the same data for both fitting and evaluating the models. b) Model validation is computationally faster. c) Model validation is always guaranteed to produce the same result. d) Model validation allows you to make predictions using the model.
a
Which of the following statements are true about backward selection? a) In backward selection, a model with all predictors is the starting model. b) In backward selection, one predictor is removed at a time. c) In backward selection, a model with a single predictor is the starting model.
a and b
Which of the following statements are true about odds? a) Odds are defined as the ratio of the probability of "success" and the probability of "failure". b) Odds are defined when the response variable has two possible outcomes, typically called "success" and "failure". c) Odds can only range between 0 and 1
a and b
Under which of the following situations does it make sense to delete an influential point from the analysis? a) The point is not representative of the study population. b) The model becomes significant if you remove it. c) It can be verified that the point is a data entry error. d) The point is an outlier.
a and c
Which of the following statements are true about model selection? a) The model selected based on best subsets regression can be the same as forward selection. b) Backward selection and forward selection always give same model. c) AIC and BIC can pick different models.
a and c
Why is a linear regression model inappropriate for data with binary response? a) In a linear regression model, the response is a measured variable on a continuous scale. b) Linear regression has linearity assumption. c) Linear regression will generate meaningless predictions if it is used to model data with binary response.
a and c
𝑌=𝛽0+𝛽1𝑋+𝛽2𝑍+𝛽3(𝑋⋅𝑍)+𝜖 If a model with only main effects is considered, that is 𝑌=𝛽0+𝛽1𝑋+𝛽2𝑍+𝜖Y=β0+β1X+β2Z+ϵ, which of the following statements are true? a) The regression models for Z=0 and Z=1 will have the same slope. b) The regression models for Z=0 and Z=1 will have the same intercept. c) The regression models for Z=0 and Z=1 will have different intercept. d) The regression models for Z=0 and Z=1 will have different slope.
a and c
When a categorical predictor is used for regression, which of the following are true? a) Unless the factor levels are specified, R will treat the categories based on their alphanumeric ordering. b) Categorical predictors cannot be used in regression settings. c) The interpretation of categorical variables is very difficult in regression settings. d) The intercept term of the regression model will correspond to one of the categories.
a and d
Which of the following statements are true about AIC? a) Models with smaller AIC are preferred. b) It is a function of residual sum of squares (RSS), sample size 𝑛 and the number of parameters 𝑝. c) AIC() is the function in R to extract AIC values.
a, b, and c
Which of the following statements are true about BIC? a) BIC() is the function in R to extract BIC values. b) Models with smaller BIC are preferred. c) It is a function of residual sum of squares (RSS), sample size 𝑛 and the number of parameters 𝑝.
a, b, and c
Which of the following are limitations of a validation study with a single holdout sample? a) A different split of the data could favor a different model. b) It may be biased because it does not use all of the information. c) You need at least 6000 observations. c) If the response has been transformed, the best model for the transformed response may not be the same as the original data.
a, b, and d
Which of the following statements are true about forward selection? a) In forward selection, each predictor is added one at a time. b) In forward selection, a model with intercept is the starting model. c) The selected models of the forward selection and backward selection always have the same set of predictors. d) You must specify the scope in forward selection.
a, b, and d
A repeated k-fold CV study have what advantages over LOOCV, k-fold CV studies, and single validation studies? a) It has less variability than a LOOCV study. b) It is more computationally costly than a LOOCV. c) It mitigates some of the inherit bias in a single validation study or k-fold CV study. d) It provides the practitioner with a distribution of RMSE and MAE values.
a, c, and d
Which of the following statements are true about model selection criteria? a) Models with larger adjusted R squared values are preferred. b) Models with larger R squared values are preferred. c) Models with smaller AIC values are preferred. d) Models with smaller BIC values are preferred.
a, c, and d
Which of the following statements is NOT true about best subsets regression? a) Different model sizes could be tried and compared to select the final model. b) Best subsets regression can provide subset models of different sizes. c) Different criteria can be used to select best subsets regression models.
all are true
Regression problems that have a mixture of both quantitative and qualitative predictors. (intro to dummy variables)
ancova
Which of the following R function is used to conduct a reduced F-test?
anova()
Consider the following two models. model 1: 𝑌=𝛽0+𝛽1𝑋1+𝜖Y=β0+β1X1+ϵ model 2: 𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+𝛽3𝑋3+𝜖Y=β0+β1X1+β2X2+β3X3+ϵ Which of the following represents the alternativehypothesis of a reduced F-test used to compare these models?
at least one b is not 0
A study was conducted on an experimental treatment for coronavirus. Patients were treated to either the experimental treatment (treatment=1) or a placebo (treatment=0). Then they were followed until they either recovered (y=0) or died (y=1) from the disease. Age was included in the model as a covariate. The coefficients from the logistic regression model are given below. 𝛽𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡=0.011 𝛽𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡=−0.126 𝛽𝑎𝑔𝑒=0.019 Based on the above scenario, what is the interpretation for the treatment coefficient? a) Adjusting for age, the odds of death are 0.882 lower in the treatment group than the control group. b) Adjusting for age, the odds of death are 0.882 times as high in the treatment group compared to the control group. c) Adjusting for age, the odds of death are 0.126 times as high in the treatment group compared to the control group. d) Adjusting for age, for every one unit increase in treatment, the odds of death decrease by 0.126.
b
Which of the following statements is not true about odds ratio? a) Odds Ratios can be used to compare factors for binary outcomes. b) Odds ratios are defined as a ratio between probabilities. c) Odds ratios are defined as a ratio between odds.
b
𝑌=𝛽0+𝛽1𝑋+𝛽2𝑍+𝛽3(𝑋⋅𝑍)+𝜖 how can you interpret 𝛽1? a) The change in the mean of the response variable for one unit increase of X when Z is 1. b) The change in the mean of the response variable for one unit increase of X when Z is 0. c) The change in the mean of the response variable for one unit increase of X.
b
Which of the following are appropriate measures for handling unusual observations? a) Always delete the unusual observations b) Add an indicator variable to the model that indicates the unusual observation c) Fit the model with and without the unusual observations to see if the conclusion is the same d) Try a different model form (e.g. use a non-linear model or add an interaction)
b, c, and d
Which of the following statements are true about the value of odds ratio? a) Odds ratios can only range between 0 and 1 b) If the value of odds ratio of two levels of a factor is around 1, this means that the odds of these two levels are similar. c) A very large or a very small value of odds ratio implies that the predictor could have an effect on the response. d) If the value of odds ratio is around 1, then the predictor may not have an effect on the response.
b, c, and d
choose a "reference" category, and set up k-1 variables, where k is the number of factor levels(for more than 2 factor levels)
dummy variables
Below is an output of _____ selection, in the first step, _____ is fitted. The first variable removed is _____. The selected model has a AIC value _____. Start: AIC=61.77 Y ~ X1 + X2 + X3 + X4 Df Sum of Sq RSS AIC - X1 1 1.073 162.44 59.986 - X3 1 1.796 163.16 60.128 <none> 161.37 61.774 - X4 1 102.558 263.92 75.518 - X2 1 133.216 294.58 79.034 Step: AIC=59.99 Y ~ X2 + X3 + X4 Df Sum of Sq RSS AIC - X3 1 0.768 163.21 58.137 <none> 162.44 59.986 - X4 1 107.986 270.43 74.296 - X2 1 139.481 301.92 77.822 Step: AIC=58.14 Y ~ X2 + X4 Df Sum of Sq RSS AIC <none> 163.21 58.137 - X2 1 162.29 325.50 78.228 - X4 1 351.08 514.28 92.865
backward, a full model with all predictors, X1, 58.14
A study was conducted on an experimental treatment for coronavirus. Patients were treated to either the experimental treatment (treatment=1) or a placebo (treatment=0). Then they were followed until they either recovered (y=0) or died (y=1) from the disease. Age was included in the model as a covariate. The coefficients from the logistic regression model are given below. 𝛽𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡=0.011 𝛽𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡=−0.126 𝛽𝑎𝑔𝑒=0.019 What is the interpretation of the intercept? a) Holding age constant, the odds of death from coronavirus for a patient taking placebo are 0.011. b) The odds ratio of death from coronavirus is 1.011. c) The odds of death from coronavirus for an age 0 patient taking placebo are 1.011. d) The odds of death from coronavirus for an age 0 patient taking placebo are 0.011.
c
Consider a model with two predictors X and Z, where X is a numerical variable and Z is a categorical variable with values 0 and 1. A linear model with both main effects and interaction of X and Z is fitted of the form 𝑌=𝛽0+𝛽1𝑋+𝛽2𝑍+𝛽3(𝑋⋅𝑍)+𝜖Y=β0+β1X+β2Z+β3(X⋅Z)+ϵ. What is the corresponding model when the categorical variable's value is 1? a) 𝑌=𝛽0+𝛽2+𝛽1𝑋+𝜖Y=β0+β2+β1X+ϵ b) 𝑌=𝛽0+𝛽1𝑋+𝜖Y=β0+β1X+ϵ c) 𝑌=𝛽0+𝛽2+(𝛽1+𝛽3)𝑋+𝜖Y=β0+β2+(β1+β3)X+ϵ d) 𝑌=𝛽0+(𝛽1+𝛽3)𝑋+𝜖
c
What is the advantage of cross-validation (CV) compared to single validation studies? a) CV is always unbiased b) CV is faster computationally c) CV uses all of the data for both testing and training d) CV can provide an estimate of RMSE, but single validation studies cannot
c
Which of the following statements is not true about Analysis of Covariance? a) The predictors have both numerical variables and categorical variables. b) The interaction term between the categorical variable and the numerical variable needs to be considered. c) The predictors can only include categorical variables. d) A reduced model with main effects only can be fit if the interaction term is not significant.
c
Which of the following methods can be used to fix multicollinearity problems? a) Remove one or more observations b) Standardize all dummy variables c) Standardize all numeric predictors d) Remove one or more predictors
c and d
Choose whether each of the following statements best describes k-fold CV or LOOCV. Faster computation time
k-fold CV
Choose whether each of the following statements best describes k-fold CV or LOOCV. RMSE and MAE estimates have higher bias
k-fold CV
𝑌=𝛽0+𝛽1𝑋+𝛽2𝑍+𝛽3(𝑋⋅𝑍)+𝜖 What statement is used in R to fit this model?
lm(Y ~ X + Z + X:Z, data=data)
If fit is a logistic regression model object, then predict(fit) will give the predictions in terms of _____ and predict(fit, type="response") will give the predictions in terms of _____
log odds, probabilities
_____ studies involve determining how well a model predicts a response variable.
model validation
In the caret package in R, which function is used to define the type of cross-validation to be done?
trainControl
In a validation study, the _____ is used to fit the model, then the _____ is used to evaluate the model.
training set, testing set