6414 FINAL MIX (15)
MSE=
Bias^2+Var
How will the procedure change if we test whether the coefficient is equal to a constant?
1) We reject the null hypothesis if the t-value is larger than t alpha over 2n-p-1, when the absolute value of t-value, is larger than the critical point. 2) OR we can look if the p-value is smaller than .01
what are three things that a regression analysis is used for?
1. Prediction of the response variable, 2. Modeling the relationship between the response and explanatory variables, 3. Testing hypotheses of association relationships
3 options for splitting the data
1. Random sampling 2. K-fold cross-validation 3. Leave-one-out cross-validation
How do we compute classification error? (2 ways)
1. Training error 2. Cross validation
Which one is correct? A. The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. B. The prediction intervals are centered at the predicted value. C. The sampling distribution of the prediction of a new response is a t-distribution. D. All of the above.
3.2 - Knowledge Check 3
To correct for complexity for GLM, what can we use?
AIC and BIC
Outlier
Any data point that is far from the majority from the data in both x and y
conditional relationship
Capturing the association of a predicting variable to the response variable, conditional of other predicting variables in the model.
Conditional Relationship
Capturing the association oof a predicting variable to the response variable conditional of other predicting variables in the model
What do you add to prediction risk to get training risk?
Complexity penalty
L2 does not account for multicollenarity.
False
What is the rule of thumb for Cook's Distance?
If Di > 4/n or Di >1 or any "large" Di should be investigated
When do we reject the f-statistic? and what does this mean?
If it's larger than the critical point, with alpha being the significance level of the test. This means, again, that at least one of the coefficients is different from 0 at the alpha significance level.
In multiple linear regression, the model can be written in...?
Matrix form.
What method do we use to estimate the model parameters?
Maximum Likelihood Estimation approach
Do the estimated residuals have constant variance?
No
Mallows CP is an example of what?
Prediction risk
How do we interpret R^2?
Proportion of total variability in Y that can be explained by the regression (that uses X)
SST = ?
SSE + SSTR
Linearity assumption for a Logistic Model
Similar to the regression model we have learned in the previous lectures, the relationship we assume now, between the link, the g of the probability of success and the predicted variable, is a linear function.
In the case of complete separation, what should you do?
Simplify the model
T/F: The statistical inference for logistic regression relies on large size of the sample data.
T
Null-deviance
Test statistic for Overall Regression, shows how well the response variable is predicted by a model that includes only the intercept.
The fitted values are defined as
The regression line with parameters replaced with the estimated regression coefficients.
If the scatter plot of the residuals (epsilon ij) for the ANOVA is NOT random: (2)
The sample responses are not independent, or the variances of responses are not equal
The total sum of squares divided by N-1 is
The sample variance estimator assuming equal means and equal variances
The pooled variance estimator is:
The sample variance estimator assuming equal variances.
AIC uses estimated variance based on what?
The submodel
The objective of the pairwise comparison is
To identify the statistically significantly different means.
The L1 penalty results in lasso regression.
True
In the case of multiple linear regression, controlling variables are used to control for sample bias. (T/F)
True. See Lesson 3.4: Model Interpretation "Controlling variables can be used to control for bias selection in a sample."
The expectation of the mean response is:
UNBIASED
The estimators for the regression coefficients are:
Unbiased regardless of the distribution of the data.
predicting or explanatory (independent) variables
a set of other variables that might be useful in predicting or modeling the response variable (x1, x2)
MSSTr measures...
between-group variability
for our linear model where: Y = B0 + B1 + EPSILON (E), what does the epsilon represent?
deviance of the data from the linear model (error term)
The test of subset of coefficients tests the null hypothesis that
discarded variables have coefficients equal to zero.
An important aspect in prediction is....
how it performs in new settings.
B0 = ?
intercept
If we have a negative value for B1,....
is consistent with an inverse relationship between x and y
L2 penalty (ridge)...
is easy to implement, but it does not measure sparsity and does not perform variable selection
Models with many predictors have...
low bias, high variance
We'd like to have prediction with...
low uncertainty for new settings.
Goodness of fit
means that the model assumptions hold and fits the data well.
The sampling distribution of MLEs can be approximated by a...
normal distribution
We can make associated statements for...
observational studies
response (dependent) variables
one particular variable that we are interested in understanding or modeling (y)
What inference does 'testing a subset of coefficients' provide?
provides inferences on the predictive power of the model
B1 = ?
slope
If there is a coefficient that does NOT equal zero, what does that mean?
that at least one of these predictors included in the model has predictive power.
In the pairwise comparison, if the confidence interval only contains positive values, then we conclude...
that the difference in means in statistically positive
Rate parameter
the expectation of the response Yi, given the predicting variables, which is modeled as the exponential of the linear combination of the predicting variables since the link function between expectation and the predicting variables is the log function
The larger the number of variables is for training risk....
the larger the training risk
ANOVA is a linear regression model where...
the predicting factor is a (one) categorical variable.
Predictive factors
to best predict variability in the response regardless of their explanatory power
what is the objective of ANOVA?
to compare the means across the k populations, are they equal?
In a multiple linear regression model, the R^2 measures the proportion of total variability in the response variable that is captured by the regression model.
true
How will we diagnose the assumptions for ANOVA?
we are going to diagnose the assumptions on the residuals because under the error terms we do not know the means
How do we estimate prediction risk?
we can use an approach called Training Risk
What can we use to test if Betaj is = 0?
z test (wald test)
When the p-value of the slope estimate in the SLR is small the r-squared becomes smaller too.
False - When P value is small, the model fits become more significant and R squared become larger.
LOOCV is approximately AIC when the true variance is replaced by the estimate of the variance from the S submodel.
True
Lasso regression does not have a closed from expression for the estimated coefficients.
True
MLE is used for the GLMs for handling complicated link function modeling in the X-Y relationship.
True
The estimated regression coefficients from lasso regression as less efficient than those from the ordinary least squares estimation approach.
True
The first degree of freedom in the F distribution for any of the three procedures in stepwise is always equal to one.
True
The goal is to balance the trade-off between bias (more predictors) and variance (fewer predictors).
True
To get BIC in R, we use AIC with a k value of log(n).
True
Typically, confounding variables should be included in a model.
True
Variable selection is an art by itself.
True
We can combine ridge and lasso regression into what we call the elastic net regression.
True
When selecting variables, we need to first establish which variables are used for controlling bias selection in the sample and which are explanatory.
True
When the objective is of a model is prediction, correlated variables should be avoided.
True
When the objective is to explain the relationship to the response, one might consider including predicting variables which are correlated.
True
In multiple linear regression with idd and equal variance, the least squares estimation of regression coefficients are always unbiased.
True - the least squares estimates are BLUE (Best Linear Unbiased Estimates) in multiple linear regression.
Akaike Information Criterion (AIC)
A more general approach, for linear regression under normality this becomes training risk + penalty that looks like Mallow's EXCEPT the variance is the true variance not the estimate.
The following output was captured from the summary output of a simple linear regression model that relates the duration of an eruption with the waiting time since the previous eruption. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.374016 A -1.70 0.045141 * waiting 0.043714 0.011098 B 0.000052 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 Using the table above, what is the standard error of the intercept, labeled A, and rounded to three decimal places? 2.336 0.808 0.806 -0.806 None of the above
0.808 See 1.4 Statistical Inference Std.Err = Estimate /t-value = -1.374016/-1.70 = 0.808
VIF =
1 / 1 - R^2
The objective of multiple linear regression is
1. To predict future new responses 2. To model the association of explanatory variables to a response variable accounting for controlling factors. 3. To test hypothesis using statistical inference on the model.
Objectives of Variable Selection
- High Dimensionality - Multicollinearity - Prediction vs Explanatory Objective
What are 4 reasons why the logistic model might not be a good fit?
1. there may be other variables that should be included in the model and/or the relationship between logit of the expected probability and predictors might be multiplicative rather than additive. 2. Initial observation outliers and leverage points are also still an issue for this model. The model should be fitted with and without these outliers. 3. the binomial distribution isn't appropriate (overdispersion) 4. the logit function does not fit well with the data. (there could be other s shaped functions that would work better)
Plotting the residuals versus fitted values checks for which assumption?
Constant variance & Independence
The 3 assumptions of ANOVA with respect to the error term:
Constant variance, independence, normality
3 ways Predicting Variables can be distinguished as:
Controlling, Explanatory, Predictive
What can we use to identify outliers?
Cook's Distance
How do we choose lambda?
Cross validation!
In Poisson regression: A) We make inference using t-intervals for the regression coefficients. B) Statistical inference relies on exact sampling distribution of the regression coefficients. C) Statistical inference is reliable for small sample data. D) None of the above.
D
The log-likelihood function is a linear function with a closed-form solution.
False Maximizing the log-likelihood function with respect to the coefficients in closed form expression is not possible because the log-likelihood function is non-linear.
If one confidence interval in the pairwise comparison does not include zero, we conclude that the two means are plausibly equal.
False
If the confidence interval for a regression coefficient contains the value zero, we interpret that the regression coefficient is definitely equal to zero.
False
If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.
False
If the p-value of the overall F-test is close to 0, we can conclude all the predicting variable coefficients are significantly nonzero.
False
If there are a group of correlated variables, lasso tends to select all of them.
False
If we do not reject the test of equal means, we conclude that means are definitely all equal
False
If we reject the test of equal means, we conclude that all treatment means are not equal.
False
In MLR, a VIFj of 10 means that there is no correlation among the jth predictor and the remaining predictors, here the variance of the estimated regression coefficient Bj is not inflated
False
In a multiple linear regression model with quantitative predictors, the coefficient corresponding to one predictor is interpreted as the estimated expected change in the response variable when there is a one unit change in that predictor.
False
In linear regression, outliers do not impact the estimation of the regression coefficients.
False
In ridge regression, when the penalty constant lambda equals to 1, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates
False
In the ANOVA, the number of degrees of freedom of the chi-squared distribution for the variance estimator is N-k-1 where k is the number of groups.
False
In the presence of near multicollinearity, the prediction will not be impacted.
False
In the presence of near multicollinearity, the regression coefficients will tend to be identified as statistically significant even if they are not.
False
In the regression model, the variable of interest for study is the predicting variable.
False
In the simple linear regression model, we lose three degrees of freedom because of the estimation of the three model parameters LaTeX: \beta_0,\:\beta_1,\sigma^2 β 0 , β 1 , σ 2 .
False
In the simple linear regression model, we lose three degrees of freedom because of the estimation of the three model parameters β 0 , β 1 , σ 2 .
False
Independence assumption can be assessed using the normal probability plot.
False
Independence assumption can be assessed using the residuals vs fitted values.
False
LaTeX: \beta_1 β 1 is an unbiased estimator for LaTeX: \beta_0 β 0 .
False
Observational studies allow us to make causal inference.
False
One-way ANOVA is a linear regression model with more than one qualitative predicting variables.
False
Only the log-transformation of the response variable can be used when the normality assumption does not hold.
False
Only the log-transformation of the response variable should be used when the normality assumption does not hold.
False
Prediction is the only objective of multiple linear regression.
False
Ridge regression does not have a close form expression for the estimated coefficients.
False
Suppose x1 was not found to be significant in the model specified with lm(y ~ x1 + x2 + x3). Then x1 will also not be significant in the model lm(y ~ x1 + x2).
False
T/F: Backward stepwise regression is preferable over forward stepwise regression because it starts with larger models.
False
T/F: Complex models with many predictors are often extremely biased, but have low variance.
False
T/F: Conducting t-tests on each β parameter in a multiple regression model is the best way for testing the overall significance of the model.
False
T/F: Given a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model.
False
T/F: If a predicting variable is categorical with 5 categories in a linear regression model with intercept, we will include 5 dummy variables in the model.
False
T/F: In regularized regression, the penalization is generally applied to all regression coefficients (β0, ... ,βp), where p = number of predictors.
False
T/F: In the case of a multiple linear regression model containing 6 quantitative predicting variables and an intercept, the number of parameters to estimate is 7.
False
T/F: In the case of a multiple regression model with 10 predictors, the error term variance estimator follows a χ 2 (chi-squared) distribution with n - 10 degrees of freedom.
False
T/F: It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables.
False
T/F: It is good practice to perform variable selection based on the statistical significance of the regression coefficients.
False
T/F: It is not required to standardize or rescale the predicting variables when performing regularized regression.
False
T/F: Mallow's Cp statistic penalizes for complexity of the model more than both leave-one-out CV and Bayesian information criterion (BIC).
False
T/F: Multiple linear regression captures the causation of a predicting variable to the response variable, conditional of other predicting variables in the model.
False
T/F: Predicting values of the response variable for values of the predictors that are within the data range is known as extrapolation.
False
T/F: Ridge regression is a regularized regression approach that can be used for variable selection.
False
T/F: Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score.
False
T/F: The causation of a predicting variable to the response variable can be captured using multiple linear regression, conditional of other predicting variables in the model.
False
T/F: The equation to find the estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing that by n - p , where n is the sample size and p is the number of predictors.
False
T/F: The only objective of multiple linear regression is prediction.
False
T/F: The sampling distribution for estimating confidence intervals for the regression coefficients is a normal distribution.
False
T/F: The sampling distribution used for estimating confidence intervals for the regression coefficients is the normal distribution.
False
T/F: The training risk is an unbiased estimator of the prediction risk.
False
T/F: Variable selection is a simple and solved statistical problem since we can implement it using the R statistical software.
False
T/F: We can make causal inference in observational studies.
False
T/F: We can use the normal test to test whether a regression coefficient is equal to zero.
False
T/F: We interpret the coefficient corresponding to one predictor in a regression with multiple predictors as the estimated expected change in the response variable associated with one unit of change in the corresponding predicting variable.
False
T/F: When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables.
False
The AIC is commonly used for prediction models since it penalizes the model complexity the most.
False
The F-test can be used to evaluate the relationship between two qualitative variables.
False
The causation effect of a predicting variable to the response variable can be captured using multiple linear regression, conditional of other predicting variables in the model.
False
The causation of a predicting variable to the response variable can be captured using Multiple linear regression, conditional of other predicting variables in the model.
False
The constant variance assumption is diagnosed by plotting the predicting variable vs. the response variable.
False
The constant variance is diagnosted using the quantile-quantile normal plot.
False
The estimated regression coefficient \beta^hat_i is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable .
False
The estimated regression coefficients will be the same under marginal and conditional model, only their interpretation is not.
False
The estimated variance of the error term has a \chi^2 distribution regardless of the distribution assumption of the error terms.
False
The estimator LaTeX: \hat \sigma^2 σ ^ 2 is a fixed variable.
False
The estimator σ ^ 2 is a fixed variable.
False
The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model.
False
The larger then number of variables, the lower the training risk is.
False
The means of the k populations is a model parameter in ANOVA.
False
The number of degrees of freedom for the \chi^2 distribution of the estimated variance is n-p-1 for a model without intercept.
False
The number of degrees of freedom of the LaTeX: \chi^2 χ 2 (chi-square) distribution for the pooled variance estimator is LaTeX: N-k+1 N − k + 1 where LaTeX: k k is the number of samples.
False
The number of degrees of freedom of the χ 2 (chi-square) distribution for the variance estimator is N − k + 1 where k is the number of samples.
False
The only assumptions for a linear regression model are linearity, constant variance, and normality.
False
The only assumptions for a simple linear regression model are linearity, constant variance, and normality.
False
The penalty constant 𝜆 has the role to select the best subset of predicting variables.
False
The regression coefficient corresponding to one predictor is interpreted in a multiple regression in terms of the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable.
False
The regression coefficient is used to measure the linear dependence between two variables.
False
The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution.
False
The residuals have constant variance for the multiple linear regression model.
False
The residuals vs fitted can be used to assess the assumption of independence.
False
The sample means of the k populations is a model parameter in ANOVA.
False
The sample means of the k samples is a model parameter in ANOVA.
False
The sampling distribution for the variance estimator in ANOVA is LaTeX: \chi^2 χ 2 (chi-square) with N - k degrees of freedom.
False
The sampling distribution for the variance estimator in ANOVA is χ 2 (chi-square) regardless of the assumptions of the data.
False
The sampling distribution of the mean squared error is different of that of the estimated variance.
False
The statistical inference for linear regression under normality relies on large size of sample data.
False
The variables chosen for prediction and the variables chosen for explanatory objectives will be the same.
False
There are four assumptions needed for estimation with multiple linear regression: mean zero, constant variance, independence, and normality.
False
There are multiple model selection criteria that can be used and all provide the same penalization of the model complexity.
False
To get AIC in R, we use AIC with a value of k=1
False
Variable selection is solved statistical problem, particularly for a models with a large number of predictors.
False
We can test for a subset of regression coefficients only if we are interested whether additional explanatory variables should be considered in addition to the controlling variables.
False
We can test for a subset of regression coefficients to evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant.
False
We can test for a subset of regression coefficients using the F statistic test of the overall regression.
False
We cannot estimate a multiple linear regression model if the predicting variables are linearly independent.
False
We do not need to assume normality of the response variable for making inference on the regression coefficients.
False
the L2 penalty results in lasso regression.
False
β 1 is an unbiased estimator for β 0 .
False
Maximum Likelihood Estimation is not applicable for simple linear regression and multiple linear regression.
False - In SLR and MLR, the SLE and MLE are the same with normal idd data.
In interpretation of the regression coefficients is the same for logistic regression and Poisson regression
False - Interpretation of the regression coefficients of Poisson is in terms of log ratio of the rate. While in logistic regression it is in terms of log odd
In the GLMs the link function cannot be a non linear regression.
False - It can be linear, non linear, or parametric
It is a good practice to perform variable selection based on the statistical significance of the regression coefficients
False - It is not good practice to perform variable selection based on the statistical significance of the regression coefficients.
A Poisson regression model with p predictors and the intercept will have p + 2 parameters to estimate.
False - It'll have p + 1 parameter to estimate
Like standard linear regression we can use the F test to test for overall regression in logistic regression.
False - It's 1-pchisq(null deviance-residual deviance, DFnull-DFresidual)
Leave on out cross validation is preferred
False - K fold is preferred.
L2 penalty term measures sparsity
False - L1 penalty measures sparsity. L2 removes the limitation on variable selection
The L2 penalty measures the sparsity of a vector and forces regression coefficients to be zero
False - L1, Lasso, penalty measures sparsity of a vector and forces regression coefficients to be zero
The interpretation of the regression coefficients is the same for both Logistic and Poisson regressions.
False - Logistic : in terms of log odds. Poisson: in terms of log ratio of the rate.
The regression coefficients for the Poisson regression model can be estimated in exact/closed form.
False - MLE is NOT closed form.
If a Logistic Regression model provides accurate classification, then we can conclude that it is a good fit for the data
False - accuracy is not the same as good fit. GOF determine good fit. These two things don't have to co-exist.
It is good practice apply variable selection without understanding the problem at hand to reduce bias.
False - always understand the problem at hand to better select variables for the model.
When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables
False - backward and forward regression will not always sslect the same set of variables
Backward stepwise regression is preferable over forward stepwise regression because it starts with larger models.
False - backward stepwise regression is more computational expensive than forward stepwise regression and generally selects a larger model.
We cannot use the training error rate as an estimate of the true error classification error rate because it is biased upward.
False - biased downward
For logistic regression we can define residuals for evaluating model goodness of fit for models with and without replication.
False - can only be with replication under the assumption that Yi is binary and n1 is greater than 1
It is good practice to perform a GOF test of Logistic Regression models without replications
False - can only do with replication on Logistic Regression
Complex models with many predictors are often extremely biased, but have low variance.
False - complex models with many predictors often have low bias but high variance
Trying all 3 link functions for a logistic regression (C-ln-ln, probit, logit) will produce models with the same GOF for a dateset
False - different link produce different output, thus the GOF would be different
In Logistics Regression, the estimated value for a regression coefficient Bi represents the estimated expected change in the response variable associated with one unit increase in the corresponding predicting variable xi holding the rest constant
False - expected change with respect to the odds of success
In MLR, the proportion of variability in the response variable that is explained by the predicting variables is called adjusted R2
False - explained by the model (not predicting variables)
In Logistic Regression, if the p-value of the deviance test for GOF is smaller than the significant level alpha, then it is plausible that the model is good fit
False - for GOF, if p - value is small meaning model is not a good fit (kind of reverse from significance of regression coefficients)
After fitting a logistic regression model, a plot of residuals versus fitted values is useful for checking if model assumptions are violated.
False - for logistic regression use deviance residuals.
The presence of multicollinearity in a MLR model will not impact the standard errors of the estimated regression coefficients.
False - it can impact the standard errors of the estimated regression coefficients.
Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score
False - it does not guarantee the best score and not all possible combinations are checked
In Logistic regression, the error terms are assumed to follow a normal distribution
False - it doesn't have an error term
Least Square Elimination (LSE) cannot be applied to GLM models.
False - it is applicable but does not use data distribution information fully.
When a statistically insignificant variable is discarded from the model, there is little change in the other predictors statistical significance.
False - it is possible that when a predictor is discarded, the statistical significance of other variables will change.
For the testing procedure for subsets of coefficients, we compare the likelihood of a reduced model versus a full model. This is a goodness of fit test
False - it provides inference of the predictive power of the model
Forward stepwise will select larger models than backward.
False - it will typically select smaller models especially if p is large
The log-likelihood function is a linear function with a closed-form solution
False - log likelihood is non linear. A numerical algorithm is needed in order to maximize it.
Regression models are only appropriate for continuous response variables.
False - logistic and poisson model probability and rate
The logit link function is the best link function to model binary response data because it always fit the data better than other link functions.
False - logit function is not the only function that yields the s-shaped kind of curve.
Model with many predictors have high bias but low variance.
False - low bias and high variance
If the Cook's distance for any particular observation is greater than one, that data point is definitely a record error and thus needs to be discarded.
False - must see a comparison of data points. Is 1 too large?
If data on (Y, X) are available at only two values of X, then the model Y = \beta_1 X + \beta_2 X^2 + \epsilon provides a better fit than Y = \beta_0 + \beta_1 X + \epsilon.
False - nothing to determine of a quadratic model is necessary or required.
In a greenhouse experiment with several predictors, the response variable is the number of seeds that germinate out of 60 that are planted with different treatment combinations. A Poisson regression model is most appropriate for modeling this data
False - poisson regression models rate or count data.
If the confidence interval for a regression coefficient contains the value zero, we interpret that the regression coefficient is definitely equal to zero.
false
If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.
false
In ANOVA, the number of degrees of freedom of the chi-squared distribution for the variance estimator is N-k-1 where k is the number of groups.
false
In logistic regression, R^2 could be used as a measure of explained variation in the response variable.
false
In simple linear regression, the confidence interval of the response increases as the distance between the predictor value and the mean value of the predictors decreases.
false
T/F: Prediction is the only objective of multiple linear regression.
false
T/F: The proportion of variability in the response variable that is explained by the predicting variables is called correlation.
false
The F-test can be used to test for the overall regression in Poisson regression.
false
The interpretation of the regression coefficients is the same for both Logistic and Poisson regression.
false
The only assumptions for a simple linear regression model are linearity, constant variance, and normality.
false
We cannot estimate a multiple linear regression model if the predicting variables are linearly independent.
false
We do not need to assume independence between data points for making inference on the regression coefficients.
false
If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.
false "Goodness of fit doesn't guarantee good prediction." And conversely, good prediction doesn't guarantee the model is a good fit.
In a multiple linear regression model with n observations, all observations with Cook's distance greater than 4/n should always be discarded from the model.
false An observation should not be discarded just because it is found to be an outlier. We must investigate the nature of the outlier before deciding to discard it.
In a simple linear regression model, any outlier has a significant influence on the estimated slope parameter.
false An outlier does not necessarily have a large influence on model parameters. When it does, we call it an influential point.
Mallow's Cp statistic penalizes for complexity of the model more than both leave-one-out CV and Bayesian information criterion (BIC).
false BIC penalizes complexity more than the other approaches.
Backward stepwise regression is computationally preferable over forward stepwise regression.
false Backward stepwise regression is more computational expensive than forward stepwise regression and generally selects a larger model.
Complex models with many predictors are often extremely biased, but have low variance.
false Complex models with many predictors have often low bias but high variance.
You obtained a statistically significant F-statistic when testing for equal means across four groups. The number of unique pairwise comparisons that could be perfomed is seven
false For k=4 treatments, there are 𝑘(𝑘−1)/2 = 4(4−1)/2 = 6 unique pairs of treatments. The number of unique pairwise comparisons that could be perfomed is six.
Consider a multiple linear regression model with intercept. If two predicting variables are categorical and each variable has three categories, then we need to include five dummy variables in the model
false In a multiple linear regression model with intercept, if two predicting variables are categorical and both have k=3 categories, then we need to include 2*(k-1) = 2*(3-1) = 4 dummy variables in the model.
what are two ways to transform data?
power and log transformation
We do not interpret beta with respect to the response variable for a Poisson model but with....
respect to the ratio of the rate.
When conducting ANOVA, the larger the between-group variability is relative to the within-group variability, the larger the value of the F-statistic will tend to be.
true Given the formula of the F-statistic a larger increase in the numerator (between-group variability) compared to the denominator will result in a larger F-statistic ; hence, the larger MSSTr is relative to MSE, the larger the value of F-stat.
log rate
the log function of the expected value of the response
What is the interpretation of coefficient Beta in terms of logistic regression?
the log of the odds ratio for an increase of one unit in the predicting variable, holding all other variables constant
If one confidence interval in the pairwise comparison includes zero under ANOVA, we conclude that the two corresponding means are plausibly equal.
true
If there are specific variables that are required to control the bias selection in the model, they should be forced into the model and not be part of the variable selection process.
true
In ANOVA, to test the null hypothesis of equal means across groups, the variance of the response variable must be the same across all groups.
true
In a simple linear regression model, we can assess if the residuals are correlated by plotting them against fitted values.
true
In a simple linear regression model, we need the normality assumption to hold for deriving a reliable prediction interval for a new response.
true
In ridge regression, when the penalty constant lambda (λ) equals zero, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates.
true
Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
true
Ridge regression can be used to deal with problems caused by high correlation among the predictors.
true
T/F: A partial F-test can be used to test whether a subset of regression coefficients are all equal to zero.
true
T/F: Analysis of variance (ANOVA) is a multiple regression model.
true
T/F: Before making statistical inference on regression coefficients, estimation of the variance of the error terms is necessary.
true
T/F: In the case of multiple linear regression, controlling variables are used to control for sample bias.
true
T/F: The estimated coefficients obtained by using the method of least squares are unbiased estimators of the true coefficients.
true
T/F: The regression coefficient corresponding to one predictor in multiple linear regression is interpreted in terms of the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed.
true
The L1 penalty measures the sparsity of a vector and forces regression coefficients to be zero.
true
The estimators of the error term variance and of the regression coefficients are random variables.
true
The larger the coefficient of determination or R-squared, the higher the variability explained by the simple linear regression model.
true
The lasso regression requires a numerical algorithm to minimize the penalized sum of least squares.
true
The one-way ANOVA is a linear regression model with one qualitative predicting variable.
true
The penalty constant lambda (λ) in penalized regression controls the trade-off between lack of fit and model complexity.
true
We can assess the assumption of constant-variance in multiple linear regression by plotting the standardized residuals against fitted values.
true
We estimate the regression coefficients in Poisson regression using the maximum likelihood estimation approach.
true
A binary response variable with replications in logistic regression has a Binomial distribution.
true A binary response variable with replications does follow a Binomial distribution.
Suppose that we have a multiple linear regression model with k quantitative predictors, a qualitative predictor with l categories and an intercept. Consider the estimated variance of error terms based on n observations. The estimator should follow a chi-square distribution with n − k − l degrees of freedom.
true For this example, we use k + l df to estimate the following parameters: k regression coefficients associated to the k quantitative predictors, ( l − 1 ) regression coefficients associated to the ( l − 1 ) dummy variables and 1 regression coefficient associated to the intercept. This leaves n − k − l degrees of freedom for the estimation of the error variance.
Lambda
λ is a constant that balances the tradeoff between the lack of fit (measured by the SSE) and the complexity (measured by the penalty) which depends on the regression coefficients. The bigger λ, the bigger the penalty for model complexity
T/F: Cook's distance measures how much the fitted values (response) in the multiple linear regression model change when the ith observation is removed.
T
T/F: If the residuals are not normally distributed, then we can model instead the transformed response variable where the common transformation for normality is the Box-Cox transformation.
T
T/F: In Poisson regression, there is a linear relationship between the log rate and the predicting variables.
T
T/F: Influential points in multiple linear regression are outliers.
T
T/F: Multicollinearity can lead to less accurate statistical significance of some of the regression coefficients.
T
T/F: Multicollinearity in the predicting variables will impact the standard deviations of the estimated coefficients.
T
T/F: The assumptions to diagnose with a linear regression model are independence, linearity, constant variance, and normality.
T
T/F: The estimated regression coefficients in Poisson regression are approximate.
T
T/F: The estimator σ ^ 2 is a random variable.
T
T/F: The linear regression model under normality is also a generalized linear model with link function the identity link function.
T
T/F: The linear regression model with a qualitative predicting variable with k levels/classes will have k + 1 parameters to estimate
T
T/F: The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.
T
T/F: The logit link function is not the only S-shape function that can be used to model binary response data.
T
T/F: The mean sum of square errors in ANOVA measures variability within groups.
T
T/F: The prediction interval will never be smaller than the confidence interval for data points with identical predictor values.
T
T/F: The prediction of the response variable has higher uncertainty than the estimation of the mean response.
T
T/F: The presence of certain types of outliers can impact the statistical significance of some of the regression coefficients.
T
T/F: Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.
T
T/F: We can use a Z-test to test for the statistical significance of a coefficient given all predicting variables in a Poisson regression model.
T
T/F: We can use a t-test to test for the statistical significance of a coefficient given all predicting variables in a multiple linear regression model.
T
T/F: We could diagnose the normality assumption using the normal probability plot.
T
T/F: When a Poisson regression does not fit well the data, it may that there may be more variability in the estimators than provided by the model.
T
T/F: When making a prediction for predicting variables on the "edge" of the space of predicting variables, then its uncertainty level is high.
T
T/F; For both logistic and Poisson regression, both the Pearson and deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.
T
T/F; The estimator of the mean response is unbiased.
T
T/F: In logistic regression, there is not a linear relationship between the probability of success and the predicting variables.
T, the relationship is NOT linear but a linear FUNCTION.
A measure of the bias-variance tradeoff is the prediction risk
TRUE
Explanatory variable is one that explains changes in the response variable
TRUE
Mallow's CP is useful when there are no control variables.
TRUE
Bayesian Information Criterion
The complexity penalty is (# of predictors in submodel * true_variance of full model * log(n))/n BIC penalizes complexity more than other approaches and thus preferred in model selection for prediction BIC is similar to AIC except that the AIC complexity is further penalized by log(n)/2 For BIC, we need to specify k = log(n) Select the model with the smallest BIC
Characteristics of Spatial Regression
Trend: This can be long-distance increase/decrease in the data over space Periodicity/Seasonality: This is not common for spatial process Heteroskedasticity: This means the variability varies with space
If the constant variance assumption does not hold, we transform the response variable.
True
If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.
True
If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption.
True
If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation.
True
If the residuals of a MLR model are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation
True
If there are specific variables that are required to control the bias selection in the model, they should be force into the model and not be part of the variable selection process
True
If we reject the test of equal means, we conclude that some treatment means are not equal.
True
In Logistic Regression, we can perform residual analysis for binary data with replications
True
In Logistic regression, the relationship between the probability of success and predicting variables is non linear
True
In MLR, we can assess the assumption of Constant - Variance by plotting the standardized residuals against fitted valued
True
In Poisson Regression we do not interpret beta with respect to the response variable but with respect to the ratio of the rate.
True
In Poisson regression underlying assumption is that the response variable has a Poisson distribution, or responses could be wait times, or exponential distribution
True
In a Poisson regression model, we use a chi-squared test to test the overall regression.
True
Mean square error is commonly used in statistics to obtain estimators that may be biased, but less uncertain than unbiased ones. And that's preferred.
True
Multicolinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
True
Multicollinearity can lead to misleading conclusions on the statistical significance of the regression coefficients of a MLR model
True
Multicollinearity in MLR means that the columns in the design matrix are (nearly) linearly dependent.
True
Multiple linear regression is a general model encompassing both ANOVA and simple linear regression.
True
Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. correct
True
One problem with fitting a normal regression model to Poisson data is the departure from the assumption of constant variance
True
One reason why the logistic model may not fit is the relationship between logit of the expected probability and predictors might be multiplicative, rather than additive
True
Overdispersion is when the variability of the response variable is larger than estimated by the model
True
Partial F-Test can also be defined as the hypothesis test for the scenario where a subset of regression coefficients are all equal to zero.
True
Poisson distribution, the variance is equal to the expectation. Thus, the variance is not constant
True
Predicting variable is used in regression to predict the outcome of another variable.
True
Predictive power means that the predicting variables predict the data even if one or more of the assumptions do not hold.
True
Random sampling is computationally more expensive than the K-fold cross validation, with no clear advantage in terms of the accuracy of the estimation classification error rate.
True
Residual analysis can only be used to assess uncorrelated errors.
True
Ridge regression has a closed form expression for the estimated coefficients.
True
Simpson's Paradox - the reversal of association when looking at marginal vs conditional relationships
True
Stepwise regression is a greedy algorithm.
True
Studying the relationship between a single response variable and more than one predicting quantitative and/or qualitative variable is termed as Multiple linear regression.
True
T/F: Akaike Information Criterion (AIC) is an estimate for the prediction risk.
True
T/F: Controlling variables used in multiple linear regression are used to control for bias in the sample.
True
T/F: Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both.
True
T/F: For a given predicting variable, the estimated coefficient of regression associated with it will likely be different in a model with other predicting variables or in the model with only the predicting variable alone.
True
T/F: If there are specific variables that are required to control the bias selection in the model, they should be forced into the model and not be part of the variable selection process.
True
T/F: In a multiple linear regression model with 6 predicting variables but without intercept, there are 7 parameters to estimate.
True
T/F: In multiple linear regression we study the relationship between a single response variable and several predicting quantitative and/or qualitative variables.
True
T/F: In multiple linear regression, we study the relationship between one response variable and both predicting quantitative and qualitative variables.
True
T/F: In order to make statistical inference on the regression coefficients, we need to estimate the variance of the error terms.
True
T/F: In ridge regression, when the penalty constant lambda (λ) equals zero, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates.
True
T/F: Ridge regression can be used to deal with problems caused by high correlation among the predictors.
True
T/F: The L1 penalty measures the sparsity of a vector and forces regression coefficients to be zero.
True
T/F: The error term in the multiple linear regression cannot be correlated.
True
T/F: The error term variance estimator has a (chi-squared) distribution with degrees of freedom for a multiple regression model with 10 predictors.
True
T/F: The estimated regression coefficient corresponding to a predicting variable will likely be different in the model with only one predicting variable alone versus in a model with multiple predicting variables.
True
T/F: The estimated regression coefficients are unbiased estimators.
True
T/F: The estimated variance of the error terms is the sum of squared residuals divided by the sample size minus the number of predictors minus one.
True
T/F: The hypothesis test for whether a subset of regression coefficients are all equal to zero is a partial F-test.
True
T/F: The lasso regression requires a numerical algorithm to minimize the penalized sum of least squares.
True
T/F: The penalty constant lambda (λ) in penalized regression controls the trade-off between lack of fit and model complexity.
True
T/F: We cannot estimate a multiple linear regression model if the predicting variables are linearly dependent.
True
T/F: We need to assume normality of the response variable for making inference on the regression coefficients.
True
T/F: When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis.
True
The ANOVA is a linear regression model with one or more qualitative predicting variables.
True
The ANOVA model with a qualitative predicting variable with LaTeX: k k levels/classes will have LaTeX: k+1 k + 1 parameters to estimate.
True
The L1 penalty results in a sparse matrix, and therefore does variable selection.
True
The L2 penalty does not result in a sparse matrix.
True
The LASSO regression requires a numerical algorithm to minimize the penalized sum of least squares
True
The LaTeX: R^2 R 2 value represents the percentage of variability in the response that can be explained by the linear regression on the predictors. Models with higher LaTeX: R^2 R 2 are always preferred over models with lower LaTeX: R^2 R 2 .
True
The Mallow's CP complexity penalty is two times the size of the model (the number of variables in the submodel) times the estimated variance divided by n.
True
The Partial F-Test can test whether a subset of regression coefficients are all equal to zero.
True
The Sum of Squares Regression (SSR) measures the explained variability captured by the regression model given the explanatory variables used in the model.
True
The advantage of having a biased model with less predicting variables is the reduction in uncertainty of predictions of future responses.
True
The decision in using ANOVA table for testing whether a model is significant depends on the normal distribution of the response variable
True
The deviance residuals are the signed square root of the log-likelihood evaluated at the saturated model
True
The equation to find the estimated variance of the error terms can be obtained by summing up the squared residuals and dividing that by n - p - 1, where n is the sample size and p is the number of predictors.
True
The estimated regression coefficient in lasso regression are obtained using a numeric algorithm.
True
The estimated regression coefficient in ridge regression are obtained using exact or closed form expression.
True
The estimated regression coefficients from Lasso are less efficient than those provided by the ordinary least squares
True
The estimators for the regression coefficients are Uunbiased regardless of the distribution of the data. correct
True
The estimators of the error term variance and of the regression coefficients are random variables.
True
The gam() function is a non-parametric test to determine what transformation is best.
True
The hypothesis testing procedures for subsets of regression coefficients is not used for GOF assessment in Logistic Regression
True
The l2 penalty results in ridge regression.
True
The larger K is, the larger the number of folds, the less bias the estimate of the classification the error is but has higher variability.
True
The larger the coefficient of determination or R-squared, the higher the variability explained by the simple linear regression model.
True
The lasso minimization problem is convex.
True
The least square estimation for the standard regression model is equivalent with Maximum Likelihood Estimation, under the assumption of normality.
True
The linear regression model with a qualitative predicting variable with k levels/classes will have k + 1 parameters to estimate
True
The log odds function, also called the logit function, which is the log of the ratio between the probability of a success and the probability of a failure
True
The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as log odds function
True
The mean sum of square errors in ANOVA measures variability within groups.
True
The number of parameters to estimate in the case of a multiple linear regression model containing 5 predicting variables and no intercept is 6.
True
The one-way ANOVA is a linear regression model with one qualitative predicting variable.
True
The penalty constant lambda in penalized regression control the trade-off between lack of fit and model complexity
True
The penalty constant 𝜆 has the role to control the trade-off between lack of fit and model complexity.
True
The prediction intervals are centered at the predicted value.
True
The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly.
True
The prediction of the response variable has higher uncertainty than the estimation of the mean response.
True
The prediction risk is the sum between the irreducible error and the mean square error
True
The regression coefficients can be estimated only if the predicting variables are not linearly dependent.
True
The regression coefficients that are estimated serve as unbiased estimators.
True
The sampling distribution of the estimated regression coefficients is Centered at the true regression parameters.
True
The sampling distribution of the estimated regression coefficients is centered at the true regression parameters.
True
The sampling distribution of the estimated regression coefficients is dependent on the design matrix.
True
The sampling distribution of the estimated regression coefficients is the t-distribution assuming that the variance of the error term is unknown an replaced by its estimate.
True
The sampling distribution of the prediction of a new response is a t-distribution.
True
The training risk is a biased estimator of the prediction risk
True
The variability of a submodel is smaller than the full model.
True
The variance of the response is equivalent to the expected value of the response in Poisson regression with no overdispersion.
True
To get AIC in R, we use AIC with a value of k=2
True
Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.
True
Using leave-1-out Cross Validation is equivalent to K-fold to K-fold Cross Validation where the number of the folds is equal to the sample size of the training set.
True
Using stepwise regression, we can force variables into the model.
True
Variable selection can be used to deal with multicollinearity, reduce predictors and fit a model with more variables than observations.
True
We assess the assumption of constant-variance by plotting the residuals against fitted values.
True
We can assess the assumption of constant-variance in multiple linear regression by plotting the standardized residuals against fitted values.
True
We can do a partial F test to determine if variable selection is necessary.
True
We would like to have a prediction with low uncertainty for new settings. This means that we're willing to give up some bias to reduce the variability in the prediction.
True
When considering using generalized linear models, it's important to consider the impact of Simpson's paradox when interpreting relationship between explanatory variables and the response. This paradox refers to the reversal of these associations when looking at the marginal relationship compared to a conditional ones.
True
When estimating confidence values for the mean response for all instances of the predicting variables, we should use a critical point based on the F-distribution to correct for the simultaneous inference.
True
When selecting variables for a model, one needs also to consider the research hypothesis, as well as any potential confounding variables to control for
True
When the data may not be normally distributed, AIC is more appropriate for variable selection than adjusted R-squared
True
if VIF for each predicting variable is smaller than a certain threshold we can say that there is not problematic amount of multicollinearity in the MLR model
True
in MLR Regression, we could diagnose the normality assumption by using the normal probability plot
True
in MLR Regression, when using very large samples, replying on the p-values associated to the traditional hypothesis test with Ha: Bj not equal to 0 can lead to misleading conclusions on the statistical significance of the regression coeffcient
True
A negative value of β 1 is consistent with an inverse relationship between the predictor variable and the response variable. (T/F)
True See 1.2 Estimation Method "A negative value of β 1 is consistent with an inverse relationship"
L0 penalty, which is the number of nonzero regression coefficients
True - not feasible for a large number of predicting variables as requires fitting all models
Poisson Assumptions - log transformation of the rate is a linear combination of the predicting variables, the response variables are independently observed, the link function g is the log function
True - remember, NO ERROR TERM
Standard linear regression could be used to model Poisson regression using the variance stabilizing transformation sqrt(mu-3/8) if the number of counts is large
True - the number of counts can be small - then use Poisson
To estimate prediction risk we compute the prediction risk for the observed data and take the sum of squared differences between fitted values for sub model S and the observed values.
True - this is called training risk and it is a biased estimate of prediction risk
The g link function is also called the canonical link function.
True - which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples.
We can use the z value to determine if a coefficient is equal to zero in logistic regression.
True - z value = (Beta-0)/(SE of Beta)
Where p the number of predictors is larger than n the number of observationsthe Lasso selects, at most, n variables
True when p is greater than n, lasso will select n variables at the most
The estimated regression coefficients in Poisson regression are approximate
True.
The pooled variance estimator, s pooled^2, in ANOVA is synonymous with the variance estimator, σ ^ 2, in simple linear regression because they both use mean squared error (MSE) for their calculations. (T/F)
True. See 1.2 Estimation Method for simple linear regression See 2.2 Estimation Method for ANOVA The pooled variance estimator is, in fact, the variance estimator.
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 What is the probability of survival for a botrytis blight sample exposed to a sulfer concentration of 0.7 and a copper concentration of 0.9? 0.826 0.674 0.311 0.577
0.577 exp(3.58770 - 4.32735*0.7 - 0.27483*0.9) / (1 + exp(3.58770 - 4.32735*0.7 - 0.27483*0.9)) = 0.577
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 The p-value for testing the overall regression can be obtained from which of the following? 1-pchisq(718.76,19) 1-pchisq(419.33,2) 1-pchisq(363.53,3) 1-pchisq(299.43,17
1-pchisq(419.33,2) The chi-squared test statistic is the difference between the null deviance (718.76) and the residual deviance (299.43), which is 419.33. The degrees of freedom is the difference between the null deviance degrees of freedom (19) and the residual deviance degrees of freedom (17), which is 2.
Akaike Information Criterion (AIC) is an estimate for the prediction risk
AIC indeed measures prediction risk.
When would we use the 'Comparing pairs of Means' method?
After we reject the null hypothesis of equal means
How can we diagnose multicollinearity?
An approach to diagnose collinearities through the computation of the Variance Inflation Factor, which you will compute for EACH predicting variable.
What is the hypothesis testing procedure for overall regression and what is it testing?
Analysis of Variance for multiple regression. We will use analysis of variance (ANOVA) to test the hypothesis that the regression coefficients are zero.
The estimated versus predicted regression line for a given x*: A. Have the same variance B. Have the same expectation C. Have the same variance and expectation D. None of the above
B. Have the same expectation 1.2 - Knowledge Check 3
Which one is correct? A. Independence assumption can be assessed using the residuals vs fitted values. B. Independence assumption can be assessed using the normal probability plot. C. Residual analysis can be used to assess uncorrelated errors. D. None of the above
C. Residual analysis can be used to assess uncorrelated errors. 1.3 - Knowledge Check 4
Classification Error Rate
Classification error rate is the probability that the new response is equal to the classifier(R) R is between 0 and 1. Most common value for R is 0.5 however a different R can be used to improve the prediction accuracy
Classification
Classification is prediction of binary responses. If the predicted probability is large, then classify y star as a success
Logistic Regression
Commonly used for modeling binary response data. The response variable is a binary variable, and thus, not normally distributed. In logistic regression, we model the probability of a success, not the response variable. In this model, we do not have an error term
A variable that is related to both a predictor and a response. For example outdoor temperature as it relates to ice cream sales and home invasions.
Confounding variable
Constant Variance Assumption
Constant variance assumption means that it cannot be true that the model is more accurate for some parts of the population, and less accurate for other parts A violation of this assumption means that estimates are not as efficient as they could be in estimating the true parameters. It also results in poorly calibrated prediction intervals
Used to control for bias selection in a sample.
Controlling variable.
How else can we estimate the classification error without the need of observing new data?
Cross validation
The estimators for the regression coefficients are: A. Biased but with small variance B. Biased with large variance C. Unbiased under normality assumptions but biased otherwise. D. Unbiased regardless of the distribution of the data.
D. Unbiased regardless of the distribution of the data. 1.2 - Knowledge Check 2
The estimators for the regression coefficients are: A. Biased but with small variance B. Unbiased under normality assumptions but biased otherwise. C. Biased regardless of the distribution of the data. D. Unbiased regardless of the distribution of the data.
D. Unbiased regardless of the distribution of the data. 3.2 - Knowledge Check 2
Leverage points
Data points that are far from the mean of the x's
leverage points
Data points that are far from the mean of the x's
Data w/ replications vs Data w/o replications
Data with replications: We can observe binary data for repeated trials. That is a binomial distribution with more than one trial or ni greater than 1 Data without replications: For each unique set of the observed predicting variables, we can observe binary data with no repeated trials. That is a binomial distribution with one trial where ni = 1
A data point far from the mean of the x's and y's is always: A. an influential point and an outlier B. a leverage point but not an outlier C. an outlier and a leverage point D. an outlier but not a leverage point E. None of the above
E. None of the Above. See 1.9 Outliers and Model Evaluation We only know that the data point is far from the mean of x's and y's. It only fits the definition of a leverage point because the only information we know is that it is far from the mean of the x's. So you can eliminate the answers that do not include a leverage point. That leaves us with remaining possibilities, "a leverage point but not an outlier" and "an outlier and a leverage point" , both of which we can eliminate. We do not have enough information to know if it is or is not an outlier . None of the answers above fit the criteria of it being always being a leverage point.
T/F: A goodness-of-fit test should always be conducted after fitting a logistic regression model without repetition.
F
T/F: For assessing the normality assumption of the ANOVA model, we can only use the quantile-quantile normal plot of the residuals.
F
T/F: If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.
F
T/F: If one confidence interval in the pairwise comparison does not include zero, we conclude that the two means are plausibly equal.
F
T/F: If the VIF for each predicting variable is smaller than a certain threshold, then we can say that multicollinearity does not exist in this model.
F
T/F: If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.
F
T/F: Other link functions for the Poisson regression model are c-log-log and probit.
F
A multiple linear regression model with p predicting variables but no intercept has p model parameters.
False
BIC penalizes complexity less than other approaches.
False
Backward stepwise regression adds one variable to the model at a time starting with the full model.
False
Backward stepwise regression can be performed if p>n.
False
Backward stepwise regression is preferable over forward stepwise regressioin.
False
For testing if a regression coefficient is zero, the normal test can be used.
False
In the presence of near multicollinearity, the coefficient of variation decreases.
False
T/F: Observational studies allow us to make causal inference.
False
The Box-Cox transformation is commonly used to improve upon the linearity assumption.
False
Under Logistics Regression, the sampling distribution used for a coefficient estimator is a Chi-squared distribution when the sample size is large.
False - sampling (or coef estimator) follos an approximately Normal
The slope of a linear regression equation is an example of a correlation coefficient.
False - the correlation coefficient is the r value. Will have the same + or - sign as the slope.
Observational Studies
For observation studies, causality is rarely implied
When do we use transformations?
If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. If the constant variance assumption does not hold, we transform the response variable.
What does it mean when we reject the H0 of the F test?
If we reject the null hypothesis, we will conclude that at least one of the predicting variables has explanatory power for the variability in the response.
OLS vs Robust Regression
In OLS, we minimize the sum of square root errors to estimate the expectation of Y, given the predictor variables In robust regression, minimize the sum of absolute errors to estimate the median of Y, given the predictor variables Both the expectation and the median are measures of centrality of the distribution. However, median is robust to outliers, whereas the mean or the expectation is not The estimated confidence intervals are wider for robust regression than for ordinary least squares. And thus, the statistical inference is more conservative
Regularized Regression (Variable Selection)
In regularized regression for variable selection, we need to first re-scale all the predicting variables in order to be comparable on the same scale.
What is the difference between using Poisson regression versus the standard regression with, say, with the log transformation of the response variable?
In standard regression, the variance is assumed constant. In Poisson, the variance of the response is assumed to be equal to the expectation, since for the Poisson distribution, the variance is equal to the expectation. Thus the variance is not constant.
Stepwise regression is a greed algorithm, what does that mean?
It does not guarantee to find the model with the best score.
How we interpret ^B0?
It is the estimated expected value of the response variable, when the predicting variable equals zero.
Ridge Regression
It is used to correct for the impact multicollinearity This is commonly used to fit a model under multicollinearity; not used for model selection, does not force any betas to 0 For this model, the penalty is the sum of square regression coefficients times the lambda constant Minimizing the penalty provides a closed-form expression for the estimated regression coefficients When λ = 0; low bias, high variance λ = 1; betas = 0, high bias, low variance We have a closed-form expression for the estimated regression coefficients under this model Some of the regression coefficients may intersect the 0 line for large effective degrees of freedom but for others, the regression coefficients increase Alpha = 0
If our ANOVA model has an intercept, then how many dummy variables and why?
K-1 because of linear dependence between the X's
In regularized regression, what balances the bias variance tradeoff?
Lambda
Plotting the residuals versus each predictor checks for which assumption?
Linearity
Linearity Assumption - Poisson
Linearity can be evaluated by plotting the log of the event rate versus the predicting variables We can also evaluate linearity on the assumption of uncorrelated responses using the scatter plot for the residuals versus the predicting variables
What are the 4 assumptions of MLR?
Linearity, Constant Variance, Independence, Normality
Variable Selection Approaches
Mallows Cp AIC BIC Cross Validation
Parameter estimation
Maximizing the log likelihood function with respect to beta0, beta1 etc in closed (exact) form expression is not possible because the log likelihood function is a non-linear function in the model parameters i.e. we cannot derive the estimated regression coefficients in an exact form Use numerical algorithm to estimate betas (maximize the log likelihood function). The estimated parameters and their standard errors are approximate estimates
If we reject the null hypothesis for overall regression, what does that mean
Meaning that the overall regression has statistically significant power in explaining the response variable.
Constant variance assumption
Means that it cannot be true that the model is more accurate for some parts of the population, and less accurate for other parts of the populations. This can result in less accurate parameters and poorly-calibrated prediction intervals.
Using MLE, can we derive estimated coefficients/parameters in exact form?
No, they are approximate estimated parameters
What does the QQ plot and histogram check for?
Normality
Regression Analysis
Overview of the following models: - Linear Regression - ANOVA Regression - Multiple Linear Regression - Logistic Regression - Poisson Regression
Pearson Residuals - Poisson Regression
Pearson residuals is the standardized difference between the ith observed response, and estimated expected rate of events lambda i hat divided by the square root of the variance where the variance is equal to lambda i hat We need to standardized the difference between observed and expected response since their responses have different variances Normal Distribution
Biased Regression Penalties
Penalize large values of betas jointly. This should lead to multivariate shrinkage of the vector of regression coefficients
What are three ways we can transform the predicting variables?
Power, Log, Polynomial transformations
ANOVA
Relationship between the response variable y and one or more qualitative/categorical variables We can write the response as a sum between the mean of the category from which the response is observed plus an error term epsilon Assumptions: Same as linear regression, except that we do not have the linearity assumption, since we're not considering a relationship with a quantitative variable The model parameters are the group mean parameters, along with the variance of the error terms. The mean parameters are estimated as the group sample means
Logistic Regression with replications
Residuals: We can only define residuals for binary data with replications Goodness of Fit: We perform goodness of fit only for logistic regression with replications under the assumption that Yi is binomial with ni greater than 1
what is the relationship between s^2 and sigma^2?
S^2 estimates sigma^2
Elastic Net Regression
Simultaneously performs variable selection and continues shrinkage and can select groups of correlated variables. Elastic net often out performs the lasso in terms of prediction accuracy The L1 penalty generates a sparse model that enforces some of the regression coefficients to be 0. The L2 penalty removes the limitation of the number of selected variables, encourages group effect, stabilizes the L1 regularization path Alpha = 0.5 (between 0 & 1)
Elastic Net
Simultaneously performs variable selection and continues shrinkage and can select groups of correlated variables. Elastic net often out performs the lasso in terms of prediction accuracy.
HW 4 After this
Space
Spatial Process Assumptions
Stationarity: This means that the probability distribution of the spatial process does not change when shifted in space Isotropic: This means that the distribution of Ys is the same in all orientations
How can we find that a normality transformation of the predictive variable will improve the fit?
Such transformation can be identified by comparing the logit of the success rate, versus the predicted variables.
T/F: A negative value of β 1 is consistent with an inverse relationship between x and y.
T
T/F: Cross-validation using random sampling is less computationally efficient (more computationally expensive) in estimating the model error rate than the K-fold cross validation.
T
T/F: For a classification model, training error tends to underestimate the true classification error rate of the model.
T
T/F: For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it is an indication that the model is a good fit.
T
T/F: If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is statistically significantly positive.
T
T/F: If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.
T
T/F: We can perform goodness-of-fit analysis for a Poisson regression.
T
Variable selection addresses multicolinearity, high dimensionaltiy, and prediction vs explanatory prediction
TRUE
Variable selection is not special, it is affected by highly correlated variables
TRUE
Statistical Inference for Logistic Regression is not reliable for small sample size
TRUE - Only Large sample size
The R-squared and adjusted R-squared are not appropriate model comparisons for non linear regression but are for linear regression models.
TRUE - The underlying assumption of R-squared calculations is that you are fitting a linear model.
If p is larger than n, stepwise is feasible
TRUE - for forward, but not backward
The deviance and pearson residuals are normally distributed
TRUE - the residual deviances are chi square distributed
Stepwise is a heuristic search
TRUE it is also a greedy search that does not guarantee to find the best score
Linearity Assumption
The Logit transformation of the probability of success is a linear combination of the predicting variables. The relationship may not be linear, however, and transformation may improve the fit The linearity assumption can be evaluated by plotting the logit of the success rate versus the predicting variables. If there's a curvature or some non-linear pattern, it may be an indication that the lack of fit may be due to the non-linearity with respect to some of the predicting variables
Assumptions
The assumption that the errors are normally distributed is needed if we want to do any confidence or prediction intervals or hypothesis tests If this assumption is violated, hypothesis test and confidence and prediction intervals can be very misleading
Regression Coefficients
The estimated regression coefficients still have a closed form expression. However, we correct for the variance-covariance matrix of the error terms. The estimated regression coefficients remain unbiased, but the variance of the coefficients changes. The sample distribution of beta estimators also remains to be normal under the normality assumption
Statistical Inference for Poisson Regression
The estimators for the regression coefficients in the Poisson regression are approximately unbiased. The mean of the approximate normal distribution is beta. This approximation relies on the large sample data
What is the f-statistic?
The f statistic is going to be the ratio between the mean sum of square regression and mean sum of square of error.
Goodness of Fit Test
The null hypothesis is that the model fits well. The alternative is that the model does not fit well The test statistic for the goodness of fit test is the sum of squared deviances which has a Chi-Square distribution with n- p- 1 degrees of freedom If the p-value is small, we reject the null hypothesis of good fit, and thus we conclude that the model is not a good fit. We want LARGE values of P. Large values of P indicate that the model may be a good fit For goodness of fit test, we compare the likelihoods of the saturated model versus the fitted model.
L0 penalty
The number of nonzero regression coefficients. This is equivalent to searching over all models and not viable for a large p.
Classification
The prediction of binary responses. Classification is nothing more than a prediction of the class of your response, y* (y star), given the predictor variable, x* (x star). If the predicted probability is large, then classify y* as a success.
The fitted values are defined as:
The regression line with parameters replaced with the estimated regression coefficients.
Response variable
The response data are Bernoulli or binomial with one trial with probability of success
MLE
The resulting log-likelihood function to be maximized, is very complicated and it is non-linear in the regression coefficients beta 0, beta 1, and beta p MLE has good statistical properties under the assumption of a large sample size i.e. large N For large N, the sampling distribution of MLEs can be approximated by a normal distribution The least square estimation for the standard regression model is equivalent with MLE, under the assumption of normality. MLE is the most applied estimation approach
Binomial Data
This is binary data with repititions
Log Rate
This is the log function of the expected value of the response ln(λ(x)) = β0 + β1x
The objective of the residual analysis is
To evaluate departures from the model assumptions
Generalized Linear Model
To generalize the standard regression model to response data that do not have a normal distribution, this generalizes the linear model to response data coming from other distributions.
A negative value of β 1 is consistent with an inverse relationship between x and y .
True
A no-intercept model with one qualitative predicting variable with 3 levels will use 3 dummy variables.
True
In a multiple regression problem, a quantitative input variable x is replaced by x − mean(x). The R-squared for the fitted model will be the same
True
In a simple linear regression model, the variable of interest is the response variable.
True
In all the regression models we have considered (including multiple linear, logistic, and Poisson), the response variable is assumed to have a distribution from the exponential family of distribution.
True
Logistic Regression deals with the case where the dependent variable is binary, the conditional distribution is Binomial
True
Under the normality assumption, the estimator for LaTeX: \beta_1 β 1 is a linear combination of normally distributed random variables.
True
A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more type I errors than expected
True -
The mean sum of squared errors in ANOVA measures variability within groups. (T/F)
True. See 2.4 Test for Equal Means. MSE = within-group variability
Cook's distance (Di) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed. (T/F)
True. See Lesson 3.11: Assumptions and Diagnostics "This is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model. "
The presence of certain types of outliers, such as influential points, can impact the statistical significance of some of the regression coefficients. (T/F)
True. See Lesson 3.11: Assumptions and diagnostics Outliers that are influential can impact the statistical significance of the beta parameters.
An example of a multiple linear regression model is Analysis of Variance (ANOVA). (T/F)
True. See Lesson 3.2 Basic Concepts "Earlier, we contrasted the simple linear regression model with the ANOVA model... Multiple linear regression is a generalization of both models."
Curse of Dimensionality
Using non-parametric regression with an increasing number of predicting variables p, there are many, many possible regression functions F To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension p
If p is large, what can we do instead?
We can perform a heuristic search, like stepwise regression
Backward Stepwise Regression
We start with all predictors in the full model and drop one predictor at a time Backward stepwise regression cannot be performed if p > n This is also more computationally expensive than forward stepwise regression This selects larger models if p is large
Estimation of WLS
We use a least squares approach but we need to account for the variance-covariance matrix sigma in weighting the error terms by their standard errors When there is correlation (sigma is not diagonal) we need to decorrelate the errors. Thus, we'll add the inverse of the sigma matrix to the least squares objective function
What can we do if the Normality or Constant Variance assumption does not hold?
We would transform the response variable. A common transformation is the Box-Cox transformation.
Prediction vs Explanatory Objective
When the objective is to explanatory, include predicting variables which are correlated For prediction, this should be avoided.
Multicollinearity
When the predicting variables are correlated, it is important to select variables in such a way that the impact of multicollinearity is minimized
When is Mallow's Cp useful?
When there are no control variables
Regularized Regression
Without Penalization: Estimate betas by minimizing the SSE With Penalization: Estimate betas by minimizing the penalized SSE λ*Penalty The bigger λ, the bigger the penalty for model complexity
Do the error terms have constant variance?
Yes
Will R^2 always increase when we add predicting variables?
Yes
Is the predicted regression line is the same as the estimated regression line at x*? How does it affect confidence intervals?
Yes, but the prediction confidence interval is wider than the estimation confidence interval because of the higher variability in the prediction.
Is training risk biased? why or why not?
Yes, the training risk is a biased estimate of prediction risk because we use the data twice. Once for fitting the model S and once for estimating the prediction risk. Thus, training risk is biased upward.
What is the distribution of binary data WITHOUT replications?
a binomial distribution with one trial where ni = 1
lambda
a constant that has the role of balancing the tradeoff between lack of fit and model complexity
What is B^ in MLR?
a linear combination of Y's and is normally distributed.
design matrix
a matrix consisting of columns of predicting variables including the column of ones corresponding to the intercept:
prediction risk
a measure of the bias-variance tradeoff
linear relationship
a simple deterministic relationship between 2 factors, x and y
correlation coefficient
a statistic that efficiently summarizes how well the X's are linearly related to Y
R^2 or coefficient of determination
a statistic that efficiently summarizes how well the x can be used to predict the response variable.
A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the MSE for this model? a. 30.669 b. 11.201 c. 20.534 d. None of the above
a. 30.669 MSE is the square of the Residual standard error. MSE = 5.5382 = 30.669
Adjusted R^2
adjusted for the number of predictive variables. So it's not going to increase as we add more predictive variables
If our ANOVA model does not have an intercept, then how many dummy variables?
all k dummy variables
3 primary objectives of ANOVA
analyze the variability of data, test whether the means are equal, estimate confidence intervals
Pearson Residuals
as the standardized difference between the ith observed response and estimated expected response, which is ni times the probability of success.
A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of the correlation coefficient between Height and Diameter? a. 0.2697 b. 0.5193 c. 0.3222 d. None of the above
b. 0.5193 In simple linear regression, the correlation coefficient between the response and predictor variables, 𝜌 = √𝑅2. Since, 𝑅 2 = 0.2697, then 𝜌 = √0.2697 = 0.5193.
conditional model (MLR)
captures the association of a predictor variable to the response variable, conditional of other predicting variables in the model.
Trying all three link functions for a logistic regression model (C-ln-ln, probit, logit) will produce models with the same goodness of fit for a dataset.
false
For Generalized Linear Models, including Poisson regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.
false The deviance residuals are approximately N(0,1) if the model is a good fit.
If p is small,....
fit all submodels
what does residual analysis NOT check for? (for SLR assumptions)
independence
To test if a coefficient is less than a critical value, C, we conduct a one-sided test on the _________ tail of a ___________ distribution. left, normal left, t right, normal right, t None of the above
left, t See 1.4 Statistical Inference "For β 1 greater than zero we're interested on the right tail of the distribution of the β ^ 1."
ANOVA
linear regression with one or more qualitative predicting variables
simple linear regression
linear regression with one quantitative predicting variable
g link function
link the probability of success to the predicting variables
Forward-Backward stepwise regression
meaning adding and discarding one variable at a time iteratively.
Predictive power
means that the predicting variables predict the data even if one or more of the assumptions do not hold.
Multiple linear regression
multiple quantitative and qualitative predicting variables
The F-test is a _________ tailed test with ______ and ______ degrees of freedom. one, k, N-1 one, k-1, N-k two, k-1, N-k two, k, N-1 None of the above.
one, k-1, N-k See 2.4 Test for Equal Means The F-test is a one tailed test that has two degrees of freedom, namely k − 1 and N − k.
What is 'X*'?
predictor
In logistic regression, we model the__________________, not the response variable, given the predicting variables.
probability of a success
What inference does 'saturated vs fitted model' provide?
provides inferences on the goodness of the model
What kind of variable is a response variable and why?
random, because it varies with changes in the predictor/s along with other random changes.
For Poisson regression, the variance = ?
rate lambda
F-test measures...
ratio of between-group variability and within-group variability
Simpson's paradox
refers to reversal of an association when looking at a marginal relationship versus a partial or conditional one. This is a situation where the marginal relationship has a wrong sign.
Do we evaluate normality using residuals or the response variable?
residuals
What is the sampling distribution of ^B1?
t distribution with N-2 DF
The L0 penalty provides....
the best model given a selection criteria, but it requires fitting all submodels.
The bigger the lambda,.....
the bigger the penalty for model complexity.
The overall regression F-statistic tests the null hypothesis that
the coefficients are equal to zero
residuals
the difference between the observed response and the fitted responses
Cook's Distance
the distance between the fitted values of the model with all the data versus the fitted values of the model discarding the ith observation:
classification error rate
the probability that the new response is equal to the classifier.
How will we assess the normality assumption for the ANOVA model?
the quantile-quantile normal plot and the historgram of the residuals.
Deviance residuals
the signed square root of the log-likelihood evaluated at the saturated model when we assume that the estimate expected response is the observed response versus the fitted model.
Multicollinearity inflates...?
the standard error of the estimated coefficients
sum of square errors
the sum of square differences between the observations and the individual sample means
sum of square treatments
the sum of the square difference between the sample means of the individual samples minus the overall mean
Controlling factors
to control for bias selection in the sample. They are used as 'default' variables in order to capture more meaningful relationships.
Explanatory factors
to explain variability in the response variable; they may be included in the model even if other "similar" variables are in the model
what is the confidence interval used for?
to provide an interval estimate for the true average value of y for all members of the population with a particular value of x*
We interpret the beta in a logistic regression model in respect to?
to the odds of success
A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more Type I errors than expected.
true
A logistic regression model may not be a good fit if the responses are correlated or if there is heterogeneity in the success that hasn't been modeled.
true
Akaike Information Criterion (AIC) is an estimate for the prediction risk.
true
An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.
true
The estimated regression coefficients in Poisson regression are approximate.
true The estimated parameters and their standard errors are approximate estimates.
To evaluate whether the model is a good fit or whether the assumptions hold, what should we use?
use the Pearson or Deviance residuals to evaluate whether they are normally distributed and conclude it is a good fit via hypotheses testing.
what is the prediction interval used for?
used to provide an interval estimate for a prediction of y for one member of the population with a particular value of x*
ANOVA measures...
variability between samples to the variability within a sample.
pooled variance estimator (MSE)
we compare the means, assuming the variances are the same, and estimate the variance across all samples
Overdispersion
where the variability of the response variable is larger than estimated by the model.
outliers
which are data points far from the majority of the data in both x and y or just x
canonical link function
which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples
MSE measures..
within-group variability
An example of a multiple regression model is Analysis of Variance (ANOVA).
True
An indication that a higher order non linear relationship better fits the data is that the dummy variables are all, or nearly all, statistically significant
True
An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.
True
Analysis of Variance (ANOVA) is an example of a multiple regression model.
True
Assuming the model is a good fit, the residuals in simple linear regression have constant variance.
True
Before making statistical inference on regression coefficients, estimation of the variance of the error terms is necessary.
True
Both LASSO and ridge regression always provide greater residual sum of squares than that of simple multiple linear regression.
True
Classification is nothing else than prediction of binary responses.
True
Confounding variable is a variable that influences both the dependent variable and independent variable
True
Cook's distance (Di) measurs how much the fitted values in a MLR model change when the ith observation is removed
True
Event rates can be calculated as events per units of varying size, this unit of size is called exposure
True
For Poisson regression, we can reduce type I errors of identifying statistical significance in the regression coefficients by increasing the sample size.
True
For a MLR model to be a good fit, we need the linearity assumption to hold for at least one of the predicting variables
True
For a given predicting variable, the estimated coefficient of regression associated with it will likely be different in a model with other predicting variables or in the model with only the predicting variable alone.
True
For a linearly dependent set of predictor variables, we should not estimate a multiple linear regression model.
True
For assessing the normality assumption of the ANOVA model, we can use the quantile-quantile normal plot and the historgram of the residuals.
True
For both Logistic and Poisson Regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data
True
For large sample size data, the distribution of the test statistic, assuming the null hypothesis, is a chi-squared distribution
True
If a predicting variable is categorical with 5 categories in a linear regression model without intercept, we will include 5 dummy variables in the model.
True
If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is positive, and statistically significant.
True
If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is statistically significantly positive.
True
If one confidence interval in the pairwise comparison includes zero under ANOVA, we conclude that the two corresponding means are plausibly equal.
True
If response variable Y has a quadratic relationship with a predictor variable X, it is possible to model the relationship using multiple linear regression.
True
To what distribution can we derive the confidence interval from?
t-distribution
The sampling distribution of β ^ 0 is a t-distribution chi-squared distribution normal distribution None of the above
t-distribution See 1.4 Statistical Inference The distribution of β 0 is normal. Since we are using a sample and not the full population, the sampling distribution of β ^ 0 is the t-distribution.
What is the sampling distribution for individual β hat?
t-distribution with n-p-1 DF
What can we use to test for statistical significance?
t-test
If there is a group of variables among which the correlation are very high, then the Lasso...
tends to select only one variable from that group and does not care which one is selected.
What is the F-test for?
test for overall regression
T/F: If you apply linear regression under normality to count data, the assumption of constant variance still holds.
F
T/F: In Poisson regression, we also need to check for the assumption of constant variance of the error terms.
F
T/F: In multiple linear regression, a VIF value of 6 for a predictor means that 80% of the variation in that predictor can be modeled by the other predictors.
F
T/F: In multiple linear regression, if the coefficient of a quantitative predicting variable is negative, that means the response variable will decrease as this predicting variable increases.
F
T/F: In multiple linear regression, we need the linearity assumption to hold for at least one of the predicting variables
F
T/F: In simple linear regression models, we lose three degrees of freedom because of the estimation of the three model parameters β 0 , β 1 , σ 2.
F
T/F: Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.
F
T/F: Only the log-transformation of the response variable can be used when the normality assumption does not hold.
F
T/F: The binary response variable in logistic regression has a Bernoulli distribution.
F
T/F: The canonical link function for Poisson regression is the logit link function.
F
T/F: The coefficient of variation is used to evaluate goodness-of-fit.
F
T/F: The constant variance assumption is diagnosed using the histogram.
F
T/F: The error term in logistic regression has a normal distribution.
F
T/F: The log-likelihood function is a linear function with a closed-form solution.
F
T/F: The logit link function is the best link function to model binary response data because the models produced always fit the data better than other link functions.
F
T/F: The mean sum of square errors in ANOVA measures variability between groups.
F
T/F: The number of degrees of freedom of the χ 2 (chi-square) distribution for the variance estimator is N − k + 1 where k is the number of samples.
F
T/F: The number of parameters that need to be estimated in a logistic regression model with 5 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.
F
T/F: The prediction of the response variable and the estimation of the mean response have the same interpretation.
F
T/F: The prediction of the response variable has the same levels of uncertainty compared with the estimation of the mean response.
F
T/F: The regression coefficients are used to measure the linear dependence between two variables.
F
T/F: The regression coefficients for the Poisson regression can be estimated in exact/closed form.
F
T/F: The sampling distribution for the estimated regression coefficients under logistic regression is approximately t-distribution.
F
T/F: The sampling distribution for the variance estimator in ANOVA is χ 2 (chi-square) regardless of the assumptions of the data.
F
T/F: The sampling distribution of the predicted response variable used in statistical inference is normal in multiple linear regression under the normality assumption.
F
T/F: The sampling distribution of the prediction of the response variable is a χ 2(chi-squared) distribution.
F
T/F: The statistical inference for linear regression under normality relies on large size of sample data.
F
T/F: We assess the assumption of constant-variance by plotting the response variable against fitted values.
F
T/F: We can perform goodness-of-fit analysis through residual diagnosis for a logistic regression without replications.
F
T/F: We interpret logistic regression coefficients with respect to the response variable.
F
T/F: When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients in the reduced model.
F
T/F: β ^ 1 is an unbiased estimator for β0.
F
T/F; A linear regression model has high predictive power if the coefficient of determination is close to 1.
F
T/F: Under logistic regression, the sampling distribution used for a coefficient estimator is a chi-square distribution.
F, derived using MLE which is a normal distribution.
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Construct an approximate 95% confidence interval for the coefficient of concCu. (-0.322, -0.249) (-4.931, -3.724) (-4.847, -3.808) (-0.310,-0.240)
(-0.310,-0.240) [-0.27483-1.96*0.01784, -0.27483+1.96*0.01784]
Penalized or regularized regression
When we perform variable selection and estimation simultaneously.
An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 Suppose you wanted to test if the coefficient for doseamt is equal to 50. What t-value would you use for this test? 1.54 -0.948 0.692 -0.882
-0.948 t-value = (41.33160−50)/ 9.13907 = −8.6684/ 9.13907 = -0.9484991
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Suppose you wanted to test if the coefficient for concCu is equal to -0.2. What z-value would you use for this test? 0.095 -0.073 -4.195 1.411
-4.195 (-0.27483-(-0.2))/0.01784 = -4.195
You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. For students with average math and language arts scores, how many more days on average are female students absent compared to their male counterparts? 4.8545 3.5729 2.2525 0.6697
2.2525 λ(Xmath, Xlangarts, Xmale) = e^( 2.687666−0.003523Xmath−0.012152Xlangarts−0.400921∗Xmale ) λ(Xmath = 45.5, Xlangarts = 50, Xmale = 0) = e^( 2.687666−0.003523∗(45.5)−0.012152∗(50)−0.400921∗(0) = 6.819386 ) λˆ(Xmath = 45.5, Xlangarts = 50, Xmale = 1) = e ^(2.687666−0.003523∗(45.5)−0.012152∗(50)−0.400921∗(1) = 4.566963 ) λ(Xmath = 45.5, Xlangarts = 50, Xmale = 0) − λ(Xmath = 45.5, Xlangarts = 50, Xmale = 1) = 2.252423
Box-cox transformation
A common transformation is the power transformation y to the lambda used to improve the normality and/or constant variance assumption.
Influential points
A data point that is far from the mean both x and y and change the value of the estimated parameters significantly. It can change the statistical significance, it can change the magnitude. It can change even the sign.
An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 For an F-test of overall significance of the regression model, what degrees of freedom would be used? 3 , 24 2, 27 3, 23 2, 23
3, 23 The numerator degrees of freedom (ndf) is equal to p and the denominator degrees of freedom (ddf) is equal to n-p-1, where n: number of observations and p: number of predictors. Hence, ndf = 3 and ddf = 27-3-1 = 23
The following output was captured from the summary output of a simple linear regression model that relates the duration of an eruption with the waiting time since the previous eruption. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.374016 A -1.70 0.045141 * waiting 0.043714 0.011098 B 0.000052 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 Using the table above, what is the t-value of the coefficient for waiting, labeled B, and rounded to three decimal places? 3.939 3.931 3.935 None of the above
3.939 See 1.4 Statistical Inference t-value = Estimate /Std.Err= 0.043714/0.011098 = 3.939
You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. What is the expected number of days missed for a female student with a langarts of 48 and a math score of 50 based on the model? 6.8773 1.9106 6.6363 4.5251
6.8773 λ(Xmath, Xlangarts, Xmale) = e^( 2.687666−0.003523Xmath−0.012152Xlangarts−0.400921Xmale) λ(Xmath = 50, Xlangarts = 48, Xmale = 0) = e^( 2.687666−0.003523∗50−0.012152∗48−0.400921∗0) = 6.877258
influential points
A data point that is far from the mean of both the x's and the y's, because they are influencing the fit of the regression.
An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 Calculate the Sum of Squared Regression from the model summary. 17,484.25 73,163.60 67,181.18 55,284.40
73,163.60 You could calculate this value in several ways. This is one possible way. Fstat = MSReg/MSE = (SSReg/p)/MSE SSReg = Fstat ∗ MSE ∗ p = (8.348)(54.052 )(3) = 73,163.60
In Poisson regression, A) We model the log of the expected response variable not the expected log response variable. B) We use the ordinary least squares to fit the model. C) There is an error term. D) None of the above.
A
In logistic regression, A) The estimation of the regression coefficients is based on maximum likelihood estimation. B) We can derive exact (close form expression) estimates for the regression coefficients. C) The estimations of the regression coefficients is based on minimizing the sum of least squares. D) All of the above.
A
The mean squared errors (MSE) measures: A) The within-treatment variability. B) The between-treatment variability. C) The sum of the within-treatment and between-treatment variability. D) None of the above.
A
The objective of the residual analysis is: A) To evaluate departures from the model assumptions B) To evaluate whether the means are equal. C) To evaluate whether only the normality assumptions holds. D) None of the above.
A
The pooled variance estimator is: A) The sample variance estimator assuming equal variances. B) The variance estimator assuming equal means and equal variances. C) The sample variance estimator assuming equal means. D) None of the above.
A
Which is correct? A) The regression coefficients can be estimated only if the predicting variables are not linearly dependent. B) The estimated regression coefficient beta hat i is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable. C) The estimated regression coefficients will be the same under marginal and conditional model; only their interpretation is not. D) Causality is the same as association in interpreting the relationship between the response and predicting variables.
A
Which one is correct? A) We use a chi-square testing procedure to test whether a subset of regression coefficients are zero in Poisson regression. B) The test for subsets of regression coefficients is a goodness of fit test. C) The test for subsets of regression coefficients is reliable for small sample data in Poisson regression. D) None of the above.
A
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Interpret the coefficient for concCu. A 1-unit increase in the concentration of copper decreases the odds of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the number of samples of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the log odds of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the probability of botrytis blight surviving by 0.27483 holding sulfer constant.
A 1-unit increase in the concentration of copper decreases the log odds of botrytis blight surviving by 0.27483 holding sulfer constant.
Assuming that the data are normally distributed, the estimated variance has the following sampling distribution under the simple linear model: A. Chi-square with n-2 degrees of freedom B. T-distribution with n-2 degrees of freedom C. Chi-square with n degrees of freedom D. T-distribution with n degrees of freedom
A. Chi-square with n-2 degrees of freedom 1.1 - Knowledge Check 1
The estimators of the linear regression model are derived by: A. Minimizing the sum of squared differences between observed and expected values of the response variable. B. Maximizing the sum of squared differences between observed and expected values of the response variable. C. Minimizing the sum of absolute differences between observed and expected values of the response variable. D. Maximizing the sum of absolute differences between observed and expected values of the response variable.
A. Minimizing the sum of squared differences between observed and expected values of the response variable. 1.1 - Knowledge Check 1
Which one is correct? A. The regression coefficients can be estimated only if the predicting variables are not linearly dependent. B. The estimated regression coefficient 𝛽∧𝑖 is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable . C. The estimated regression coefficients will be the same under marginal and conditional model, only their interpretation is not. D. Causality is the same as association in interpreting the relationship between the response and the predicting variables.
A. The regression coefficients can be estimated only if the predicting variables are not linearly dependent. 3.1 - Knowledge Check 1
The pooled variance estimator is: A. The variance estimator assuming equal variances. B. The variance estimator assuming equal means and equal variances. C. The sample variance estimator assuming equal means. D. None of the above.
A. The variance estimator assuming equal variances. 2.1 - Knowledge Check 1
The mean squared errors (MSE) measures: A. The within-treatment variability. B. The between-treatment variability. C. The sum of the within-treatment and between-treatment variability. D. None of the above.
A. The within-treatment variability. 2.1 - Knowledge Check 2
The objective of the residual analysis is A. To evaluate goodness of fit B. To evaluate whether the means are equal. C. To evaluate whether only the normality assumptions holds. D. None of the above.
A. To evaluate goodness of fit 2.2 - Knowledge Check 3
We detect departure from the assumption of constant variance A. When the residuals increase as the fitted values increase also. B. When the residuals vs fitted are scattered randomly around the zero line. C. When the histogram does not have a symmetric shape. D. All of the above.
A. When the residuals increase as the fitted values increase also. 1.3 - Knowledge Check 4
The sampling distribution of β ^ 0 is a: A. t-distribution B. chi-squared distribution C. normal distribution D. None of the above
A. t-distribution See 1.4 Statistical Inference The distribution of β 0 is normal. Since we are using a sample and not the full population, the sampling distribution of β ^ 0 is the t-distribution.
Multiple Linear Regression
Assumptions: - The deviances or error terms have 0 mean - Constant variance - They are independent. Statistical Inference: - We also need to assume that the error terms are normally distributed
Multiple Linear Regression
Assumptions: We assume that the error terms epsilon_i have 0 mean and constant variance, and they're independent. We also assumed that error terms are normally distributed The parameter is defining the regression line, beta 0, beta 1, to beta p are parameters. But we also have the additional parameter, the variance of the error terms, denoted with sigma squared These parameters are unknown but estimated based on the observed data using the ordinary least squares approach Statistical inference on the regression coefficients using the sampling distribution of the estimated regression coefficients that is the T distribution with n-p-1 degrees of freedom
Comparing cross-validation methods, A) The random sampling approach is more computational efficient that leave-one-out cross validation. B) In K-fold cross-validation, the larger K is, the higher the variability in the estimation of the classification error is. C) Leave-one-out cross validation is a particular case of the random sampling cross-validation. D) None of the above.
B
In logistic regression, A) The hypothesis test for subsets of coefficients is a goodness of fit test. B) The hypothesis test for subsets of coefficients is approximate; it relies on large sample size. C) We can use the partial F test for testing whether a subset of coefficients are all zero. D) None of the above.
B
The estimated versus predicted regression line for a given x*: A) Have the same variance B) Have the same expectation C) Have the same variance and expectation D) None of the above
B
The objective of the pairwise comparison is: A) To find which means are equal. B) To identify the statistically significantly different means. C) To find the estimated means which are greater or lower than other. D) None of the above.
B
The total sum of squares divided by N-1 is: A) The mean sum of squared errors B) The sample variance estimator assuming equal means and equal variances C) The sample variance estimator assuming equal variances. D) None of the above.
B
Which one is correct? A) The logit link function is the only link function that can be used for modeling binary response data. B) Logistic regression models the probability of a success given a set of predicting variables. C) The interpretation of the regression coefficients in logistic regression is the same as for standard linear regression assuming normality. D) None of the above.
B
Which one is correct? A) We can evaluate the goodness of fit a model using the testing procedure of the overall regression. B) In applying the deviance test for goodness of fit in logistic regression, we seek large p-values, that is, not reject the null hypothesis. C) There is no error term in logistic regression and thus we cannot perform a goodness of fit assessment. D) None of the above.
B
The fitted values are defined as: A. The difference between observed and expected responses. B. The regression line with parameters replaced with the estimated regression coefficients. C. The regression line. D. The response values.
B. The regression line with parameters replaced with the estimated regression coefficients. 1.1 - Knowledge Check 1
The total sum of squares divided by N-1 is A. The mean sum of squared errors B. The sample variance estimator assuming equal means and equal variances C. The sample variance estimator assuming equal variances. D. None of the above.
B. The sample variance estimator assuming equal means and equal variances 2.1 - Knowledge Check 2
The objective of the pairwise comparison is A. To find which means are equal. B. To identify the statistically significantly different means. C. To find the estimated means which are greater or lower than other. D. None of the above.
B. To identify the statistically significantly different means. 2.2 - Knowledge Check 3
To test if a coefficient is less than a critical value, C, we conduct a one-sided test on the _________ tail of a ___________ distribution. A. left, normal B. left, t C. right, normal D. right, t E. None of the above
B. left, t See 1.4 Statistical Inference "For β 1 greater than zero we're interested on the right tail of the distribution of the β ^ 1."
The F-test is a _________ tailed test with ______ and ______ degrees of freedom. A. one, k, N-1 B. one, k-1, N-k C. two, k-1, N-k D. two, k, N-1 E. None of the above.
B. one, k-1, N-k See 2.4 Test for Equal Means The F-test is a one tailed test that has two degrees of freedom, namely k − 1 and N − k.
what are the model parameters to be estimated in MLR?
B0 (intercept), B1-Bp, and sigma squared
what are the 3 parameters we estimated in regression?
B0, B1, sigma squared (variance of the one pop.)
In SLR, we are interested in the behavior of which parameter?
B1
what does 'statistical significance' mean?
B1 is statistically different from zero.
AIC vs BIC
BIC is similar to AIC except that the complexity is penalized by log(n)/2
Which is more computationally expensive, forward or backward?
Backward
2. Which is correct? A) A multiple linear regression model with p predicting variables but no intercept has p model parameters. B) The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model. C) Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. D) None of the above.
C
In logistic regression: A) We can perform residual analysis for response data with or without replications. B) Residuals are derived as the fitted values minus the observed responses. C) The sampling distribution of the residual is approximately normal distribution if the model is a good fit. D) All of the above.
C
The assumption of normality: A) It is needed for deriving the estimators of the regression coefficients. B) It is not needed for linear regression modeling and inference. C) It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. D) It is needed for deriving the expectation and variance of the estimators of the regression coefficients.
C
The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise. C) Unbiased regardless of the distribution of the data.
C
The variability in the prediction comes from: A) The variability due to a new measurement. B) The variability due to estimation. C) The variability due to a new measurement and due to estimation. D) None of the above.
C
Using the R statistical software to fit a logistic regression, A) We can use the lm() command. B) The input of the response variable is exactly the same if the binary response data are with or without replications. C) We can obtain both the estimates and the standard deviations of the estimates for the regression coefficients. D) None of the above.
C
Which is correct? A) If we reject the test of equal means, we conclude that all treatment means are not equal. B) If we do not reject the test of equal means, we conclude that means are definitely all equal C) If we reject the test of equal means, we conclude that some treatment means are not equal. D) None of the above.
C
Link Functions
C-log Function: This has very long tails, meaning that it works best in extremely skewed distributions Probit Function: This is the inverse of the CDF of a standard normal distribution. This fits data with least-heavy tails among the three S shaped functions. This would work well when the probabilities are all concentrated within a small range Logit Function: This is what is called the canonical link function, which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples. The interpretations of regression coefficients in terms of log odds is possible with a logit function but not other S-shape functions
Which is correct? A. If we reject the test of equal means, we conclude that all treatment means are not equal. B. If we do not reject the test of equal means, we conclude that means are definitely all equal C. If we reject the test of equal means, we conclude that some treatment means are not equal. D. None of the above.
C. If we reject the test of equal means, we conclude that some treatment means are not equal. 2.1 - Knowledge Check 2
The assumption of normality: A. It is needed for deriving the estimators of the regression coefficients. B. It is not needed for linear regression modeling and inference. C. It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. D. It is needed for deriving the expectation and variance of the estimators of the regression coefficients.
C. It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. 1.2 - Knowledge Check 2
Which one is correct? A. A multiple linear regression model with p predicting variables but no intercept has p model parameters. B. The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model. C. Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. D. None of the above.
C. Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. 3.1 - Knowledge Check 1
The variability in the prediction comes from: A. The variability due to a new measurement. B. The variability due to estimation C. The variability due to a new measurement and due to estimation. D. None of the above.
C. The variability due to a new measurement and due to estimation. 1.2 - Knowledge Check 3
The alternative hypothesis of ANOVA can be stated as, A. the means of all pairs of groups are different B. the means of all groups are equal C. the means of at least one pair of groups is different D. None of the above
C. the means of at least one pair of groups is different See 2.4 Test for Equal Means "Using the hypothesis testing procedure for equal means, we test: The null hypothesis, which that the means are all equal (mu 1 = mu 2...=mu k) versus the alternative hypothesis, that some means are different. Not all means have to be different for the alternative hypothesis to be true -- at least one pair of the means needs to be different."
marginal relationship
Capturing the association of a predicting variable to the response variable marginally, i.e. without consideration of other factors.
Marginal Relationship
Capturing the association of a predicting variable to the response variable without consideration of other factors
Under the null hypothesis of good fit, the test statistic's (sum of squared deviances) distribution and DOF is...?
Chi square with n-p-1 DF
Assuming that the data are normally distributed, under the simple linear model, the estimated variance has the following sampling distribution:
Chi-square with n-2 degrees of freedom
You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. The approximated distribution of the residual deviance is ____ with ____ degrees of freedom. Normal, 315 Chi-squared, 312 Chi-squared, 315 t, 312
Chi-squared, 312 The approximated distribution of the residual deviance is Chi-square with n-p-1 degrees of freedom. In this example n = 316 and p = 3 ; Hence df= 312.
In evaluating a multiple linear model: A) The F test is used to evaluate the overall regression. B) The coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model. C) Residual analysis is used for goodness of fit assessment. D) All of the above.
D
In the presence of near multicollinearity: A) The coefficient of variation decreases. B) The regression coefficients will tend to be identified as statistically significant even if they are not. C) The prediction will not be impacted. D) None of the above.
D
Logistic regression is different from standard linear regression in that: A) It does not have an error term B) The response variable is not normally distributed. C) It models probability of a response and not the expectation of the response. D) All of the above.
D
Logistic regression is different from standard linear regression in that: A) The sampling distribution of the regression coefficient is approximate. B) A large sample data is requirded for making accurate statistical inferences. C) A normal sampling distribution is used instead of a t-distribution for statistical inference. D) All of the above.
D
Poisson regression can be used: A) To model count data. B) To model rate response data. C) To model response data with a Poisson distribution. D) All of the above.
D
Residual analysis in Poisson regression can be used: A) To evaluate goodness of fit of the model. B) To evaluate whether the relationship between the log of the expected response and the predicting variables is linear. C) To evaluate whether the data are uncorrelated. D) All of the above.
D
The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise C) Biased regardless of the distribution of the data. D) Unbiased regardless of the distribution of the data.
D
The objective of multiple linear regression is: A) To predict future new responses B) To model the association of explanatory variables to a response variable accounting for controlling factors. C) To test hypotheses using statistical inference on the model. D) All of the above.
D
The sampling distribution of the estimated regression coefficients is: A) Centered at the true regression parameters. B) The t-distribution assuming that the variance of the error term is unknown and replaced by its estimate. C) Dependent on the design matrix. D) All of the above.
D
We can test for a subset of regression coefficients: A) Using the F-statistic test of the overall regression. B) Only if we are interested in whether additional explanatory variables should be considered in addition to the controlling variables. C) To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. D) None of the above.
D
When do we use transformations? A) If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. B) If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. C) If the constant variance assumption does not hold, we transform the response variable. D) All of the above.
D
When we do not have a good fit in generalized linear models, it may be that: A) We need to transform some of the predicting variables or to include other variables. B) The variability of the expected rate is higher than estimated. C) There may be leverage point that need to explored further. D) All of the above.
D
Which are all the model parameters in ANOVA? A) The means of the k populations. B) The sample means of the k populations. C) The sample means of the k samples. D) None of the above.
D
Which is correct? A) Prediction translates into classification of a future binary response in logistic regression. B) In order to perform classification in logistic regression, we need to first define a classifier for the classification error rate. C) One common approach to estimate the classification error is cross-validation. D) All of the above.
D
Which one correctly characterizes the sampling distribution of the estimated variance? A) The estimated variance of the error term has a chi-squared distribution regardless of the distribution assumption of the error terms. B) The number of degrees of freedom for the chi-squared distribution of the estimated variance is n - p - 1 for a model without an intercept. C) The sampling distribution of the mean squared error is different of that of the estimated variance. D) None of the above.
D
Which one is correct? A) The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. B) The prediction intervals are centered at the predicted value. C) The sampling distribution of the prediction of a new response is a t-distribution. D) All of the above.
D
Which one is correct? A) The estimated regression coefficients and their standard deviations are approximate not exact in Poisson regression. B) We use the glm() R command to fit a Poisson linear regression. C) The interpretation of the estimated regression coefficients is in terms of the ratio of the response rates. D) All of the above.
D
Which one is correct? A) The residuals have constant variance for the multiple linear regression model. B) The residuals vs. fitted can be used to assess the assumption of independence. C) The residuals have a t-distribution if the error term is assumed to have a normal distribution. D) None of the above.
D
Which one is correct? A) The standard normal regression, the logistic regression and the Poisson regression are all falling under the generalized linear model framework. B) If we were to apply a standard normal regression to response data with a Poisson distribution, the constant variance assumption would not hold. C) The link function for the Poisson regression is the log function. D) All of the above.
D
What is the difference between LASSO and Elastic Net?
Elastic Net has an additional penalty just like the one used in ridge regression.
In evaluating a simple linear model A. There is a direct relationship between the coefficient of determination and the correlation between the predicting and response variables. B. The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. C. Residual analysis is used for goodness of fit assessment. D. All of the above.
D. All of the Above 1.3 - Knowledge Check 4
In evaluating a multiple linear model, A. The F test is used to evaluate the overall regression. B. The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. C. Residual analysis is used for goodness of fit assessment. D. All of the above.
D. All of the Above 3.3 - Knowledge Check 4
When do we use transformations? A. If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. B. If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. C. If the constant variance assumption does not hold, we transform the response variable. D. All of the above.
D. All of the Above 3.3 - Knowledge Check 4
The objective of multiple linear regression is A. To predict future new responses B. To model the association of explanatory variables to a response variable accounting for controlling factors. C. To test hypothesis using statistical inference on the model. D. All of the above.
D. All of the above. 3.1 - Knowledge Check 1
The sampling distribution of the estimated regression coefficients is A. Centered at the true regression parameters. B. The t-distribution assuming that the variance of the error term is unknown an replaced by its estimate. C. Dependent on the design matrix. D. All of the above.
D. All of the above. 3.2 - Knowledge Check 2
In the presence of near multicollinearity, A. The coefficient of determination decreases. B. The regression coefficients will tend to be identified as statistically significant even if they are not. C. The prediction will not be impacted. D. None of the above.
D. None of the Above 3.3 - Knowledge Check 4
Which one is correct? A. The residuals have constant variance for the multiple linear regression model. B. The residuals vs fitted can be used to assess the assumption of independence. C. The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution. D. None of the above.
D. None of the Above 3.3 - Knowledge Check 4
Which one is correct? A. If a departure from normality is detected, we transform the predicting variable to improve upon the normality assumption. B. If a departure from the independence assumption is detected, we transform the response variable to improve upon this assumption. C. The Box-Cox transformation is commonly used to improve upon the linearity assumption. D. None of the above
D. None of the above. 1.3 - Knowledge Check 4
Which are all the model parameters in ANOVA? A. The means of the k populations. B. The sample means of the k populations. C. The sample means of the k samples. D. None of the above.
D. None of the above. 2.1 - Knowledge Check 1
Which one correctly characterizes the sampling distribution of the estimated variance? A. The estimated variance of the error term has a 𝜒2distribution regardless of the distribution assumption of the error terms. B. The number of degrees of freedom for the 𝜒2 distribution of the estimated variance is n-p-1 for a model without intercept. C. The sampling distribution of the mean squared error is different of that of the estimated variance. D. None of the above.
D. None of the above. 3.1 - Knowledge Check 1
We can test for a subset of regression coefficients A. Using the F statistic test of the overall regression. B. Only if we are interested whether additional explanatory variables should be considered in addition to the controlling variables. C. To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. D. None of the above.
D. None of the above. 3.2 - Knowledge Check 2
You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. How does an increase in 1 unit in langarts affect the expected number of days missed, given that the other predictors in the model are held constant? Increase by 0.012152 days Increase by 0.9879 days Increase by 1.22% Decrease by 1.21%
Decrease by 1.21% The estimated coefficient for langarts is -0.012152. A one unit increase in langarts gives us e −0.012152 = 0.9879215. In terms of percentages, this should be interpreted as the expected number of days missed decreasing by 1.21% (1-0.9879215). Hence, given that the other predictors in the model are held constant, a one unit increase in langarts results in the expected number of days missed decreasing by 1.21%, holding all other predictors constant.
Deviance Residuals - Poisson Regression
Deviance residuals are the sign square root of the likelihood evaluated the saturated model when we assume that the estimated expected response is the observed response versus the fitted model Normal Distribution
Causality is the same as association in interpreting the relationship between the response and the predicting variables.
False
For a multiple regression model, both the true errors LaTeX: \epsilon ϵ and the estimated residuals LaTeX: \hat \epsilon ϵ ^ have a constant mean and a constant variance.
False
For estimating confidence intervals for the regression coefficients, the sampling distribution used is a normal distribution.
False
For the model LaTeX: y=\beta_0+\beta_1x_1+...+\beta_px_p+\epsilon y = β 0 + β 1 x 1 + ... + β p x p + ϵ , where LaTeX: \epsilon\sim N(0,\sigma^2) ϵ ∼ N ( 0 , σ 2 ) , there are p+1 parameters to be estimated
False
Given a categorial predictor with 4 categories in a linear regression model with intercept, 4 dummy variables need to be included in the model.
False
If a departure from normality is detected, we transform the predicting variable to improve upon the normality assumption.
False
If a departure from the independence assumption is detected, we transform the response variable to improve upon this assumption.
False
The sampling distribution for the variance estimator in simple linear regression is χ 2 (chi-squared) regardless of the assumptions of the data. (T/F)
False See 1.2 Estimation Method "The sampling distribution of the estimator of the variance is chi-squared, with n - 2 degrees of freedom (more on this in a moment). This is under the assumption of normality of the error terms."
We assess the constant variance assumption by plotting the error terms, ϵ i, against fitted values. (T/F)
False See 1.2 Estimation Method "We use ϵ ^ i as proxies for the deviances or the error terms. We don't have the deviances because we don't have β 0 and β 1.
The simple linear regression coefficient, β ^ 0, is used to measure the linear relationship between the predicting and response variables. (T/F)
False See 1.2 Estimation Method β ^ 0 is the intercept and does not tell us about the relationship between the predicting and response variables.
β ^ 1 is an unbiased estimator for β 0.
False See 1.4 Statistical Inference "What that means is that β ^ 1 is an unbiased estimator for β 1." It is not an unbiased estimator for β 0.
β ^ 1 is an unbiased estimator for β 0. (T/F)
False See 1.4 Statistical Inference "What that means is that β ^ 1 is an unbiased estimator for β 1." It is not an unbiased estimator for β 0.
The p-value is a measure of the probability of rejecting the null hypothesis. (T/F)
False See 1.5 Statistical Inference Data Example "p-value is a measure of how rejectable the null hypothesis is... It's not the probability of rejecting the null hypothesis, nor is it the probability that the null hypothesis is true."
For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for only one of the predicting variables. (T/F)
False See Lesson 3.11: Assumptions and diagnostics In multiple linear regression, we need the linearity assumption to hold for all of the predicting variables, for the model to be a good fit. "For example, if the linearity does not hold with one or more predicting variables, then we could transform the predicting variables to improve the linearity assumption."
Given a quantitative predicting variable and a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model.
False See Lesson 3.2: Basic Concepts We only need 7 dummy variables. "When we have qualitative variables with k levels, we only include k-1 dummy variables if the regression model has an intercept."
In MLR, the adjusted R2 can be used to compare models, and its value will be always greater than or equal to that of R2
False - Adjusted R2 will be always less than or equal to R2
The threshold to calculate the classification error rate of a logistic regression model should always be set at 0.5
False - Althought 0.5 is common value, the threshold is problem-dependent and its value should be tuned.
In a MLR model, an observation should always be discarded when its Cook's distance is greater than 4/n (n=sample size)
False - An observation should not be discarded just because it is found to be an outlier. We must investigate the nature of the outlier before deciding to discard it.
If the constant variance assumption does not hold in MLR, we apply Box-Cox transformation to the predicting variables
False - Apply Box-Cox to the response (y) not predicting variables.
Hypothesis testing for Poisson regression can be done on small sample sizes
False - Approximation of normal distribution needs large sample sizes, so does hypothesis testing.
Bayesian information criterion (BIC) penalizes for complexity of the model more than both leave one out CV and Mallow's Cp statistic
False - BIC penalizes complexity more than other approaches
LASSO regression will always select the same number or more predicting variables than Ridge and Elastic Net regression
False - Because LASSO can eliminate predicting variables using the penalty while the Ridge and Elastic Net retain coefficients, LASSO will have the same number or LESS predicting variables
The number of parameters that need to be estimated in a Logistic Regression model with 6 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard Linear Regression model with an intercept and same predicting variable
False - Because it doesn't have the error term
Multicollinearity in MLR means that the rows in the design matrix are linearly dependent.
False - Columns
Multicollinearity in MLR means that the columns in the design matrix are linearly independent.
False - Columns are actually linearly DEPENDENT
Elastic net often underperforms LASSO regression in terms of prediction accuracy because it considers both L1 and L2 penalties together
False - Elastic Net often outperforms LASSO in terms of accuracy. The difference between LASSO and Elastic Net is the addition of penalty just like the one used in Ridge Regression. By considering both, L1 and L2, we have the advantages of both LASSO and Ridge Regression.
Since there are no error terms in a Poisson model, we cannot perform residual analysis for evaluating the model's GOF.
False - For Poisson Regression, we can perform residual analysis although there is not an error term.
In MLR, the coefficient of determination is used to evaluate GOF
False - GOF in MLR through model structure and assumptions
Multiplying a variable by 10 in LASSO regression, decreases the chance that the coefficient of this variable is nonzero.
False - I am not sure why anyone would think this would be true.
In Poisson Regression, the expectation of the response variable given the predictors is equal to the linear combination of the predicting variables.
False - In Poisson Regression, the expectation of the response variable given the predictors is equal to the exponential of the linear combination of the predicting variables.
In Poisson regression, we assume a nonlinear relationship between the log rate and the predicting variables.
False - In Poisson Regression, we assume a linear relationship between the log rate and the predicting variables.
The mean square prediction error(MSPE) is a robust prediction accuracy measurement for a OLS model regardless of the characteristics of the dataset.
False - MPSE is appropriate for evaluating prediction accuracy for a linear model estimated using least squares, but it depends on the scale of the response data, and thus is sensitive to outliers.
If the outcome variable is quantitative and all explanatory variables take values 0 or 1, a logistic regression model is most appropriate.
False - More research is necessary to determine the correct model.
From the binomial approximation with a normal distribution using the central limit theorem, the Pearson residuals have an approximately standard chi-squared distribution.
False - Normal distribution
The assumption of constant variance will always hold for standard Linear Regression models with Poisson distributed response data.
False - One common problem with fitting a linear regression model to Poisson data is the departure from the assumption of Constant variance.
If a Poisson regression model does not have a good fit, the relationship between the log of the expected rate and the predicting variables might not be linear
False - Only one way (Non linear relationship causes Poisson not a good fit). Not the other way around
The presence of certain types of outliers can impact the statistical significance of some of the regression coefficients of a MLR model
False - Outliers that are influential can impact statistical significance of a beta parameters.
Parameter tuning is not recommended as part of the sub-sampling approach for addressing the p-value problem with large samples.
False - Parameter tuning is recommended as part of the sub-sampling approac
The F-test can be used to test for the overall regression in Poisson regression.
False - Poisson uses Chi-squared test to test for overall regression.
Another criteria for variable selection is cross validation which is a direct measure of explanatory power.
False - Predictive power
Forward stepwise variable selection starts with the simpler model and select the predicting variable that increases the R2 the most, unless the R2 cannot be increased any further by adding variables
False - R2 is not compared during stepwise variable selection. Variables are selected if they reduced the AIC or BIC of a model.
In Logistic Regression, R2 could be used as a measure of explained variation in the response variable.
False - R2 is used as explained variability (not variation). In logistics regression, the response variable is binary therefore R2 doesn't measure explained variation.
R2 decreases as more predictors are added to a MLR model, given that the predictors added are unrelated to the response variable
False - R2 never decreases as more predictors are added to MLR.
Random sampling is computationally less expensive than the K-fold cross validation
False - Random sampling is computationally more expensive than the K-fold CV, with no clear advantage in terms of accuracy of the estimation classification error. K fold CV is preferred at least from a computation standpoint.
It is not required to standardize or rescale the predicting variables when performing regularized regression
False - Regularized regression requires standardization or scaling of the predicting variables.
The p-value of the test computed as the left tail of the chi-squared distribution
False - Right tail
When assessing GOF for a Logistic Regression model on a binary data with replications, the assumption is that the response variables(Yi) come from a normal distribution.
False - The assumption is that the response variable comes from a binomial distribution
The null hypothesis for the GOF of a Logistic Regression model is that the model does not fit the data.
False - The null hypothesis is the model fits the data
In regularized regression, the penalization is generally applied to all regression coefficient (Bo, ...., Bp) where p = number of predictors
False - The shrinkage penalty is applied to B1, ...., Bp but not to the intercept Bo
We can diagnose the constant variance assumption in Poisson regression using the normal probability plot.
False - There is not a constant variance assumption in Poisson Regression
If a Poisson regression model is found to be overdispersed, there is an indication that the variability of the response variable implied by the model is larger than the variability present in the observed response variable
False - This indicates the variability of the response variable implied by the model is smaller than the variability present in the observed response variable.
IN MLR, a VIF value of 6 for a predictor means that 92% of the variance in that predictor can be modeled by other predictors
False - Use VIF = 1/1 - Rj^2 to solve for Rj^2
Variable selection is a simple and solved statistical problem since we can implement it using the R software
False - Variable selection for a large number of predicting variables is an "unsolved" problem, and variable selection approaches should be tailored to the problem at hand.
If the constant variance assumption does not hold in MLR, we apply Box-Cox transformation to the predicting variables.
False - We apply a Box-Cox transformation to the response variable.
In stepwise regression, we accept/remove variables that produce larger AICs or BICs
False - We desire model that has the smallest AIC or BIC
When testing a subset of coefficients, deviance follows a Chi-square distribution with q Degree of Freedom, where q is the number of regression coefficients in the reduced model
False - q is number of regression coefficients discarded from the full model to get the reduced model
The estimated coefficients of a regression line is positive, when the coefficient of determination is positive.
False - r squared is always positive.
In multiple linear regression, as the value of R-squared increases, the relationship between predictors becomes stronger
False - r squared measures how much variability is explained by the model, NOT how strong the predictors are.
Ridge regression is a regularized regression approach that can be used for variable selection
False - regularized regression approached but it does not perform variable selection
In Logistic Regression, R2 can be used as a measure of explained variation in the response variable
False - response in logistic regression is binary. R2 is only used to explain variability in the dependent variable that is explained by the model.
Ridge regression cannot be used to deal with problems caused by high correlation among the predictors
False - ridge regression is used to deal with this problem actually
When dealing with a multiple linear regression model, an adjusted R-squared can be greater than the corresponding unadjusted R-Squared value.
False - the adjusted rsquared value take the number and types of predictors into account. It is lower than the r squared value.
The larger the number of variables in the model, the larger the training risk.
False - the larger the number of variables in a model the lower the training risk.
There is never a situation where a complex model is best.
False - there are situations where a complex model is best
In Poisson regression we model the error term
False - there is no error term
In logistic regression we have an additional error term to estimate.
False - there is not error term in logistic regression.
The estimators for the regression coefficients in the Poisson regression are biased.
False - they are unbiased
Generally models with covariance have high bias but low variance
False - they have low bias but high variance.
The variance estimator in logistic regression has a closed form expression.
False - use statistical software to obtain the variance-co-variance matrix
A t-test is used for testing the statistical significance of a coefficient given all predicting variables in a Poisson regression model.
False - use z-test for Poisson
For a linear regression under normality, the variance used in the Mallow's Cp penalty is the true variance, not an estimate of variance
False - variance used in Mallow's Cp penalty is the estimated variance from the full model
We can use residual analysis to conclusively determine the assumption of independence
False - we can only determine uncorrelated errors.
In logistic regression for goodness of fit, we can only use the Pearson residuals.
False - we can use Pearson or Deviance.
We cannot estimate the Regression coefficients of a MLR if predicting variables are linearly independent
False - we cannot predict if predictors are dependent (tricky part)
For Poisson regression we estimate the expectation of the log response variable.
False - we estimate the log of the expectation of the response variable.
In logistic regression we interpret the Betas in terms of the response variable.
False - we interpret it in terms of the odds of success or the log odds of success
If there is a high correlation between variables, Lasso will select both.
False lasso will select 1
A linear Regression model is a good fit to the data set if the Adjusted R2 is above 0.85
False- Adjust R2 is not a measure of GOF. GOF refers to all model assumptions holding
In regression inference, the 99% confidence interval of coefficient \beta_0 is always wider than the 95% confidence interval of \beta_1.
False- can only compare beta1 with beta1 and beta0 with beta0
. A correlation coefficient close to 1 is evidence of a cause-and-effect relationship between the two variables.
False- cause and effect can only be determined by a well designed experiment.
In simple linear regression models, we lose three degrees of freedom when estimating the variance because of the estimation of the three model parameters β 0 , β 1 , σ^2. (T/F)
False. See 1.2 Estimation Method "The estimator for σ 2 is σ ^ 2, and is the sum of the squared residuals, divided by n - 2." We lose two degrees of freedom because the variance estimator, σ ^ 2, uses only the estimates for β 0, and β 1 in its calculation.
With the Box-Cox transformation, when λ = 0 we do not transform the response. (T/F)
False. See 1.8 Diagnostics When λ = 0, we transform using the normal log.
In ANOVA, the linearity assumption is assessed using a plot of the response against the predicting variable. (T/F)
False. See 2.2 - Estimation Method Linearity is not an assumption of Anova.
What are the assumptions for multiple linear regression?
Linearity/Mean zero assumption, Constant Variance, Independence and Normality (for statistical inference)
In multiple linear regression, a VIF value of 6 for a predictor means that 90% of the variation in that predictor can be modeled by the other predictors. (T/F)
False. See Lesson 3.13: Model Evaluation and Multicollinearity A VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model.
Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent. (T/F)
False. See Lesson 3.13: Model Evaluation and Multicollinearity Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
Multicollinearity among the predicting variables will not impact the standard errors of the estimated regression coefficients. (T/F)
False. See Lesson 3.13: Multicollinearity Multicollinearity in the predicting variables can impact the standard errors of the estimated coefficients. "However, the bigger problem is that the standard errors will be artificially large."
In multiple linear regression, the prediction of the response variable and the estimation of the mean response have the same interpretation. (T/F)
False. See Lesson 3.2.9: Regression Line and Predicting a New Response. In multiple linear regression, the prediction of the response variable and the estimation of the mean response do not have the same interpretation.
A multiple linear regression model contains 6 quantitative predicting variables and an intercept. The number of parameters to estimate in this model is 7. (T/F)
False. See Lesson 3.2: Basic Concepts The number of parameters to estimate in a multiple linear regression model containing 6 quantitative predicting variables and an intercept is 8: 7 regression coefficients (β0,β1,...,β6) and the variance of the error terms (σ2).
Given a a quantitative predicting variable and a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model. (T/F)
False. See Lesson 3.2: Basic Concepts We only need 7 dummy variables. "When we have qualitative variables with k levels, we only include k-1 dummy variables if the regression model has an intercept."
The estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing the sum by n - p , where n is the sample size and p is the number of predictors. (T/F)
False. See Lesson 3.3: Regression Parameter Estimation The estimated variance of the error terms of a multiple linear regression model with intercept should be obtained by summing up the squared residuals and dividing that by n-p-1, where n is the sample size and p is the number of predictors as we lose p+1 degrees of freedom when we estimate the p coefficients and 1 intercept.
The causation of a predicting variable to the response variable can be captured using multiple linear regression on observational data, conditional of other predicting variables in the model. (T/F)
False. See Lesson 3.4 Model Interpretation "This is particularly prevalent in a context of making causal statements when the setup of the regression does not allow so. Causality statements can only be made in a controlled environment such as randomized trials or experiments. "
Conducting t-tests on each β parameter in a multiple linear regression model is the preferable to an F-test when testing the overall significance of the model. (T/F)
False. See Lesson 3.7: Testing for Subsets of Coefficients "We cannot and should not select the combination of predicting variables that most explains the variability in the response based on the t-tests for statistical significance because the statistical significance depends on what other variables are in the model."
Under testing a subset of coefficients, what is the distribution and degrees of freedom for the deviance?
For large sample size data, the distribution of this test statistic, assuming the null hypothesis is true, is a chi square distribution. With Q degrees of freedom where Q is the number of regression coefficients discarded from the full model to get the reduced model or the number of Z predicting variables.
Akaike Information Criterion
For linear regression under normality this is the training risk + penalty The complexity penalty is (2 * # of predictors in submodel * true_variance of full model)/n For AIC, we need to specify k = 2 Select the model with the smallest AIC
Using the Tukey method to find the confidence interval of the means, what does having a '0' in the CI mean?
For the confidence intervals that include zero, it's plausible that the difference between means is zero.
Forward Stepwise Regression
Forward stepwise regression starts with no predictors in the model and usually tends to select smaller models. As a result, this is preferable over backward stepwise regression. The same model is not necessarily the same as the one selected using backward stepwise regression
Which is preferred, forward or backward?
Forward, because it selects smaller models versus backwards which starts with a full model.
distribution of pearson residuals?
From the binomial approximation with a normal distribution using the central limit theorem
distribution of deviance residuals?
From the properties of the likelihood function, standard normal distribution if the model assumptions hold. That is, if the model is a good fit.
Multicollinearity
Generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant info, skewing the results in a regression model.
Combinations of Variables
Given p predicting variables, we can have 2^p combinations of the variables, and thus, 2^p models to choose from
When we are testing for overall regression for a Logistic model, what is the H0 and HA?
H0: all regression coefficients except intercept are 0 HA: at least one is not 0.
What are the null and alternative hypotheses of the F test?
H0: all the regression coefficients except the intercept are 0. HA: at least one is not 0.
Another GOF test is hypotheses testing, what is the H0 and HA?
H0: is that the model fits well. HA: the alternative is that the model does not fit well.
What is the null and alternative hypothesis for MLR?
H0: the coefficients are not0 HA: the coefficients (at least 1) are not equal to 0
H0 and HA for test of equal means?
H0: the means are equal HA: some means are different
What are the null and alternative hypotheses for ANOVA for MLR?
H0: the regression coefficients all equal zero. HA: At least one of the regression coefficients is not equal to zero.
The estimated versus predicted regression line for a given x*
Have the same expectation
The estimated versus predicted regression line for a given x*:
Have the same expectation
Forward-Backward Regression
Here we add and discard one variable at a time iteratively.
Generalized Linear Models (GLM)
Here, the response Y is assumed to have a distribution from the exponential family of distributions (Normal, Binomial, Poisson, Gamma etc) Under this model, we model a transformation g of the expectation of Y, given the predicting variables as a linear combination of the predicting variables We can write the expectation as the inverse of the g transformation of the linear combination of the predicting variables **Include table w/ link function & regression function pg 67
Hypothesis testing (statistical significance: +/-)
Here, the z-value is the same but the P-value will change Positive: P-value = P(Z > z-value) Negative: P-value = P(Z < z-value) **Applies for Logistic & Poisson Regression
Robust Regression
Here, we can estimate the regression coefficients in the presence of outliers When there are outliers in the distribution, it has heavy tails and we thus have departures from the normality assumption
Maximum Likelihood Estimate
Here, we maximize the likelihood function with respect to the model parameters or in this case, the regression coefficient The log likelihood that needs to be maximized is highly non-linear. Thus we cannot derive a closed form (exact) expression of the estimates. The estimated parameters and their standard errors are approximate estimates
high dimensionality
In linear regression, when the number of predicting variables P is large, we might get better predictions by omitting some of the predicting variables.
Nonparametric Regression
In non-parametric regression, the regression function has an unknown structure given the predicting variables, and the regression function does not depend on any parameters
Poisson Regression Assumptions
Linearity: The log transformation of the rate is a linear combination of the predicting variables. Independence: The response variables are independently obserterm-52ved Logit: The link function g is the log function. The log link function is almost always used
Logistic Regression Assumptions
Linearity: The relationship between the g of the probability of success and the predicted variable, is a linear function. Independence: The response binary variables are independently observed Logit: The logistic regression model assumes that the link function g is a logit function
An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 How does an increase in 1 unit in doserate affect the expected number of chromosome abnormalities, given that the other predictors in the model are held constant? Increase of 8.291 Decrease of 41.331 Increase of 20.284 Decrease of 9.134
Increase of 20.284 The estimated coefficient for doserate is 20.284. If we fix all other predictors, for each 1 unit increase in doserate, the expected number of chromosome abnormalities increases 20.284 units.
The assumption of normality:
It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference.
How do we interpret B1?
It is the estimated expected change in the response variable associated with one unit of change in the predicting variable.
What is the purpose of testing a subset of coefficients?
It simply compares two models and decides whether the larger model is statistically significantly better than the reduced model.
Weighted Least Squares (WLS)
It's a multiple regression model. But the difference is that we assume that the variance of the errors is not constant The vector of errors can be assumed to have a covariance-variance matrix sigma, thus allowing for correlated errors We have the independence assumption only when sigma matrix is a diagonal matrix
Penalties
L0: This is the number of nonzero regression coefficients. - Maximizing Q(betas) means searching through all submodels L1: This penalty applied to the vector of regression coefficients is equal to the sum of the absolute values of the regression coefficients to be penalized - Maximizing Q forces many betas to be zeros (Lasso) L2: This penalty applied to the vector of regression coefficients is equal to the sim of the squared regression coefficients to be penalized - Maximizing Q accounts for multicollinearity (Ridge)
LOOCV
LOOCV can be approximated by the sum between the training risk + the complexity penalty. The complexity penalty is (2 * # of predictors in submodel * estimated_variance of submodel)/n The variability of the submodel is smaller than that of the full model, thus LOOCV penalizes complexity less than Mallow's Cp LOOCV is ~ AIC when the true variance is replaced by the estimate of the variance from the submodel
Does the statistical inference for logistic regression rely on a small or large sample size?
Large, if it was a small then the statistical inference is not reliable
Lasso Regression
Lasso performs estimation of variable selections simultaneously The estimated regression coefficients from Lasso are less efficient than those provided by the ordinary least squares The penalty is the sum of absolute values of the regression coefficients except for intercept, that force coefficients to be zero We can apply lasso to standard linear regression, logistic regression, Poisson regression, and other linear models We do not have a closed-form expression for the estimated regression coefficients under this model Once a coefficient is non-zero, then it does not go back to zero In the case, where p the number of predictors is larger than n the number of observations, that is, more in the variables than observations, the Lasso selects, at most, n variables because of the nature of the convex optimization problem If there exists high correlation among predictors the prediction performance of the Lasso is dominated by ridge regression If there is a group of variables among which the correlation is very high, then the Lasso tends to select only one variable from that group Alpha = 1
3 assumptions of the logistic regression model
Linearity, Independence, Logit link function
what are the 4 assumptions of linear regression?
Linearity/Mean Zero, Constant Variance, Independence, Normality
Statistical Inference
Logistic Regression: Normal Distribution. The statistical inference based on the normal distribution applies only under large sample data. If the sample size, or n, is small? Then the statistical inference is not reliable i.e. warn on the lack of the reliability of the results Standard Regression: T-Distribution. The statistical inference relies on the distribution that applies under both small and large samples **Applies for Logistic & Poisson Regression
we estimate the Poisson model parameters using...
MLE
Introducing some bias yields a decrease in....
MSE
Mean Squared Error
MSE is commonly used to obtain estimators that may be biased, but less uncertain than unbiased ones. The MSE can be controlled It is possible to find a model with lower MSE than the unbiased/full model Introducing some bias yields a decrease in MSE followed by a later increase
4 types of Variable Selection Criteria
Mallow's Cp, AIC, BIC, LOOCV
Assumption of Independence
Means that the deviances, or in fact the response variables ys, are independently drawn from the data-generating process. (this most often occurs in time series data) This can result in very misleading assessments of the strength of regression.
Linearity/Mean zero assumption
Means that the expected value (deviances) of errors is zero. This leads to difficulties in estimating B0 and means that our model does not include a necessary systematic component
L2 penalty - Ridge
Minimizing the penalized least squares using this penalty accounts for multicollinearity, but does not perform variable selection. The result in regularized regression is a so-called ridge regression
The estimators of the linear regression model are derived by:
Minimizing the sum of squared differences between observed and expected values of the response variable.
What do we mean by model parameters in statistics?
Model parameters are unknown quantities, and they stay unknown regardless how much data are observed. We estimate those parameters given the model assumptions and the data, but through estimation, we're not identifying the true parameters. We're just estimating approximations of those parameters.
Poisson Regression
Model response data in the form of counts Modeling the expectation of the response variable as the exponential of the linear combination of the predicting variables Assumptions: - Log transformation of the rate is a linear combination of the predicting variables - The response count data are independently observed - The link function data is the log function The only model parameters are the regression coefficients estimated using the maximum likelihood approach Statistical inference on the regression coefficients using an approximation of the sampling distribution, the normal distribution
Variable Selection
Models with many predictors/covariates have low bias but high variance. Models with few predictors/covariates have high bias but low variance
Pooled variance estimator - N vs k?
N = combined size of samples, k = # of samples
If p(predictors) is large, is it feasible to fit a large number of submodels?
No
Is testing a subset of coefficients a GOF test?
No
Do the three stepwise approaches select the same model?
No, especially when p is large.
A data point far from the mean of the x's and y's is always: an influential point and an outlier a leverage point but not an outlier an outlier and a leverage point an outlier but not a leverage point None of the above
None of the above See 1.9 Outliers and Model Evaluation We only know that the data point is far from the mean of x's and y's. It only fits the definition of a leverage point because the only information we know is that it is far from the mean of the x's. So you can eliminate the answers that do not include a leverage point. That leaves us with remaining possibilities, "a leverage point but not an outlier" and "an outlier and a leverage point" , both of which we can eliminate. We do not have enough information to know if it is or is not an outlier . None of the answers above fit the criteria of it being always being a leverage point.
what is the distribution of B1?
Normal
Normal Distribution
Normal distribution relies on a large sample of data. Using this approximate normal distribution we can further derive confidence intervals. Since the distribution is normal, the confidence interval is the z-interval **Applies for Logistic & Poisson Regression
what can we use to check for normality?
QQ plot and histogram
T/F: An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.
T
What would we do if the T value is large?
Reject the null hypothesis that β1 is equal to zero. If the null hypothesis is rejected, we interpret this that β1 is statistically significant.
Time Series
Response data are correlated. This correlation results in a much smaller number of degrees of freedom than otherwise assumed under independence. Moreover, because of the correlation the data are concentrated into a smaller part of the probability space where the data lie. Ignoring dependence leads to inefficient estimates of regression parameters and leads to poor predictions. Standard errors are unrealistically small. i.e. too narrow confidence intervals thus improper statistical inferences
What are the two most common approaches to regularized regression?
Ridge and LASSO
For the G-o-F tests, do we reject the null hypotheses when the p-value is SMALL or BIG?
Small and we conclude the model is not a good fit
Cross validation
Split the data into two parts, first part called the training data and testing/validation data. The training data will be used to fit the model and thus get the estimated regression coefficients. The testing or validation data will be used to predict or classify the responses for this portion of the data, then compare to the observed response to estimate the classification error, one can repeat the process several times.
Poisson Regression vs Log transformed Linear Regression
Standard Regression: We estimate the expectation of the log of the response - E(log(Y)) The variance under the standard regression is assumed constant Poisson Regression: We estimate the log of the expectation of the response - log(E(Y)) The variance of the response is assumed to be equal to the expectation; thus, the variance is not constant. **Use the Poisson regression especially when the response data are small counts **Using the standard linear regression with log transformation instead of Poisson regression, will result in violations of the assumption of constant variance **Standard Linear Regression could be used if the number of counts are large and with the variance stabilizing transformation √(µ + 3/8) i.e. Square root of the response + 3/8. This transformation will work well for large count when the response data are large counts
Overall Regression
Standard Regression: We use the F test to test for the overall regression Logistic Regression: We use the difference between the log likelihood function of the model under the null hypothesis (also called the null-deviance), and the log likelihood of the full model (residual deviance) i.e. the difference between the null deviance and the residual deviance
T/F: Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.
T
T/F: An approximate test can be used to test for the overall regression in Poisson regression.
T
Assumptions - Mixed Effects
The error term is normally distributed with zero mean and constant variance The group effect is normally distributed with mu 0 and constant variance, where the variances of the error terms and of the group effect are two different parameters In random effects model, the observations are no longer independent, even if the error terms are independent.
What is the partial F-test?
The hypothesis test for whether a subset of regression coefficients are all equal to zero.
Coefficient Test (Deviance)
The hypothesis testing procedure is testing the null hypothesis that all alpha coefficients are zero, versus the alternative that at least one alpha coefficient is not zero For the testing procedure for subsets of coefficients, we compare the likelihood of a reduced model versus a full model. This test provides inferences on the predictive power of the model. Predictive power means that the predicting variables predict the data even if one or more of the assumptions do not hold
Logit link function assumption
The logistic regression model assumes that the link function is a so-called logit function. This is an assumption since the logit function is not the only function that yields s-shaped curves. And it would seem that there is no reason to prefer the logit to other possible choices.
Log odds function
The logit function which is the log of the ratio between the probability of a success and the probability of a failure
Model parameters
The model parameters are the regression coefficients. There is no additional parameter to model the variance since there's no error term. For P predictors, we have P + 1 regression coefficients for a model with intercept (beta 0). We estimate the model parameters using the maximum likelihood estimation approach
Linear Regression
The predicting variable is assumed to be fixed, whereas the response variable y is random We model their relationship as a linear function in x plus an error term epsilon Assumptions Linearity: Relationship between the response and the predicting variable is linear or the expectation of the error term is 0 Constant Variance: The variance of the error term is the same across all observations Independence: Uncorrelated errors Normality: Normality of the error terms We assume that the error terms are independent and identically distributed from a normal distribution with mean 0 and variance sigma squared The parameters are unknown but estimated based on the observed data Statistical inference in the regression coefficients is performed using the sampling distribution of the estimated regression coefficients. The t distribution with n-2 degrees of freedom
Regression Coefficient
The regression coefficient is interpreted as the log ratio of the rate with an increase with one unit in the predicting variable For p predictors, we have p + 1 regression coefficients for a model with intercept. We estimate the model parameters using the maximum likelihood estimation
Poisson Regression
The response Y in Poisson regression is assumed to have a Poisson distribution, and this is commonly used for modeling count or rate data. We assume that the i-th response Yi has a Poisson distribution, with rate lambda i. Alternatively, log of the rate lambda i is equal to the linear combination of the predicting variables We do not interpret beta with respect to the response variable but with respect to the ratio of the rate There is no error term
Generalized Linear Models
The response Y is assumed to have a distribution from the exponential family of distributions. Example of distributions in the exponential family of distributions are normal, binomial, Poisson, gamma We model a transformation g of the expectation of Y as a linear combination of the predicting variables The transformation g is called the link function, since it links the expectation of the response data to the predicting variables. The transformations g depends on the distribution of the response data
Overall Regression (Logistic)
The test statistic is a chi-squared distribution with p degrees of freedom where p is the number of predicting variables. We reject the null hypothesis when the P-value is small, indicating that the overall regression has explanatory power.
G transformation
The transformation g is called a link function since it links the expectation of the response to the predicting variables
The variability in the prediction comes from
The variability due to a new measurement and due to estimation.
The variability in the prediction comes from:
The variability due to a new measurement and due to estimation.
The mean squared errors (MSE) measures:
The within-treatment variability.
Wald Test (Z-test)
The z-test value is the ratio between the estimated coefficient minus 0, (which is the null value) divided by the standard error We reject the null hypothesis that the regression coefficient is 0 if the z value (gets too large) is larger in absolute value than the z critical point, (or the 1- alpha over 2 of the normal quantile). We interpret that the coefficient is statistically significant **Applies for Logistic & Poisson Regression
Reasons why a model may not be a good fit
There may be other variables that should be included in the model The relationship between Logit of the expected probability and predictors might be multiplicative, rather than additive Departure from the linearity assumption Initial observations outliers, leverage points are also still an issue for this model Logit function does not fit well with the data The binomial distribution isn't appropriate. For example, if there's correlation among the responses or there's heterogeneity in the success that hasn't been modeled. Both of these violations can lead to what we call overdispersion
Nonlinear & Nonparametric Regression
These approaches are commonly used to deal with nonlinearity
Regularized (Penalized) Regression
These approaches perform variable selection and estimation simultaneously.
Deviance Residuals
These are the signed square root of the log-likelihood evaluated at the saturated model when we assume that the estimate expected response is the observed response versus the fitted model Deviance residuals have a standard normal distribution if the model is a good fit (i.e. model assumptions hold)
Spatial Regression
This approach also deals with correlated data If the assumption of uncorrelated errors does not hold, it can lead to misleading assessments of the strength of the regression Spatial processes can be observed over regular grid such as images. A regular grid is also called the lattice. Spatial processes can also be observed on irregular grids
Mixed Effects Models
This approach deals with replications in the response data Generally, a group effect is random if we can think of the responses we observe in that group to be samples from a larger population
Mallows Cp
This approach is useful when there are no control variables. It assumes we can estimate the variance from the full model. **This is not the case when p > n The complexity penalty is (2 * # of predictors in submodel * estimated_variance of full model)/n Select the model with the smallest Cp score
Type I Error
This happens if the sample size, or n is small. The hypothesis testing procedure will have a probability of Type I error larger than the significance level (i.e. more Type I errors than expected) **Applies for Logistic & Poisson Regression
Cross Validation
This is a direct measure of predictive power Random sampling is computationally more expensive than the K-fold cross validation, with no clear advantage in terms of the accuracy of the estimation classification error rate The rule of thumb for choosing K is about K = 10 LOOCV is a K-fold cross validation with K = n. The larger K is, the larger the number of folds, the less bias the estimate of the classification the error is but has higher variability.
Leave-One-Out Cross Validation (LOOCV)
This is a direct measure of predictive power. This is just like Mallow's, except the variance is for the S submodel, not the full model. The LOOCV penalizes complexity less than Mallow's Cp.
Stepwise Regression
This is a heuristic search used when p is large. It's also useful when there are control variables. The three stepwise regression approaches do not necessarily select the same model, especially when p is large This is a greedy algorithm; it does not guarantee to find the model with the best score Greedy means we always take the biggest jump (up or down) in the selected criterion
General Additive Regression
This is a non-parametric model used if the linearity assumption does not hold and/or it's difficult to identify transformations that improve the fit Here the relationship of predicting variable to the response is assumed unknown The estimation of the parameters does not have a closed form expression
Normality assumption
This is needed if we want to do any confidence or prediction intervals, or hypothesis test, which we usually do. If this assumption is violated, hypothesis test and confidence and prediction intervals and be very misleading.
Deviance
This is the difference between the log likelihood from a reduced model and the log likelihood from a full model For large sample size data, the distribution (assuming the null hypothesis is true), is a chi square distribution with Q degrees of freedom Q = Number of Z predicting variables (controlling variables for bias selection) i.e. the number of regression coefficients discarded from the full model to get the reduced model The P-value of the test is computed as the right tail of the chi-square distribution with Q degrees of freedom of the test value (Deviance) **This test is NOT a goodness of fit test. It simply compares two models and decides whether the larger model is statistically significantly better than the reduced model.
Odds of a success
This is the exponential of the Logit function
Prediction Risk
This is the measure of the bias-variance tradeoff It is the sum of expected squared differences between fitted values by the model S and future observations Prediction Risk = variance(future observation) + bias^2 + variance(prediction) MSE = bias^2 + variance(prediction) The variance of the future observation (sigma squared) is an irreducible error and thus cannot be controlled
Logistic Regression
This is the model where the link function g is the logit function. The link function g is the log of the ration of p and 1-p We model the probability of a success given the predicting variable using the g link function in such a way that the g function of the probability of the success is a linear model of the predicting variables. The g function is the S-shaped function that models the probability of success with respect to the predicting variables. Assumptions: - Linearity in the predicting variables - Independence in the response of the observed data - The link function is the logit function The parameters for logistic regression are beta 0, beta 1 and beta p. These parameters are unknown but estimated based on the observed data using the maximum likelihood approach Statistical inference in the regression coefficient using an approximate sampling distribution of the estimated regression coefficients, where the approximate distribution is the normal distribution
Exposure of the response variable
This is the number of units when modeling rate data with Poisson Regression Exposure is accounted for using an offset which is the log of the exposure. This variable is added as a predictor to the Poisson regression model
Mallow's Cp
This is the oldest approach to variable selection. This assumes that we can estimate the variance from the full model, however this is NOT the case when p is larger than n.
Training Error
This is the proportion of the responses that are misclassified We cannot use the training error rate as an estimate of the true error classification error rate because it is biased downward The bias comes from the fact that we use the data twice. One, we used it for fitting the model and the second time is to estimate the classification error rate
Pearson Residuals
This is the standardized difference between the ith observed response and estimated expected response, which is ni times the probability of success We need to standardize the difference between observed and expected response, as the responses have different variances Pearson residuals have an approximately standard normal distribution
Training Risk
This is the sum of squared differences between fitted values for sub model S and the observed values. The training risk is biased upward because the prediction risk is used twice The larger the number of variables in the model, the larger the training risk is
Simpson's paradox
This is when the addition of a predictive variable reverses the sign on the coefficients of an existing parameter It refers to reversal of an association when looking at a marginal relationship versus a partial or conditional one. This is a situation where the marginal relationship adds a wrong sign This happens when the 2 variables are correlated
Complete Separation
This is when the model fits perfectly (p-value = 1) after an outlier is removed It indicates that the possibility of a simpler model being good enough should be explored
Nonlinear Regression
This is when the relationship between the response variable and the predicting variables is known but it cannot be expressed as a sum of transformed predicting variables. The regression function has a known structure given the predicting variables. It depends on a series of parameters In nonlinear regression, we know the function F except for the theta. We estimate the model by minimizing the sum of squared errors. We minimize this with respect to theta
Overdispersion
This is where the variability of the probability estimates is larger than would be implied by a binomial random variable ɸ = D/(n-p-1) D is the Deviance(sum of squared deviances) If ɸ > 2 then model is overdispersed; an over-dispersed model will fit better Overdispersion impacts the estimated variance and statistical inference. If overdispersion is not accounted for, statistical inference will not be as reliable
You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. How many regression coefficients including the intercept are statistically significant at the significance level 0.05? All Three Two None
Three As the summary output above shows, the coefficients associated to the intercept, langarts and male are statistically significant at α = 0.05. Their associated p-values (<2e-16, 3.52e-11, <2e-16) are smaller than 0.05
Hypothesis Testing (coefficient == 0)
To perform hypothesis testing, we can use the approximate normal sampling distribution. The resulting hypothesis test is also called the Wald test since it relies on the large sample normal approximation of MLEs To test whether the coefficient betaj = 0 or not, we can use the z- value **Applies for Logistic & Poisson Regression
Hypothesis Testing (coefficient == constant)
To test if the regression coefficient is equal to this constant b, then the z-value changes. We subtract b from the estimated coefficients of the numerator We decide to reject/accept using the P-value The P-value is 2 times the left tail of the standard normal of the quantile provided by the absolute value of the z-value P-value = 2P(Z > |z-value|) **Applies for Logistic & Poisson Regression
Alpha is only used in elastic net.
True
Time Series Regression Characteristics
Trends: Can be long-term increase/decrease in the data over time or it can fluctuate mostly over time Seasonality: This is influenced by seasonal factors. Periodicity: This is when seasonality repeats exactly at the same time with exactly the same regular pattern Cyclical Trends: This is when data exhibits rises and falls that are not of a fixed period Heteroskedasticity: This means the variability varies with time Correlation: Successive observations are similar or negative/positive To account for trend and seasonality or periodicity, we recompose a time series into three components, mt (trend), st(seasonality), and Xt (stationary process). In other words its probability distribution does not change when shifted in time
1. The means of the k populations 2. The sample means of the k populations 3. The sample means of the k samples are NOT all the model parameters in ANOVA
True
A MLR model has high explanatory power if the coefficient of determination is close to 1
True
A high Cook's distance for a particular observation suggests that the observation could be an influential point.
True
A logistic regression model may not be a good fit to the data if the responses are correlated or if there is heterogeneity in the success that hasn't been modeled
True
A negative value of LaTeX: \beta_1 β 1 is consistent with an inverse relationship between LaTeX: x x and LaTeX: y y .
True
AIC looks just like the Mallow's Cp except that the variance is the true variance and not its estimate.
True
In a full model F test, a low p-value indicates the model has predictive power.
True
In a multiple regression model with 7 predicting variables, the sampling distribution of the estimated variance of the error terms is a chi-squared distribution with n-8 degrees of freedom.
True
In case of multiple linear regression, controlling variables are used to control for sample bias.
True
In evaluating a multiple linear model residual analysis is used for goodness of fit assessment.
True
In evaluating a multiple linear model the F test is used to evaluate the overall regression.
True
In evaluating a multiple linear model the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.
True
In evaluating a multiple linear model, Residual analysis is used for goodness of fit assessment.
True
In evaluating a multiple linear model, the F test is used to evaluate the overall regression.
True
In evaluating a multiple linear model, the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.
True
In evaluating a simple linear model residual analysis is used for goodness of fit assessment.
True
In evaluating a simple linear model the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.
True
In evaluating a simple linear model there is a direct relationship between coefficient of variation and the correlation between the predicting and response variables.
True
In multiple linear regression, controlling variables are used to control for sample bias.
True
In simple linear regression, we can diagnose the assumption of constant-variance by plotting the residuals against fitted values.
True
In testing for a subset of coefficients in logistic regression the null hypothesis is that the coefficient is equal to zero
True
It is possible to apply logistic regression when the response variable Y has 3 classes.
True
It is possible to produce a model where the overall F-statistic is significant but all the regression coefficients have insignificant t-statistics.
True
Let LaTeX: Y^* Y ∗ be the predicted response at LaTeX: x^* x ∗ . The variance of LaTeX: Y^* Y ∗ given LaTeX: x^* x ∗ depends on both the value of LaTeX: x^* x ∗ and the design matrix.
True
Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables. (T/F)
True See 1.4 Statistical Inference "Under the normality assumption, β 1 is thus a linear combination of normally distributed random variables... β ^ 0 is also linear combination of random variables"
If the model assumptions hold, then the estimator for the variance, σ ^ 2, is a random variable. (T/F)
True See 1.8 Statistical Inference We assume that the error terms are independent random variables. Therefore, the residuals are independent random variables. Since σ ^ 2 is a combination of the residuals, it is also a random variable.
An ANOVA model with a single qualitative predicting variable containing k groups will have k + 1 parameters to estimate. (T/F)
True See 2.2 Estimation Method We have to estimate the means of the k groups and the pooled variance estimator, s pooled ^2.
If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable. (T/F)
True See 2.8 Data Example "This is important since without a good fit, we cannot rely on the statistical inference." Only when the model is a good fit, i.e. all model assumptions hold, can we rely on the statistical inference.
If the pairwise comparison interval between groups in an ANOVA model includes zero, we conclude that the two means are plausibly equal. (T/F)
True See 2.8 Data Example If the comparison interval includes zero, then the two means are not statistically significantly different, and are thus, plausibly equal.
The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.
True The logit link function is also known as the log odds function.
In logistic regression, the relationship between the probability of success and the predicting variables is nonlinear.
True We model the probability of success given the predictors by linking the probability to the predicting variables through a nonlinear link function.
Visual Analytics for logistic regression Normal probability plot of residuals Residuals vs predictors Logit of success rate vs predictors
True Normal probability plot of residuals - Normality Residuals vs predictors - Linearity/Independence Logit of success rate vs predictors - Linearity
Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both
True -
For a classification model, the training error rate tends to underestimate the true classification error rate of the model.
True -
In the balance of Bias-Variance tradeoff, adding variables to our model tends to increase our variance and decrease our bias
True - Adding more variables will increase the variability and possibly induce multicollinearity. Adding more variables also reduces the bias in the model since it has an additional predictor to conform to which keeps the model from favoring one of the original predictors.
BIC variable selection criteria favors simpler models
True - BIC penalize complexity more than other approaches.
When the objective is to explain the relationship to the response, one might consider including predicting variables which are correlated
True - But this should be avoided for prediction
Ridge regression corrects for the impact of multicollinearity by reweighting the regression coefficients
True - Ridge regression has been developed to correct for the impact of multicollinearity. If there is multicollinearity in the model, all predicting variables are considered to be included in the model but ridge regression will allow for re-weighting the regression coefficients in a way that those corresponding to correlated predictor variables share their explanatory power and thus minimizing the impact of multicollinearity on the estimation and statistical inference of the regression coefficients.
We can estimate the regression coefficients in Poisson regression using the maximum likelihood estimation approach
True - Use MLE for Poisson
To perform hypothesis testing for Poisson, we can use again the approximate normal sampling distribution, also called the Wald test
True - Wald Test also used with logistic regression
Although there are no error terms in a Logistic Regression model using binary data with replications, we can still perform residual analysis.
True - We can perform residual analysis on the Pearson residual or the Deviance residual.
When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis
True - When the objective is to explain the relationship to the response, one might consider including the predicting variables even they are correlated
Under the null hypothesis of good fit for logistic regression, the test statistic has a Chi-Square distribution with n- p- 1 degrees of freedom
True - don't forget, we want large P values
A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more Type I errors than expected
True - if sample size is small, the statistical inference is not reliable. Thus, the hypothesis testing procedure will have a probability of type I error larger than the significant level
L1 penalty will force many betas, many regression coefficients to be 0s
True - is equal to the sum of the absolute values of the regression coefficients to be penalized
L2 does not perform variable selection
True - is equal to the sum of the squared regression coefficients to be penalized and does not do variable selection
Stepwise regression is a greedy search algorithm that is not guaranteed to find the model with the best score
True - it does not guarantee to find the model with the best score
The assumptions in logistic regression are - Linearity, Independence of response variable, and the link function is the logit function.
True - linearity is measured through the link, , the g of the probability of success and the predicted variable.
The prediction interval of one member of the population will always be larger than the confidence interval of the mean response for all members of the population when using the same predicting values. (T/F)
True. See 1.7 Regression Line: Estimation & Prediction Examples "Just to wrap up the comparison, the confidence intervals under estimation are narrower than the prediction intervals because the prediction intervals have additional variance from the variation of a new measurement."
The normality assumption states that the response variable is normally distributed. (T/F)
True. See 1.8 Diagnostics "Normality assumption: the error terms are normally distributed." The response may or may not be normally distributed, but the error terms are assumed to be normally distributed.
It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables. (T/F)
True. See Lesson 3.13: Model Evaluation and Multicollinearity It is good practice to create a multiple linear regression model using a linearly independent set of predicting variables. "XTX is not invertible if the columns of X are linearly dependent, i.e. one predicting variable, corresponding to one column, is a linear combination of the others."
If the residuals are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation. (T/F)
True. See Lesson 3.3.11: Assumptions and Diagnostics If the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.
A linear regression model has high explanatory power if the coefficient of determination is close to 1. (T/F)
True. See Lesson 3.3.13: Model Evaluation and Multicollinearity If R2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.
For a given predicting variable, the corresponding estimated regression coefficient will likely be different in a conditional model versus a marginal model. (T/F)
True. See Lesson 3.4: Model Interpretation "Importantly, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship."
In multiple linear regression, the estimated regression coefficient corresponding to a quantitative predicting variable is interpreted as the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed. (T/F)
True. See Lesson 3.4: Model Interpretation "The estimated value for one of the regression coefficient βi represents the estimated expected change in y associated with one unit of change in the corresponding predicting variable, Xi, holding all else in the model fixed."
A partial F-Test can be used to test whether the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero. (T/F)
True. See Lesson 3.7: Testing for Subsets of Regression Parameters We use the Partial F-test to test the null hypothesis that the regression coefficients associated to a subset of the predicting variables are all equal to zero. The alternative hypothesis is that at least one of these regression coefficients is not zero.
Goodness of Fit (Poisson)
Use the Pearson or deviance residuals to evaluate whether they are normally distributed, if they are normally distributed then we conclude that the model is a good fit In a goodness of fit test, the null hypothesis is that the model fits well and the alternative that the model does not fit well. The test statistic for the goodness of fit test is the sum of square root deviances Chi-squared distribution of n-p-1 degrees of freedom If the p-value is small. We reject the null hypotheses of good fit and thus, we conclude that the model is not a good fit
Goodness of Fit (binary data with no responses)
Use the deviances from the aggregated model for goodness of fit, not based on the individual level data
L2 Penalty
Using this penalty accounts for multicollinearity, but does not perform variable selection. The resulting regularized regression is called the Ridge regression This is easy to implement, but it does not measure sparsity not perform variable selection
L1 Penalty
Using this penalty will force many betas to be 0s. The resulting regularized regression is called the LASSO regression L1 penalty measures sparsity
L0 Penalty
Using this penalty, the penalized least squares is equivalent to searching over all models and thus not feasible for a large number of predictive variables L0 penalty provides the best model given selection criteria, but it requires fitting all submodels
How do we interpret the VIF?
VIF measures the proportional increase in the variance of beta hat j compared to what it would have been if the predictive variables had been completely uncorrelated
Uncorrelated Errors Assumption
Violations of this assumption can lead to misleading assessment of the strength of the regression. This is because the degrees of freedom are not equal to the sample size. In fact, there are less degrees of freedom due to the correlation. Moreover, not accounting for correlation will result in higher variability or uncertainty in the estimate, thus less reliable statistical inference
Goodness of Fit
We can use the Pearson or Deviance residuals to evaluate whether they are normally distributed. If they're normally distributed, we conclude that the model is a good fit If the model is not a good fit, it means the linearity assumption may not hold
Logistic Regression Coefficient
We interpret the regression coefficient beta as the log of the odds ratio for an increase of one unit in the predicting variable We do not interpret beta with respect to the response variable but with respect to the odds of success The estimators for the regression coefficients in logistic regression are unbiased and thus the mean of the approximate normal distribution is beta. The variance of the estimator does not have a closed form expression
g-function
We link the probability of success to the predicting variables using the g link function. The g function is the s-shape function that models the probability of success with respect to the predicting variables The link function g is the log of the ratio of p over one minus p, where p again is the probability of success Logit function (log odds function) of the probability of success is a linear model in the predicting variables The probability of success is equal to the ratio between the exponential of the linear combination of the predicting variables over 1 plus this same exponential
What can we do since the training risk is biased?
We need to correct for this bias by penalizing the training risk by adding a complexity penalty.
To use the estimated residuals for assessing model assumptions, what do we need to do first?
We need to standardize them
When would we reject the null hypothesis for a z test?
We reject the null hypothesis that the regression coefficient is 0 if the z value is larger in absolute value than the z critical point. Or the 1- alpha over 2 normal quanta. We interpret this that the coefficient is statistically significant.
Regression Coefficients - Robust Regression
We replace the sum of squared errors with the sum of absolute errors, thus estimating the regression coefficients by minimizing the sum of absolute errors. We cannot obtain closed or exact expressions for the estimated regression coefficients. The estimated coefficients are approximate estimates
We detect departure from the assumption of constant variance
When the residuals vs fitted values are larger in the ends but smaller in the middle.
Robust Regression II
When we have a heavy tailed distribution that has symmetry, we need to replace the normal distribution with the PDF The main difference between this distribution and that of a normal distribution is that we have the absolute value of the difference between y and the centrality parameter mu whereas for normal distribution, we had y minus mu squared. This distribution has heavier tails than the normal The parameter mu is not the mean or the expectation anymore, but it is the median of a distribution. The estimated mu using this approach is the sample median
High Dimensionality
When we have a very large number of predicting variables to consider, it can be difficult to interpret and work with the fitted model
training risk
compute the prediction risk for the observed data and take the sum of squared differences between fitted values for sub model S and the observed values
In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Calculate the sum of squared errors (SSE) from the given R output. Select the choice that most closely approximates your calculation. a. 102.617 b. 2668.039 c. 2533.081 d. 2786.025
b. 2668.039 MSE = SSE/(n−p−1) = SSE/DF. Hence, SSE =MSE*DF = 10.132* (30-3-1) = 2668.039
A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of A (standard error for the estimated intercept)? a. 877.9 b. 4.383 c. 0.2281 d. None of the above
b. 4.383 Since t-value = (estimated intercept - 0)/estimated std, we have, estimated std = estimated intercept/tvalue = 62.0313/14.152 = 4.383
Which of the following is not an application of regression? a. Testing hypotheses b. Proving causation c. Predicting outcomes d. Modeling data
b. Proving causation
We can make causality statements for...
experimental designs
Why can we not use the training error rate as an estimate of the true error classification error rate?
because it is biased downward. And the bias comes from the fact that we use the data twice. First, we used it for fitting the model and the second time to estimate the classification error rate.
In logistic regression, how do we define residuals for evaluating g-o-f?
binary data with replications.
What is the distribution of binary data WITH replications?
binomial distribution with more than one trial or ni greater than 1
In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Calculate the sum of squares total (SST) from the given R output. Select the choice that most closely approximates your calculation. a. 4994.48 b. 3147.54 c. 7662.38 d. 8655.21
c. 7662.38 Since, R2 = 1 - SSE/SST, we have, SST = SSE /(1-R 2 ) = 2668.039/(1-0.6518) = 7662.38
You have measured the systolic blood pressure of a random sample of 50 employees of a company, and have fitted a linear regression model to estimate the response variable systolic blood pressure using the sex of the employees. The 95% confidence interval for the mean systolic blood pressure for the female employees is computed to be (122, 138). Which of the following statements gives a valid frequentist interpretation of this interval? a. 95% of the sample of female employees has a systolic blood pressure between 122 and 138. b. 95 % of the employees in the company have a systolic blood pressure between 122 and 138. c. If the sampling procedure were repeated 100 times, then approximately 95 of the resulting 100 confidence intervals would contain the true mean systolic blood pressure for all female employees of the company. d. We are 95% confident the sample mean is between 122 and 138
c. If the sampling procedure were repeated 100 times, then approximately 95 of the resulting 100 confidence intervals would contain the true mean systolic blood pressure for all female employees of the company.
In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Given the R output, an increase in the concentration of lactic acid by one unit results in a(n) ___________ in the given taste score by ___________ points, holding all other variables constant. a. Decrease, 19.6705 b. Increase, 8.6291 c. Increase, 19.6705 d. Decrease, 8.6291
c. Increase, 19.6705 The estimated coefficient for Lactic is 19.6705. If we fix all other predictors, for each 1 unit increase in Lactic, the given test score increases 19.6705 points.
In ANOVA, for which of the following purposes is the Tukey method used? a. Test for homogeneity of variance b. Test for normality c. Test for differences in pairwise means d. Test for independence of errors
c. Test for differences in pairwise means
marginal model (SLR)
captures the association of one predicting variable to the response variable marginally, that means without consideration of other factors.
What is the sampling distribution for the pooled variance estimator?
chi-square distribution with N - K degrees of freedom
What is the estimated sampling distribution of s^2?
chi-square with n-1 DF
What is the estimated sampling distribution of sigma^2?
chi-square with n-2 DF (~ equivalent to MSE)
in MLR, the sampling distribution for sigma^2 is MSE .....
chi-square with n-p-1 DF
σ^2 hat distribution is?
chi-square, n-p-1 DF
What is the distribution and DOF of overall regression test statistic?
chi-squared with p degrees of freedom where p is the number of predicting variables.
Poisson regression
commonly used for modeling count or rate data.
A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of B (degrees of freedom of the estimated error variance)? a. 32 b. 31 c. 30 d. 29
d. 29 The degrees of freedom of the estimated error variance are calculated as df = n-k - 1 = 31 - 1 - 1 = 29
Models with few predictors have...
high bias but low variance
In GLM or generalized linear models, the response Y is assumed to have what kind of distribution?
distribution from the exponential family of distributions
L1 penalty - LASSO
equal to the sum of the absolute values of the regression coefficients to be penalized. Minimizing the penalized least squares using this penalty will force many betas, many regression coefficients to be 0s. (LASSO)
In a simple linear regression model, given a significance level α , the ( 1 − α ) % confidence interval for the mean response should be wider than the ( 1 − α ) % prediction interval for a new response at the predictor's value x∗ .
false In a simple linear regression model, given a significance level α, the (1−α)100% confidence interval for the mean response should be narrower than the (1−α)100% prediction interval for a new response at the predictor's value x* .
With k-fold cross validation larger k values increase bias and reduce variance.
false Larger values of k decrease bias and increase variance.
Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score.
false Not all possible combinations are checked.
In a multiple linear regression model, when more predictors are added, R^2 can decrease if the added predictors are unrelated to the response variable.
false R^2 never decreases as more predictors are added to a multiple linear regression model.
Ridge regression is a regularized regression approach that can be used for variable selection.
false Ridge regression is a regularized regression approach but does not perform variable selection.
In simple linear regression models, we lose three degrees of freedom when estimating the variance because of the estimation of the three model parameters β 0 , β 1 , σ 2.
false See 1.2 Estimation Method "The estimator for σ 2 is σ ^ 2, and is the sum of the squared residuals, divided by n - 2."
The sampling distribution for the variance estimator in simple linear regression is χ 2 (chi-squared) regardless of the assumptions of the data.
false See 1.2 Estimation Method "The sampling distribution of the estimator of the variance is chi-squared, with n - 2 degrees of freedom (more on this in a moment). This is under the assumption of normality of the error terms."
We assess the constant variance assumption by plotting the error terms, ϵ i, against fitted values.
false See 1.2 Estimation Method "We use ϵ ^ i as proxies for the deviances or the error terms. We don't have the deviances because we don't have β 0 and β 1.
The simple linear regression coefficient, β ^ 0, is used to measure the linear relationship between the predicting and response variables.
false See 1.2 Estimation Method β ^ 0 is the intercept and does not tell us about the relationship between the predicting and response variables.
The p-value is a measure of the probability of rejecting the null hypothesis.
false See 1.5 Statistical Inference Data Example "p-value is a measure of how rejectable the null hypothesis is... It's not the probability of rejecting the null hypothesis, nor is it the probability that the null hypothesis is true."
The normality assumption states that the response variable is normally distributed.
false See 1.8 Diagnostics "Normality assumption: the error terms are normally distributed." The response may or may not be normally distributed, but the error terms are assumed to be normally distributed.
With the Box-Cox transformation, when λ = 0 we do not transform the response.
false See 1.8 Diagnostics When λ = 0, we transform using the normal log.
In ANOVA, the linearity assumption is assessed using a plot of the response against the predicting variable.
false See 2.2. Estimation Method Linearity is not an assumption of ANOVA.
For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for only one of the predicting variables.
false See Lesson 3.11: Assumptions and diagnostics In multiple linear regression, we need the linearity assumption to hold for all of the predicting variables, for the model to be a good fit. "For example, if the linearity does not hold with one or more predicting variables, then we could transform the predicting variables to improve the linearity assumption."
In multiple linear regression, a VIF value of 6 for a predictor means that 90% of the variation in that predictor can be modeled by the other predictors.
false See Lesson 3.13: Model Evaluation and Multicollinearity A VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model.
It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables.
false See Lesson 3.13: Model Evaluation and Multicollinearity It is good practice to create a multiple linear regression model using a linearly independent set of predicting variables. "XTX is not invertible if the columns of X are linearly dependent, i.e. one predicting variable, corresponding to one column, is a linear combination of the others."
Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.
false See Lesson 3.13: Model Evaluation and Multicollinearity Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.
Multicollinearity among the predicting variables will not impact the standard errors of the estimated regression coefficients.
false See Lesson 3.13: Multicollinearity Multicollinearity in the predicting variables can impact the standard errors of the estimated coefficients. "However, the bigger problem is that the standard errors will be artificially large."
In multiple linear regression, the prediction of the response variable and the estimation of the mean response have the same interpretation.
false See Lesson 3.2.9: Regression Line and Predicting a New Response. In multiple linear regression, the prediction of the response variable and the estimation of the mean response do not have the same interpretation.
A multiple linear regression model contains 6 quantitative predicting variables and an intercept. The number of parameters to estimate in this model is 7.
false See Lesson 3.2: Basic Concepts The number of parameters to estimate in a multiple linear regression model containing 6 quantitative predicting variables and an intercept is 8: 7 regression coefficients (β0,β1,...,β6) and the variance of the error terms (σ2).
What are three problems that variable selection tries to minimize?
high dimensionality, multicollinearity, prediction vs explanatory
The estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing the sum by n - p , where n is the sample size and p is the number of predictors.
false See Lesson 3.3: Regression Parameter Estimation The estimated variance of the error terms of a multiple linear regression model with intercept should be obtained by summing up the squared residuals and dividing that by n-p-1, where n is the sample size and p is the number of predictors as we lose p+1 degrees of freedom when we estimate the p coefficients and 1 intercept.
The causation of a predicting variable to the response variable can be captured using multiple linear regression on observational data, conditional of other predicting variables in the model.
false See Lesson 3.4 Model Interpretation "This is particularly prevalent in a context of making causal statements when the setup of the regression does not allow so. Causality statements can only be made in a controlled environment such as randomized trials or experiments. "
Conducting t-tests on each β parameter in a multiple linear regression model is the preferable to an F-test when testing the overall significance of the model.
false See Lesson 3.7: Testing for Subsets of Coefficients "We cannot and should not select the combination of predicting variables that most explains the variability in the response based on the t-tests for statistical significance because the statistical significance depends on what other variables are in the model."
In a multiple linear regression model, the adjusted R^2 measures the goodness of fit of the model
false The adjusted R^2 is not a measure of Goodness of fit. R^2 and adjusted R^2 measures the ability of the model and the predictor variable to explain the variation in response variable. Goodness of Fit refers to having all model assumptions satisfied.
In ANOVA, when testing for equal means across groups, the alternative hypothesis is that the means are not equal between two groups for all pairs of means/groups.
false The alternative is that at least one pair of groups have unequal means
Under Poisson regression, the sampling distribution used for a coefficient estimator is a chi-squared distribution when the sample size is large.
false The coefficient estimator follows an approximate normal distribution.
In regularized regression, the penalization is generally applied to all regression coefficients (β0, ... ,βp), where p = number of predictors.
false The shrinkage penalty is applied to β1, . . . , βp, but not to the intercept β0.
The number of parameters that need to be estimated in a logistic regression model with 5 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.
false There are no error terms in logistic regression, so we only have parameters for the 6 coefficients in the model. With linear regression, we have parameters for the 6 coefficients in the model as well as the variance of the error terms.
Variable selection is a simple and completely solved statistical problem since we can implement it using the R statistical software.
false Variable selection for a large number of predicting variables is an "unsolved" problem, and variable selection approaches should be tailored to the problem at hand.
It is good practice to perform a goodness-of-fit test on logistic regression models without replications.
false We can only define residuals for binary data with replications, and residuals are needed for a goodness-of-fit test.
When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients in the reduced model.
false q is difference between the number of regression coefficients in the full model and the reduced model.
When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables.
false Backward and forward stepwise regressions will not always select the same set of variables.
It is good practice to perform variable selection based on the statistical significance of the regression coefficients.
false It is not good practice to perform variable selection based on the statistical significance of the regression coefficients.
A logistic regression model has the same four model assumptions as a multiple linear regression model.
false The assumptions of a logistic regression model are: 1. Linearity Assumption: There is a linear relationship between the link function and the predictors 2. Independence assumption: The response variables are independent random variables 3. The link function is the logit function
What kind of variable is a predicting variable and why?
fixed, because it does not change with the response but it is fixed before the response is measured.
Where does uncertainty from estimation come from?
from estimation alone
Where does uncertainty from prediction come from?
from the estimation of regression parameters and from the newness of the observation itself
Cross-validation
leave out some of the data when fitting a model, that is, split the data into two parts. One part, also called a training data, will be used to fit the model given a specific lambda, and thus give the estimated regression coefficients given that lambda constant.
what is the coefficient interpretation of a GLM (poisson)?
log ratio of the rate with an increase with one unit in the predicting variable.
A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 The p-value for a goodness-of-fit test using the deviance residuals for the regression can be obtained from which of the following? pchisq(419.33,2, lower.tail =FALSE) pchisq(363.53,3, lower.tail =FALSE) pchisq(299.43,17, lower.tail =FALSE) pchisq(718.76,19, lower.tail =FALSE)
pchisq(299.43,17, lower.tail =FALSE) The goodness of fit test uses the residual deviance (299.43) and corresponding degrees of freedom (17) as the test statistic for the chi-squared test.
For goodness of fit test, we compare the likelihoods of the...?
saturated model versus the fitted model
Ridge regression does not perform variable selection. It only...
shrinks the coefficients to zero but it does not FORCE coefficients to be zero as needed in variable selection.
training error
simply use the data to fit the model then compute the classifier from each response in the data and take the proportion of the responses we misclassified
Forward stepwise tends to select...
smaller models
HW3 Multi Regression
spacing
L1 penalty measures...
sparsity
The test statistic for the goodness of fit test is?
sum of squared deviances
If we replace the unknown variance with its estimator, sigma^2=MSE, for PREDICTION, the sampling distribution becomes...
t distribution with n-p-1 DF
The alternative hypothesis of ANOVA can be stated as, the means of all pairs of groups are different the means of all groups are equal the means of at least one pair of groups is different None of the above
the means of at least one pair of groups is different See 2.4 Test for Equal Means "Using the hypothesis testing procedure for equal means, we test: The null hypothesis, which that the means are all equal (mu 1 = mu 2...=mu k) versus the alternative hypothesis, that some means are different. Not all means have to be different for the alternative hypothesis to be true -- at least one pair of the means needs to be different."
Goodness of fit tests the null hypothesis that
the model fits the data
Deviance
the test statistic is the difference of the log likelihood under the reduced model and the log likelihood under the full model for testing the subset of coefficients
If we have a positive value for B1,....
then that's consistent with a direct relationship between the predicting variable x and the response variable y.
Assuming the model is a good fit, the residuals in simple linear regression have constant variance.
true
Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both.
true
If a Poisson regression model does not have a good fit, the relationship between the log of the expected rate and the predicting variables might be not linear.
true
If a predicting variable is categorical with 5 categories in a linear regression model without intercept, we will include 5 dummy variables in the model.
true
In a simple linear regression model, given a significance level α , if the ( 1 − α ) % confidence interval for a regression coefficient does not include zero, we conclude that the coefficient is statistically significant at the α level.
true In a simple linear regression model, given a significance level α , if the ( 1 − α ) % confidence interval for a regression coefficient does not include zero, we conclude that the coefficient is statistically significant at the α level.
It is required to standardize or rescale the predicting variables when performing regularized regression.
true Regularized regression requires standardization or scaling of the predicting variables.
A negative value of β 1 is consistent with an inverse relationship between the predictor variable and the response variable.
true See 1.2 Estimation Method "A negative value of β 1 is consistent with an inverse relationship"
The pooled variance estimator, s p o o l e d 2, in ANOVA is synonymous with the variance estimator, σ ^ 2, in simple linear regression because they both use mean squared error (MSE) for their calculations.
true See 1.2 Estimation Method for simple linear regression See 2.2 Estimation Method for ANOVA The pooled variance estimator is, in fact, the variance estimator.
Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.
true See 1.4 Statistical Inference "Under the normality assumption, β 1 is thus a linear combination of normally distributed random variables... β ^ 0 is also linear combination of random variables"
The prediction interval of one member of the population will always be larger than the confidence interval of the mean response for all members of the population when using the same predicting values.
true See 1.7 Regression Line: Estimation & Prediction Examples "Just to wrap up the comparison, the confidence intervals under estimation are narrower than the prediction intervals becausethe prediction intervals have additional variance from the variation of a new measurement."
If the model assumptions hold, then the estimator for the variance, σ ^ 2, is a random variable.
true See 1.8 Statistical Inference We assume that the error terms are independent random variables. Therefore, the residuals are independent random variables. Since σ ^ 2 is a combination of the residuals, it is also a random variable.
An ANOVA model with a single qualitative predicting variable containing k groups will have k + 1 parameters to estimate.
true See 2.2 Estimation Method We have to estimate the means of the k groups and the pooled variance estimator, s p o o l e d 2.
The mean sum of squared errors in ANOVA measures variability within groups.
true See 2.4 Test for Equal Means MSE = within-group variability
If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.
true See 2.8 Data Example "This is important since without a good fit, we cannot rely on the statistical inference." Only when the model is a good fit, i.e. all model assumptions hold, can we rely on the statistical inference.
If the pairwise comparison interval between groups in an ANOVA model includes zero, we conclude that the two means are plausibly equal.
true See 2.8 Data Example If the comparison interval includes zero, then the two means are not statistically significantly different, and are thus, plausibly equal.
Cook's distance (Di) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed.
true See Lesson 3.11: Assumptions and Diagnostics "This is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model. "
The presence of certain types of outliers, such as influential points, can impact the statistical significance of some of the regression coefficients.
true See Lesson 3.11: Assumptions and diagnostics Outliers that are influential can impact the statistical significance of the beta parameters.
An example of a multiple linear regression model is Analysis of Variance (ANOVA).
true See Lesson 3.2 Basic Concepts "Earlier, we contrasted the simple linear regression model with the ANOVA model... Multiple linear regression is a generalization of both models."
If the residuals are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation.
true See Lesson 3.3.11: Assumptions and Diagnostics If the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.
linear regression model has high explanatory power if the coefficient of determination is close to 1.
true See Lesson 3.3.13: Model Evaluation and Multicollinearity If R2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.
In the case of multiple linear regression, controlling variables are used to control for sample bias.
true See Lesson 3.4: Model Interpretation "Controlling variables can be used to control for bias selection in a sample."
For a given predicting variable, the corresponding estimated regression coefficient will likely be different in a conditional model versus a marginal model.
true See Lesson 3.4: Model Interpretation "Importantly, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship."
In multiple linear regression, the estimated regression coefficient corresponding to a quantitative predicting variable is interpreted as the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed.
true See Lesson 3.4: Model Interpretation "The estimated value for one of the regression coefficient βi represents the estimated expected change in y associated with one unit of change in the corresponding predicting variable, Xi, holding all else in the model fixed."
A partial F-Test can be used to test whether the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero.
true See Lesson 3.7: Testing for Subsets of Regression Parameters We use the Partial F-test to test the null hypothesis that the regression coefficients associated to a subset of the predicting variables are all equal to zero. The alternative hypothesis is that at least one of these regression coefficients is not zero.
Simpson's Paradox occurs when a coefficient reverses its sign when used in a marginal versus a conditional model.
true Simpson's paradox: Reversal of an association when looking at a marginal relationship versus a conditional relationship.
Generalized linear models, like logistic regression, use a Wald test to determine the statistical significance of the coefficients.
true The coefficient estimates follow an approximate normal distribution and a z-test, also known as a Wald test, is used to determine their statistical significance.
For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it suggests that the model is a good fit.
true The null hypothesis is that the model fits the data. So large p-values suggests that the model is a good fit.
With Poisson regression, the variance of the response is not constant.
true V(Y|x_1,...x_p)=exp(beta_0 + beta_1 x_1 + ... + beta_p x_p)
Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.
true We can perform residual analysis on the Pearson residuals or the Deviance residuals.
The training risk is not an unbiased estimator of the prediction risk.
true The training risk is a biased estimator of the prediction risk.
When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis.
true When the objective is to explain the relationship to the response, one might consider including the predicting variables even they are correlated.
What does it mean if 0 is NOT included in the CI?
we conclude that Bj IS statistically significant
What does it mean if 0 is included in the CI?
we conclude that Bj is NOT statistically significant
pairwise comparison
we estimate the difference in the means (for example, a pair: meani and meanj) as a difference between their corresponding means
If we add the bias squared and the variance,
we get Mean Squared Error (MSE)
What does a cluster of residuals mean on a residual plot?
we have correlated errors
When B1 is close to zero...
we interpret that there is not a significant association between predicting variables, between the predicting variable x, and the response variable y.
If the t-value is large...
we reject the null hypothesis and conclude that the coefficient is statistically significant.
Why do we lose 1 DF for s^2?
we replace mu with zbar
Why do we lose 2 DF for sigma^2?
we replaced two parameters, B0 and B1
Backward stepwise regression
we start with all predictors, the full model and drop one predictor at a time.
Forward stepwise regression
we start with no predictor or with a minimum model, and add one predictor at a time
When do we reject the null hypothesis for the overall regression test in regards to the p value?
when the P-value is small, indicating that the overall regression has explanatory power.
A model selection method that can be performed in the R statistical software given a set of controlling factors that form the minimum or starting model.
Stepwise regression
Forward stepwise tends to result in smaller models that backward stepwise regression.
True
Given a model of p predicting variables, there are p^2 models to choose from.
True
If there is a group of correlated variables, lasso tends to select only one of the variables in the group.
True
If you had a model with multicollinearity that you wanted to preserve, you'd use ridge regression over lasso.
True
In cases where p>n, lasso selects only up to n variables.
True
In regularized regression, different lambdas result in different models.
True
In regularized regression, lambda is a constant.
True
The L1 penalty results in ridge regression.
False
The penalty for lasso regression is not a sparsity penalty.
False
The selected variables using variable selection approaches are the only variables that explain the response variable.
False
We need not be wary of over interpretation in MLR.
False
Variable selection methods is performed by
Balancing the bias-variance trade-off.
Why is training risk a biased estimate of prediction risk?
Because we use the data twice.
What does training risk correct for?
Bias
In GLMs the main reason one does not use LSE to estimate model parameters is the potential constrained in the parameters.
False - The potential constraint in the parameters of GLMs is handled by the link function.
The backward elimination requires a pre-set probability of type II error
False - Type I error
Can be used to explain variability in the response variable.
Explanatory variable
All regularized regression approaches will select the same model.
False
Forward stepwise is more computationally expensive than backward.
False
If a predicting variables is selected to be in the model, we conclude that the predicting variable has a causal relationship with the response variable.
False
In regularized regression, the bigger the lambda the smaller the complexity penality.
False
It is always feasible to apply a model search for all possible combinations of the predicting variables.
False
LOOCV penalizes complexity more than Mallow's CP
False
Lasso regression has a closed from expression for the estimated coefficients.
False
Mallows Cp works even when p>n.
False
Once lasso has identified the predicting variables, it is not necessary to use OLS to obtain the regression coefficients.
False
Ridge regression is used for variable selection.
False
The AIC and BIC cannot be used in selecting variables for generalized linear models.
False
It is possible to select a model to include variables that are not statistically significant, even though that model will provide the best prediction.
True
Models with many predictors have what?
Low bias, high variance.
L0 penalty is equivalent to searching through all the models.
True
The minus expected log likelihood function is also known as what?
Prediction risk
Can be used to predict variability in the response regardless of their predictive power.
Predictive variable
Mallows CP uses estimated variance based on what?
The full model
Forward stepwise regression adds one variable to the model at a time starting with a minimum model.
True
Alpha balances between ridge and lasso in elasticnet.
True
Complexity is equivalent to a large model with many predicting variables.
True
Cross validation is used to find the best lambda for regularized regression problems.
True
For BIC, we need to replace sigma^2 with an estimate from the submodel S.
True
For logistic regression and Poisson regression, the training risk is the sum of square deviances for the submodel S.
True
We cannot obtain prediction risk because of what?
We don't have future observations at the time of prediction.