6414 FINAL MIX (15)

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

MSE=

Bias^2+Var

How will the procedure change if we test whether the coefficient is equal to a constant?

1) We reject the null hypothesis if the t-value is larger than t alpha over 2n-p-1, when the absolute value of t-value, is larger than the critical point. 2) OR we can look if the p-value is smaller than .01

what are three things that a regression analysis is used for?

1. Prediction of the response variable, 2. Modeling the relationship between the response and explanatory variables, 3. Testing hypotheses of association relationships

3 options for splitting the data

1. Random sampling 2. K-fold cross-validation 3. Leave-one-out cross-validation

How do we compute classification error? (2 ways)

1. Training error 2. Cross validation

Which one is correct? A. The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. B. The prediction intervals are centered at the predicted value. C. The sampling distribution of the prediction of a new response is a t-distribution. D. All of the above.

3.2 - Knowledge Check 3

To correct for complexity for GLM, what can we use?

AIC and BIC

Outlier

Any data point that is far from the majority from the data in both x and y

conditional relationship

Capturing the association of a predicting variable to the response variable, conditional of other predicting variables in the model.

Conditional Relationship

Capturing the association oof a predicting variable to the response variable conditional of other predicting variables in the model

What do you add to prediction risk to get training risk?

Complexity penalty

L2 does not account for multicollenarity.

False

What is the rule of thumb for Cook's Distance?

If Di > 4/n or Di >1 or any "large" Di should be investigated

When do we reject the f-statistic? and what does this mean?

If it's larger than the critical point, with alpha being the significance level of the test. This means, again, that at least one of the coefficients is different from 0 at the alpha significance level.

In multiple linear regression, the model can be written in...?

Matrix form.

What method do we use to estimate the model parameters?

Maximum Likelihood Estimation approach

Do the estimated residuals have constant variance?

No

Mallows CP is an example of what?

Prediction risk

How do we interpret R^2?

Proportion of total variability in Y that can be explained by the regression (that uses X)

SST = ?

SSE + SSTR

Linearity assumption for a Logistic Model

Similar to the regression model we have learned in the previous lectures, the relationship we assume now, between the link, the g of the probability of success and the predicted variable, is a linear function.

In the case of complete separation, what should you do?

Simplify the model

T/F: The statistical inference for logistic regression relies on large size of the sample data.

T

Null-deviance

Test statistic for Overall Regression, shows how well the response variable is predicted by a model that includes only the intercept.

The fitted values are defined as

The regression line with parameters replaced with the estimated regression coefficients.

If the scatter plot of the residuals (epsilon ij) for the ANOVA is NOT random: (2)

The sample responses are not independent, or the variances of responses are not equal

The total sum of squares divided by N-1 is

The sample variance estimator assuming equal means and equal variances

The pooled variance estimator is:

The sample variance estimator assuming equal variances.

AIC uses estimated variance based on what?

The submodel

The objective of the pairwise comparison is

To identify the statistically significantly different means.

The L1 penalty results in lasso regression.

True

In the case of multiple linear regression, controlling variables are used to control for sample bias. (T/F)

True. See Lesson 3.4: Model Interpretation "Controlling variables can be used to control for bias selection in a sample."

The expectation of the mean response is:

UNBIASED

The estimators for the regression coefficients are:

Unbiased regardless of the distribution of the data.

predicting or explanatory (independent) variables

a set of other variables that might be useful in predicting or modeling the response variable (x1, x2)

MSSTr measures...

between-group variability

for our linear model where: Y = B0 + B1 + EPSILON (E), what does the epsilon represent?

deviance of the data from the linear model (error term)

The test of subset of coefficients tests the null hypothesis that

discarded variables have coefficients equal to zero.

An important aspect in prediction is....

how it performs in new settings.

B0 = ?

intercept

If we have a negative value for B1,....

is consistent with an inverse relationship between x and y

L2 penalty (ridge)...

is easy to implement, but it does not measure sparsity and does not perform variable selection

Models with many predictors have...

low bias, high variance

We'd like to have prediction with...

low uncertainty for new settings.

Goodness of fit

means that the model assumptions hold and fits the data well.

The sampling distribution of MLEs can be approximated by a...

normal distribution

We can make associated statements for...

observational studies

response (dependent) variables

one particular variable that we are interested in understanding or modeling (y)

What inference does 'testing a subset of coefficients' provide?

provides inferences on the predictive power of the model

B1 = ?

slope

If there is a coefficient that does NOT equal zero, what does that mean?

that at least one of these predictors included in the model has predictive power.

In the pairwise comparison, if the confidence interval only contains positive values, then we conclude...

that the difference in means in statistically positive

Rate parameter

the expectation of the response Yi, given the predicting variables, which is modeled as the exponential of the linear combination of the predicting variables since the link function between expectation and the predicting variables is the log function

The larger the number of variables is for training risk....

the larger the training risk

ANOVA is a linear regression model where...

the predicting factor is a (one) categorical variable.

Predictive factors

to best predict variability in the response regardless of their explanatory power

what is the objective of ANOVA?

to compare the means across the k populations, are they equal?

In a multiple linear regression model, the R^2 measures the proportion of total variability in the response variable that is captured by the regression model.

true

How will we diagnose the assumptions for ANOVA?

we are going to diagnose the assumptions on the residuals because under the error terms we do not know the means

How do we estimate prediction risk?

we can use an approach called Training Risk

What can we use to test if Betaj is = 0?

z test (wald test)

When the p-value of the slope estimate in the SLR is small the r-squared becomes smaller too.

False - When P value is small, the model fits become more significant and R squared become larger.

LOOCV is approximately AIC when the true variance is replaced by the estimate of the variance from the S submodel.

True

Lasso regression does not have a closed from expression for the estimated coefficients.

True

MLE is used for the GLMs for handling complicated link function modeling in the X-Y relationship.

True

The estimated regression coefficients from lasso regression as less efficient than those from the ordinary least squares estimation approach.

True

The first degree of freedom in the F distribution for any of the three procedures in stepwise is always equal to one.

True

The goal is to balance the trade-off between bias (more predictors) and variance (fewer predictors).

True

To get BIC in R, we use AIC with a k value of log(n).

True

Typically, confounding variables should be included in a model.

True

Variable selection is an art by itself.

True

We can combine ridge and lasso regression into what we call the elastic net regression.

True

When selecting variables, we need to first establish which variables are used for controlling bias selection in the sample and which are explanatory.

True

When the objective is of a model is prediction, correlated variables should be avoided.

True

When the objective is to explain the relationship to the response, one might consider including predicting variables which are correlated.

True

In multiple linear regression with idd and equal variance, the least squares estimation of regression coefficients are always unbiased.

True - the least squares estimates are BLUE (Best Linear Unbiased Estimates) in multiple linear regression.

Akaike Information Criterion (AIC)

A more general approach, for linear regression under normality this becomes training risk + penalty that looks like Mallow's EXCEPT the variance is the true variance not the estimate.

The following output was captured from the summary output of a simple linear regression model that relates the duration of an eruption with the waiting time since the previous eruption. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.374016 A -1.70 0.045141 * waiting 0.043714 0.011098 B 0.000052 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 Using the table above, what is the standard error of the intercept, labeled A, and rounded to three decimal places? 2.336 0.808 0.806 -0.806 None of the above

0.808 See 1.4 Statistical Inference Std.Err = Estimate /t-value = -1.374016/-1.70 = 0.808

VIF =

1 / 1 - R^2

The objective of multiple linear regression is

1. To predict future new responses 2. To model the association of explanatory variables to a response variable accounting for controlling factors. 3. To test hypothesis using statistical inference on the model.

Objectives of Variable Selection

- High Dimensionality - Multicollinearity - Prediction vs Explanatory Objective

What are 4 reasons why the logistic model might not be a good fit?

1. there may be other variables that should be included in the model and/or the relationship between logit of the expected probability and predictors might be multiplicative rather than additive. 2. Initial observation outliers and leverage points are also still an issue for this model. The model should be fitted with and without these outliers. 3. the binomial distribution isn't appropriate (overdispersion) 4. the logit function does not fit well with the data. (there could be other s shaped functions that would work better)

Plotting the residuals versus fitted values checks for which assumption?

Constant variance & Independence

The 3 assumptions of ANOVA with respect to the error term:

Constant variance, independence, normality

3 ways Predicting Variables can be distinguished as:

Controlling, Explanatory, Predictive

What can we use to identify outliers?

Cook's Distance

How do we choose lambda?

Cross validation!

In Poisson regression: A) We make inference using t-intervals for the regression coefficients. B) Statistical inference relies on exact sampling distribution of the regression coefficients. C) Statistical inference is reliable for small sample data. D) None of the above.

D

The log-likelihood function is a linear function with a closed-form solution.

False Maximizing the log-likelihood function with respect to the coefficients in closed form expression is not possible because the log-likelihood function is non-linear.

If one confidence interval in the pairwise comparison does not include zero, we conclude that the two means are plausibly equal.

False

If the confidence interval for a regression coefficient contains the value zero, we interpret that the regression coefficient is definitely equal to zero.

False

If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.

False

If the p-value of the overall F-test is close to 0, we can conclude all the predicting variable coefficients are significantly nonzero.

False

If there are a group of correlated variables, lasso tends to select all of them.

False

If we do not reject the test of equal means, we conclude that means are definitely all equal

False

If we reject the test of equal means, we conclude that all treatment means are not equal.

False

In MLR, a VIFj of 10 means that there is no correlation among the jth predictor and the remaining predictors, here the variance of the estimated regression coefficient Bj is not inflated

False

In a multiple linear regression model with quantitative predictors, the coefficient corresponding to one predictor is interpreted as the estimated expected change in the response variable when there is a one unit change in that predictor.

False

In linear regression, outliers do not impact the estimation of the regression coefficients.

False

In ridge regression, when the penalty constant lambda equals to 1, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates

False

In the ANOVA, the number of degrees of freedom of the chi-squared distribution for the variance estimator is N-k-1 where k is the number of groups.

False

In the presence of near multicollinearity, the prediction will not be impacted.

False

In the presence of near multicollinearity, the regression coefficients will tend to be identified as statistically significant even if they are not.

False

In the regression model, the variable of interest for study is the predicting variable.

False

In the simple linear regression model, we lose three degrees of freedom because of the estimation of the three model parameters LaTeX: \beta_0,\:\beta_1,\sigma^2 β 0 , β 1 , σ 2 .

False

In the simple linear regression model, we lose three degrees of freedom because of the estimation of the three model parameters β 0 , β 1 , σ 2 .

False

Independence assumption can be assessed using the normal probability plot.

False

Independence assumption can be assessed using the residuals vs fitted values.

False

LaTeX: \beta_1 β 1 is an unbiased estimator for LaTeX: \beta_0 β 0 .

False

Observational studies allow us to make causal inference.

False

One-way ANOVA is a linear regression model with more than one qualitative predicting variables.

False

Only the log-transformation of the response variable can be used when the normality assumption does not hold.

False

Only the log-transformation of the response variable should be used when the normality assumption does not hold.

False

Prediction is the only objective of multiple linear regression.

False

Ridge regression does not have a close form expression for the estimated coefficients.

False

Suppose x1 was not found to be significant in the model specified with lm(y ~ x1 + x2 + x3). Then x1 will also not be significant in the model lm(y ~ x1 + x2).

False

T/F: Backward stepwise regression is preferable over forward stepwise regression because it starts with larger models.

False

T/F: Complex models with many predictors are often extremely biased, but have low variance.

False

T/F: Conducting t-tests on each β parameter in a multiple regression model is the best way for testing the overall significance of the model.

False

T/F: Given a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model.

False

T/F: If a predicting variable is categorical with 5 categories in a linear regression model with intercept, we will include 5 dummy variables in the model.

False

T/F: In regularized regression, the penalization is generally applied to all regression coefficients (β0, ... ,βp), where p = number of predictors.

False

T/F: In the case of a multiple linear regression model containing 6 quantitative predicting variables and an intercept, the number of parameters to estimate is 7.

False

T/F: In the case of a multiple regression model with 10 predictors, the error term variance estimator follows a χ 2 (chi-squared) distribution with n - 10 degrees of freedom.

False

T/F: It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables.

False

T/F: It is good practice to perform variable selection based on the statistical significance of the regression coefficients.

False

T/F: It is not required to standardize or rescale the predicting variables when performing regularized regression.

False

T/F: Mallow's Cp statistic penalizes for complexity of the model more than both leave-one-out CV and Bayesian information criterion (BIC).

False

T/F: Multiple linear regression captures the causation of a predicting variable to the response variable, conditional of other predicting variables in the model.

False

T/F: Predicting values of the response variable for values of the predictors that are within the data range is known as extrapolation.

False

T/F: Ridge regression is a regularized regression approach that can be used for variable selection.

False

T/F: Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score.

False

T/F: The causation of a predicting variable to the response variable can be captured using multiple linear regression, conditional of other predicting variables in the model.

False

T/F: The equation to find the estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing that by n - p , where n is the sample size and p is the number of predictors.

False

T/F: The only objective of multiple linear regression is prediction.

False

T/F: The sampling distribution for estimating confidence intervals for the regression coefficients is a normal distribution.

False

T/F: The sampling distribution used for estimating confidence intervals for the regression coefficients is the normal distribution.

False

T/F: The training risk is an unbiased estimator of the prediction risk.

False

T/F: Variable selection is a simple and solved statistical problem since we can implement it using the R statistical software.

False

T/F: We can make causal inference in observational studies.

False

T/F: We can use the normal test to test whether a regression coefficient is equal to zero.

False

T/F: We interpret the coefficient corresponding to one predictor in a regression with multiple predictors as the estimated expected change in the response variable associated with one unit of change in the corresponding predicting variable.

False

T/F: When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables.

False

The AIC is commonly used for prediction models since it penalizes the model complexity the most.

False

The F-test can be used to evaluate the relationship between two qualitative variables.

False

The causation effect of a predicting variable to the response variable can be captured using multiple linear regression, conditional of other predicting variables in the model.

False

The causation of a predicting variable to the response variable can be captured using Multiple linear regression, conditional of other predicting variables in the model.

False

The constant variance assumption is diagnosed by plotting the predicting variable vs. the response variable.

False

The constant variance is diagnosted using the quantile-quantile normal plot.

False

The estimated regression coefficient \beta^hat_i is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable .

False

The estimated regression coefficients will be the same under marginal and conditional model, only their interpretation is not.

False

The estimated variance of the error term has a \chi^2 distribution regardless of the distribution assumption of the error terms.

False

The estimator LaTeX: \hat \sigma^2 σ ^ 2 is a fixed variable.

False

The estimator σ ^ 2 is a fixed variable.

False

The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model.

False

The larger then number of variables, the lower the training risk is.

False

The means of the k populations is a model parameter in ANOVA.

False

The number of degrees of freedom for the \chi^2 distribution of the estimated variance is n-p-1 for a model without intercept.

False

The number of degrees of freedom of the LaTeX: \chi^2 χ 2 (chi-square) distribution for the pooled variance estimator is LaTeX: N-k+1 N − k + 1 where LaTeX: k k is the number of samples.

False

The number of degrees of freedom of the χ 2 (chi-square) distribution for the variance estimator is N − k + 1 where k is the number of samples.

False

The only assumptions for a linear regression model are linearity, constant variance, and normality.

False

The only assumptions for a simple linear regression model are linearity, constant variance, and normality.

False

The penalty constant 𝜆 has the role to select the best subset of predicting variables.

False

The regression coefficient corresponding to one predictor is interpreted in a multiple regression in terms of the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable.

False

The regression coefficient is used to measure the linear dependence between two variables.

False

The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution.

False

The residuals have constant variance for the multiple linear regression model.

False

The residuals vs fitted can be used to assess the assumption of independence.

False

The sample means of the k populations is a model parameter in ANOVA.

False

The sample means of the k samples is a model parameter in ANOVA.

False

The sampling distribution for the variance estimator in ANOVA is LaTeX: \chi^2 χ 2 (chi-square) with N - k degrees of freedom.

False

The sampling distribution for the variance estimator in ANOVA is χ 2 (chi-square) regardless of the assumptions of the data.

False

The sampling distribution of the mean squared error is different of that of the estimated variance.

False

The statistical inference for linear regression under normality relies on large size of sample data.

False

The variables chosen for prediction and the variables chosen for explanatory objectives will be the same.

False

There are four assumptions needed for estimation with multiple linear regression: mean zero, constant variance, independence, and normality.

False

There are multiple model selection criteria that can be used and all provide the same penalization of the model complexity.

False

To get AIC in R, we use AIC with a value of k=1

False

Variable selection is solved statistical problem, particularly for a models with a large number of predictors.

False

We can test for a subset of regression coefficients only if we are interested whether additional explanatory variables should be considered in addition to the controlling variables.

False

We can test for a subset of regression coefficients to evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant.

False

We can test for a subset of regression coefficients using the F statistic test of the overall regression.

False

We cannot estimate a multiple linear regression model if the predicting variables are linearly independent.

False

We do not need to assume normality of the response variable for making inference on the regression coefficients.

False

the L2 penalty results in lasso regression.

False

β 1 is an unbiased estimator for β 0 .

False

Maximum Likelihood Estimation is not applicable for simple linear regression and multiple linear regression.

False - In SLR and MLR, the SLE and MLE are the same with normal idd data.

In interpretation of the regression coefficients is the same for logistic regression and Poisson regression

False - Interpretation of the regression coefficients of Poisson is in terms of log ratio of the rate. While in logistic regression it is in terms of log odd

In the GLMs the link function cannot be a non linear regression.

False - It can be linear, non linear, or parametric

It is a good practice to perform variable selection based on the statistical significance of the regression coefficients

False - It is not good practice to perform variable selection based on the statistical significance of the regression coefficients.

A Poisson regression model with p predictors and the intercept will have p + 2 parameters to estimate.

False - It'll have p + 1 parameter to estimate

Like standard linear regression we can use the F test to test for overall regression in logistic regression.

False - It's 1-pchisq(null deviance-residual deviance, DFnull-DFresidual)

Leave on out cross validation is preferred

False - K fold is preferred.

L2 penalty term measures sparsity

False - L1 penalty measures sparsity. L2 removes the limitation on variable selection

The L2 penalty measures the sparsity of a vector and forces regression coefficients to be zero

False - L1, Lasso, penalty measures sparsity of a vector and forces regression coefficients to be zero

The interpretation of the regression coefficients is the same for both Logistic and Poisson regressions.

False - Logistic : in terms of log odds. Poisson: in terms of log ratio of the rate.

The regression coefficients for the Poisson regression model can be estimated in exact/closed form.

False - MLE is NOT closed form.

If a Logistic Regression model provides accurate classification, then we can conclude that it is a good fit for the data

False - accuracy is not the same as good fit. GOF determine good fit. These two things don't have to co-exist.

It is good practice apply variable selection without understanding the problem at hand to reduce bias.

False - always understand the problem at hand to better select variables for the model.

When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables

False - backward and forward regression will not always sslect the same set of variables

Backward stepwise regression is preferable over forward stepwise regression because it starts with larger models.

False - backward stepwise regression is more computational expensive than forward stepwise regression and generally selects a larger model.

We cannot use the training error rate as an estimate of the true error classification error rate because it is biased upward.

False - biased downward

For logistic regression we can define residuals for evaluating model goodness of fit for models with and without replication.

False - can only be with replication under the assumption that Yi is binary and n1 is greater than 1

It is good practice to perform a GOF test of Logistic Regression models without replications

False - can only do with replication on Logistic Regression

Complex models with many predictors are often extremely biased, but have low variance.

False - complex models with many predictors often have low bias but high variance

Trying all 3 link functions for a logistic regression (C-ln-ln, probit, logit) will produce models with the same GOF for a dateset

False - different link produce different output, thus the GOF would be different

In Logistics Regression, the estimated value for a regression coefficient Bi represents the estimated expected change in the response variable associated with one unit increase in the corresponding predicting variable xi holding the rest constant

False - expected change with respect to the odds of success

In MLR, the proportion of variability in the response variable that is explained by the predicting variables is called adjusted R2

False - explained by the model (not predicting variables)

In Logistic Regression, if the p-value of the deviance test for GOF is smaller than the significant level alpha, then it is plausible that the model is good fit

False - for GOF, if p - value is small meaning model is not a good fit (kind of reverse from significance of regression coefficients)

After fitting a logistic regression model, a plot of residuals versus fitted values is useful for checking if model assumptions are violated.

False - for logistic regression use deviance residuals.

The presence of multicollinearity in a MLR model will not impact the standard errors of the estimated regression coefficients.

False - it can impact the standard errors of the estimated regression coefficients.

Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score

False - it does not guarantee the best score and not all possible combinations are checked

In Logistic regression, the error terms are assumed to follow a normal distribution

False - it doesn't have an error term

Least Square Elimination (LSE) cannot be applied to GLM models.

False - it is applicable but does not use data distribution information fully.

When a statistically insignificant variable is discarded from the model, there is little change in the other predictors statistical significance.

False - it is possible that when a predictor is discarded, the statistical significance of other variables will change.

For the testing procedure for subsets of coefficients, we compare the likelihood of a reduced model versus a full model. This is a goodness of fit test

False - it provides inference of the predictive power of the model

Forward stepwise will select larger models than backward.

False - it will typically select smaller models especially if p is large

The log-likelihood function is a linear function with a closed-form solution

False - log likelihood is non linear. A numerical algorithm is needed in order to maximize it.

Regression models are only appropriate for continuous response variables.

False - logistic and poisson model probability and rate

The logit link function is the best link function to model binary response data because it always fit the data better than other link functions.

False - logit function is not the only function that yields the s-shaped kind of curve.

Model with many predictors have high bias but low variance.

False - low bias and high variance

If the Cook's distance for any particular observation is greater than one, that data point is definitely a record error and thus needs to be discarded.

False - must see a comparison of data points. Is 1 too large?

If data on (Y, X) are available at only two values of X, then the model Y = \beta_1 X + \beta_2 X^2 + \epsilon provides a better fit than Y = \beta_0 + \beta_1 X + \epsilon.

False - nothing to determine of a quadratic model is necessary or required.

In a greenhouse experiment with several predictors, the response variable is the number of seeds that germinate out of 60 that are planted with different treatment combinations. A Poisson regression model is most appropriate for modeling this data

False - poisson regression models rate or count data.

If the confidence interval for a regression coefficient contains the value zero, we interpret that the regression coefficient is definitely equal to zero.

false

If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.

false

In ANOVA, the number of degrees of freedom of the chi-squared distribution for the variance estimator is N-k-1 where k is the number of groups.

false

In logistic regression, R^2 could be used as a measure of explained variation in the response variable.

false

In simple linear regression, the confidence interval of the response increases as the distance between the predictor value and the mean value of the predictors decreases.

false

T/F: Prediction is the only objective of multiple linear regression.

false

T/F: The proportion of variability in the response variable that is explained by the predicting variables is called correlation.

false

The F-test can be used to test for the overall regression in Poisson regression.

false

The interpretation of the regression coefficients is the same for both Logistic and Poisson regression.

false

The only assumptions for a simple linear regression model are linearity, constant variance, and normality.

false

We cannot estimate a multiple linear regression model if the predicting variables are linearly independent.

false

We do not need to assume independence between data points for making inference on the regression coefficients.

false

If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.

false "Goodness of fit doesn't guarantee good prediction." And conversely, good prediction doesn't guarantee the model is a good fit.

In a multiple linear regression model with n observations, all observations with Cook's distance greater than 4/n should always be discarded from the model.

false An observation should not be discarded just because it is found to be an outlier. We must investigate the nature of the outlier before deciding to discard it.

In a simple linear regression model, any outlier has a significant influence on the estimated slope parameter.

false An outlier does not necessarily have a large influence on model parameters. When it does, we call it an influential point.

Mallow's Cp statistic penalizes for complexity of the model more than both leave-one-out CV and Bayesian information criterion (BIC).

false BIC penalizes complexity more than the other approaches.

Backward stepwise regression is computationally preferable over forward stepwise regression.

false Backward stepwise regression is more computational expensive than forward stepwise regression and generally selects a larger model.

Complex models with many predictors are often extremely biased, but have low variance.

false Complex models with many predictors have often low bias but high variance.

You obtained a statistically significant F-statistic when testing for equal means across four groups. The number of unique pairwise comparisons that could be perfomed is seven

false For k=4 treatments, there are 𝑘(𝑘−1)/2 = 4(4−1)/2 = 6 unique pairs of treatments. The number of unique pairwise comparisons that could be perfomed is six.

Consider a multiple linear regression model with intercept. If two predicting variables are categorical and each variable has three categories, then we need to include five dummy variables in the model

false In a multiple linear regression model with intercept, if two predicting variables are categorical and both have k=3 categories, then we need to include 2*(k-1) = 2*(3-1) = 4 dummy variables in the model.

what are two ways to transform data?

power and log transformation

We do not interpret beta with respect to the response variable for a Poisson model but with....

respect to the ratio of the rate.

When conducting ANOVA, the larger the between-group variability is relative to the within-group variability, the larger the value of the F-statistic will tend to be.

true Given the formula of the F-statistic a larger increase in the numerator (between-group variability) compared to the denominator will result in a larger F-statistic ; hence, the larger MSSTr is relative to MSE, the larger the value of F-stat.

log rate

the log function of the expected value of the response

What is the interpretation of coefficient Beta in terms of logistic regression?

the log of the odds ratio for an increase of one unit in the predicting variable, holding all other variables constant

If one confidence interval in the pairwise comparison includes zero under ANOVA, we conclude that the two corresponding means are plausibly equal.

true

If there are specific variables that are required to control the bias selection in the model, they should be forced into the model and not be part of the variable selection process.

true

In ANOVA, to test the null hypothesis of equal means across groups, the variance of the response variable must be the same across all groups.

true

In a simple linear regression model, we can assess if the residuals are correlated by plotting them against fitted values.

true

In a simple linear regression model, we need the normality assumption to hold for deriving a reliable prediction interval for a new response.

true

In ridge regression, when the penalty constant lambda (λ) equals zero, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates.

true

Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.

true

Ridge regression can be used to deal with problems caused by high correlation among the predictors.

true

T/F: A partial F-test can be used to test whether a subset of regression coefficients are all equal to zero.

true

T/F: Analysis of variance (ANOVA) is a multiple regression model.

true

T/F: Before making statistical inference on regression coefficients, estimation of the variance of the error terms is necessary.

true

T/F: In the case of multiple linear regression, controlling variables are used to control for sample bias.

true

T/F: The estimated coefficients obtained by using the method of least squares are unbiased estimators of the true coefficients.

true

T/F: The regression coefficient corresponding to one predictor in multiple linear regression is interpreted in terms of the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed.

true

The L1 penalty measures the sparsity of a vector and forces regression coefficients to be zero.

true

The estimators of the error term variance and of the regression coefficients are random variables.

true

The larger the coefficient of determination or R-squared, the higher the variability explained by the simple linear regression model.

true

The lasso regression requires a numerical algorithm to minimize the penalized sum of least squares.

true

The one-way ANOVA is a linear regression model with one qualitative predicting variable.

true

The penalty constant lambda (λ) in penalized regression controls the trade-off between lack of fit and model complexity.

true

We can assess the assumption of constant-variance in multiple linear regression by plotting the standardized residuals against fitted values.

true

We estimate the regression coefficients in Poisson regression using the maximum likelihood estimation approach.

true

A binary response variable with replications in logistic regression has a Binomial distribution.

true A binary response variable with replications does follow a Binomial distribution.

Suppose that we have a multiple linear regression model with k quantitative predictors, a qualitative predictor with l categories and an intercept. Consider the estimated variance of error terms based on n observations. The estimator should follow a chi-square distribution with n − k − l degrees of freedom.

true For this example, we use k + l df to estimate the following parameters: k regression coefficients associated to the k quantitative predictors, ( l − 1 ) regression coefficients associated to the ( l − 1 ) dummy variables and 1 regression coefficient associated to the intercept. This leaves n − k − l degrees of freedom for the estimation of the error variance.

Lambda

λ is a constant that balances the tradeoff between the lack of fit (measured by the SSE) and the complexity (measured by the penalty) which depends on the regression coefficients. The bigger λ, the bigger the penalty for model complexity

T/F: Cook's distance measures how much the fitted values (response) in the multiple linear regression model change when the ith observation is removed.

T

T/F: If the residuals are not normally distributed, then we can model instead the transformed response variable where the common transformation for normality is the Box-Cox transformation.

T

T/F: In Poisson regression, there is a linear relationship between the log rate and the predicting variables.

T

T/F: Influential points in multiple linear regression are outliers.

T

T/F: Multicollinearity can lead to less accurate statistical significance of some of the regression coefficients.

T

T/F: Multicollinearity in the predicting variables will impact the standard deviations of the estimated coefficients.

T

T/F: The assumptions to diagnose with a linear regression model are independence, linearity, constant variance, and normality.

T

T/F: The estimated regression coefficients in Poisson regression are approximate.

T

T/F: The estimator σ ^ 2 is a random variable.

T

T/F: The linear regression model under normality is also a generalized linear model with link function the identity link function.

T

T/F: The linear regression model with a qualitative predicting variable with k levels/classes will have k + 1 parameters to estimate

T

T/F: The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.

T

T/F: The logit link function is not the only S-shape function that can be used to model binary response data.

T

T/F: The mean sum of square errors in ANOVA measures variability within groups.

T

T/F: The prediction interval will never be smaller than the confidence interval for data points with identical predictor values.

T

T/F: The prediction of the response variable has higher uncertainty than the estimation of the mean response.

T

T/F: The presence of certain types of outliers can impact the statistical significance of some of the regression coefficients.

T

T/F: Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.

T

T/F: We can use a Z-test to test for the statistical significance of a coefficient given all predicting variables in a Poisson regression model.

T

T/F: We can use a t-test to test for the statistical significance of a coefficient given all predicting variables in a multiple linear regression model.

T

T/F: We could diagnose the normality assumption using the normal probability plot.

T

T/F: When a Poisson regression does not fit well the data, it may that there may be more variability in the estimators than provided by the model.

T

T/F: When making a prediction for predicting variables on the "edge" of the space of predicting variables, then its uncertainty level is high.

T

T/F; For both logistic and Poisson regression, both the Pearson and deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.

T

T/F; The estimator of the mean response is unbiased.

T

T/F: In logistic regression, there is not a linear relationship between the probability of success and the predicting variables.

T, the relationship is NOT linear but a linear FUNCTION.

A measure of the bias-variance tradeoff is the prediction risk

TRUE

Explanatory variable is one that explains changes in the response variable

TRUE

Mallow's CP is useful when there are no control variables.

TRUE

Bayesian Information Criterion

The complexity penalty is (# of predictors in submodel * true_variance of full model * log(n))/n BIC penalizes complexity more than other approaches and thus preferred in model selection for prediction BIC is similar to AIC except that the AIC complexity is further penalized by log(n)/2 For BIC, we need to specify k = log(n) Select the model with the smallest BIC

Characteristics of Spatial Regression

Trend: This can be long-distance increase/decrease in the data over space Periodicity/Seasonality: This is not common for spatial process Heteroskedasticity: This means the variability varies with space

If the constant variance assumption does not hold, we transform the response variable.

True

If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.

True

If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption.

True

If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation.

True

If the residuals of a MLR model are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation

True

If there are specific variables that are required to control the bias selection in the model, they should be force into the model and not be part of the variable selection process

True

If we reject the test of equal means, we conclude that some treatment means are not equal.

True

In Logistic Regression, we can perform residual analysis for binary data with replications

True

In Logistic regression, the relationship between the probability of success and predicting variables is non linear

True

In MLR, we can assess the assumption of Constant - Variance by plotting the standardized residuals against fitted valued

True

In Poisson Regression we do not interpret beta with respect to the response variable but with respect to the ratio of the rate.

True

In Poisson regression underlying assumption is that the response variable has a Poisson distribution, or responses could be wait times, or exponential distribution

True

In a Poisson regression model, we use a chi-squared test to test the overall regression.

True

Mean square error is commonly used in statistics to obtain estimators that may be biased, but less uncertain than unbiased ones. And that's preferred.

True

Multicolinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.

True

Multicollinearity can lead to misleading conclusions on the statistical significance of the regression coefficients of a MLR model

True

Multicollinearity in MLR means that the columns in the design matrix are (nearly) linearly dependent.

True

Multiple linear regression is a general model encompassing both ANOVA and simple linear regression.

True

Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. correct

True

One problem with fitting a normal regression model to Poisson data is the departure from the assumption of constant variance

True

One reason why the logistic model may not fit is the relationship between logit of the expected probability and predictors might be multiplicative, rather than additive

True

Overdispersion is when the variability of the response variable is larger than estimated by the model

True

Partial F-Test can also be defined as the hypothesis test for the scenario where a subset of regression coefficients are all equal to zero.

True

Poisson distribution, the variance is equal to the expectation. Thus, the variance is not constant

True

Predicting variable is used in regression to predict the outcome of another variable.

True

Predictive power means that the predicting variables predict the data even if one or more of the assumptions do not hold.

True

Random sampling is computationally more expensive than the K-fold cross validation, with no clear advantage in terms of the accuracy of the estimation classification error rate.

True

Residual analysis can only be used to assess uncorrelated errors.

True

Ridge regression has a closed form expression for the estimated coefficients.

True

Simpson's Paradox - the reversal of association when looking at marginal vs conditional relationships

True

Stepwise regression is a greedy algorithm.

True

Studying the relationship between a single response variable and more than one predicting quantitative and/or qualitative variable is termed as Multiple linear regression.

True

T/F: Akaike Information Criterion (AIC) is an estimate for the prediction risk.

True

T/F: Controlling variables used in multiple linear regression are used to control for bias in the sample.

True

T/F: Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both.

True

T/F: For a given predicting variable, the estimated coefficient of regression associated with it will likely be different in a model with other predicting variables or in the model with only the predicting variable alone.

True

T/F: If there are specific variables that are required to control the bias selection in the model, they should be forced into the model and not be part of the variable selection process.

True

T/F: In a multiple linear regression model with 6 predicting variables but without intercept, there are 7 parameters to estimate.

True

T/F: In multiple linear regression we study the relationship between a single response variable and several predicting quantitative and/or qualitative variables.

True

T/F: In multiple linear regression, we study the relationship between one response variable and both predicting quantitative and qualitative variables.

True

T/F: In order to make statistical inference on the regression coefficients, we need to estimate the variance of the error terms.

True

T/F: In ridge regression, when the penalty constant lambda (λ) equals zero, the corresponding ridge coefficient estimates are the same as the ordinary least squares estimates.

True

T/F: Ridge regression can be used to deal with problems caused by high correlation among the predictors.

True

T/F: The L1 penalty measures the sparsity of a vector and forces regression coefficients to be zero.

True

T/F: The error term in the multiple linear regression cannot be correlated.

True

T/F: The error term variance estimator has a (chi-squared) distribution with degrees of freedom for a multiple regression model with 10 predictors.

True

T/F: The estimated regression coefficient corresponding to a predicting variable will likely be different in the model with only one predicting variable alone versus in a model with multiple predicting variables.

True

T/F: The estimated regression coefficients are unbiased estimators.

True

T/F: The estimated variance of the error terms is the sum of squared residuals divided by the sample size minus the number of predictors minus one.

True

T/F: The hypothesis test for whether a subset of regression coefficients are all equal to zero is a partial F-test.

True

T/F: The lasso regression requires a numerical algorithm to minimize the penalized sum of least squares.

True

T/F: The penalty constant lambda (λ) in penalized regression controls the trade-off between lack of fit and model complexity.

True

T/F: We cannot estimate a multiple linear regression model if the predicting variables are linearly dependent.

True

T/F: We need to assume normality of the response variable for making inference on the regression coefficients.

True

T/F: When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis.

True

The ANOVA is a linear regression model with one or more qualitative predicting variables.

True

The ANOVA model with a qualitative predicting variable with LaTeX: k k levels/classes will have LaTeX: k+1 k + 1 parameters to estimate.

True

The L1 penalty results in a sparse matrix, and therefore does variable selection.

True

The L2 penalty does not result in a sparse matrix.

True

The LASSO regression requires a numerical algorithm to minimize the penalized sum of least squares

True

The LaTeX: R^2 R 2 value represents the percentage of variability in the response that can be explained by the linear regression on the predictors. Models with higher LaTeX: R^2 R 2 are always preferred over models with lower LaTeX: R^2 R 2 .

True

The Mallow's CP complexity penalty is two times the size of the model (the number of variables in the submodel) times the estimated variance divided by n.

True

The Partial F-Test can test whether a subset of regression coefficients are all equal to zero.

True

The Sum of Squares Regression (SSR) measures the explained variability captured by the regression model given the explanatory variables used in the model.

True

The advantage of having a biased model with less predicting variables is the reduction in uncertainty of predictions of future responses.

True

The decision in using ANOVA table for testing whether a model is significant depends on the normal distribution of the response variable

True

The deviance residuals are the signed square root of the log-likelihood evaluated at the saturated model

True

The equation to find the estimated variance of the error terms can be obtained by summing up the squared residuals and dividing that by n - p - 1, where n is the sample size and p is the number of predictors.

True

The estimated regression coefficient in lasso regression are obtained using a numeric algorithm.

True

The estimated regression coefficient in ridge regression are obtained using exact or closed form expression.

True

The estimated regression coefficients from Lasso are less efficient than those provided by the ordinary least squares

True

The estimators for the regression coefficients are Uunbiased regardless of the distribution of the data. correct

True

The estimators of the error term variance and of the regression coefficients are random variables.

True

The gam() function is a non-parametric test to determine what transformation is best.

True

The hypothesis testing procedures for subsets of regression coefficients is not used for GOF assessment in Logistic Regression

True

The l2 penalty results in ridge regression.

True

The larger K is, the larger the number of folds, the less bias the estimate of the classification the error is but has higher variability.

True

The larger the coefficient of determination or R-squared, the higher the variability explained by the simple linear regression model.

True

The lasso minimization problem is convex.

True

The least square estimation for the standard regression model is equivalent with Maximum Likelihood Estimation, under the assumption of normality.

True

The linear regression model with a qualitative predicting variable with k levels/classes will have k + 1 parameters to estimate

True

The log odds function, also called the logit function, which is the log of the ratio between the probability of a success and the probability of a failure

True

The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as log odds function

True

The mean sum of square errors in ANOVA measures variability within groups.

True

The number of parameters to estimate in the case of a multiple linear regression model containing 5 predicting variables and no intercept is 6.

True

The one-way ANOVA is a linear regression model with one qualitative predicting variable.

True

The penalty constant lambda in penalized regression control the trade-off between lack of fit and model complexity

True

The penalty constant 𝜆 has the role to control the trade-off between lack of fit and model complexity.

True

The prediction intervals are centered at the predicted value.

True

The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly.

True

The prediction of the response variable has higher uncertainty than the estimation of the mean response.

True

The prediction risk is the sum between the irreducible error and the mean square error

True

The regression coefficients can be estimated only if the predicting variables are not linearly dependent.

True

The regression coefficients that are estimated serve as unbiased estimators.

True

The sampling distribution of the estimated regression coefficients is Centered at the true regression parameters.

True

The sampling distribution of the estimated regression coefficients is centered at the true regression parameters.

True

The sampling distribution of the estimated regression coefficients is dependent on the design matrix.

True

The sampling distribution of the estimated regression coefficients is the t-distribution assuming that the variance of the error term is unknown an replaced by its estimate.

True

The sampling distribution of the prediction of a new response is a t-distribution.

True

The training risk is a biased estimator of the prediction risk

True

The variability of a submodel is smaller than the full model.

True

The variance of the response is equivalent to the expected value of the response in Poisson regression with no overdispersion.

True

To get AIC in R, we use AIC with a value of k=2

True

Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.

True

Using leave-1-out Cross Validation is equivalent to K-fold to K-fold Cross Validation where the number of the folds is equal to the sample size of the training set.

True

Using stepwise regression, we can force variables into the model.

True

Variable selection can be used to deal with multicollinearity, reduce predictors and fit a model with more variables than observations.

True

We assess the assumption of constant-variance by plotting the residuals against fitted values.

True

We can assess the assumption of constant-variance in multiple linear regression by plotting the standardized residuals against fitted values.

True

We can do a partial F test to determine if variable selection is necessary.

True

We would like to have a prediction with low uncertainty for new settings. This means that we're willing to give up some bias to reduce the variability in the prediction.

True

When considering using generalized linear models, it's important to consider the impact of Simpson's paradox when interpreting relationship between explanatory variables and the response. This paradox refers to the reversal of these associations when looking at the marginal relationship compared to a conditional ones.

True

When estimating confidence values for the mean response for all instances of the predicting variables, we should use a critical point based on the F-distribution to correct for the simultaneous inference.

True

When selecting variables for a model, one needs also to consider the research hypothesis, as well as any potential confounding variables to control for

True

When the data may not be normally distributed, AIC is more appropriate for variable selection than adjusted R-squared

True

if VIF for each predicting variable is smaller than a certain threshold we can say that there is not problematic amount of multicollinearity in the MLR model

True

in MLR Regression, we could diagnose the normality assumption by using the normal probability plot

True

in MLR Regression, when using very large samples, replying on the p-values associated to the traditional hypothesis test with Ha: Bj not equal to 0 can lead to misleading conclusions on the statistical significance of the regression coeffcient

True

A negative value of β 1 is consistent with an inverse relationship between the predictor variable and the response variable. (T/F)

True See 1.2 Estimation Method "A negative value of β 1 is consistent with an inverse relationship"

L0 penalty, which is the number of nonzero regression coefficients

True - not feasible for a large number of predicting variables as requires fitting all models

Poisson Assumptions - log transformation of the rate is a linear combination of the predicting variables, the response variables are independently observed, the link function g is the log function

True - remember, NO ERROR TERM

Standard linear regression could be used to model Poisson regression using the variance stabilizing transformation sqrt(mu-3/8) if the number of counts is large

True - the number of counts can be small - then use Poisson

To estimate prediction risk we compute the prediction risk for the observed data and take the sum of squared differences between fitted values for sub model S and the observed values.

True - this is called training risk and it is a biased estimate of prediction risk

The g link function is also called the canonical link function.

True - which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples.

We can use the z value to determine if a coefficient is equal to zero in logistic regression.

True - z value = (Beta-0)/(SE of Beta)

Where p the number of predictors is larger than n the number of observationsthe Lasso selects, at most, n variables

True when p is greater than n, lasso will select n variables at the most

The estimated regression coefficients in Poisson regression are approximate

True.

The pooled variance estimator, s pooled^2, in ANOVA is synonymous with the variance estimator, σ ^ 2, in simple linear regression because they both use mean squared error (MSE) for their calculations. (T/F)

True. See 1.2 Estimation Method for simple linear regression See 2.2 Estimation Method for ANOVA The pooled variance estimator is, in fact, the variance estimator.

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 What is the probability of survival for a botrytis blight sample exposed to a sulfer concentration of 0.7 and a copper concentration of 0.9? 0.826 0.674 0.311 0.577

0.577 exp(3.58770 - 4.32735*0.7 - 0.27483*0.9) / (1 + exp(3.58770 - 4.32735*0.7 - 0.27483*0.9)) = 0.577

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 The p-value for testing the overall regression can be obtained from which of the following? 1-pchisq(718.76,19) 1-pchisq(419.33,2) 1-pchisq(363.53,3) 1-pchisq(299.43,17

1-pchisq(419.33,2) The chi-squared test statistic is the difference between the null deviance (718.76) and the residual deviance (299.43), which is 419.33. The degrees of freedom is the difference between the null deviance degrees of freedom (19) and the residual deviance degrees of freedom (17), which is 2.

Akaike Information Criterion (AIC) is an estimate for the prediction risk

AIC indeed measures prediction risk.

When would we use the 'Comparing pairs of Means' method?

After we reject the null hypothesis of equal means

How can we diagnose multicollinearity?

An approach to diagnose collinearities through the computation of the Variance Inflation Factor, which you will compute for EACH predicting variable.

What is the hypothesis testing procedure for overall regression and what is it testing?

Analysis of Variance for multiple regression. We will use analysis of variance (ANOVA) to test the hypothesis that the regression coefficients are zero.

The estimated versus predicted regression line for a given x*: A. Have the same variance B. Have the same expectation C. Have the same variance and expectation D. None of the above

B. Have the same expectation 1.2 - Knowledge Check 3

Which one is correct? A. Independence assumption can be assessed using the residuals vs fitted values. B. Independence assumption can be assessed using the normal probability plot. C. Residual analysis can be used to assess uncorrelated errors. D. None of the above

C. Residual analysis can be used to assess uncorrelated errors. 1.3 - Knowledge Check 4

Classification Error Rate

Classification error rate is the probability that the new response is equal to the classifier(R) R is between 0 and 1. Most common value for R is 0.5 however a different R can be used to improve the prediction accuracy

Classification

Classification is prediction of binary responses. If the predicted probability is large, then classify y star as a success

Logistic Regression

Commonly used for modeling binary response data. The response variable is a binary variable, and thus, not normally distributed. In logistic regression, we model the probability of a success, not the response variable. In this model, we do not have an error term

A variable that is related to both a predictor and a response. For example outdoor temperature as it relates to ice cream sales and home invasions.

Confounding variable

Constant Variance Assumption

Constant variance assumption means that it cannot be true that the model is more accurate for some parts of the population, and less accurate for other parts A violation of this assumption means that estimates are not as efficient as they could be in estimating the true parameters. It also results in poorly calibrated prediction intervals

Used to control for bias selection in a sample.

Controlling variable.

How else can we estimate the classification error without the need of observing new data?

Cross validation

The estimators for the regression coefficients are: A. Biased but with small variance B. Biased with large variance C. Unbiased under normality assumptions but biased otherwise. D. Unbiased regardless of the distribution of the data.

D. Unbiased regardless of the distribution of the data. 1.2 - Knowledge Check 2

The estimators for the regression coefficients are: A. Biased but with small variance B. Unbiased under normality assumptions but biased otherwise. C. Biased regardless of the distribution of the data. D. Unbiased regardless of the distribution of the data.

D. Unbiased regardless of the distribution of the data. 3.2 - Knowledge Check 2

Leverage points

Data points that are far from the mean of the x's

leverage points

Data points that are far from the mean of the x's

Data w/ replications vs Data w/o replications

Data with replications: We can observe binary data for repeated trials. That is a binomial distribution with more than one trial or ni greater than 1 Data without replications: For each unique set of the observed predicting variables, we can observe binary data with no repeated trials. That is a binomial distribution with one trial where ni = 1

A data point far from the mean of the x's and y's is always: A. an influential point and an outlier B. a leverage point but not an outlier C. an outlier and a leverage point D. an outlier but not a leverage point E. None of the above

E. None of the Above. See 1.9 Outliers and Model Evaluation We only know that the data point is far from the mean of x's and y's. It only fits the definition of a leverage point because the only information we know is that it is far from the mean of the x's. So you can eliminate the answers that do not include a leverage point. That leaves us with remaining possibilities, "a leverage point but not an outlier" and "an outlier and a leverage point" , both of which we can eliminate. We do not have enough information to know if it is or is not an outlier . None of the answers above fit the criteria of it being always being a leverage point.

T/F: A goodness-of-fit test should always be conducted after fitting a logistic regression model without repetition.

F

T/F: For assessing the normality assumption of the ANOVA model, we can only use the quantile-quantile normal plot of the residuals.

F

T/F: If a logistic regression model provides accurate classification, then we can conclude that it is a good fit for the data.

F

T/F: If one confidence interval in the pairwise comparison does not include zero, we conclude that the two means are plausibly equal.

F

T/F: If the VIF for each predicting variable is smaller than a certain threshold, then we can say that multicollinearity does not exist in this model.

F

T/F: If the non-constant variance assumption does not hold in multiple linear regression, we apply a transformation to the predicting variables.

F

T/F: Other link functions for the Poisson regression model are c-log-log and probit.

F

A multiple linear regression model with p predicting variables but no intercept has p model parameters.

False

BIC penalizes complexity less than other approaches.

False

Backward stepwise regression adds one variable to the model at a time starting with the full model.

False

Backward stepwise regression can be performed if p>n.

False

Backward stepwise regression is preferable over forward stepwise regressioin.

False

For testing if a regression coefficient is zero, the normal test can be used.

False

In the presence of near multicollinearity, the coefficient of variation decreases.

False

T/F: Observational studies allow us to make causal inference.

False

The Box-Cox transformation is commonly used to improve upon the linearity assumption.

False

Under Logistics Regression, the sampling distribution used for a coefficient estimator is a Chi-squared distribution when the sample size is large.

False - sampling (or coef estimator) follos an approximately Normal

The slope of a linear regression equation is an example of a correlation coefficient.

False - the correlation coefficient is the r value. Will have the same + or - sign as the slope.

Observational Studies

For observation studies, causality is rarely implied

When do we use transformations?

If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. If the constant variance assumption does not hold, we transform the response variable.

What does it mean when we reject the H0 of the F test?

If we reject the null hypothesis, we will conclude that at least one of the predicting variables has explanatory power for the variability in the response.

OLS vs Robust Regression

In OLS, we minimize the sum of square root errors to estimate the expectation of Y, given the predictor variables In robust regression, minimize the sum of absolute errors to estimate the median of Y, given the predictor variables Both the expectation and the median are measures of centrality of the distribution. However, median is robust to outliers, whereas the mean or the expectation is not The estimated confidence intervals are wider for robust regression than for ordinary least squares. And thus, the statistical inference is more conservative

Regularized Regression (Variable Selection)

In regularized regression for variable selection, we need to first re-scale all the predicting variables in order to be comparable on the same scale.

What is the difference between using Poisson regression versus the standard regression with, say, with the log transformation of the response variable?

In standard regression, the variance is assumed constant. In Poisson, the variance of the response is assumed to be equal to the expectation, since for the Poisson distribution, the variance is equal to the expectation. Thus the variance is not constant.

Stepwise regression is a greed algorithm, what does that mean?

It does not guarantee to find the model with the best score.

How we interpret ^B0?

It is the estimated expected value of the response variable, when the predicting variable equals zero.

Ridge Regression

It is used to correct for the impact multicollinearity This is commonly used to fit a model under multicollinearity; not used for model selection, does not force any betas to 0 For this model, the penalty is the sum of square regression coefficients times the lambda constant Minimizing the penalty provides a closed-form expression for the estimated regression coefficients When λ = 0; low bias, high variance λ = 1; betas = 0, high bias, low variance We have a closed-form expression for the estimated regression coefficients under this model Some of the regression coefficients may intersect the 0 line for large effective degrees of freedom but for others, the regression coefficients increase Alpha = 0

If our ANOVA model has an intercept, then how many dummy variables and why?

K-1 because of linear dependence between the X's

In regularized regression, what balances the bias variance tradeoff?

Lambda

Plotting the residuals versus each predictor checks for which assumption?

Linearity

Linearity Assumption - Poisson

Linearity can be evaluated by plotting the log of the event rate versus the predicting variables We can also evaluate linearity on the assumption of uncorrelated responses using the scatter plot for the residuals versus the predicting variables

What are the 4 assumptions of MLR?

Linearity, Constant Variance, Independence, Normality

Variable Selection Approaches

Mallows Cp AIC BIC Cross Validation

Parameter estimation

Maximizing the log likelihood function with respect to beta0, beta1 etc in closed (exact) form expression is not possible because the log likelihood function is a non-linear function in the model parameters i.e. we cannot derive the estimated regression coefficients in an exact form Use numerical algorithm to estimate betas (maximize the log likelihood function). The estimated parameters and their standard errors are approximate estimates

If we reject the null hypothesis for overall regression, what does that mean

Meaning that the overall regression has statistically significant power in explaining the response variable.

Constant variance assumption

Means that it cannot be true that the model is more accurate for some parts of the population, and less accurate for other parts of the populations. This can result in less accurate parameters and poorly-calibrated prediction intervals.

Using MLE, can we derive estimated coefficients/parameters in exact form?

No, they are approximate estimated parameters

What does the QQ plot and histogram check for?

Normality

Regression Analysis

Overview of the following models: - Linear Regression - ANOVA Regression - Multiple Linear Regression - Logistic Regression - Poisson Regression

Pearson Residuals - Poisson Regression

Pearson residuals is the standardized difference between the ith observed response, and estimated expected rate of events lambda i hat divided by the square root of the variance where the variance is equal to lambda i hat We need to standardized the difference between observed and expected response since their responses have different variances Normal Distribution

Biased Regression Penalties

Penalize large values of betas jointly. This should lead to multivariate shrinkage of the vector of regression coefficients

What are three ways we can transform the predicting variables?

Power, Log, Polynomial transformations

ANOVA

Relationship between the response variable y and one or more qualitative/categorical variables We can write the response as a sum between the mean of the category from which the response is observed plus an error term epsilon Assumptions: Same as linear regression, except that we do not have the linearity assumption, since we're not considering a relationship with a quantitative variable The model parameters are the group mean parameters, along with the variance of the error terms. The mean parameters are estimated as the group sample means

Logistic Regression with replications

Residuals: We can only define residuals for binary data with replications Goodness of Fit: We perform goodness of fit only for logistic regression with replications under the assumption that Yi is binomial with ni greater than 1

what is the relationship between s^2 and sigma^2?

S^2 estimates sigma^2

Elastic Net Regression

Simultaneously performs variable selection and continues shrinkage and can select groups of correlated variables. Elastic net often out performs the lasso in terms of prediction accuracy The L1 penalty generates a sparse model that enforces some of the regression coefficients to be 0. The L2 penalty removes the limitation of the number of selected variables, encourages group effect, stabilizes the L1 regularization path Alpha = 0.5 (between 0 & 1)

Elastic Net

Simultaneously performs variable selection and continues shrinkage and can select groups of correlated variables. Elastic net often out performs the lasso in terms of prediction accuracy.

HW 4 After this

Space

Spatial Process Assumptions

Stationarity: This means that the probability distribution of the spatial process does not change when shifted in space Isotropic: This means that the distribution of Ys is the same in all orientations

How can we find that a normality transformation of the predictive variable will improve the fit?

Such transformation can be identified by comparing the logit of the success rate, versus the predicted variables.

T/F: A negative value of β 1 is consistent with an inverse relationship between x and y.

T

T/F: Cross-validation using random sampling is less computationally efficient (more computationally expensive) in estimating the model error rate than the K-fold cross validation.

T

T/F: For a classification model, training error tends to underestimate the true classification error rate of the model.

T

T/F: For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it is an indication that the model is a good fit.

T

T/F: If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is statistically significantly positive.

T

T/F: If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.

T

T/F: We can perform goodness-of-fit analysis for a Poisson regression.

T

Variable selection addresses multicolinearity, high dimensionaltiy, and prediction vs explanatory prediction

TRUE

Variable selection is not special, it is affected by highly correlated variables

TRUE

Statistical Inference for Logistic Regression is not reliable for small sample size

TRUE - Only Large sample size

The R-squared and adjusted R-squared are not appropriate model comparisons for non linear regression but are for linear regression models.

TRUE - The underlying assumption of R-squared calculations is that you are fitting a linear model.

If p is larger than n, stepwise is feasible

TRUE - for forward, but not backward

The deviance and pearson residuals are normally distributed

TRUE - the residual deviances are chi square distributed

Stepwise is a heuristic search

TRUE it is also a greedy search that does not guarantee to find the best score

Linearity Assumption

The Logit transformation of the probability of success is a linear combination of the predicting variables. The relationship may not be linear, however, and transformation may improve the fit The linearity assumption can be evaluated by plotting the logit of the success rate versus the predicting variables. If there's a curvature or some non-linear pattern, it may be an indication that the lack of fit may be due to the non-linearity with respect to some of the predicting variables

Assumptions

The assumption that the errors are normally distributed is needed if we want to do any confidence or prediction intervals or hypothesis tests If this assumption is violated, hypothesis test and confidence and prediction intervals can be very misleading

Regression Coefficients

The estimated regression coefficients still have a closed form expression. However, we correct for the variance-covariance matrix of the error terms. The estimated regression coefficients remain unbiased, but the variance of the coefficients changes. The sample distribution of beta estimators also remains to be normal under the normality assumption

Statistical Inference for Poisson Regression

The estimators for the regression coefficients in the Poisson regression are approximately unbiased. The mean of the approximate normal distribution is beta. This approximation relies on the large sample data

What is the f-statistic?

The f statistic is going to be the ratio between the mean sum of square regression and mean sum of square of error.

Goodness of Fit Test

The null hypothesis is that the model fits well. The alternative is that the model does not fit well The test statistic for the goodness of fit test is the sum of squared deviances which has a Chi-Square distribution with n- p- 1 degrees of freedom If the p-value is small, we reject the null hypothesis of good fit, and thus we conclude that the model is not a good fit. We want LARGE values of P. Large values of P indicate that the model may be a good fit For goodness of fit test, we compare the likelihoods of the saturated model versus the fitted model.

L0 penalty

The number of nonzero regression coefficients. This is equivalent to searching over all models and not viable for a large p.

Classification

The prediction of binary responses. Classification is nothing more than a prediction of the class of your response, y* (y star), given the predictor variable, x* (x star). If the predicted probability is large, then classify y* as a success.

The fitted values are defined as:

The regression line with parameters replaced with the estimated regression coefficients.

Response variable

The response data are Bernoulli or binomial with one trial with probability of success

MLE

The resulting log-likelihood function to be maximized, is very complicated and it is non-linear in the regression coefficients beta 0, beta 1, and beta p MLE has good statistical properties under the assumption of a large sample size i.e. large N For large N, the sampling distribution of MLEs can be approximated by a normal distribution The least square estimation for the standard regression model is equivalent with MLE, under the assumption of normality. MLE is the most applied estimation approach

Binomial Data

This is binary data with repititions

Log Rate

This is the log function of the expected value of the response ln(λ(x)) = β0 + β1x

The objective of the residual analysis is

To evaluate departures from the model assumptions

Generalized Linear Model

To generalize the standard regression model to response data that do not have a normal distribution, this generalizes the linear model to response data coming from other distributions.

A negative value of β 1 is consistent with an inverse relationship between x and y .

True

A no-intercept model with one qualitative predicting variable with 3 levels will use 3 dummy variables.

True

In a multiple regression problem, a quantitative input variable x is replaced by x − mean(x). The R-squared for the fitted model will be the same

True

In a simple linear regression model, the variable of interest is the response variable.

True

In all the regression models we have considered (including multiple linear, logistic, and Poisson), the response variable is assumed to have a distribution from the exponential family of distribution.

True

Logistic Regression deals with the case where the dependent variable is binary, the conditional distribution is Binomial

True

Under the normality assumption, the estimator for LaTeX: \beta_1 β 1 is a linear combination of normally distributed random variables.

True

A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more type I errors than expected

True -

The mean sum of squared errors in ANOVA measures variability within groups. (T/F)

True. See 2.4 Test for Equal Means. MSE = within-group variability

Cook's distance (Di) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed. (T/F)

True. See Lesson 3.11: Assumptions and Diagnostics "This is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model. "

The presence of certain types of outliers, such as influential points, can impact the statistical significance of some of the regression coefficients. (T/F)

True. See Lesson 3.11: Assumptions and diagnostics Outliers that are influential can impact the statistical significance of the beta parameters.

An example of a multiple linear regression model is Analysis of Variance (ANOVA). (T/F)

True. See Lesson 3.2 Basic Concepts "Earlier, we contrasted the simple linear regression model with the ANOVA model... Multiple linear regression is a generalization of both models."

Curse of Dimensionality

Using non-parametric regression with an increasing number of predicting variables p, there are many, many possible regression functions F To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension p

If p is large, what can we do instead?

We can perform a heuristic search, like stepwise regression

Backward Stepwise Regression

We start with all predictors in the full model and drop one predictor at a time Backward stepwise regression cannot be performed if p > n This is also more computationally expensive than forward stepwise regression This selects larger models if p is large

Estimation of WLS

We use a least squares approach but we need to account for the variance-covariance matrix sigma in weighting the error terms by their standard errors When there is correlation (sigma is not diagonal) we need to decorrelate the errors. Thus, we'll add the inverse of the sigma matrix to the least squares objective function

What can we do if the Normality or Constant Variance assumption does not hold?

We would transform the response variable. A common transformation is the Box-Cox transformation.

Prediction vs Explanatory Objective

When the objective is to explanatory, include predicting variables which are correlated For prediction, this should be avoided.

Multicollinearity

When the predicting variables are correlated, it is important to select variables in such a way that the impact of multicollinearity is minimized

When is Mallow's Cp useful?

When there are no control variables

Regularized Regression

Without Penalization: Estimate betas by minimizing the SSE With Penalization: Estimate betas by minimizing the penalized SSE λ*Penalty The bigger λ, the bigger the penalty for model complexity

Do the error terms have constant variance?

Yes

Will R^2 always increase when we add predicting variables?

Yes

Is the predicted regression line is the same as the estimated regression line at x*? How does it affect confidence intervals?

Yes, but the prediction confidence interval is wider than the estimation confidence interval because of the higher variability in the prediction.

Is training risk biased? why or why not?

Yes, the training risk is a biased estimate of prediction risk because we use the data twice. Once for fitting the model S and once for estimating the prediction risk. Thus, training risk is biased upward.

What is the distribution of binary data WITHOUT replications?

a binomial distribution with one trial where ni = 1

lambda

a constant that has the role of balancing the tradeoff between lack of fit and model complexity

What is B^ in MLR?

a linear combination of Y's and is normally distributed.

design matrix

a matrix consisting of columns of predicting variables including the column of ones corresponding to the intercept:

prediction risk

a measure of the bias-variance tradeoff

linear relationship

a simple deterministic relationship between 2 factors, x and y

correlation coefficient

a statistic that efficiently summarizes how well the X's are linearly related to Y

R^2 or coefficient of determination

a statistic that efficiently summarizes how well the x can be used to predict the response variable.

A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the MSE for this model? a. 30.669 b. 11.201 c. 20.534 d. None of the above

a. 30.669 MSE is the square of the Residual standard error. MSE = 5.5382 = 30.669

Adjusted R^2

adjusted for the number of predictive variables. So it's not going to increase as we add more predictive variables

If our ANOVA model does not have an intercept, then how many dummy variables?

all k dummy variables

3 primary objectives of ANOVA

analyze the variability of data, test whether the means are equal, estimate confidence intervals

Pearson Residuals

as the standardized difference between the ith observed response and estimated expected response, which is ni times the probability of success.

A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of the correlation coefficient between Height and Diameter? a. 0.2697 b. 0.5193 c. 0.3222 d. None of the above

b. 0.5193 In simple linear regression, the correlation coefficient between the response and predictor variables, 𝜌 = √𝑅2. Since, 𝑅 2 = 0.2697, then 𝜌 = √0.2697 = 0.5193.

conditional model (MLR)

captures the association of a predictor variable to the response variable, conditional of other predicting variables in the model.

Trying all three link functions for a logistic regression model (C-ln-ln, probit, logit) will produce models with the same goodness of fit for a dataset.

false

For Generalized Linear Models, including Poisson regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data.

false The deviance residuals are approximately N(0,1) if the model is a good fit.

If p is small,....

fit all submodels

what does residual analysis NOT check for? (for SLR assumptions)

independence

To test if a coefficient is less than a critical value, C, we conduct a one-sided test on the _________ tail of a ___________ distribution. left, normal left, t right, normal right, t None of the above

left, t See 1.4 Statistical Inference "For β 1 greater than zero we're interested on the right tail of the distribution of the β ^ 1."

ANOVA

linear regression with one or more qualitative predicting variables

simple linear regression

linear regression with one quantitative predicting variable

g link function

link the probability of success to the predicting variables

Forward-Backward stepwise regression

meaning adding and discarding one variable at a time iteratively.

Predictive power

means that the predicting variables predict the data even if one or more of the assumptions do not hold.

Multiple linear regression

multiple quantitative and qualitative predicting variables

The F-test is a _________ tailed test with ______ and ______ degrees of freedom. one, k, N-1 one, k-1, N-k two, k-1, N-k two, k, N-1 None of the above.

one, k-1, N-k See 2.4 Test for Equal Means The F-test is a one tailed test that has two degrees of freedom, namely k − 1 and N − k.

What is 'X*'?

predictor

In logistic regression, we model the__________________, not the response variable, given the predicting variables.

probability of a success

What inference does 'saturated vs fitted model' provide?

provides inferences on the goodness of the model

What kind of variable is a response variable and why?

random, because it varies with changes in the predictor/s along with other random changes.

For Poisson regression, the variance = ?

rate lambda

F-test measures...

ratio of between-group variability and within-group variability

Simpson's paradox

refers to reversal of an association when looking at a marginal relationship versus a partial or conditional one. This is a situation where the marginal relationship has a wrong sign.

Do we evaluate normality using residuals or the response variable?

residuals

What is the sampling distribution of ^B1?

t distribution with N-2 DF

The L0 penalty provides....

the best model given a selection criteria, but it requires fitting all submodels.

The bigger the lambda,.....

the bigger the penalty for model complexity.

The overall regression F-statistic tests the null hypothesis that

the coefficients are equal to zero

residuals

the difference between the observed response and the fitted responses

Cook's Distance

the distance between the fitted values of the model with all the data versus the fitted values of the model discarding the ith observation:

classification error rate

the probability that the new response is equal to the classifier.

How will we assess the normality assumption for the ANOVA model?

the quantile-quantile normal plot and the historgram of the residuals.

Deviance residuals

the signed square root of the log-likelihood evaluated at the saturated model when we assume that the estimate expected response is the observed response versus the fitted model.

Multicollinearity inflates...?

the standard error of the estimated coefficients

sum of square errors

the sum of square differences between the observations and the individual sample means

sum of square treatments

the sum of the square difference between the sample means of the individual samples minus the overall mean

Controlling factors

to control for bias selection in the sample. They are used as 'default' variables in order to capture more meaningful relationships.

Explanatory factors

to explain variability in the response variable; they may be included in the model even if other "similar" variables are in the model

what is the confidence interval used for?

to provide an interval estimate for the true average value of y for all members of the population with a particular value of x*

We interpret the beta in a logistic regression model in respect to?

to the odds of success

A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more Type I errors than expected.

true

A logistic regression model may not be a good fit if the responses are correlated or if there is heterogeneity in the success that hasn't been modeled.

true

Akaike Information Criterion (AIC) is an estimate for the prediction risk.

true

An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.

true

The estimated regression coefficients in Poisson regression are approximate.

true The estimated parameters and their standard errors are approximate estimates.

To evaluate whether the model is a good fit or whether the assumptions hold, what should we use?

use the Pearson or Deviance residuals to evaluate whether they are normally distributed and conclude it is a good fit via hypotheses testing.

what is the prediction interval used for?

used to provide an interval estimate for a prediction of y for one member of the population with a particular value of x*

ANOVA measures...

variability between samples to the variability within a sample.

pooled variance estimator (MSE)

we compare the means, assuming the variances are the same, and estimate the variance across all samples

Overdispersion

where the variability of the response variable is larger than estimated by the model.

outliers

which are data points far from the majority of the data in both x and y or just x

canonical link function

which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples

MSE measures..

within-group variability

An example of a multiple regression model is Analysis of Variance (ANOVA).

True

An indication that a higher order non linear relationship better fits the data is that the dummy variables are all, or nearly all, statistically significant

True

An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.

True

Analysis of Variance (ANOVA) is an example of a multiple regression model.

True

Assuming the model is a good fit, the residuals in simple linear regression have constant variance.

True

Before making statistical inference on regression coefficients, estimation of the variance of the error terms is necessary.

True

Both LASSO and ridge regression always provide greater residual sum of squares than that of simple multiple linear regression.

True

Classification is nothing else than prediction of binary responses.

True

Confounding variable is a variable that influences both the dependent variable and independent variable

True

Cook's distance (Di) measurs how much the fitted values in a MLR model change when the ith observation is removed

True

Event rates can be calculated as events per units of varying size, this unit of size is called exposure

True

For Poisson regression, we can reduce type I errors of identifying statistical significance in the regression coefficients by increasing the sample size.

True

For a MLR model to be a good fit, we need the linearity assumption to hold for at least one of the predicting variables

True

For a given predicting variable, the estimated coefficient of regression associated with it will likely be different in a model with other predicting variables or in the model with only the predicting variable alone.

True

For a linearly dependent set of predictor variables, we should not estimate a multiple linear regression model.

True

For assessing the normality assumption of the ANOVA model, we can use the quantile-quantile normal plot and the historgram of the residuals.

True

For both Logistic and Poisson Regression, the deviance residuals should approximately follow the standard normal distribution if the model is a good fit for the data

True

For large sample size data, the distribution of the test statistic, assuming the null hypothesis, is a chi-squared distribution

True

If a predicting variable is categorical with 5 categories in a linear regression model without intercept, we will include 5 dummy variables in the model.

True

If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is positive, and statistically significant.

True

If one confidence interval in the pairwise comparison includes only positive values, we conclude that the difference in means is statistically significantly positive.

True

If one confidence interval in the pairwise comparison includes zero under ANOVA, we conclude that the two corresponding means are plausibly equal.

True

If response variable Y has a quadratic relationship with a predictor variable X, it is possible to model the relationship using multiple linear regression.

True

To what distribution can we derive the confidence interval from?

t-distribution

The sampling distribution of β ^ 0 is a t-distribution chi-squared distribution normal distribution None of the above

t-distribution See 1.4 Statistical Inference The distribution of β 0 is normal. Since we are using a sample and not the full population, the sampling distribution of β ^ 0 is the t-distribution.

What is the sampling distribution for individual β hat?

t-distribution with n-p-1 DF

What can we use to test for statistical significance?

t-test

If there is a group of variables among which the correlation are very high, then the Lasso...

tends to select only one variable from that group and does not care which one is selected.

What is the F-test for?

test for overall regression

T/F: If you apply linear regression under normality to count data, the assumption of constant variance still holds.

F

T/F: In Poisson regression, we also need to check for the assumption of constant variance of the error terms.

F

T/F: In multiple linear regression, a VIF value of 6 for a predictor means that 80% of the variation in that predictor can be modeled by the other predictors.

F

T/F: In multiple linear regression, if the coefficient of a quantitative predicting variable is negative, that means the response variable will decrease as this predicting variable increases.

F

T/F: In multiple linear regression, we need the linearity assumption to hold for at least one of the predicting variables

F

T/F: In simple linear regression models, we lose three degrees of freedom because of the estimation of the three model parameters β 0 , β 1 , σ 2.

F

T/F: Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.

F

T/F: Only the log-transformation of the response variable can be used when the normality assumption does not hold.

F

T/F: The binary response variable in logistic regression has a Bernoulli distribution.

F

T/F: The canonical link function for Poisson regression is the logit link function.

F

T/F: The coefficient of variation is used to evaluate goodness-of-fit.

F

T/F: The constant variance assumption is diagnosed using the histogram.

F

T/F: The error term in logistic regression has a normal distribution.

F

T/F: The log-likelihood function is a linear function with a closed-form solution.

F

T/F: The logit link function is the best link function to model binary response data because the models produced always fit the data better than other link functions.

F

T/F: The mean sum of square errors in ANOVA measures variability between groups.

F

T/F: The number of degrees of freedom of the χ 2 (chi-square) distribution for the variance estimator is N − k + 1 where k is the number of samples.

F

T/F: The number of parameters that need to be estimated in a logistic regression model with 5 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.

F

T/F: The prediction of the response variable and the estimation of the mean response have the same interpretation.

F

T/F: The prediction of the response variable has the same levels of uncertainty compared with the estimation of the mean response.

F

T/F: The regression coefficients are used to measure the linear dependence between two variables.

F

T/F: The regression coefficients for the Poisson regression can be estimated in exact/closed form.

F

T/F: The sampling distribution for the estimated regression coefficients under logistic regression is approximately t-distribution.

F

T/F: The sampling distribution for the variance estimator in ANOVA is χ 2 (chi-square) regardless of the assumptions of the data.

F

T/F: The sampling distribution of the predicted response variable used in statistical inference is normal in multiple linear regression under the normality assumption.

F

T/F: The sampling distribution of the prediction of the response variable is a χ 2(chi-squared) distribution.

F

T/F: The statistical inference for linear regression under normality relies on large size of sample data.

F

T/F: We assess the assumption of constant-variance by plotting the response variable against fitted values.

F

T/F: We can perform goodness-of-fit analysis through residual diagnosis for a logistic regression without replications.

F

T/F: We interpret logistic regression coefficients with respect to the response variable.

F

T/F: When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients in the reduced model.

F

T/F: β ^ 1 is an unbiased estimator for β0.

F

T/F; A linear regression model has high predictive power if the coefficient of determination is close to 1.

F

T/F: Under logistic regression, the sampling distribution used for a coefficient estimator is a chi-square distribution.

F, derived using MLE which is a normal distribution.

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Construct an approximate 95% confidence interval for the coefficient of concCu. (-0.322, -0.249) (-4.931, -3.724) (-4.847, -3.808) (-0.310,-0.240)

(-0.310,-0.240) [-0.27483-1.96*0.01784, -0.27483+1.96*0.01784]

Penalized or regularized regression

When we perform variable selection and estimation simultaneously.

An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 Suppose you wanted to test if the coefficient for doseamt is equal to 50. What t-value would you use for this test? 1.54 -0.948 0.692 -0.882

-0.948 t-value = (41.33160−50)/ 9.13907 = −8.6684/ 9.13907 = -0.9484991

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Suppose you wanted to test if the coefficient for concCu is equal to -0.2. What z-value would you use for this test? 0.095 -0.073 -4.195 1.411

-4.195 (-0.27483-(-0.2))/0.01784 = -4.195

You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. For students with average math and language arts scores, how many more days on average are female students absent compared to their male counterparts? 4.8545 3.5729 2.2525 0.6697

2.2525 λ(Xmath, Xlangarts, Xmale) = e^( 2.687666−0.003523Xmath−0.012152Xlangarts−0.400921∗Xmale ) λ(Xmath = 45.5, Xlangarts = 50, Xmale = 0) = e^( 2.687666−0.003523∗(45.5)−0.012152∗(50)−0.400921∗(0) = 6.819386 ) λˆ(Xmath = 45.5, Xlangarts = 50, Xmale = 1) = e ^(2.687666−0.003523∗(45.5)−0.012152∗(50)−0.400921∗(1) = 4.566963 ) λ(Xmath = 45.5, Xlangarts = 50, Xmale = 0) − λ(Xmath = 45.5, Xlangarts = 50, Xmale = 1) = 2.252423

Box-cox transformation

A common transformation is the power transformation y to the lambda used to improve the normality and/or constant variance assumption.

Influential points

A data point that is far from the mean both x and y and change the value of the estimated parameters significantly. It can change the statistical significance, it can change the magnitude. It can change even the sign.

An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 For an F-test of overall significance of the regression model, what degrees of freedom would be used? 3 , 24 2, 27 3, 23 2, 23

3, 23 The numerator degrees of freedom (ndf) is equal to p and the denominator degrees of freedom (ddf) is equal to n-p-1, where n: number of observations and p: number of predictors. Hence, ndf = 3 and ddf = 27-3-1 = 23

The following output was captured from the summary output of a simple linear regression model that relates the duration of an eruption with the waiting time since the previous eruption. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.374016 A -1.70 0.045141 * waiting 0.043714 0.011098 B 0.000052 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 Using the table above, what is the t-value of the coefficient for waiting, labeled B, and rounded to three decimal places? 3.939 3.931 3.935 None of the above

3.939 See 1.4 Statistical Inference t-value = Estimate /Std.Err= 0.043714/0.011098 = 3.939

You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. What is the expected number of days missed for a female student with a langarts of 48 and a math score of 50 based on the model? 6.8773 1.9106 6.6363 4.5251

6.8773 λ(Xmath, Xlangarts, Xmale) = e^( 2.687666−0.003523Xmath−0.012152Xlangarts−0.400921Xmale) λ(Xmath = 50, Xlangarts = 48, Xmale = 0) = e^( 2.687666−0.003523∗50−0.012152∗48−0.400921∗0) = 6.877258

influential points

A data point that is far from the mean of both the x's and the y's, because they are influencing the fit of the regression.

An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 Calculate the Sum of Squared Regression from the model summary. 17,484.25 73,163.60 67,181.18 55,284.40

73,163.60 You could calculate this value in several ways. This is one possible way. Fstat = MSReg/MSE = (SSReg/p)/MSE SSReg = Fstat ∗ MSE ∗ p = (8.348)(54.052 )(3) = 73,163.60

In Poisson regression, A) We model the log of the expected response variable not the expected log response variable. B) We use the ordinary least squares to fit the model. C) There is an error term. D) None of the above.

A

In logistic regression, A) The estimation of the regression coefficients is based on maximum likelihood estimation. B) We can derive exact (close form expression) estimates for the regression coefficients. C) The estimations of the regression coefficients is based on minimizing the sum of least squares. D) All of the above.

A

The mean squared errors (MSE) measures: A) The within-treatment variability. B) The between-treatment variability. C) The sum of the within-treatment and between-treatment variability. D) None of the above.

A

The objective of the residual analysis is: A) To evaluate departures from the model assumptions B) To evaluate whether the means are equal. C) To evaluate whether only the normality assumptions holds. D) None of the above.

A

The pooled variance estimator is: A) The sample variance estimator assuming equal variances. B) The variance estimator assuming equal means and equal variances. C) The sample variance estimator assuming equal means. D) None of the above.

A

Which is correct? A) The regression coefficients can be estimated only if the predicting variables are not linearly dependent. B) The estimated regression coefficient beta hat i is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable. C) The estimated regression coefficients will be the same under marginal and conditional model; only their interpretation is not. D) Causality is the same as association in interpreting the relationship between the response and predicting variables.

A

Which one is correct? A) We use a chi-square testing procedure to test whether a subset of regression coefficients are zero in Poisson regression. B) The test for subsets of regression coefficients is a goodness of fit test. C) The test for subsets of regression coefficients is reliable for small sample data in Poisson regression. D) None of the above.

A

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 Interpret the coefficient for concCu. A 1-unit increase in the concentration of copper decreases the odds of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the number of samples of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the log odds of botrytis blight surviving by 0.27483 holding sulfer constant. A 1-unit increase in the concentration of copper decreases the probability of botrytis blight surviving by 0.27483 holding sulfer constant.

A 1-unit increase in the concentration of copper decreases the log odds of botrytis blight surviving by 0.27483 holding sulfer constant.

Assuming that the data are normally distributed, the estimated variance has the following sampling distribution under the simple linear model: A. Chi-square with n-2 degrees of freedom B. T-distribution with n-2 degrees of freedom C. Chi-square with n degrees of freedom D. T-distribution with n degrees of freedom

A. Chi-square with n-2 degrees of freedom 1.1 - Knowledge Check 1

The estimators of the linear regression model are derived by: A. Minimizing the sum of squared differences between observed and expected values of the response variable. B. Maximizing the sum of squared differences between observed and expected values of the response variable. C. Minimizing the sum of absolute differences between observed and expected values of the response variable. D. Maximizing the sum of absolute differences between observed and expected values of the response variable.

A. Minimizing the sum of squared differences between observed and expected values of the response variable. 1.1 - Knowledge Check 1

Which one is correct? A. The regression coefficients can be estimated only if the predicting variables are not linearly dependent. B. The estimated regression coefficient 𝛽∧𝑖 is interpreted as the change in the response variable associated with one unit of change in the i-th predicting variable . C. The estimated regression coefficients will be the same under marginal and conditional model, only their interpretation is not. D. Causality is the same as association in interpreting the relationship between the response and the predicting variables.

A. The regression coefficients can be estimated only if the predicting variables are not linearly dependent. 3.1 - Knowledge Check 1

The pooled variance estimator is: A. The variance estimator assuming equal variances. B. The variance estimator assuming equal means and equal variances. C. The sample variance estimator assuming equal means. D. None of the above.

A. The variance estimator assuming equal variances. 2.1 - Knowledge Check 1

The mean squared errors (MSE) measures: A. The within-treatment variability. B. The between-treatment variability. C. The sum of the within-treatment and between-treatment variability. D. None of the above.

A. The within-treatment variability. 2.1 - Knowledge Check 2

The objective of the residual analysis is A. To evaluate goodness of fit B. To evaluate whether the means are equal. C. To evaluate whether only the normality assumptions holds. D. None of the above.

A. To evaluate goodness of fit 2.2 - Knowledge Check 3

We detect departure from the assumption of constant variance A. When the residuals increase as the fitted values increase also. B. When the residuals vs fitted are scattered randomly around the zero line. C. When the histogram does not have a symmetric shape. D. All of the above.

A. When the residuals increase as the fitted values increase also. 1.3 - Knowledge Check 4

The sampling distribution of β ^ 0 is a: A. t-distribution B. chi-squared distribution C. normal distribution D. None of the above

A. t-distribution See 1.4 Statistical Inference The distribution of β 0 is normal. Since we are using a sample and not the full population, the sampling distribution of β ^ 0 is the t-distribution.

Multiple Linear Regression

Assumptions: - The deviances or error terms have 0 mean - Constant variance - They are independent. Statistical Inference: - We also need to assume that the error terms are normally distributed

Multiple Linear Regression

Assumptions: We assume that the error terms epsilon_i have 0 mean and constant variance, and they're independent. We also assumed that error terms are normally distributed The parameter is defining the regression line, beta 0, beta 1, to beta p are parameters. But we also have the additional parameter, the variance of the error terms, denoted with sigma squared These parameters are unknown but estimated based on the observed data using the ordinary least squares approach Statistical inference on the regression coefficients using the sampling distribution of the estimated regression coefficients that is the T distribution with n-p-1 degrees of freedom

Comparing cross-validation methods, A) The random sampling approach is more computational efficient that leave-one-out cross validation. B) In K-fold cross-validation, the larger K is, the higher the variability in the estimation of the classification error is. C) Leave-one-out cross validation is a particular case of the random sampling cross-validation. D) None of the above.

B

In logistic regression, A) The hypothesis test for subsets of coefficients is a goodness of fit test. B) The hypothesis test for subsets of coefficients is approximate; it relies on large sample size. C) We can use the partial F test for testing whether a subset of coefficients are all zero. D) None of the above.

B

The estimated versus predicted regression line for a given x*: A) Have the same variance B) Have the same expectation C) Have the same variance and expectation D) None of the above

B

The objective of the pairwise comparison is: A) To find which means are equal. B) To identify the statistically significantly different means. C) To find the estimated means which are greater or lower than other. D) None of the above.

B

The total sum of squares divided by N-1 is: A) The mean sum of squared errors B) The sample variance estimator assuming equal means and equal variances C) The sample variance estimator assuming equal variances. D) None of the above.

B

Which one is correct? A) The logit link function is the only link function that can be used for modeling binary response data. B) Logistic regression models the probability of a success given a set of predicting variables. C) The interpretation of the regression coefficients in logistic regression is the same as for standard linear regression assuming normality. D) None of the above.

B

Which one is correct? A) We can evaluate the goodness of fit a model using the testing procedure of the overall regression. B) In applying the deviance test for goodness of fit in logistic regression, we seek large p-values, that is, not reject the null hypothesis. C) There is no error term in logistic regression and thus we cannot perform a goodness of fit assessment. D) None of the above.

B

The fitted values are defined as: A. The difference between observed and expected responses. B. The regression line with parameters replaced with the estimated regression coefficients. C. The regression line. D. The response values.

B. The regression line with parameters replaced with the estimated regression coefficients. 1.1 - Knowledge Check 1

The total sum of squares divided by N-1 is A. The mean sum of squared errors B. The sample variance estimator assuming equal means and equal variances C. The sample variance estimator assuming equal variances. D. None of the above.

B. The sample variance estimator assuming equal means and equal variances 2.1 - Knowledge Check 2

The objective of the pairwise comparison is A. To find which means are equal. B. To identify the statistically significantly different means. C. To find the estimated means which are greater or lower than other. D. None of the above.

B. To identify the statistically significantly different means. 2.2 - Knowledge Check 3

To test if a coefficient is less than a critical value, C, we conduct a one-sided test on the _________ tail of a ___________ distribution. A. left, normal B. left, t C. right, normal D. right, t E. None of the above

B. left, t See 1.4 Statistical Inference "For β 1 greater than zero we're interested on the right tail of the distribution of the β ^ 1."

The F-test is a _________ tailed test with ______ and ______ degrees of freedom. A. one, k, N-1 B. one, k-1, N-k C. two, k-1, N-k D. two, k, N-1 E. None of the above.

B. one, k-1, N-k See 2.4 Test for Equal Means The F-test is a one tailed test that has two degrees of freedom, namely k − 1 and N − k.

what are the model parameters to be estimated in MLR?

B0 (intercept), B1-Bp, and sigma squared

what are the 3 parameters we estimated in regression?

B0, B1, sigma squared (variance of the one pop.)

In SLR, we are interested in the behavior of which parameter?

B1

what does 'statistical significance' mean?

B1 is statistically different from zero.

AIC vs BIC

BIC is similar to AIC except that the complexity is penalized by log(n)/2

Which is more computationally expensive, forward or backward?

Backward

2. Which is correct? A) A multiple linear regression model with p predicting variables but no intercept has p model parameters. B) The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model. C) Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. D) None of the above.

C

In logistic regression: A) We can perform residual analysis for response data with or without replications. B) Residuals are derived as the fitted values minus the observed responses. C) The sampling distribution of the residual is approximately normal distribution if the model is a good fit. D) All of the above.

C

The assumption of normality: A) It is needed for deriving the estimators of the regression coefficients. B) It is not needed for linear regression modeling and inference. C) It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. D) It is needed for deriving the expectation and variance of the estimators of the regression coefficients.

C

The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise. C) Unbiased regardless of the distribution of the data.

C

The variability in the prediction comes from: A) The variability due to a new measurement. B) The variability due to estimation. C) The variability due to a new measurement and due to estimation. D) None of the above.

C

Using the R statistical software to fit a logistic regression, A) We can use the lm() command. B) The input of the response variable is exactly the same if the binary response data are with or without replications. C) We can obtain both the estimates and the standard deviations of the estimates for the regression coefficients. D) None of the above.

C

Which is correct? A) If we reject the test of equal means, we conclude that all treatment means are not equal. B) If we do not reject the test of equal means, we conclude that means are definitely all equal C) If we reject the test of equal means, we conclude that some treatment means are not equal. D) None of the above.

C

Link Functions

C-log Function: This has very long tails, meaning that it works best in extremely skewed distributions Probit Function: This is the inverse of the CDF of a standard normal distribution. This fits data with least-heavy tails among the three S shaped functions. This would work well when the probabilities are all concentrated within a small range Logit Function: This is what is called the canonical link function, which means that parameter estimates under logistic regression are fully efficient and tests on those parameters are better behaved for small samples. The interpretations of regression coefficients in terms of log odds is possible with a logit function but not other S-shape functions

Which is correct? A. If we reject the test of equal means, we conclude that all treatment means are not equal. B. If we do not reject the test of equal means, we conclude that means are definitely all equal C. If we reject the test of equal means, we conclude that some treatment means are not equal. D. None of the above.

C. If we reject the test of equal means, we conclude that some treatment means are not equal. 2.1 - Knowledge Check 2

The assumption of normality: A. It is needed for deriving the estimators of the regression coefficients. B. It is not needed for linear regression modeling and inference. C. It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. D. It is needed for deriving the expectation and variance of the estimators of the regression coefficients.

C. It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference. 1.2 - Knowledge Check 2

Which one is correct? A. A multiple linear regression model with p predicting variables but no intercept has p model parameters. B. The interpretation of the regression coefficients is the same whether or not interaction terms are included in the model. C. Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. D. None of the above.

C. Multiple linear regression is a general model encompassing both ANOVA and simple linear regression. 3.1 - Knowledge Check 1

The variability in the prediction comes from: A. The variability due to a new measurement. B. The variability due to estimation C. The variability due to a new measurement and due to estimation. D. None of the above.

C. The variability due to a new measurement and due to estimation. 1.2 - Knowledge Check 3

The alternative hypothesis of ANOVA can be stated as, A. the means of all pairs of groups are different B. the means of all groups are equal C. the means of at least one pair of groups is different D. None of the above

C. the means of at least one pair of groups is different See 2.4 Test for Equal Means "Using the hypothesis testing procedure for equal means, we test: The null hypothesis, which that the means are all equal (mu 1 = mu 2...=mu k) versus the alternative hypothesis, that some means are different. Not all means have to be different for the alternative hypothesis to be true -- at least one pair of the means needs to be different."

marginal relationship

Capturing the association of a predicting variable to the response variable marginally, i.e. without consideration of other factors.

Marginal Relationship

Capturing the association of a predicting variable to the response variable without consideration of other factors

Under the null hypothesis of good fit, the test statistic's (sum of squared deviances) distribution and DOF is...?

Chi square with n-p-1 DF

Assuming that the data are normally distributed, under the simple linear model, the estimated variance has the following sampling distribution:

Chi-square with n-2 degrees of freedom

You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. The approximated distribution of the residual deviance is ____ with ____ degrees of freedom. Normal, 315 Chi-squared, 312 Chi-squared, 315 t, 312

Chi-squared, 312 The approximated distribution of the residual deviance is Chi-square with n-p-1 degrees of freedom. In this example n = 316 and p = 3 ; Hence df= 312.

In evaluating a multiple linear model: A) The F test is used to evaluate the overall regression. B) The coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model. C) Residual analysis is used for goodness of fit assessment. D) All of the above.

D

In the presence of near multicollinearity: A) The coefficient of variation decreases. B) The regression coefficients will tend to be identified as statistically significant even if they are not. C) The prediction will not be impacted. D) None of the above.

D

Logistic regression is different from standard linear regression in that: A) It does not have an error term B) The response variable is not normally distributed. C) It models probability of a response and not the expectation of the response. D) All of the above.

D

Logistic regression is different from standard linear regression in that: A) The sampling distribution of the regression coefficient is approximate. B) A large sample data is requirded for making accurate statistical inferences. C) A normal sampling distribution is used instead of a t-distribution for statistical inference. D) All of the above.

D

Poisson regression can be used: A) To model count data. B) To model rate response data. C) To model response data with a Poisson distribution. D) All of the above.

D

Residual analysis in Poisson regression can be used: A) To evaluate goodness of fit of the model. B) To evaluate whether the relationship between the log of the expected response and the predicting variables is linear. C) To evaluate whether the data are uncorrelated. D) All of the above.

D

The estimators for the regression coefficients are: A) Biased but with small variance B) Unbiased under normality assumptions but biased otherwise C) Biased regardless of the distribution of the data. D) Unbiased regardless of the distribution of the data.

D

The objective of multiple linear regression is: A) To predict future new responses B) To model the association of explanatory variables to a response variable accounting for controlling factors. C) To test hypotheses using statistical inference on the model. D) All of the above.

D

The sampling distribution of the estimated regression coefficients is: A) Centered at the true regression parameters. B) The t-distribution assuming that the variance of the error term is unknown and replaced by its estimate. C) Dependent on the design matrix. D) All of the above.

D

We can test for a subset of regression coefficients: A) Using the F-statistic test of the overall regression. B) Only if we are interested in whether additional explanatory variables should be considered in addition to the controlling variables. C) To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. D) None of the above.

D

When do we use transformations? A) If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. B) If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. C) If the constant variance assumption does not hold, we transform the response variable. D) All of the above.

D

When we do not have a good fit in generalized linear models, it may be that: A) We need to transform some of the predicting variables or to include other variables. B) The variability of the expected rate is higher than estimated. C) There may be leverage point that need to explored further. D) All of the above.

D

Which are all the model parameters in ANOVA? A) The means of the k populations. B) The sample means of the k populations. C) The sample means of the k samples. D) None of the above.

D

Which is correct? A) Prediction translates into classification of a future binary response in logistic regression. B) In order to perform classification in logistic regression, we need to first define a classifier for the classification error rate. C) One common approach to estimate the classification error is cross-validation. D) All of the above.

D

Which one correctly characterizes the sampling distribution of the estimated variance? A) The estimated variance of the error term has a chi-squared distribution regardless of the distribution assumption of the error terms. B) The number of degrees of freedom for the chi-squared distribution of the estimated variance is n - p - 1 for a model without an intercept. C) The sampling distribution of the mean squared error is different of that of the estimated variance. D) None of the above.

D

Which one is correct? A) The prediction intervals need to be corrected for simultaneous inference when multiple predictions are made jointly. B) The prediction intervals are centered at the predicted value. C) The sampling distribution of the prediction of a new response is a t-distribution. D) All of the above.

D

Which one is correct? A) The estimated regression coefficients and their standard deviations are approximate not exact in Poisson regression. B) We use the glm() R command to fit a Poisson linear regression. C) The interpretation of the estimated regression coefficients is in terms of the ratio of the response rates. D) All of the above.

D

Which one is correct? A) The residuals have constant variance for the multiple linear regression model. B) The residuals vs. fitted can be used to assess the assumption of independence. C) The residuals have a t-distribution if the error term is assumed to have a normal distribution. D) None of the above.

D

Which one is correct? A) The standard normal regression, the logistic regression and the Poisson regression are all falling under the generalized linear model framework. B) If we were to apply a standard normal regression to response data with a Poisson distribution, the constant variance assumption would not hold. C) The link function for the Poisson regression is the log function. D) All of the above.

D

What is the difference between LASSO and Elastic Net?

Elastic Net has an additional penalty just like the one used in ridge regression.

In evaluating a simple linear model A. There is a direct relationship between the coefficient of determination and the correlation between the predicting and response variables. B. The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. C. Residual analysis is used for goodness of fit assessment. D. All of the above.

D. All of the Above 1.3 - Knowledge Check 4

In evaluating a multiple linear model, A. The F test is used to evaluate the overall regression. B. The coefficient of determination is interpreted as the percentage of variability in the response variable explained by the model. C. Residual analysis is used for goodness of fit assessment. D. All of the above.

D. All of the Above 3.3 - Knowledge Check 4

When do we use transformations? A. If the linearity assumption with respect to one or more predictors does not hold, then we use transformations of the corresponding predictors to improve on this assumption. B. If the normality assumption does not hold, we transform the response variable, commonly using the Box-Cox transformation. C. If the constant variance assumption does not hold, we transform the response variable. D. All of the above.

D. All of the Above 3.3 - Knowledge Check 4

The objective of multiple linear regression is A. To predict future new responses B. To model the association of explanatory variables to a response variable accounting for controlling factors. C. To test hypothesis using statistical inference on the model. D. All of the above.

D. All of the above. 3.1 - Knowledge Check 1

The sampling distribution of the estimated regression coefficients is A. Centered at the true regression parameters. B. The t-distribution assuming that the variance of the error term is unknown an replaced by its estimate. C. Dependent on the design matrix. D. All of the above.

D. All of the above. 3.2 - Knowledge Check 2

In the presence of near multicollinearity, A. The coefficient of determination decreases. B. The regression coefficients will tend to be identified as statistically significant even if they are not. C. The prediction will not be impacted. D. None of the above.

D. None of the Above 3.3 - Knowledge Check 4

Which one is correct? A. The residuals have constant variance for the multiple linear regression model. B. The residuals vs fitted can be used to assess the assumption of independence. C. The residuals have a t-distribution distribution if the error term is assumed to have a normal distribution. D. None of the above.

D. None of the Above 3.3 - Knowledge Check 4

Which one is correct? A. If a departure from normality is detected, we transform the predicting variable to improve upon the normality assumption. B. If a departure from the independence assumption is detected, we transform the response variable to improve upon this assumption. C. The Box-Cox transformation is commonly used to improve upon the linearity assumption. D. None of the above

D. None of the above. 1.3 - Knowledge Check 4

Which are all the model parameters in ANOVA? A. The means of the k populations. B. The sample means of the k populations. C. The sample means of the k samples. D. None of the above.

D. None of the above. 2.1 - Knowledge Check 1

Which one correctly characterizes the sampling distribution of the estimated variance? A. The estimated variance of the error term has a 𝜒2distribution regardless of the distribution assumption of the error terms. B. The number of degrees of freedom for the 𝜒2 distribution of the estimated variance is n-p-1 for a model without intercept. C. The sampling distribution of the mean squared error is different of that of the estimated variance. D. None of the above.

D. None of the above. 3.1 - Knowledge Check 1

We can test for a subset of regression coefficients A. Using the F statistic test of the overall regression. B. Only if we are interested whether additional explanatory variables should be considered in addition to the controlling variables. C. To evaluate whether all regression coefficients corresponding to the predicting variables excluded from the reduced model are statistically significant. D. None of the above.

D. None of the above. 3.2 - Knowledge Check 2

You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. How does an increase in 1 unit in langarts affect the expected number of days missed, given that the other predictors in the model are held constant? Increase by 0.012152 days Increase by 0.9879 days Increase by 1.22% Decrease by 1.21%

Decrease by 1.21% The estimated coefficient for langarts is -0.012152. A one unit increase in langarts gives us e −0.012152 = 0.9879215. In terms of percentages, this should be interpreted as the expected number of days missed decreasing by 1.21% (1-0.9879215). Hence, given that the other predictors in the model are held constant, a one unit increase in langarts results in the expected number of days missed decreasing by 1.21%, holding all other predictors constant.

Deviance Residuals - Poisson Regression

Deviance residuals are the sign square root of the likelihood evaluated the saturated model when we assume that the estimated expected response is the observed response versus the fitted model Normal Distribution

Causality is the same as association in interpreting the relationship between the response and the predicting variables.

False

For a multiple regression model, both the true errors LaTeX: \epsilon ϵ and the estimated residuals LaTeX: \hat \epsilon ϵ ^ have a constant mean and a constant variance.

False

For estimating confidence intervals for the regression coefficients, the sampling distribution used is a normal distribution.

False

For the model LaTeX: y=\beta_0+\beta_1x_1+...+\beta_px_p+\epsilon y = β 0 + β 1 x 1 + ... + β p x p + ϵ , where LaTeX: \epsilon\sim N(0,\sigma^2) ϵ ∼ N ( 0 , σ 2 ) , there are p+1 parameters to be estimated

False

Given a categorial predictor with 4 categories in a linear regression model with intercept, 4 dummy variables need to be included in the model.

False

If a departure from normality is detected, we transform the predicting variable to improve upon the normality assumption.

False

If a departure from the independence assumption is detected, we transform the response variable to improve upon this assumption.

False

The sampling distribution for the variance estimator in simple linear regression is χ 2 (chi-squared) regardless of the assumptions of the data. (T/F)

False See 1.2 Estimation Method "The sampling distribution of the estimator of the variance is chi-squared, with n - 2 degrees of freedom (more on this in a moment). This is under the assumption of normality of the error terms."

We assess the constant variance assumption by plotting the error terms, ϵ i, against fitted values. (T/F)

False See 1.2 Estimation Method "We use ϵ ^ i as proxies for the deviances or the error terms. We don't have the deviances because we don't have β 0 and β 1.

The simple linear regression coefficient, β ^ 0, is used to measure the linear relationship between the predicting and response variables. (T/F)

False See 1.2 Estimation Method β ^ 0 is the intercept and does not tell us about the relationship between the predicting and response variables.

β ^ 1 is an unbiased estimator for β 0.

False See 1.4 Statistical Inference "What that means is that β ^ 1 is an unbiased estimator for β 1." It is not an unbiased estimator for β 0.

β ^ 1 is an unbiased estimator for β 0. (T/F)

False See 1.4 Statistical Inference "What that means is that β ^ 1 is an unbiased estimator for β 1." It is not an unbiased estimator for β 0.

The p-value is a measure of the probability of rejecting the null hypothesis. (T/F)

False See 1.5 Statistical Inference Data Example "p-value is a measure of how rejectable the null hypothesis is... It's not the probability of rejecting the null hypothesis, nor is it the probability that the null hypothesis is true."

For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for only one of the predicting variables. (T/F)

False See Lesson 3.11: Assumptions and diagnostics In multiple linear regression, we need the linearity assumption to hold for all of the predicting variables, for the model to be a good fit. "For example, if the linearity does not hold with one or more predicting variables, then we could transform the predicting variables to improve the linearity assumption."

Given a quantitative predicting variable and a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model.

False See Lesson 3.2: Basic Concepts We only need 7 dummy variables. "When we have qualitative variables with k levels, we only include k-1 dummy variables if the regression model has an intercept."

In MLR, the adjusted R2 can be used to compare models, and its value will be always greater than or equal to that of R2

False - Adjusted R2 will be always less than or equal to R2

The threshold to calculate the classification error rate of a logistic regression model should always be set at 0.5

False - Althought 0.5 is common value, the threshold is problem-dependent and its value should be tuned.

In a MLR model, an observation should always be discarded when its Cook's distance is greater than 4/n (n=sample size)

False - An observation should not be discarded just because it is found to be an outlier. We must investigate the nature of the outlier before deciding to discard it.

If the constant variance assumption does not hold in MLR, we apply Box-Cox transformation to the predicting variables

False - Apply Box-Cox to the response (y) not predicting variables.

Hypothesis testing for Poisson regression can be done on small sample sizes

False - Approximation of normal distribution needs large sample sizes, so does hypothesis testing.

Bayesian information criterion (BIC) penalizes for complexity of the model more than both leave one out CV and Mallow's Cp statistic

False - BIC penalizes complexity more than other approaches

LASSO regression will always select the same number or more predicting variables than Ridge and Elastic Net regression

False - Because LASSO can eliminate predicting variables using the penalty while the Ridge and Elastic Net retain coefficients, LASSO will have the same number or LESS predicting variables

The number of parameters that need to be estimated in a Logistic Regression model with 6 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard Linear Regression model with an intercept and same predicting variable

False - Because it doesn't have the error term

Multicollinearity in MLR means that the rows in the design matrix are linearly dependent.

False - Columns

Multicollinearity in MLR means that the columns in the design matrix are linearly independent.

False - Columns are actually linearly DEPENDENT

Elastic net often underperforms LASSO regression in terms of prediction accuracy because it considers both L1 and L2 penalties together

False - Elastic Net often outperforms LASSO in terms of accuracy. The difference between LASSO and Elastic Net is the addition of penalty just like the one used in Ridge Regression. By considering both, L1 and L2, we have the advantages of both LASSO and Ridge Regression.

Since there are no error terms in a Poisson model, we cannot perform residual analysis for evaluating the model's GOF.

False - For Poisson Regression, we can perform residual analysis although there is not an error term.

In MLR, the coefficient of determination is used to evaluate GOF

False - GOF in MLR through model structure and assumptions

Multiplying a variable by 10 in LASSO regression, decreases the chance that the coefficient of this variable is nonzero.

False - I am not sure why anyone would think this would be true.

In Poisson Regression, the expectation of the response variable given the predictors is equal to the linear combination of the predicting variables.

False - In Poisson Regression, the expectation of the response variable given the predictors is equal to the exponential of the linear combination of the predicting variables.

In Poisson regression, we assume a nonlinear relationship between the log rate and the predicting variables.

False - In Poisson Regression, we assume a linear relationship between the log rate and the predicting variables.

The mean square prediction error(MSPE) is a robust prediction accuracy measurement for a OLS model regardless of the characteristics of the dataset.

False - MPSE is appropriate for evaluating prediction accuracy for a linear model estimated using least squares, but it depends on the scale of the response data, and thus is sensitive to outliers.

If the outcome variable is quantitative and all explanatory variables take values 0 or 1, a logistic regression model is most appropriate.

False - More research is necessary to determine the correct model.

From the binomial approximation with a normal distribution using the central limit theorem, the Pearson residuals have an approximately standard chi-squared distribution.

False - Normal distribution

The assumption of constant variance will always hold for standard Linear Regression models with Poisson distributed response data.

False - One common problem with fitting a linear regression model to Poisson data is the departure from the assumption of Constant variance.

If a Poisson regression model does not have a good fit, the relationship between the log of the expected rate and the predicting variables might not be linear

False - Only one way (Non linear relationship causes Poisson not a good fit). Not the other way around

The presence of certain types of outliers can impact the statistical significance of some of the regression coefficients of a MLR model

False - Outliers that are influential can impact statistical significance of a beta parameters.

Parameter tuning is not recommended as part of the sub-sampling approach for addressing the p-value problem with large samples.

False - Parameter tuning is recommended as part of the sub-sampling approac

The F-test can be used to test for the overall regression in Poisson regression.

False - Poisson uses Chi-squared test to test for overall regression.

Another criteria for variable selection is cross validation which is a direct measure of explanatory power.

False - Predictive power

Forward stepwise variable selection starts with the simpler model and select the predicting variable that increases the R2 the most, unless the R2 cannot be increased any further by adding variables

False - R2 is not compared during stepwise variable selection. Variables are selected if they reduced the AIC or BIC of a model.

In Logistic Regression, R2 could be used as a measure of explained variation in the response variable.

False - R2 is used as explained variability (not variation). In logistics regression, the response variable is binary therefore R2 doesn't measure explained variation.

R2 decreases as more predictors are added to a MLR model, given that the predictors added are unrelated to the response variable

False - R2 never decreases as more predictors are added to MLR.

Random sampling is computationally less expensive than the K-fold cross validation

False - Random sampling is computationally more expensive than the K-fold CV, with no clear advantage in terms of accuracy of the estimation classification error. K fold CV is preferred at least from a computation standpoint.

It is not required to standardize or rescale the predicting variables when performing regularized regression

False - Regularized regression requires standardization or scaling of the predicting variables.

The p-value of the test computed as the left tail of the chi-squared distribution

False - Right tail

When assessing GOF for a Logistic Regression model on a binary data with replications, the assumption is that the response variables(Yi) come from a normal distribution.

False - The assumption is that the response variable comes from a binomial distribution

The null hypothesis for the GOF of a Logistic Regression model is that the model does not fit the data.

False - The null hypothesis is the model fits the data

In regularized regression, the penalization is generally applied to all regression coefficient (Bo, ...., Bp) where p = number of predictors

False - The shrinkage penalty is applied to B1, ...., Bp but not to the intercept Bo

We can diagnose the constant variance assumption in Poisson regression using the normal probability plot.

False - There is not a constant variance assumption in Poisson Regression

If a Poisson regression model is found to be overdispersed, there is an indication that the variability of the response variable implied by the model is larger than the variability present in the observed response variable

False - This indicates the variability of the response variable implied by the model is smaller than the variability present in the observed response variable.

IN MLR, a VIF value of 6 for a predictor means that 92% of the variance in that predictor can be modeled by other predictors

False - Use VIF = 1/1 - Rj^2 to solve for Rj^2

Variable selection is a simple and solved statistical problem since we can implement it using the R software

False - Variable selection for a large number of predicting variables is an "unsolved" problem, and variable selection approaches should be tailored to the problem at hand.

If the constant variance assumption does not hold in MLR, we apply Box-Cox transformation to the predicting variables.

False - We apply a Box-Cox transformation to the response variable.

In stepwise regression, we accept/remove variables that produce larger AICs or BICs

False - We desire model that has the smallest AIC or BIC

When testing a subset of coefficients, deviance follows a Chi-square distribution with q Degree of Freedom, where q is the number of regression coefficients in the reduced model

False - q is number of regression coefficients discarded from the full model to get the reduced model

The estimated coefficients of a regression line is positive, when the coefficient of determination is positive.

False - r squared is always positive.

In multiple linear regression, as the value of R-squared increases, the relationship between predictors becomes stronger

False - r squared measures how much variability is explained by the model, NOT how strong the predictors are.

Ridge regression is a regularized regression approach that can be used for variable selection

False - regularized regression approached but it does not perform variable selection

In Logistic Regression, R2 can be used as a measure of explained variation in the response variable

False - response in logistic regression is binary. R2 is only used to explain variability in the dependent variable that is explained by the model.

Ridge regression cannot be used to deal with problems caused by high correlation among the predictors

False - ridge regression is used to deal with this problem actually

When dealing with a multiple linear regression model, an adjusted R-squared can be greater than the corresponding unadjusted R-Squared value.

False - the adjusted rsquared value take the number and types of predictors into account. It is lower than the r squared value.

The larger the number of variables in the model, the larger the training risk.

False - the larger the number of variables in a model the lower the training risk.

There is never a situation where a complex model is best.

False - there are situations where a complex model is best

In Poisson regression we model the error term

False - there is no error term

In logistic regression we have an additional error term to estimate.

False - there is not error term in logistic regression.

The estimators for the regression coefficients in the Poisson regression are biased.

False - they are unbiased

Generally models with covariance have high bias but low variance

False - they have low bias but high variance.

The variance estimator in logistic regression has a closed form expression.

False - use statistical software to obtain the variance-co-variance matrix

A t-test is used for testing the statistical significance of a coefficient given all predicting variables in a Poisson regression model.

False - use z-test for Poisson

For a linear regression under normality, the variance used in the Mallow's Cp penalty is the true variance, not an estimate of variance

False - variance used in Mallow's Cp penalty is the estimated variance from the full model

We can use residual analysis to conclusively determine the assumption of independence

False - we can only determine uncorrelated errors.

In logistic regression for goodness of fit, we can only use the Pearson residuals.

False - we can use Pearson or Deviance.

We cannot estimate the Regression coefficients of a MLR if predicting variables are linearly independent

False - we cannot predict if predictors are dependent (tricky part)

For Poisson regression we estimate the expectation of the log response variable.

False - we estimate the log of the expectation of the response variable.

In logistic regression we interpret the Betas in terms of the response variable.

False - we interpret it in terms of the odds of success or the log odds of success

If there is a high correlation between variables, Lasso will select both.

False lasso will select 1

A linear Regression model is a good fit to the data set if the Adjusted R2 is above 0.85

False- Adjust R2 is not a measure of GOF. GOF refers to all model assumptions holding

In regression inference, the 99% confidence interval of coefficient \beta_0 is always wider than the 95% confidence interval of \beta_1.

False- can only compare beta1 with beta1 and beta0 with beta0

. A correlation coefficient close to 1 is evidence of a cause-and-effect relationship between the two variables.

False- cause and effect can only be determined by a well designed experiment.

In simple linear regression models, we lose three degrees of freedom when estimating the variance because of the estimation of the three model parameters β 0 , β 1 , σ^2. (T/F)

False. See 1.2 Estimation Method "The estimator for σ 2 is σ ^ 2, and is the sum of the squared residuals, divided by n - 2." We lose two degrees of freedom because the variance estimator, σ ^ 2, uses only the estimates for β 0, and β 1 in its calculation.

With the Box-Cox transformation, when λ = 0 we do not transform the response. (T/F)

False. See 1.8 Diagnostics When λ = 0, we transform using the normal log.

In ANOVA, the linearity assumption is assessed using a plot of the response against the predicting variable. (T/F)

False. See 2.2 - Estimation Method Linearity is not an assumption of Anova.

What are the assumptions for multiple linear regression?

Linearity/Mean zero assumption, Constant Variance, Independence and Normality (for statistical inference)

In multiple linear regression, a VIF value of 6 for a predictor means that 90% of the variation in that predictor can be modeled by the other predictors. (T/F)

False. See Lesson 3.13: Model Evaluation and Multicollinearity A VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model.

Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent. (T/F)

False. See Lesson 3.13: Model Evaluation and Multicollinearity Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.

Multicollinearity among the predicting variables will not impact the standard errors of the estimated regression coefficients. (T/F)

False. See Lesson 3.13: Multicollinearity Multicollinearity in the predicting variables can impact the standard errors of the estimated coefficients. "However, the bigger problem is that the standard errors will be artificially large."

In multiple linear regression, the prediction of the response variable and the estimation of the mean response have the same interpretation. (T/F)

False. See Lesson 3.2.9: Regression Line and Predicting a New Response. In multiple linear regression, the prediction of the response variable and the estimation of the mean response do not have the same interpretation.

A multiple linear regression model contains 6 quantitative predicting variables and an intercept. The number of parameters to estimate in this model is 7. (T/F)

False. See Lesson 3.2: Basic Concepts The number of parameters to estimate in a multiple linear regression model containing 6 quantitative predicting variables and an intercept is 8: 7 regression coefficients (β0,β1,...,β6) and the variance of the error terms (σ2).

Given a a quantitative predicting variable and a qualitative predicting variable with 7 categories in a linear regression model with intercept, 7 dummy variables need to be included in the model. (T/F)

False. See Lesson 3.2: Basic Concepts We only need 7 dummy variables. "When we have qualitative variables with k levels, we only include k-1 dummy variables if the regression model has an intercept."

The estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing the sum by n - p , where n is the sample size and p is the number of predictors. (T/F)

False. See Lesson 3.3: Regression Parameter Estimation The estimated variance of the error terms of a multiple linear regression model with intercept should be obtained by summing up the squared residuals and dividing that by n-p-1, where n is the sample size and p is the number of predictors as we lose p+1 degrees of freedom when we estimate the p coefficients and 1 intercept.

The causation of a predicting variable to the response variable can be captured using multiple linear regression on observational data, conditional of other predicting variables in the model. (T/F)

False. See Lesson 3.4 Model Interpretation "This is particularly prevalent in a context of making causal statements when the setup of the regression does not allow so. Causality statements can only be made in a controlled environment such as randomized trials or experiments. "

Conducting t-tests on each β parameter in a multiple linear regression model is the preferable to an F-test when testing the overall significance of the model. (T/F)

False. See Lesson 3.7: Testing for Subsets of Coefficients "We cannot and should not select the combination of predicting variables that most explains the variability in the response based on the t-tests for statistical significance because the statistical significance depends on what other variables are in the model."

Under testing a subset of coefficients, what is the distribution and degrees of freedom for the deviance?

For large sample size data, the distribution of this test statistic, assuming the null hypothesis is true, is a chi square distribution. With Q degrees of freedom where Q is the number of regression coefficients discarded from the full model to get the reduced model or the number of Z predicting variables.

Akaike Information Criterion

For linear regression under normality this is the training risk + penalty The complexity penalty is (2 * # of predictors in submodel * true_variance of full model)/n For AIC, we need to specify k = 2 Select the model with the smallest AIC

Using the Tukey method to find the confidence interval of the means, what does having a '0' in the CI mean?

For the confidence intervals that include zero, it's plausible that the difference between means is zero.

Forward Stepwise Regression

Forward stepwise regression starts with no predictors in the model and usually tends to select smaller models. As a result, this is preferable over backward stepwise regression. The same model is not necessarily the same as the one selected using backward stepwise regression

Which is preferred, forward or backward?

Forward, because it selects smaller models versus backwards which starts with a full model.

distribution of pearson residuals?

From the binomial approximation with a normal distribution using the central limit theorem

distribution of deviance residuals?

From the properties of the likelihood function, standard normal distribution if the model assumptions hold. That is, if the model is a good fit.

Multicollinearity

Generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant info, skewing the results in a regression model.

Combinations of Variables

Given p predicting variables, we can have 2^p combinations of the variables, and thus, 2^p models to choose from

When we are testing for overall regression for a Logistic model, what is the H0 and HA?

H0: all regression coefficients except intercept are 0 HA: at least one is not 0.

What are the null and alternative hypotheses of the F test?

H0: all the regression coefficients except the intercept are 0. HA: at least one is not 0.

Another GOF test is hypotheses testing, what is the H0 and HA?

H0: is that the model fits well. HA: the alternative is that the model does not fit well.

What is the null and alternative hypothesis for MLR?

H0: the coefficients are not0 HA: the coefficients (at least 1) are not equal to 0

H0 and HA for test of equal means?

H0: the means are equal HA: some means are different

What are the null and alternative hypotheses for ANOVA for MLR?

H0: the regression coefficients all equal zero. HA: At least one of the regression coefficients is not equal to zero.

The estimated versus predicted regression line for a given x*

Have the same expectation

The estimated versus predicted regression line for a given x*:

Have the same expectation

Forward-Backward Regression

Here we add and discard one variable at a time iteratively.

Generalized Linear Models (GLM)

Here, the response Y is assumed to have a distribution from the exponential family of distributions (Normal, Binomial, Poisson, Gamma etc) Under this model, we model a transformation g of the expectation of Y, given the predicting variables as a linear combination of the predicting variables We can write the expectation as the inverse of the g transformation of the linear combination of the predicting variables **Include table w/ link function & regression function pg 67

Hypothesis testing (statistical significance: +/-)

Here, the z-value is the same but the P-value will change Positive: P-value = P(Z > z-value) Negative: P-value = P(Z < z-value) **Applies for Logistic & Poisson Regression

Robust Regression

Here, we can estimate the regression coefficients in the presence of outliers When there are outliers in the distribution, it has heavy tails and we thus have departures from the normality assumption

Maximum Likelihood Estimate

Here, we maximize the likelihood function with respect to the model parameters or in this case, the regression coefficient The log likelihood that needs to be maximized is highly non-linear. Thus we cannot derive a closed form (exact) expression of the estimates. The estimated parameters and their standard errors are approximate estimates

high dimensionality

In linear regression, when the number of predicting variables P is large, we might get better predictions by omitting some of the predicting variables.

Nonparametric Regression

In non-parametric regression, the regression function has an unknown structure given the predicting variables, and the regression function does not depend on any parameters

Poisson Regression Assumptions

Linearity: The log transformation of the rate is a linear combination of the predicting variables. Independence: The response variables are independently obserterm-52ved Logit: The link function g is the log function. The log link function is almost always used

Logistic Regression Assumptions

Linearity: The relationship between the g of the probability of success and the predicted variable, is a linear function. Independence: The response binary variables are independently observed Logit: The logistic regression model assumes that the link function g is a logit function

An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed in cells. A multiple linear regression model was fitted to estimate the effect of the number of cells, amount of the radiation dose (Grays), and the rate of the radiation dose (Grays/hour) on the number of chromosomal abnormalities observed. The data frame has 27 observations. Here is the model summary and Cook's Distance plot. Coefficient Estimate SE t-value Pr(>|t|) (Intercept) -74.15392 42.24544 -1.755 0.092518 cells 0.06871 0.02196 3.129 0.004709** doseamt 41.33160 9.13907 4.523 0.000153*** doserate 20.28402 8.29071 2.447 0.022482* ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 54.05 on X degrees of freedom Multiple R-squared: 0.5213, Adjusted R-squared: 0.4588 F-statistic: 8.348 on X and Y DF, p-value: 0.0006183 How does an increase in 1 unit in doserate affect the expected number of chromosome abnormalities, given that the other predictors in the model are held constant? Increase of 8.291 Decrease of 41.331 Increase of 20.284 Decrease of 9.134

Increase of 20.284 The estimated coefficient for doserate is 20.284. If we fix all other predictors, for each 1 unit increase in doserate, the expected number of chromosome abnormalities increases 20.284 units.

The assumption of normality:

It is needed for the sampling distribution of the estimators of the regression coefficients and hence for inference.

How do we interpret B1?

It is the estimated expected change in the response variable associated with one unit of change in the predicting variable.

What is the purpose of testing a subset of coefficients?

It simply compares two models and decides whether the larger model is statistically significantly better than the reduced model.

Weighted Least Squares (WLS)

It's a multiple regression model. But the difference is that we assume that the variance of the errors is not constant The vector of errors can be assumed to have a covariance-variance matrix sigma, thus allowing for correlated errors We have the independence assumption only when sigma matrix is a diagonal matrix

Penalties

L0: This is the number of nonzero regression coefficients. - Maximizing Q(betas) means searching through all submodels L1: This penalty applied to the vector of regression coefficients is equal to the sum of the absolute values of the regression coefficients to be penalized - Maximizing Q forces many betas to be zeros (Lasso) L2: This penalty applied to the vector of regression coefficients is equal to the sim of the squared regression coefficients to be penalized - Maximizing Q accounts for multicollinearity (Ridge)

LOOCV

LOOCV can be approximated by the sum between the training risk + the complexity penalty. The complexity penalty is (2 * # of predictors in submodel * estimated_variance of submodel)/n The variability of the submodel is smaller than that of the full model, thus LOOCV penalizes complexity less than Mallow's Cp LOOCV is ~ AIC when the true variance is replaced by the estimate of the variance from the submodel

Does the statistical inference for logistic regression rely on a small or large sample size?

Large, if it was a small then the statistical inference is not reliable

Lasso Regression

Lasso performs estimation of variable selections simultaneously The estimated regression coefficients from Lasso are less efficient than those provided by the ordinary least squares The penalty is the sum of absolute values of the regression coefficients except for intercept, that force coefficients to be zero We can apply lasso to standard linear regression, logistic regression, Poisson regression, and other linear models We do not have a closed-form expression for the estimated regression coefficients under this model Once a coefficient is non-zero, then it does not go back to zero In the case, where p the number of predictors is larger than n the number of observations, that is, more in the variables than observations, the Lasso selects, at most, n variables because of the nature of the convex optimization problem If there exists high correlation among predictors the prediction performance of the Lasso is dominated by ridge regression If there is a group of variables among which the correlation is very high, then the Lasso tends to select only one variable from that group Alpha = 1

3 assumptions of the logistic regression model

Linearity, Independence, Logit link function

what are the 4 assumptions of linear regression?

Linearity/Mean Zero, Constant Variance, Independence, Normality

Statistical Inference

Logistic Regression: Normal Distribution. The statistical inference based on the normal distribution applies only under large sample data. If the sample size, or n, is small? Then the statistical inference is not reliable i.e. warn on the lack of the reliability of the results Standard Regression: T-Distribution. The statistical inference relies on the distribution that applies under both small and large samples **Applies for Logistic & Poisson Regression

we estimate the Poisson model parameters using...

MLE

Introducing some bias yields a decrease in....

MSE

Mean Squared Error

MSE is commonly used to obtain estimators that may be biased, but less uncertain than unbiased ones. The MSE can be controlled It is possible to find a model with lower MSE than the unbiased/full model Introducing some bias yields a decrease in MSE followed by a later increase

4 types of Variable Selection Criteria

Mallow's Cp, AIC, BIC, LOOCV

Assumption of Independence

Means that the deviances, or in fact the response variables ys, are independently drawn from the data-generating process. (this most often occurs in time series data) This can result in very misleading assessments of the strength of regression.

Linearity/Mean zero assumption

Means that the expected value (deviances) of errors is zero. This leads to difficulties in estimating B0 and means that our model does not include a necessary systematic component

L2 penalty - Ridge

Minimizing the penalized least squares using this penalty accounts for multicollinearity, but does not perform variable selection. The result in regularized regression is a so-called ridge regression

The estimators of the linear regression model are derived by:

Minimizing the sum of squared differences between observed and expected values of the response variable.

What do we mean by model parameters in statistics?

Model parameters are unknown quantities, and they stay unknown regardless how much data are observed. We estimate those parameters given the model assumptions and the data, but through estimation, we're not identifying the true parameters. We're just estimating approximations of those parameters.

Poisson Regression

Model response data in the form of counts Modeling the expectation of the response variable as the exponential of the linear combination of the predicting variables Assumptions: - Log transformation of the rate is a linear combination of the predicting variables - The response count data are independently observed - The link function data is the log function The only model parameters are the regression coefficients estimated using the maximum likelihood approach Statistical inference on the regression coefficients using an approximation of the sampling distribution, the normal distribution

Variable Selection

Models with many predictors/covariates have low bias but high variance. Models with few predictors/covariates have high bias but low variance

Pooled variance estimator - N vs k?

N = combined size of samples, k = # of samples

If p(predictors) is large, is it feasible to fit a large number of submodels?

No

Is testing a subset of coefficients a GOF test?

No

Do the three stepwise approaches select the same model?

No, especially when p is large.

A data point far from the mean of the x's and y's is always: an influential point and an outlier a leverage point but not an outlier an outlier and a leverage point an outlier but not a leverage point None of the above

None of the above See 1.9 Outliers and Model Evaluation We only know that the data point is far from the mean of x's and y's. It only fits the definition of a leverage point because the only information we know is that it is far from the mean of the x's. So you can eliminate the answers that do not include a leverage point. That leaves us with remaining possibilities, "a leverage point but not an outlier" and "an outlier and a leverage point" , both of which we can eliminate. We do not have enough information to know if it is or is not an outlier . None of the answers above fit the criteria of it being always being a leverage point.

what is the distribution of B1?

Normal

Normal Distribution

Normal distribution relies on a large sample of data. Using this approximate normal distribution we can further derive confidence intervals. Since the distribution is normal, the confidence interval is the z-interval **Applies for Logistic & Poisson Regression

what can we use to check for normality?

QQ plot and histogram

T/F: An overdispersion parameter close to 1 indicates that the variability of the response is close to the variability estimated by the model.

T

What would we do if the T value is large?

Reject the null hypothesis that β1 is equal to zero. If the null hypothesis is rejected, we interpret this that β1 is statistically significant.

Time Series

Response data are correlated. This correlation results in a much smaller number of degrees of freedom than otherwise assumed under independence. Moreover, because of the correlation the data are concentrated into a smaller part of the probability space where the data lie. Ignoring dependence leads to inefficient estimates of regression parameters and leads to poor predictions. Standard errors are unrealistically small. i.e. too narrow confidence intervals thus improper statistical inferences

What are the two most common approaches to regularized regression?

Ridge and LASSO

For the G-o-F tests, do we reject the null hypotheses when the p-value is SMALL or BIG?

Small and we conclude the model is not a good fit

Cross validation

Split the data into two parts, first part called the training data and testing/validation data. The training data will be used to fit the model and thus get the estimated regression coefficients. The testing or validation data will be used to predict or classify the responses for this portion of the data, then compare to the observed response to estimate the classification error, one can repeat the process several times.

Poisson Regression vs Log transformed Linear Regression

Standard Regression: We estimate the expectation of the log of the response - E(log(Y)) The variance under the standard regression is assumed constant Poisson Regression: We estimate the log of the expectation of the response - log(E(Y)) The variance of the response is assumed to be equal to the expectation; thus, the variance is not constant. **Use the Poisson regression especially when the response data are small counts **Using the standard linear regression with log transformation instead of Poisson regression, will result in violations of the assumption of constant variance **Standard Linear Regression could be used if the number of counts are large and with the variance stabilizing transformation √(µ + 3/8) i.e. Square root of the response + 3/8. This transformation will work well for large count when the response data are large counts

Overall Regression

Standard Regression: We use the F test to test for the overall regression Logistic Regression: We use the difference between the log likelihood function of the model under the null hypothesis (also called the null-deviance), and the log likelihood of the full model (residual deviance) i.e. the difference between the null deviance and the residual deviance

T/F: Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.

T

T/F: An approximate test can be used to test for the overall regression in Poisson regression.

T

Assumptions - Mixed Effects

The error term is normally distributed with zero mean and constant variance The group effect is normally distributed with mu 0 and constant variance, where the variances of the error terms and of the group effect are two different parameters In random effects model, the observations are no longer independent, even if the error terms are independent.

What is the partial F-test?

The hypothesis test for whether a subset of regression coefficients are all equal to zero.

Coefficient Test (Deviance)

The hypothesis testing procedure is testing the null hypothesis that all alpha coefficients are zero, versus the alternative that at least one alpha coefficient is not zero For the testing procedure for subsets of coefficients, we compare the likelihood of a reduced model versus a full model. This test provides inferences on the predictive power of the model. Predictive power means that the predicting variables predict the data even if one or more of the assumptions do not hold

Logit link function assumption

The logistic regression model assumes that the link function is a so-called logit function. This is an assumption since the logit function is not the only function that yields s-shaped curves. And it would seem that there is no reason to prefer the logit to other possible choices.

Log odds function

The logit function which is the log of the ratio between the probability of a success and the probability of a failure

Model parameters

The model parameters are the regression coefficients. There is no additional parameter to model the variance since there's no error term. For P predictors, we have P + 1 regression coefficients for a model with intercept (beta 0). We estimate the model parameters using the maximum likelihood estimation approach

Linear Regression

The predicting variable is assumed to be fixed, whereas the response variable y is random We model their relationship as a linear function in x plus an error term epsilon Assumptions Linearity: Relationship between the response and the predicting variable is linear or the expectation of the error term is 0 Constant Variance: The variance of the error term is the same across all observations Independence: Uncorrelated errors Normality: Normality of the error terms We assume that the error terms are independent and identically distributed from a normal distribution with mean 0 and variance sigma squared The parameters are unknown but estimated based on the observed data Statistical inference in the regression coefficients is performed using the sampling distribution of the estimated regression coefficients. The t distribution with n-2 degrees of freedom

Regression Coefficient

The regression coefficient is interpreted as the log ratio of the rate with an increase with one unit in the predicting variable For p predictors, we have p + 1 regression coefficients for a model with intercept. We estimate the model parameters using the maximum likelihood estimation

Poisson Regression

The response Y in Poisson regression is assumed to have a Poisson distribution, and this is commonly used for modeling count or rate data. We assume that the i-th response Yi has a Poisson distribution, with rate lambda i. Alternatively, log of the rate lambda i is equal to the linear combination of the predicting variables We do not interpret beta with respect to the response variable but with respect to the ratio of the rate There is no error term

Generalized Linear Models

The response Y is assumed to have a distribution from the exponential family of distributions. Example of distributions in the exponential family of distributions are normal, binomial, Poisson, gamma We model a transformation g of the expectation of Y as a linear combination of the predicting variables The transformation g is called the link function, since it links the expectation of the response data to the predicting variables. The transformations g depends on the distribution of the response data

Overall Regression (Logistic)

The test statistic is a chi-squared distribution with p degrees of freedom where p is the number of predicting variables. We reject the null hypothesis when the P-value is small, indicating that the overall regression has explanatory power.

G transformation

The transformation g is called a link function since it links the expectation of the response to the predicting variables

The variability in the prediction comes from

The variability due to a new measurement and due to estimation.

The variability in the prediction comes from:

The variability due to a new measurement and due to estimation.

The mean squared errors (MSE) measures:

The within-treatment variability.

Wald Test (Z-test)

The z-test value is the ratio between the estimated coefficient minus 0, (which is the null value) divided by the standard error We reject the null hypothesis that the regression coefficient is 0 if the z value (gets too large) is larger in absolute value than the z critical point, (or the 1- alpha over 2 of the normal quantile). We interpret that the coefficient is statistically significant **Applies for Logistic & Poisson Regression

Reasons why a model may not be a good fit

There may be other variables that should be included in the model The relationship between Logit of the expected probability and predictors might be multiplicative, rather than additive Departure from the linearity assumption Initial observations outliers, leverage points are also still an issue for this model Logit function does not fit well with the data The binomial distribution isn't appropriate. For example, if there's correlation among the responses or there's heterogeneity in the success that hasn't been modeled. Both of these violations can lead to what we call overdispersion

Nonlinear & Nonparametric Regression

These approaches are commonly used to deal with nonlinearity

Regularized (Penalized) Regression

These approaches perform variable selection and estimation simultaneously.

Deviance Residuals

These are the signed square root of the log-likelihood evaluated at the saturated model when we assume that the estimate expected response is the observed response versus the fitted model Deviance residuals have a standard normal distribution if the model is a good fit (i.e. model assumptions hold)

Spatial Regression

This approach also deals with correlated data If the assumption of uncorrelated errors does not hold, it can lead to misleading assessments of the strength of the regression Spatial processes can be observed over regular grid such as images. A regular grid is also called the lattice. Spatial processes can also be observed on irregular grids

Mixed Effects Models

This approach deals with replications in the response data Generally, a group effect is random if we can think of the responses we observe in that group to be samples from a larger population

Mallows Cp

This approach is useful when there are no control variables. It assumes we can estimate the variance from the full model. **This is not the case when p > n The complexity penalty is (2 * # of predictors in submodel * estimated_variance of full model)/n Select the model with the smallest Cp score

Type I Error

This happens if the sample size, or n is small. The hypothesis testing procedure will have a probability of Type I error larger than the significance level (i.e. more Type I errors than expected) **Applies for Logistic & Poisson Regression

Cross Validation

This is a direct measure of predictive power Random sampling is computationally more expensive than the K-fold cross validation, with no clear advantage in terms of the accuracy of the estimation classification error rate The rule of thumb for choosing K is about K = 10 LOOCV is a K-fold cross validation with K = n. The larger K is, the larger the number of folds, the less bias the estimate of the classification the error is but has higher variability.

Leave-One-Out Cross Validation (LOOCV)

This is a direct measure of predictive power. This is just like Mallow's, except the variance is for the S submodel, not the full model. The LOOCV penalizes complexity less than Mallow's Cp.

Stepwise Regression

This is a heuristic search used when p is large. It's also useful when there are control variables. The three stepwise regression approaches do not necessarily select the same model, especially when p is large This is a greedy algorithm; it does not guarantee to find the model with the best score Greedy means we always take the biggest jump (up or down) in the selected criterion

General Additive Regression

This is a non-parametric model used if the linearity assumption does not hold and/or it's difficult to identify transformations that improve the fit Here the relationship of predicting variable to the response is assumed unknown The estimation of the parameters does not have a closed form expression

Normality assumption

This is needed if we want to do any confidence or prediction intervals, or hypothesis test, which we usually do. If this assumption is violated, hypothesis test and confidence and prediction intervals and be very misleading.

Deviance

This is the difference between the log likelihood from a reduced model and the log likelihood from a full model For large sample size data, the distribution (assuming the null hypothesis is true), is a chi square distribution with Q degrees of freedom Q = Number of Z predicting variables (controlling variables for bias selection) i.e. the number of regression coefficients discarded from the full model to get the reduced model The P-value of the test is computed as the right tail of the chi-square distribution with Q degrees of freedom of the test value (Deviance) **This test is NOT a goodness of fit test. It simply compares two models and decides whether the larger model is statistically significantly better than the reduced model.

Odds of a success

This is the exponential of the Logit function

Prediction Risk

This is the measure of the bias-variance tradeoff It is the sum of expected squared differences between fitted values by the model S and future observations Prediction Risk = variance(future observation) + bias^2 + variance(prediction) MSE = bias^2 + variance(prediction) The variance of the future observation (sigma squared) is an irreducible error and thus cannot be controlled

Logistic Regression

This is the model where the link function g is the logit function. The link function g is the log of the ration of p and 1-p We model the probability of a success given the predicting variable using the g link function in such a way that the g function of the probability of the success is a linear model of the predicting variables. The g function is the S-shaped function that models the probability of success with respect to the predicting variables. Assumptions: - Linearity in the predicting variables - Independence in the response of the observed data - The link function is the logit function The parameters for logistic regression are beta 0, beta 1 and beta p. These parameters are unknown but estimated based on the observed data using the maximum likelihood approach Statistical inference in the regression coefficient using an approximate sampling distribution of the estimated regression coefficients, where the approximate distribution is the normal distribution

Exposure of the response variable

This is the number of units when modeling rate data with Poisson Regression Exposure is accounted for using an offset which is the log of the exposure. This variable is added as a predictor to the Poisson regression model

Mallow's Cp

This is the oldest approach to variable selection. This assumes that we can estimate the variance from the full model, however this is NOT the case when p is larger than n.

Training Error

This is the proportion of the responses that are misclassified We cannot use the training error rate as an estimate of the true error classification error rate because it is biased downward The bias comes from the fact that we use the data twice. One, we used it for fitting the model and the second time is to estimate the classification error rate

Pearson Residuals

This is the standardized difference between the ith observed response and estimated expected response, which is ni times the probability of success We need to standardize the difference between observed and expected response, as the responses have different variances Pearson residuals have an approximately standard normal distribution

Training Risk

This is the sum of squared differences between fitted values for sub model S and the observed values. The training risk is biased upward because the prediction risk is used twice The larger the number of variables in the model, the larger the training risk is

Simpson's paradox

This is when the addition of a predictive variable reverses the sign on the coefficients of an existing parameter It refers to reversal of an association when looking at a marginal relationship versus a partial or conditional one. This is a situation where the marginal relationship adds a wrong sign This happens when the 2 variables are correlated

Complete Separation

This is when the model fits perfectly (p-value = 1) after an outlier is removed It indicates that the possibility of a simpler model being good enough should be explored

Nonlinear Regression

This is when the relationship between the response variable and the predicting variables is known but it cannot be expressed as a sum of transformed predicting variables. The regression function has a known structure given the predicting variables. It depends on a series of parameters In nonlinear regression, we know the function F except for the theta. We estimate the model by minimizing the sum of squared errors. We minimize this with respect to theta

Overdispersion

This is where the variability of the probability estimates is larger than would be implied by a binomial random variable ɸ = D/(n-p-1) D is the Deviance(sum of squared deviances) If ɸ > 2 then model is overdispersed; an over-dispersed model will fit better Overdispersion impacts the estimated variance and statistical inference. If overdispersion is not accounted for, statistical inference will not be as reliable

You were hired to consult on a study for the attendance behavior of high school students at two different schools. The data set you were given contains for each 316 students: the number of days he/she was absent in an academic year (daysabs), his/her math scores (math), his/her language arts scores (langarts), and whether the student is male or female (1 = male, 0 = female). A Poisson regression model was fitted to evaluate the relationship between the number of days of absence in an academic year and all the predictors. The R output for the model summary is as follows: Coefficient Estimate SE z value Pr(>|z|) (Intercept) 2.687666 0.072651 36.994 <2e-16 math -0.003523 0.001821 -1.934 0.0531 langarts -0.012152 0.001835 -6.623 3.52e-11 male -0.400921 0.048412 -8.281 <2e-16 Also, assume the average language arts scores (across all students) is 50, and the average math scores (across all students) is 45.5. How many regression coefficients including the intercept are statistically significant at the significance level 0.05? All Three Two None

Three As the summary output above shows, the coefficients associated to the intercept, langarts and male are statistically significant at α = 0.05. Their associated p-values (<2e-16, 3.52e-11, <2e-16) are smaller than 0.05

Hypothesis Testing (coefficient == 0)

To perform hypothesis testing, we can use the approximate normal sampling distribution. The resulting hypothesis test is also called the Wald test since it relies on the large sample normal approximation of MLEs To test whether the coefficient betaj = 0 or not, we can use the z- value **Applies for Logistic & Poisson Regression

Hypothesis Testing (coefficient == constant)

To test if the regression coefficient is equal to this constant b, then the z-value changes. We subtract b from the estimated coefficients of the numerator We decide to reject/accept using the P-value The P-value is 2 times the left tail of the standard normal of the quantile provided by the absolute value of the z-value P-value = 2P(Z > |z-value|) **Applies for Logistic & Poisson Regression

Alpha is only used in elastic net.

True

Time Series Regression Characteristics

Trends: Can be long-term increase/decrease in the data over time or it can fluctuate mostly over time Seasonality: This is influenced by seasonal factors. Periodicity: This is when seasonality repeats exactly at the same time with exactly the same regular pattern Cyclical Trends: This is when data exhibits rises and falls that are not of a fixed period Heteroskedasticity: This means the variability varies with time Correlation: Successive observations are similar or negative/positive To account for trend and seasonality or periodicity, we recompose a time series into three components, mt (trend), st(seasonality), and Xt (stationary process). In other words its probability distribution does not change when shifted in time

1. The means of the k populations 2. The sample means of the k populations 3. The sample means of the k samples are NOT all the model parameters in ANOVA

True

A MLR model has high explanatory power if the coefficient of determination is close to 1

True

A high Cook's distance for a particular observation suggests that the observation could be an influential point.

True

A logistic regression model may not be a good fit to the data if the responses are correlated or if there is heterogeneity in the success that hasn't been modeled

True

A negative value of LaTeX: \beta_1 β 1 is consistent with an inverse relationship between LaTeX: x x and LaTeX: y y .

True

AIC looks just like the Mallow's Cp except that the variance is the true variance and not its estimate.

True

In a full model F test, a low p-value indicates the model has predictive power.

True

In a multiple regression model with 7 predicting variables, the sampling distribution of the estimated variance of the error terms is a chi-squared distribution with n-8 degrees of freedom.

True

In case of multiple linear regression, controlling variables are used to control for sample bias.

True

In evaluating a multiple linear model residual analysis is used for goodness of fit assessment.

True

In evaluating a multiple linear model the F test is used to evaluate the overall regression.

True

In evaluating a multiple linear model the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.

True

In evaluating a multiple linear model, Residual analysis is used for goodness of fit assessment.

True

In evaluating a multiple linear model, the F test is used to evaluate the overall regression.

True

In evaluating a multiple linear model, the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.

True

In evaluating a simple linear model residual analysis is used for goodness of fit assessment.

True

In evaluating a simple linear model the coefficient of variation is interpreted as the percentage of variability in the response variable explained by the model.

True

In evaluating a simple linear model there is a direct relationship between coefficient of variation and the correlation between the predicting and response variables.

True

In multiple linear regression, controlling variables are used to control for sample bias.

True

In simple linear regression, we can diagnose the assumption of constant-variance by plotting the residuals against fitted values.

True

In testing for a subset of coefficients in logistic regression the null hypothesis is that the coefficient is equal to zero

True

It is possible to apply logistic regression when the response variable Y has 3 classes.

True

It is possible to produce a model where the overall F-statistic is significant but all the regression coefficients have insignificant t-statistics.

True

Let LaTeX: Y^* Y ∗ be the predicted response at LaTeX: x^* x ∗ . The variance of LaTeX: Y^* Y ∗ given LaTeX: x^* x ∗ depends on both the value of LaTeX: x^* x ∗ and the design matrix.

True

Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables. (T/F)

True See 1.4 Statistical Inference "Under the normality assumption, β 1 is thus a linear combination of normally distributed random variables... β ^ 0 is also linear combination of random variables"

If the model assumptions hold, then the estimator for the variance, σ ^ 2, is a random variable. (T/F)

True See 1.8 Statistical Inference We assume that the error terms are independent random variables. Therefore, the residuals are independent random variables. Since σ ^ 2 is a combination of the residuals, it is also a random variable.

An ANOVA model with a single qualitative predicting variable containing k groups will have k + 1 parameters to estimate. (T/F)

True See 2.2 Estimation Method We have to estimate the means of the k groups and the pooled variance estimator, s pooled ^2.

If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable. (T/F)

True See 2.8 Data Example "This is important since without a good fit, we cannot rely on the statistical inference." Only when the model is a good fit, i.e. all model assumptions hold, can we rely on the statistical inference.

If the pairwise comparison interval between groups in an ANOVA model includes zero, we conclude that the two means are plausibly equal. (T/F)

True See 2.8 Data Example If the comparison interval includes zero, then the two means are not statistically significantly different, and are thus, plausibly equal.

The logit function is the log of the ratio of the probability of success to the probability of failure. It is also known as the log odds function.

True The logit link function is also known as the log odds function.

In logistic regression, the relationship between the probability of success and the predicting variables is nonlinear.

True We model the probability of success given the predictors by linking the probability to the predicting variables through a nonlinear link function.

Visual Analytics for logistic regression Normal probability plot of residuals Residuals vs predictors Logit of success rate vs predictors

True Normal probability plot of residuals - Normality Residuals vs predictors - Linearity/Independence Logit of success rate vs predictors - Linearity

Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both

True -

For a classification model, the training error rate tends to underestimate the true classification error rate of the model.

True -

In the balance of Bias-Variance tradeoff, adding variables to our model tends to increase our variance and decrease our bias

True - Adding more variables will increase the variability and possibly induce multicollinearity. Adding more variables also reduces the bias in the model since it has an additional predictor to conform to which keeps the model from favoring one of the original predictors.

BIC variable selection criteria favors simpler models

True - BIC penalize complexity more than other approaches.

When the objective is to explain the relationship to the response, one might consider including predicting variables which are correlated

True - But this should be avoided for prediction

Ridge regression corrects for the impact of multicollinearity by reweighting the regression coefficients

True - Ridge regression has been developed to correct for the impact of multicollinearity. If there is multicollinearity in the model, all predicting variables are considered to be included in the model but ridge regression will allow for re-weighting the regression coefficients in a way that those corresponding to correlated predictor variables share their explanatory power and thus minimizing the impact of multicollinearity on the estimation and statistical inference of the regression coefficients.

We can estimate the regression coefficients in Poisson regression using the maximum likelihood estimation approach

True - Use MLE for Poisson

To perform hypothesis testing for Poisson, we can use again the approximate normal sampling distribution, also called the Wald test

True - Wald Test also used with logistic regression

Although there are no error terms in a Logistic Regression model using binary data with replications, we can still perform residual analysis.

True - We can perform residual analysis on the Pearson residual or the Deviance residual.

When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis

True - When the objective is to explain the relationship to the response, one might consider including the predicting variables even they are correlated

Under the null hypothesis of good fit for logistic regression, the test statistic has a Chi-Square distribution with n- p- 1 degrees of freedom

True - don't forget, we want large P values

A Poisson regression model fit to a dataset with a small sample size will have a hypothesis testing procedure with more Type I errors than expected

True - if sample size is small, the statistical inference is not reliable. Thus, the hypothesis testing procedure will have a probability of type I error larger than the significant level

L1 penalty will force many betas, many regression coefficients to be 0s

True - is equal to the sum of the absolute values of the regression coefficients to be penalized

L2 does not perform variable selection

True - is equal to the sum of the squared regression coefficients to be penalized and does not do variable selection

Stepwise regression is a greedy search algorithm that is not guaranteed to find the model with the best score

True - it does not guarantee to find the model with the best score

The assumptions in logistic regression are - Linearity, Independence of response variable, and the link function is the logit function.

True - linearity is measured through the link, , the g of the probability of success and the predicted variable.

The prediction interval of one member of the population will always be larger than the confidence interval of the mean response for all members of the population when using the same predicting values. (T/F)

True. See 1.7 Regression Line: Estimation & Prediction Examples "Just to wrap up the comparison, the confidence intervals under estimation are narrower than the prediction intervals because the prediction intervals have additional variance from the variation of a new measurement."

The normality assumption states that the response variable is normally distributed. (T/F)

True. See 1.8 Diagnostics "Normality assumption: the error terms are normally distributed." The response may or may not be normally distributed, but the error terms are assumed to be normally distributed.

It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables. (T/F)

True. See Lesson 3.13: Model Evaluation and Multicollinearity It is good practice to create a multiple linear regression model using a linearly independent set of predicting variables. "XTX is not invertible if the columns of X are linearly dependent, i.e. one predicting variable, corresponding to one column, is a linear combination of the others."

If the residuals are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation. (T/F)

True. See Lesson 3.3.11: Assumptions and Diagnostics If the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.

A linear regression model has high explanatory power if the coefficient of determination is close to 1. (T/F)

True. See Lesson 3.3.13: Model Evaluation and Multicollinearity If R2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.

For a given predicting variable, the corresponding estimated regression coefficient will likely be different in a conditional model versus a marginal model. (T/F)

True. See Lesson 3.4: Model Interpretation "Importantly, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship."

In multiple linear regression, the estimated regression coefficient corresponding to a quantitative predicting variable is interpreted as the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed. (T/F)

True. See Lesson 3.4: Model Interpretation "The estimated value for one of the regression coefficient βi represents the estimated expected change in y associated with one unit of change in the corresponding predicting variable, Xi, holding all else in the model fixed."

A partial F-Test can be used to test whether the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero. (T/F)

True. See Lesson 3.7: Testing for Subsets of Regression Parameters We use the Partial F-test to test the null hypothesis that the regression coefficients associated to a subset of the predicting variables are all equal to zero. The alternative hypothesis is that at least one of these regression coefficients is not zero.

Goodness of Fit (Poisson)

Use the Pearson or deviance residuals to evaluate whether they are normally distributed, if they are normally distributed then we conclude that the model is a good fit In a goodness of fit test, the null hypothesis is that the model fits well and the alternative that the model does not fit well. The test statistic for the goodness of fit test is the sum of square root deviances Chi-squared distribution of n-p-1 degrees of freedom If the p-value is small. We reject the null hypotheses of good fit and thus, we conclude that the model is not a good fit

Goodness of Fit (binary data with no responses)

Use the deviances from the aggregated model for goodness of fit, not based on the individual level data

L2 Penalty

Using this penalty accounts for multicollinearity, but does not perform variable selection. The resulting regularized regression is called the Ridge regression This is easy to implement, but it does not measure sparsity not perform variable selection

L1 Penalty

Using this penalty will force many betas to be 0s. The resulting regularized regression is called the LASSO regression L1 penalty measures sparsity

L0 Penalty

Using this penalty, the penalized least squares is equivalent to searching over all models and thus not feasible for a large number of predictive variables L0 penalty provides the best model given selection criteria, but it requires fitting all submodels

How do we interpret the VIF?

VIF measures the proportional increase in the variance of beta hat j compared to what it would have been if the predictive variables had been completely uncorrelated

Uncorrelated Errors Assumption

Violations of this assumption can lead to misleading assessment of the strength of the regression. This is because the degrees of freedom are not equal to the sample size. In fact, there are less degrees of freedom due to the correlation. Moreover, not accounting for correlation will result in higher variability or uncertainty in the estimate, thus less reliable statistical inference

Goodness of Fit

We can use the Pearson or Deviance residuals to evaluate whether they are normally distributed. If they're normally distributed, we conclude that the model is a good fit If the model is not a good fit, it means the linearity assumption may not hold

Logistic Regression Coefficient

We interpret the regression coefficient beta as the log of the odds ratio for an increase of one unit in the predicting variable We do not interpret beta with respect to the response variable but with respect to the odds of success The estimators for the regression coefficients in logistic regression are unbiased and thus the mean of the approximate normal distribution is beta. The variance of the estimator does not have a closed form expression

g-function

We link the probability of success to the predicting variables using the g link function. The g function is the s-shape function that models the probability of success with respect to the predicting variables The link function g is the log of the ratio of p over one minus p, where p again is the probability of success Logit function (log odds function) of the probability of success is a linear model in the predicting variables The probability of success is equal to the ratio between the exponential of the linear combination of the predicting variables over 1 plus this same exponential

What can we do since the training risk is biased?

We need to correct for this bias by penalizing the training risk by adding a complexity penalty.

To use the estimated residuals for assessing model assumptions, what do we need to do first?

We need to standardize them

When would we reject the null hypothesis for a z test?

We reject the null hypothesis that the regression coefficient is 0 if the z value is larger in absolute value than the z critical point. Or the 1- alpha over 2 normal quanta. We interpret this that the coefficient is statistically significant.

Regression Coefficients - Robust Regression

We replace the sum of squared errors with the sum of absolute errors, thus estimating the regression coefficients by minimizing the sum of absolute errors. We cannot obtain closed or exact expressions for the estimated regression coefficients. The estimated coefficients are approximate estimates

We detect departure from the assumption of constant variance

When the residuals vs fitted values are larger in the ends but smaller in the middle.

Robust Regression II

When we have a heavy tailed distribution that has symmetry, we need to replace the normal distribution with the PDF The main difference between this distribution and that of a normal distribution is that we have the absolute value of the difference between y and the centrality parameter mu whereas for normal distribution, we had y minus mu squared. This distribution has heavier tails than the normal The parameter mu is not the mean or the expectation anymore, but it is the median of a distribution. The estimated mu using this approach is the sample median

High Dimensionality

When we have a very large number of predicting variables to consider, it can be difficult to interpret and work with the fitted model

training risk

compute the prediction risk for the observed data and take the sum of squared differences between fitted values for sub model S and the observed values

In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Calculate the sum of squared errors (SSE) from the given R output. Select the choice that most closely approximates your calculation. a. 102.617 b. 2668.039 c. 2533.081 d. 2786.025

b. 2668.039 MSE = SSE/(n−p−1) = SSE/DF. Hence, SSE =MSE*DF = 10.132* (30-3-1) = 2668.039

A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of A (standard error for the estimated intercept)? a. 877.9 b. 4.383 c. 0.2281 d. None of the above

b. 4.383 Since t-value = (estimated intercept - 0)/estimated std, we have, estimated std = estimated intercept/tvalue = 62.0313/14.152 = 4.383

Which of the following is not an application of regression? a. Testing hypotheses b. Proving causation c. Predicting outcomes d. Modeling data

b. Proving causation

We can make causality statements for...

experimental designs

Why can we not use the training error rate as an estimate of the true error classification error rate?

because it is biased downward. And the bias comes from the fact that we use the data twice. First, we used it for fitting the model and the second time to estimate the classification error rate.

In logistic regression, how do we define residuals for evaluating g-o-f?

binary data with replications.

What is the distribution of binary data WITH replications?

binomial distribution with more than one trial or ni greater than 1

In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Calculate the sum of squares total (SST) from the given R output. Select the choice that most closely approximates your calculation. a. 4994.48 b. 3147.54 c. 7662.38 d. 8655.21

c. 7662.38 Since, R2 = 1 - SSE/SST, we have, SST = SSE /(1-R 2 ) = 2668.039/(1-0.6518) = 7662.38

You have measured the systolic blood pressure of a random sample of 50 employees of a company, and have fitted a linear regression model to estimate the response variable systolic blood pressure using the sex of the employees. The 95% confidence interval for the mean systolic blood pressure for the female employees is computed to be (122, 138). Which of the following statements gives a valid frequentist interpretation of this interval? a. 95% of the sample of female employees has a systolic blood pressure between 122 and 138. b. 95 % of the employees in the company have a systolic blood pressure between 122 and 138. c. If the sampling procedure were repeated 100 times, then approximately 95 of the resulting 100 confidence intervals would contain the true mean systolic blood pressure for all female employees of the company. d. We are 95% confident the sample mean is between 122 and 138

c. If the sampling procedure were repeated 100 times, then approximately 95 of the resulting 100 confidence intervals would contain the true mean systolic blood pressure for all female employees of the company.

In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. The data frame has 30 observations and the following variables: taste - a subjective taste score Acetic - concentration of acetic acid (log scale) H2S - concentration of hydrogen sulfide (log scale) Lactic - concentration of lactic acid Using the following R output from a fitted multiple linear regression model, answer the following multiple-choice questions. Call: lm(formula = taste ~., data = chedder) Residuals: Min 1Q Median 3Q Max -17.390 -6.612 -1.009 4.908 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 Given the R output, an increase in the concentration of lactic acid by one unit results in a(n) ___________ in the given taste score by ___________ points, holding all other variables constant. a. Decrease, 19.6705 b. Increase, 8.6291 c. Increase, 19.6705 d. Decrease, 8.6291

c. Increase, 19.6705 The estimated coefficient for Lactic is 19.6705. If we fix all other predictors, for each 1 unit increase in Lactic, the given test score increases 19.6705 points.

In ANOVA, for which of the following purposes is the Tukey method used? a. Test for homogeneity of variance b. Test for normality c. Test for differences in pairwise means d. Test for independence of errors

c. Test for differences in pairwise means

marginal model (SLR)

captures the association of one predicting variable to the response variable marginally, that means without consideration of other factors.

What is the sampling distribution for the pooled variance estimator?

chi-square distribution with N - K degrees of freedom

What is the estimated sampling distribution of s^2?

chi-square with n-1 DF

What is the estimated sampling distribution of sigma^2?

chi-square with n-2 DF (~ equivalent to MSE)

in MLR, the sampling distribution for sigma^2 is MSE .....

chi-square with n-p-1 DF

σ^2 hat distribution is?

chi-square, n-p-1 DF

What is the distribution and DOF of overall regression test statistic?

chi-squared with p degrees of freedom where p is the number of predicting variables.

Poisson regression

commonly used for modeling count or rate data.

A linear regression model was fitted to estimate the response variable Height for black cherry trees using just the Diameter. The data frame has 31 observations. Here is the model summary, with some parts missing. Coefficients: Estimate Std. Error t-value Pr(>|t|) (Intercept) 62.0313 A 14.152 1.49e-14 *** Diameter 1.0544 0.3222 3.272 0.00276 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.538 on B degrees of freedom Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445 What is the value of B (degrees of freedom of the estimated error variance)? a. 32 b. 31 c. 30 d. 29

d. 29 The degrees of freedom of the estimated error variance are calculated as df = n-k - 1 = 31 - 1 - 1 = 29

Models with few predictors have...

high bias but low variance

In GLM or generalized linear models, the response Y is assumed to have what kind of distribution?

distribution from the exponential family of distributions

L1 penalty - LASSO

equal to the sum of the absolute values of the regression coefficients to be penalized. Minimizing the penalized least squares using this penalty will force many betas, many regression coefficients to be 0s. (LASSO)

In a simple linear regression model, given a significance level α , the ( 1 − α ) % confidence interval for the mean response should be wider than the ( 1 − α ) % prediction interval for a new response at the predictor's value x∗ .

false In a simple linear regression model, given a significance level α, the (1−α)100% confidence interval for the mean response should be narrower than the (1−α)100% prediction interval for a new response at the predictor's value x* .

With k-fold cross validation larger k values increase bias and reduce variance.

false Larger values of k decrease bias and increase variance.

Stepwise regression is a greedy algorithm searching through all possible combinations of the predicting variables to find the model with the best score.

false Not all possible combinations are checked.

In a multiple linear regression model, when more predictors are added, R^2 can decrease if the added predictors are unrelated to the response variable.

false R^2 never decreases as more predictors are added to a multiple linear regression model.

Ridge regression is a regularized regression approach that can be used for variable selection.

false Ridge regression is a regularized regression approach but does not perform variable selection.

In simple linear regression models, we lose three degrees of freedom when estimating the variance because of the estimation of the three model parameters β 0 , β 1 , σ 2.

false See 1.2 Estimation Method "The estimator for σ 2 is σ ^ 2, and is the sum of the squared residuals, divided by n - 2."

The sampling distribution for the variance estimator in simple linear regression is χ 2 (chi-squared) regardless of the assumptions of the data.

false See 1.2 Estimation Method "The sampling distribution of the estimator of the variance is chi-squared, with n - 2 degrees of freedom (more on this in a moment). This is under the assumption of normality of the error terms."

We assess the constant variance assumption by plotting the error terms, ϵ i, against fitted values.

false See 1.2 Estimation Method "We use ϵ ^ i as proxies for the deviances or the error terms. We don't have the deviances because we don't have β 0 and β 1.

The simple linear regression coefficient, β ^ 0, is used to measure the linear relationship between the predicting and response variables.

false See 1.2 Estimation Method β ^ 0 is the intercept and does not tell us about the relationship between the predicting and response variables.

The p-value is a measure of the probability of rejecting the null hypothesis.

false See 1.5 Statistical Inference Data Example "p-value is a measure of how rejectable the null hypothesis is... It's not the probability of rejecting the null hypothesis, nor is it the probability that the null hypothesis is true."

The normality assumption states that the response variable is normally distributed.

false See 1.8 Diagnostics "Normality assumption: the error terms are normally distributed." The response may or may not be normally distributed, but the error terms are assumed to be normally distributed.

With the Box-Cox transformation, when λ = 0 we do not transform the response.

false See 1.8 Diagnostics When λ = 0, we transform using the normal log.

In ANOVA, the linearity assumption is assessed using a plot of the response against the predicting variable.

false See 2.2. Estimation Method Linearity is not an assumption of ANOVA.

For a multiple linear regression model to be a good fit, we need the linearity assumption to hold for only one of the predicting variables.

false See Lesson 3.11: Assumptions and diagnostics In multiple linear regression, we need the linearity assumption to hold for all of the predicting variables, for the model to be a good fit. "For example, if the linearity does not hold with one or more predicting variables, then we could transform the predicting variables to improve the linearity assumption."

In multiple linear regression, a VIF value of 6 for a predictor means that 90% of the variation in that predictor can be modeled by the other predictors.

false See Lesson 3.13: Model Evaluation and Multicollinearity A VIF value of 6 for a predictor means that 83.3% of the variation in that predictor can be modeled by the other predictors in the model.

It is good practice to create a multiple linear regression model using a linearly dependent set of predictor variables.

false See Lesson 3.13: Model Evaluation and Multicollinearity It is good practice to create a multiple linear regression model using a linearly independent set of predicting variables. "XTX is not invertible if the columns of X are linearly dependent, i.e. one predicting variable, corresponding to one column, is a linear combination of the others."

Multicollinearity in multiple linear regression means that the rows in the design matrix are (nearly) linearly dependent.

false See Lesson 3.13: Model Evaluation and Multicollinearity Multicollinearity in multiple linear regression means that the columns in the design matrix are (nearly) linearly dependent.

Multicollinearity among the predicting variables will not impact the standard errors of the estimated regression coefficients.

false See Lesson 3.13: Multicollinearity Multicollinearity in the predicting variables can impact the standard errors of the estimated coefficients. "However, the bigger problem is that the standard errors will be artificially large."

In multiple linear regression, the prediction of the response variable and the estimation of the mean response have the same interpretation.

false See Lesson 3.2.9: Regression Line and Predicting a New Response. In multiple linear regression, the prediction of the response variable and the estimation of the mean response do not have the same interpretation.

A multiple linear regression model contains 6 quantitative predicting variables and an intercept. The number of parameters to estimate in this model is 7.

false See Lesson 3.2: Basic Concepts The number of parameters to estimate in a multiple linear regression model containing 6 quantitative predicting variables and an intercept is 8: 7 regression coefficients (β0,β1,...,β6) and the variance of the error terms (σ2).

What are three problems that variable selection tries to minimize?

high dimensionality, multicollinearity, prediction vs explanatory

The estimated variance of the error terms of a multiple linear regression model with intercept can be obtained by summing up the squared residuals and dividing the sum by n - p , where n is the sample size and p is the number of predictors.

false See Lesson 3.3: Regression Parameter Estimation The estimated variance of the error terms of a multiple linear regression model with intercept should be obtained by summing up the squared residuals and dividing that by n-p-1, where n is the sample size and p is the number of predictors as we lose p+1 degrees of freedom when we estimate the p coefficients and 1 intercept.

The causation of a predicting variable to the response variable can be captured using multiple linear regression on observational data, conditional of other predicting variables in the model.

false See Lesson 3.4 Model Interpretation "This is particularly prevalent in a context of making causal statements when the setup of the regression does not allow so. Causality statements can only be made in a controlled environment such as randomized trials or experiments. "

Conducting t-tests on each β parameter in a multiple linear regression model is the preferable to an F-test when testing the overall significance of the model.

false See Lesson 3.7: Testing for Subsets of Coefficients "We cannot and should not select the combination of predicting variables that most explains the variability in the response based on the t-tests for statistical significance because the statistical significance depends on what other variables are in the model."

In a multiple linear regression model, the adjusted R^2 measures the goodness of fit of the model

false The adjusted R^2 is not a measure of Goodness of fit. R^2 and adjusted R^2 measures the ability of the model and the predictor variable to explain the variation in response variable. Goodness of Fit refers to having all model assumptions satisfied.

In ANOVA, when testing for equal means across groups, the alternative hypothesis is that the means are not equal between two groups for all pairs of means/groups.

false The alternative is that at least one pair of groups have unequal means

Under Poisson regression, the sampling distribution used for a coefficient estimator is a chi-squared distribution when the sample size is large.

false The coefficient estimator follows an approximate normal distribution.

In regularized regression, the penalization is generally applied to all regression coefficients (β0, ... ,βp), where p = number of predictors.

false The shrinkage penalty is applied to β1, . . . , βp, but not to the intercept β0.

The number of parameters that need to be estimated in a logistic regression model with 5 predicting variables and an intercept is the same as the number of parameters that need to be estimated in a standard linear regression model with an intercept and same predicting variables.

false There are no error terms in logistic regression, so we only have parameters for the 6 coefficients in the model. With linear regression, we have parameters for the 6 coefficients in the model as well as the variance of the error terms.

Variable selection is a simple and completely solved statistical problem since we can implement it using the R statistical software.

false Variable selection for a large number of predicting variables is an "unsolved" problem, and variable selection approaches should be tailored to the problem at hand.

It is good practice to perform a goodness-of-fit test on logistic regression models without replications.

false We can only define residuals for binary data with replications, and residuals are needed for a goodness-of-fit test.

When testing a subset of coefficients, deviance follows a chi-square distribution with q degrees of freedom, where q is the number of regression coefficients in the reduced model.

false q is difference between the number of regression coefficients in the full model and the reduced model.

When the number of predicting variables is large, both backward and forward stepwise regressions will always select the same set of variables.

false Backward and forward stepwise regressions will not always select the same set of variables.

It is good practice to perform variable selection based on the statistical significance of the regression coefficients.

false It is not good practice to perform variable selection based on the statistical significance of the regression coefficients.

A logistic regression model has the same four model assumptions as a multiple linear regression model.

false The assumptions of a logistic regression model are: 1. Linearity Assumption: There is a linear relationship between the link function and the predictors 2. Independence assumption: The response variables are independent random variables 3. The link function is the logit function

What kind of variable is a predicting variable and why?

fixed, because it does not change with the response but it is fixed before the response is measured.

Where does uncertainty from estimation come from?

from estimation alone

Where does uncertainty from prediction come from?

from the estimation of regression parameters and from the newness of the observation itself

Cross-validation

leave out some of the data when fitting a model, that is, split the data into two parts. One part, also called a training data, will be used to fit the model given a specific lambda, and thus give the estimated regression coefficients given that lambda constant.

what is the coefficient interpretation of a GLM (poisson)?

log ratio of the rate with an increase with one unit in the predicting variable.

A study was conducted to measure the effect of a fungicide treatment on the survival rate of botrytis blight. Botrytis blight samples were divided into 20 groups, each consisting of about 100 samples and exposed to different levels of chemicals in a fungicide. The output of a logistic regression model is below, where concS represents the concentration of a sulfur in the fungicide and concCu represents the concentration of a copper in the fungicide. Use it to answer the following multiple-choice questions. Call: glm(formula = cbind(Survived, Died) ~ concS + concCu,family = "binomial", data = data) Deviance Residuals: Min 1Q Median 3Q Max -9.5366 -2.4594 0.1223 3.9710 6.3566 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.58770 0.22958 15.63 <2e-16 *** concS -4.32735 0.26518 16.32 <2e-16 *** concCu -0.27483 0.01784 15.40 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 718.76 on 19 degrees of freedom Residual deviance: 299.43 on 17 degrees of freedom AIC: 363.53 The p-value for a goodness-of-fit test using the deviance residuals for the regression can be obtained from which of the following? pchisq(419.33,2, lower.tail =FALSE) pchisq(363.53,3, lower.tail =FALSE) pchisq(299.43,17, lower.tail =FALSE) pchisq(718.76,19, lower.tail =FALSE)

pchisq(299.43,17, lower.tail =FALSE) The goodness of fit test uses the residual deviance (299.43) and corresponding degrees of freedom (17) as the test statistic for the chi-squared test.

For goodness of fit test, we compare the likelihoods of the...?

saturated model versus the fitted model

Ridge regression does not perform variable selection. It only...

shrinks the coefficients to zero but it does not FORCE coefficients to be zero as needed in variable selection.

training error

simply use the data to fit the model then compute the classifier from each response in the data and take the proportion of the responses we misclassified

Forward stepwise tends to select...

smaller models

HW3 Multi Regression

spacing

L1 penalty measures...

sparsity

The test statistic for the goodness of fit test is?

sum of squared deviances

If we replace the unknown variance with its estimator, sigma^2=MSE, for PREDICTION, the sampling distribution becomes...

t distribution with n-p-1 DF

The alternative hypothesis of ANOVA can be stated as, the means of all pairs of groups are different the means of all groups are equal the means of at least one pair of groups is different None of the above

the means of at least one pair of groups is different See 2.4 Test for Equal Means "Using the hypothesis testing procedure for equal means, we test: The null hypothesis, which that the means are all equal (mu 1 = mu 2...=mu k) versus the alternative hypothesis, that some means are different. Not all means have to be different for the alternative hypothesis to be true -- at least one pair of the means needs to be different."

Goodness of fit tests the null hypothesis that

the model fits the data

Deviance

the test statistic is the difference of the log likelihood under the reduced model and the log likelihood under the full model for testing the subset of coefficients

If we have a positive value for B1,....

then that's consistent with a direct relationship between the predicting variable x and the response variable y.

Assuming the model is a good fit, the residuals in simple linear regression have constant variance.

true

Elastic net regression uses both penalties of ridge and lasso regression and hence combines the benefits of both.

true

If a Poisson regression model does not have a good fit, the relationship between the log of the expected rate and the predicting variables might be not linear.

true

If a predicting variable is categorical with 5 categories in a linear regression model without intercept, we will include 5 dummy variables in the model.

true

In a simple linear regression model, given a significance level α , if the ( 1 − α ) % confidence interval for a regression coefficient does not include zero, we conclude that the coefficient is statistically significant at the α level.

true In a simple linear regression model, given a significance level α , if the ( 1 − α ) % confidence interval for a regression coefficient does not include zero, we conclude that the coefficient is statistically significant at the α level.

It is required to standardize or rescale the predicting variables when performing regularized regression.

true Regularized regression requires standardization or scaling of the predicting variables.

A negative value of β 1 is consistent with an inverse relationship between the predictor variable and the response variable.

true See 1.2 Estimation Method "A negative value of β 1 is consistent with an inverse relationship"

The pooled variance estimator, s p o o l e d 2, in ANOVA is synonymous with the variance estimator, σ ^ 2, in simple linear regression because they both use mean squared error (MSE) for their calculations.

true See 1.2 Estimation Method for simple linear regression See 2.2 Estimation Method for ANOVA The pooled variance estimator is, in fact, the variance estimator.

Under the normality assumption, the estimator for β 1 is a linear combination of normally distributed random variables.

true See 1.4 Statistical Inference "Under the normality assumption, β 1 is thus a linear combination of normally distributed random variables... β ^ 0 is also linear combination of random variables"

The prediction interval of one member of the population will always be larger than the confidence interval of the mean response for all members of the population when using the same predicting values.

true See 1.7 Regression Line: Estimation & Prediction Examples "Just to wrap up the comparison, the confidence intervals under estimation are narrower than the prediction intervals becausethe prediction intervals have additional variance from the variation of a new measurement."

If the model assumptions hold, then the estimator for the variance, σ ^ 2, is a random variable.

true See 1.8 Statistical Inference We assume that the error terms are independent random variables. Therefore, the residuals are independent random variables. Since σ ^ 2 is a combination of the residuals, it is also a random variable.

An ANOVA model with a single qualitative predicting variable containing k groups will have k + 1 parameters to estimate.

true See 2.2 Estimation Method We have to estimate the means of the k groups and the pooled variance estimator, s p o o l e d 2.

The mean sum of squared errors in ANOVA measures variability within groups.

true See 2.4 Test for Equal Means MSE = within-group variability

If the constant variance assumption in ANOVA does not hold, the inference on the equality of the means will not be reliable.

true See 2.8 Data Example "This is important since without a good fit, we cannot rely on the statistical inference." Only when the model is a good fit, i.e. all model assumptions hold, can we rely on the statistical inference.

If the pairwise comparison interval between groups in an ANOVA model includes zero, we conclude that the two means are plausibly equal.

true See 2.8 Data Example If the comparison interval includes zero, then the two means are not statistically significantly different, and are thus, plausibly equal.

Cook's distance (Di) measures how much the fitted values in a multiple linear regression model change when the ith observation is removed.

true See Lesson 3.11: Assumptions and Diagnostics "This is the distance between the fitted values of the model with all the observations versus the fitted values of the model discarding the i-th observation from the data used to fit the model. "

The presence of certain types of outliers, such as influential points, can impact the statistical significance of some of the regression coefficients.

true See Lesson 3.11: Assumptions and diagnostics Outliers that are influential can impact the statistical significance of the beta parameters.

An example of a multiple linear regression model is Analysis of Variance (ANOVA).

true See Lesson 3.2 Basic Concepts "Earlier, we contrasted the simple linear regression model with the ANOVA model... Multiple linear regression is a generalization of both models."

If the residuals are not normally distributed, we can model the transformed response variable instead, where a common transformation for normality is the Box-Cox transformation.

true See Lesson 3.3.11: Assumptions and Diagnostics If the normality assumption does not hold, we can use a transformation that normalizes the response variable such as Box-Cox transformation.

linear regression model has high explanatory power if the coefficient of determination is close to 1.

true See Lesson 3.3.13: Model Evaluation and Multicollinearity If R2 is close to 1, almost all of the variability in Y can be explained by the linear regression model; hence, the model has high explanatory power.

In the case of multiple linear regression, controlling variables are used to control for sample bias.

true See Lesson 3.4: Model Interpretation "Controlling variables can be used to control for bias selection in a sample."

For a given predicting variable, the corresponding estimated regression coefficient will likely be different in a conditional model versus a marginal model.

true See Lesson 3.4: Model Interpretation "Importantly, the estimated regression coefficients for the conditional and marginal relationships can be different, not only in magnitude but also in sign or direction of the relationship."

In multiple linear regression, the estimated regression coefficient corresponding to a quantitative predicting variable is interpreted as the estimated expected change in the response variable when there is a change of one unit in the corresponding predicting variable holding all other predictors fixed.

true See Lesson 3.4: Model Interpretation "The estimated value for one of the regression coefficient βi represents the estimated expected change in y associated with one unit of change in the corresponding predicting variable, Xi, holding all else in the model fixed."

A partial F-Test can be used to test whether the regression coefficients associated with a subset of the predicting variables in a multiple linear regression model are all equal to zero.

true See Lesson 3.7: Testing for Subsets of Regression Parameters We use the Partial F-test to test the null hypothesis that the regression coefficients associated to a subset of the predicting variables are all equal to zero. The alternative hypothesis is that at least one of these regression coefficients is not zero.

Simpson's Paradox occurs when a coefficient reverses its sign when used in a marginal versus a conditional model.

true Simpson's paradox: Reversal of an association when looking at a marginal relationship versus a conditional relationship.

Generalized linear models, like logistic regression, use a Wald test to determine the statistical significance of the coefficients.

true The coefficient estimates follow an approximate normal distribution and a z-test, also known as a Wald test, is used to determine their statistical significance.

For logistic regression, if the p-value of the deviance test for goodness-of-fit is large, then it suggests that the model is a good fit.

true The null hypothesis is that the model fits the data. So large p-values suggests that the model is a good fit.

With Poisson regression, the variance of the response is not constant.

true V(Y|x_1,...x_p)=exp(beta_0 + beta_1 x_1 + ... + beta_p x_p)

Although there are no error terms in a logistic regression model using binary data with replications, we can still perform residual analysis.

true We can perform residual analysis on the Pearson residuals or the Deviance residuals.

The training risk is not an unbiased estimator of the prediction risk.

true The training risk is a biased estimator of the prediction risk.

When selecting variables for explanatory purpose, one might consider including predicting variables which are correlated if it would help answer your research hypothesis.

true When the objective is to explain the relationship to the response, one might consider including the predicting variables even they are correlated.

What does it mean if 0 is NOT included in the CI?

we conclude that Bj IS statistically significant

What does it mean if 0 is included in the CI?

we conclude that Bj is NOT statistically significant

pairwise comparison

we estimate the difference in the means (for example, a pair: meani and meanj) as a difference between their corresponding means

If we add the bias squared and the variance,

we get Mean Squared Error (MSE)

What does a cluster of residuals mean on a residual plot?

we have correlated errors

When B1 is close to zero...

we interpret that there is not a significant association between predicting variables, between the predicting variable x, and the response variable y.

If the t-value is large...

we reject the null hypothesis and conclude that the coefficient is statistically significant.

Why do we lose 1 DF for s^2?

we replace mu with zbar

Why do we lose 2 DF for sigma^2?

we replaced two parameters, B0 and B1

Backward stepwise regression

we start with all predictors, the full model and drop one predictor at a time.

Forward stepwise regression

we start with no predictor or with a minimum model, and add one predictor at a time

When do we reject the null hypothesis for the overall regression test in regards to the p value?

when the P-value is small, indicating that the overall regression has explanatory power.

A model selection method that can be performed in the R statistical software given a set of controlling factors that form the minimum or starting model.

Stepwise regression

Forward stepwise tends to result in smaller models that backward stepwise regression.

True

Given a model of p predicting variables, there are p^2 models to choose from.

True

If there is a group of correlated variables, lasso tends to select only one of the variables in the group.

True

If you had a model with multicollinearity that you wanted to preserve, you'd use ridge regression over lasso.

True

In cases where p>n, lasso selects only up to n variables.

True

In regularized regression, different lambdas result in different models.

True

In regularized regression, lambda is a constant.

True

The L1 penalty results in ridge regression.

False

The penalty for lasso regression is not a sparsity penalty.

False

The selected variables using variable selection approaches are the only variables that explain the response variable.

False

We need not be wary of over interpretation in MLR.

False

Variable selection methods is performed by

Balancing the bias-variance trade-off.

Why is training risk a biased estimate of prediction risk?

Because we use the data twice.

What does training risk correct for?

Bias

In GLMs the main reason one does not use LSE to estimate model parameters is the potential constrained in the parameters.

False - The potential constraint in the parameters of GLMs is handled by the link function.

The backward elimination requires a pre-set probability of type II error

False - Type I error

Can be used to explain variability in the response variable.

Explanatory variable

All regularized regression approaches will select the same model.

False

Forward stepwise is more computationally expensive than backward.

False

If a predicting variables is selected to be in the model, we conclude that the predicting variable has a causal relationship with the response variable.

False

In regularized regression, the bigger the lambda the smaller the complexity penality.

False

It is always feasible to apply a model search for all possible combinations of the predicting variables.

False

LOOCV penalizes complexity more than Mallow's CP

False

Lasso regression has a closed from expression for the estimated coefficients.

False

Mallows Cp works even when p>n.

False

Once lasso has identified the predicting variables, it is not necessary to use OLS to obtain the regression coefficients.

False

Ridge regression is used for variable selection.

False

The AIC and BIC cannot be used in selecting variables for generalized linear models.

False

It is possible to select a model to include variables that are not statistically significant, even though that model will provide the best prediction.

True

Models with many predictors have what?

Low bias, high variance.

L0 penalty is equivalent to searching through all the models.

True

The minus expected log likelihood function is also known as what?

Prediction risk

Can be used to predict variability in the response regardless of their predictive power.

Predictive variable

Mallows CP uses estimated variance based on what?

The full model

Forward stepwise regression adds one variable to the model at a time starting with a minimum model.

True

Alpha balances between ridge and lasso in elasticnet.

True

Complexity is equivalent to a large model with many predicting variables.

True

Cross validation is used to find the best lambda for regularized regression problems.

True

For BIC, we need to replace sigma^2 with an estimate from the submodel S.

True

For logistic regression and Poisson regression, the training risk is the sum of square deviances for the submodel S.

True

We cannot obtain prediction risk because of what?

We don't have future observations at the time of prediction.


Set pelajaran terkait

Small Business Management Chapter 1

View Set

Chapter6 Functions of Bone and the Skeletal System

View Set

Chapter 1: Introduction to Nursing

View Set

Pneumonoultramicroscopicsilicovolcanoconiosis

View Set

Ch 18 Performance and Breach of a Sales and Lease Contract

View Set