Ch.6 and 7 Quizzes MATH 457

¡Supera tus tareas y exámenes ahora con Quizwiz!

(b) Ridge regression, relative to least squares, is: [more | less] flexible and hence will give improved prediction accuracy when its [increase | decrease] in bias is [more | less] than its [increase | decrease] in variance.

(b) Ridge regression, relative to least squares, is: less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c) Non-linear methods, relative to least squares, are: [more | less] flexible and hence will give improved prediction accuracy when their [increase | decrease] in bias is [more | less] than their [increase | decrease] in variance.

(c) Non-linear methods, relative to least squares, are: more flexible and hence will give improved prediction accuracy when their decrease in bias is more than their increase in variance.

(d) Repeat (a) for (squared) bias.

(iv) Steadily decrease: When s=0s=0, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. As ss increases, more ββ s become non-zero and thus the model continues to fit training data better. And thus, bias decreases.

(e) Repeat (a) for the irreducible error

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of ss, remains constant.

(b) Does f have "knots"? If so, how many, and where are they?

1 knot at x=0

Let 1{𝑥≤𝑡} denote a function which is 1 if 𝑥≤𝑡 and 0 otherwise. Which of the following is a basis for linear splines with a knot at t? Select all that apply:

1,𝑥,(𝑥−𝑡)1{𝑥>𝑡} 1,𝑥,(𝑥−𝑡)1{𝑥≤𝑡} 1,(𝑥−𝑡)1{𝑥≤𝑡},(𝑥−𝑡)1{𝑥>𝑡}

Examine the plot on pg 23. Assume that we wanted to select a model using the one-standard-error rule on the Cross-Validated error. What tree size would end up being selected?:

2

In order to perform Boosting, we need to select 3 parameters: number of samples B, tree depth d, and step size 𝜆. How many parameters do we need to select in order to perform Random Forests?:

2

You are trying to fit a model and are given p=30 predictor variables to choose from. Ultimately, you want your model to be interpretable, so you decide to use Best Subset Selection.

2^30

Using the decision tree on page 5 of the notes, what would you predict for the log salary of a player who has played for 4 years and has 150 hits?:

5.11 The player has played less than 4.5 years, so at the first split we follow the left branch. There are no further splits, so we predict 5.11.

For the model y ~ 1+x+x^2, what is the coefficient of x (to within 10%)?

77.7

3. What is the difference between a cubic spline and a natural cubic spline?

A natural cubic spline is required to be linear in the end-regions or "at the boundary". A cubic spline is not.

In terms of model complexity, which is more similar to a smoothing spline with 100 knots and 5 effective degrees of freedom?

A natural cubic spline with 5 knots

Why are natural cubic splines preferred to polynomials with the same number of degrees of freedom for most applications?

Because polynomials of high degree are going to display radical changes at the ends. This wild behavior is not desirable for most applications

We perform best subset and forward stepwise selection on a single dataset. For both approaches, we obtain 𝑝+1 models, containing 0,1,2,...,𝑝 predictors. Which of the two models with 𝑘 predictors is guaranteed to have training RSS no larger than the other model?

Best Subset

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers: (a) Which of the three models with k predictors has the smallest training RSS?

Best subset selection has the smallest training RSS because the other two methods determine models with a path dependency on which predictors they pick first as they iterate to the k'th model.

(b) Which of the three models with k predictors has the smallest test RSS?

Best subset selection may have the smallest test RSS because it considers more models then the other methods. However, the other models might have better luck picking a model that fits the test data better.

What is the value of the Cross-Entropy? Give your answer to the nearest hundredth (using log base e, as in R):

Cross Entropy = -.64*ln(.64) - .36*ln(.36) = .6534

1. The textbook and EdX course spend a lot of time on "cubic splines". In principle, we could fit a regression model with a "quadratic spline" or a "quartic spline". Why do the authors choose cubic splines?

Cubic splines are popular because most human eyes cannot detect the discontinuity at the knots. Cubic (degree 3) is the smallest degree for which this is true

True or False: In the GAM 𝑦∼𝑓1(𝑋1)+𝑓2(𝑋2)+𝑒, as we make 𝑓1 and 𝑓2 more and more complex we can approximate any regression function to arbitrary precision.

False

We compute the principal components of our p predictor variables. The RSS in a simple linear regression of Y onto the largest principal component will always be no larger than the RSS in a simple regression of Y onto the second largest principal component. True or False? (You may want to watch 6.10 as well before answering - sorry!)

False Principal components are found independently of Y, so we can't know the relationship with Y a priori.

How many would you fit using Forward Selection?

For Forward Selection, you fit (p-k) models for each k=0,...p-1. The expression for the total number of models fit is on pg 15 of the notes: p(p+1)/2+1.

The tree building algorithm given on pg 13 is described as a Greedy Algorithm. Which of the following is also an example of a Greedy Algorithm?:

Forward Stepwise Selection is a Greedy Algorithm because at each step it selects the variable that improves the current model the most. There is no guarantee that the final result will be optimal.

What is the final classification under the average probability method?:

Green The average of the probabilities is 0.45, so the average probability method will select green.

EdX Chapter 6 Which of the following modeling techniques performs Feature Selection?

Linear Regression with Forward Selection Forward Selection chooses a subset of the predictor variables for the final model. The other three methods end up using all of the predictor variables.

Why are natural cubic splines typically preferred over global polynomials of degree d?

Polynomials tend to extrapolate very badly

Which of the following would be the worst metric to use to select \(\lambda\) in the Lasso?

RSS

Chapter 7 Which of the following can we add to linear models to capture nonlinear effects?

Spline terms Polynomial terms Interactions Step functions

Imagine that you are doing cost complexity pruning as defined on page 18 of the notes. You fit two trees to the same data: 𝑇1 is fit at 𝛼=1 and 𝑇2 is fit at 𝛼=2. Which of the following is true?

T1 will have at least as many nodes as 𝑇2

You are doing a simulation in order to compare the effect of using Cross-Validation or a Validation set. For each iteration of the simulation, you generate new data and then use both Cross-Validation and a Validation set in order to determine the optimal number of predictors. Which of the following is most likely?

The Validation set method will result in a higher variance of optimal number of predictors

For parts (a) through (c), circle the right words to form a correct statement. (a) The lasso, relative to least squares, is: [more | less] flexible and hence will give improved prediction accuracy when its [increase | decrease] in bias is [more | less] than its [increase | decrease] in variance.

The lasso, relative to least squares, is: less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

You are trying to reproduce the results of the R labs, so you run the following command in R: > library(tree) As a response, you see the following error message: Error in library(tree) : there is no package called 'tree' What went wrong?

The tree package is not installed on your computer

Suppose I have two qualitative predictor variables, each with three levels, and a quantitative response. I am considering fitting either a tree or an additive model. For the additive model, I will use a piecewise-constant function for each variable, with a separate constant for each level. Which model is capable of fitting a richer class of functions:

Tree

Which of the following is NOT a benefit of the sparsity imposed by the Lasso?

Using the Lasso penalty helps to decrease the bias of the fits. Restricting ourselves to simpler models by including a Lasso penalty will generally decrease the variance of the fits at the cost of higher bias.

You have a bag of marbles with 64 red marbles and 36 blue marbles. What is the value of the Gini Index for that bag? Give your answer to the nearest hundredth:

Using the formula from pgs 25-26: Gini Index = .64*(1-.64) + .36*(1-.36) = .4608

You are working on a regression problem with many variables, so you decide to do Principal Components Analysis first and then fit the regression to the first 2 principal components. Which of the following would you expect to happen?:

Variance of fitted values will decrease relative to the full least squares model

Which of the two models with 𝑘 predictors has the smallest test RSS?

We know that Best Subset selection will always have the lowest training RSS (that is how it is defined). That said, we don't know which model will perform better on a test set.

6. In a GAM, does it make sense to include interaction terms? Why or why not?

Yes. There is no good way to approixmate the function X_1X_2 with a function of the form f_1(X_1)+f_2(x_2).

3. Suppose we estimate the regression coefficients in a linear regression model by minimizing n sum i=1 (yi − β0 − p sum j=1 βjxi)^2 subject to p sum j=1 |βj | ≤ s for a particular value of s. For parts (a) through (e), indicate which of the completions i. through v. of the statement is correct (a) As we increase s from 0, the training RSS will:

a (iv) Steadily decreases: As we increase ss from 00, all ββ 's increase from 00 to their least square estimate values. Training error for 00 ββ s is the maximum and it steadily decreases to the Ordinary Least Square RSS

(b) Repeat (a) for test RSS.

b (ii) Decrease initially, and then eventually start increasing in a U shape: When s=0s=0, all ββ s are 00, the model is extremely simple and has a high test RSS. As we increase ss, betabeta s assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, as betabeta s approach their full blown OLS values, they start overfitting to the training data, increasing test RSS.

(c) Repeat (a) for variance.

c (iii) Steadily increase: When s=0s=0, the model effectively predicts a constant and has almost no variance. As we increase ss, the models includes more ββ s and their values start increasing. At this point, the values of ββ s become highly dependent on training data, thus increasing the variance.

One of the functions in the glmnet package is cv.glmnet(). This function, like many functions in R, will return a list object that contains various outputs of interest. What is the name of the component that contains a vector of the mean cross-validated errors?

cvm

2. Suppose we define a function by f(x) = ( x^2 x ≥ 0 −x^2 x < 0} (a) Is f a natural cubic spline? Why or why not?

no because there is discontinuity at f'' at x=0. The definition of a cubic spline is a piecewise degree 3 polynomial with continuity of derivative up to degree 2 ii. a natural cubic spline is required to be linear at the boundary (the end-regions). Both of the two regions are end regions. The functions x^2 and -x^2 are not linear in these regions

Suppose we want to fit a generalized additive model (with a continuous response) for 𝑦 against 𝑋1 and 𝑋2. Suppose that we are using a cubic spline with four knots for each variable (so our model can be expressed as a linear regression after the right basis expansion). Suppose that we fit our model by the following three steps: 1) First fit our cubic spline model for 𝑦 against 𝑋1, obtaining the fit 𝑓̂ 1(𝑥) and residuals 𝑟𝑖=𝑦𝑖−𝑓̂ 1(𝑋𝑖,1). 2) Then, fit a cubic spline model for 𝑟 against 𝑋2 to obtain 𝑓̂ 2(𝑥). 3) Finally construct fitted values 𝑦̂ 𝑖=𝑓̂ 1(𝑋𝑖,1)+𝑓̂ 2(𝑋𝑖,2). Will we get the same fitted values as we would if we fit the additive model for 𝑦 against 𝑋1 and 𝑋2 jointly?

not necessarily, even if 𝑋1 and 𝑋2 are uncorrelated.

Suppose we produce ten bootstrap samples from a data set containing red and green classes. We then apply a classification tree to each bootstrap sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75 There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in the notes. The second approach is to classify based on the average probability. What is the final classification under the majority vote method?:

red 6 of the 10 probabilities are greater than 1/2, so the majority vote method will select red.

(b) For λ = 0, will ˆg1 or ˆg2 have the smaller training RSS?

the loss functions are identical so the training RSS will be the same

Consider two curves, ˆg1 and ˆg2, defined by gˆ1 = arg min g Xn i=1 (yi − g(xi))2 + λ Z g (3)2 dx! gˆ2 = arg min g Xn i=1 (yi − g(xi))2 + λ Z g (4)2 dx! (a) As λ → ∞, will ˆg1 or ˆg2 have the smaller training RSS?

the requirement becomes that g^_1 becomes quadratic but g^_2 becomes cubic. So this means that g^_2 is more flexible and will have smaller training RSS

𝑗‾‾‾‾∑𝑝𝑗=1𝛽^2j is equivalent to

‖𝛽‖2

Load the data from the file 7.R.RData, and plot it using plot(x,y). What is the slope coefficient in a linear regression of y on x (to within 10%)?

−0.6748 Explanation The slope is negative for most of the data, and the coefficient reflects that

You are fitting a linear model to data assumed to have Gaussian errors. The model has up to p=5 predictors and n=100 observations. Which of the following is most likely true of the relationship between Cp and AIC in terms of using the statistic to select a number of predictors to include?

𝐶𝑝 will select the same model as 𝐴𝐼𝐶

You perform ridge regression on a problem where your third predictor, x3, is measured in dollars. You decide to refit the model after changing x3 to be measured in cents. Which of the following is true?:

𝛽̂ 3 and 𝑦̂ will both change.


Conjuntos de estudio relacionados

K-PLAN - 1 of 2 - Q & A - Building Systems

View Set