Math 457 Final Study

¡Supera tus tareas y exámenes ahora con Quizwiz!

True or False: In the GAM 𝑦∼𝑓1(𝑋1)+𝑓2(𝑋2)+𝑒, as we make 𝑓1 and 𝑓2 more and more complex we can approximate any regression function to arbitrary precision.

False

Complete

Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.

The predict() function can be used for this purpose

in order to properly evaluate the performance of a classification tree on these data, we must estimate the test error rather than simply computing the training error. We split the observations into a training set and a test set, build the tree using the training set, and evaluate its performance on the test data.

Logistic regression involves directly modeling

Pr(Y = k|X = x) using the logistic function, given by for the case of two response classes

dimension reduction

class of approaches that transform the predictors and then fit a least squares model using the transformed variables

poly()

command allows us to avoid having to write out a long formula with powers of age

points()

command works like the plot() command, except that it points() puts points on a plot that has already been created, instead of creating a new plot

The four most common types of linkage

complete, average, single, and centroid—

cumsum()

computes the cumulative sum of the elements of a numeric vector

The output from prcomp()

contains a number of useful quan- tities.

y. But the tuning parameter λ

controls the roughness of the smoothing spline, and hence the effective degrees of freedom. m. It is possible to show that as λ increases from 0 to ∞, the effective effective degrees of freedom degrees of freedom, which we write dfλ, decrease from n to 2

anova() function, we could have obtained these p-values more succinctly by exploiting the fact that poly()

creates orthogonal polynomials.

Step functions

cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function

One of the functions in the glmnet package is cv.glmnet(). This function, like many functions in R, will return a list object that contains various outputs of interest. What is the name of the component that contains a vector of the mean cross-validated errors?

cvm

Suppose we a data set where each data point represents a single student's scores on a math test, a physics test, a reading comprehension test, and a vocabulary test. We find the first two principal components, which capture 90% of the variability in the data, and interpret their loadings. We conclude that the first principal component represents overall academic ability, and the second represents a contrast between quantitative ability and verbal ability. What loadings below would be consistent with that interpretation? Choose all that apply. Explain why the answers you do not choose are incorrect. (a) (0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0) (b) (0.5, 0.5, 0.5, 0.5) and (0, 0, 0.71, 0.71) (c) (0.71, 0.71, 0, 0) and (−0.71, 0.71, 0, 0) (d) (0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, −0.5, −0.5) (e) (0.71, 0, −0.71, 0) and (0.71, 0, 0.71, 0) (f) (0.5, 0.5, 0.5, 0.5) and (−0.5, −0.5, 0.5, 0.5)

d and f because they are the same up to a change of sign and are consistent with the stated interpretation. c and e dont involve 2 measurements and a and b (second direction) does not represent a contrast/ In d and f the second PC does represent a contrast.

The linear discriminant analysis (LDA) method

d approximates the Bayes classifier by plugging esti- linear discriminant analysis mates for πk, μk, and σ2

. You are analyzing a dataset where each observation is an age, height, length, and width of a particular turtle. You want to know if the data can be well described by fewer than four dimensions (maybe for plotting), so you decide to do Principal Component Analysis. Which of the following is most likely to be the loadings of the first Principal Component? Make sure to explain why the answers you do not choose are not likely to be correct. (a) (1, −1, 1, −1) (b) (0.71, −0.71, 0, 0) (c) (1, 1, 1, 1) (d) (0.5, 0.5, 0.5, 0.5)

d. Loadings sum must be 1 and there should be a positive correlation.

The linear assumption states that the change in the response Y

due to a one-unit change in Xj is constant, regardless of the value of Xj .

The text, on p. 318, remarks that the probability of an out-of-bag data point is about 1/3. Why is this so?

each bagged tree makes use of around two-thirds of the observations. So there is a 1/3 observations left over. It is out of the bag tree.

average of many least squares lines

each estimated from a separate data set, is pretty close to the true population regression line.

residual—

ei = yi −yˆi

scale=0 argument to biplot()

ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

Finding the values of j and s that minimize (8.3) can be done quite quickly,

especially when the number of features p is not too large

Polynomial regression

extends the linear model by adding extra pre- dictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, X,X2, and X3, as predictors. This approach provides a simple way to provide a non- linear fit to data

lm()

fit linear models

Remedy too flexible

fit piecewise polynomial under constraint that the fitted curve must be continous

ROC curves are useful for comparing different classifiers, since they take into account all possible thresholds. It turns out that the ROC curve for the logistic regression model

fit to these data is virtually indistinguishable from this one for the LDA model, so we do not display it here

density function

fk(x ) ≡ Pr(X = x|Y = k) denote the density function of X for an observation that comes from the kth class Technically this definition is only correct if is a discrete random variable. If X is continous then fk(x)dx would correspond to the probability of X falling in a small region dx around x

The AIC criterion is defined

for a large class of models fit by maximum likelihood.

The ROC curve (receiver operating characteristics)

for simultaneously displaying the ROC curve two types of errors for all possible thresholds.

In fact, using fourth-order polynomials would likely lead to good test set performance, as the true test error rate is approximately the same

for third, fourth, fifth, and sixth-order polynomials

Consequently, best subset selection becomes computationally infeasible

for values of p greater than around 40

the validation set and cross-validation methods has an advantage relative to AIC, BIC, Cp, and adjusted R2,

in that it provides a direct estimate of the test error, and makes fewer assumptions about the true underlying model.

Since we plan classification to assign an observation in a given region to the most commonly occurring error rate class of training observations

in that region, the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class: E = 1 − max k (ˆpmk).

sample()

function to randomly select 100 observations from the range 1 to 100, with replacement.

sample()

function to split the set of observations sample() into two halves,

image()

function works the same way as contour(), except that it image() produces a color-coded plot whose colors depend on the z value

boot()

function, which is part of the boot library, to boot() perform the bootstrap by repeatedly sampling observations from the data set with replacement.

The idea behind K-means clustering is that a good clustering

g is one for which the within-cluster variation is as small as possible

When λ → ∞,

g will be perfectly smooth—it will just be a straight line that passes as closely as possible to the training points. In fact, in this case, g will be the linear least squares line, since the loss function in amounts to minimizing the residual sum of squares.

s glmnet()

ge regression and the lasso. The main function in this package is glmnet(), which can be used glmnet() to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax

linear spline

general definition of a degree-d spline is that it is a piecewise degree-d polynomial, with continuity in derivatives up to degree d − 1 at each knot.

Cost complexity pruning also is known as weakest link pruning helps select subtrees for consideration

gives us a way to do just this. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α.

summary(lm.fit)$r.sq

gives us the R2,

A truncated power basis function

h(x, ξ)=(x − ξ) 3 + = , (x − ξ)3 if x>ξ, 0 otherwise where ξ is the knot

Setting scale=TRUE

has the effect of standardizing each predictor,

Drawback of LOOCV

has the potential to be expensive to implement, since the model has to be fit n times. This can be very time consuming if n is large, and if each individual model is slow to fit

We interpret βj as the average effect on Y of a one unit increase in Xj

holding all other predictors fixed. In the advertising example,

As more principal components are used

in the regression model, the bias decreases, but the variance increases.

This is in contrast to some other supervised and unsupervised learning techniques, such as linear regression,

in which scaling the variables has no effect.

hybrid versions of forward and backward stepwise selection are available,

in which variables are added to the model sequentially. . However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic best subset selection while retaining the computational advantages of forward and backward stepwise selection

glm() function using family="binomial"

in order to fit a polynomial logistic regression model.

Consequently, the MSE drops considerably as λ

increases from 0 to 10. Beyond this point, the decrease in variance due to increasing λ slows, and the shrinkage on the coefficients causes them to be significantly underestimated, resulting in a large increase in the bias.

-train

index below selects only the observations that are not in the training set.

The test error

is the average error that results from using a statistical learning method to predict the response on a new observation— that is, a measurement that was not used in training the method.

where ˆyR1

is the mean response for the training observations in R1(j, s),

yR2

is the mean response for the training observations in R2(j, s)

The estimated pointwise standard error of ˆf(x0)

is the square-root of this variance.

lm(y∼x1+x2+x3)

is used to fit a model with three predictors, x1, x2, and x3.

Classification Tree

is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one.

Ridge regression

is very similar to least squares, except that the coefficients ridge are estimated by minimizing a slightly different quantity.

While the supervised dimension reduction of PLS can reduce bias,

it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash

GAMs for classification

log (p(X)/ 1 − p(X) )= β0 + f1(X1) + f2(X2) + · · · + fp(Xp).

log-odds or logit

log (p(X)/ 1 − p(X))= β0 + β1X

Rather than modeling this response Y directly

logistic regression models the probability that Y belongs to a particular category

Textbook Exercises This problem involves hyperplanes in two dimensions. (a) Sketch the hyperplane 1 + 3X1 − X2 = 0. Indicate the set of points for which 1+3X1 −X2 > 0, as well as the set of points for which 1 + 3X1 − X2 < 0. (b) On the same plot, sketch the hyperplane −2 + X1 + 2X2 = 0. Indicate the set of points for which −2 + X1 + 2X2 > 0, as well as the set of points for which −2 + X1 + 2X2 < 0.

look at answer online

We have seen that in p = 2 dimensions, a linear decision boundary takes the form β0 +β1X1 +β2X2 = 0. We now investigate a non-linear decision boundary. (a) Sketch the curve (1 + X1)2 + (2 − X2)2 = 4. (b) On your sketch, indicate the set of points for which (1+X1)2 +(2−X2)2 >4, as well as the set of points for which (1+X1)2 +(2−X2)2 ≤4. (c) Suppose that a classifier assigns an observation to the blue class if (1+X1)2 +(2−X2)2 >4,and to the red class otherwise. To what class is the observation (0, 0) classified? (−1, 1)? (2, 2)? (3, 8)? (d) Argue that while the decision boundary in (c) is not linear in terms of X1 and X2, it is linear in terms of X1, X12, X2, and 2 X2.

look at online

PCA

looks to find a low-dimensional representation of the observations that explain a good fraction of the variance;

Clustering

looks to find homogeneous subgroups among the observations.

However, an alternative interpretation for principal components can also be useful: principal components provide

low-dimensional linear surfaces that are closest to the observations. We expand upon that interpretation here

bias

refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model

Principal component analysis (PCA)

refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features X1, X2,...,Xp, and no associated response Y .

We see that the bagging test error rate is slightly lower in this case

than the test error rate obtained from a single tree.

In this sense, PCR is more closely related to ridge regression

than to the lasso.

e PCR approach that we just described involves identifying linear combinations or direction

that best represent the predictors X1,...,Xp

suggest that the model that includes the interaction term is superior to the model

that contains only main effects

The goal is to find boxes R1,...,RJ

that minimize the RSS

Therefore, while PCR often performs quite well in many practical settings, it does not result in the development of a model

that relies upon a small set of the original features

The comparative simplicity of selecting the number of principal components for a supervised analysis is one manifestation of the fact

that supervised analyses tend to be more clearly defined and more objectively evaluated than unsupervised analyses

In the case of p > 1 predictors, the LDA classifier assumes

that the observations in the kth class are drawn from a multivariate Gaussian distribution

center and scale components correspond to the means and standard deviations of the variables

that were used for scaling prior to implementing PCA.

As a result, the Bayes decision boundary is linear and is accurately approximated by

the LDA decision boundary. The QDA decision boundary is inferior, because it suffers from higher variance without a corresponding decrease in bias.

if ˆyi is very far from yi for one or more observations

the RSE may be quite large, indicating that the model doesn't fit the data well

I()

the ^ symbol has wrapper a special meaning in formulas).

cv.glmnet()

the built-in cross-validation function

The main difference between bagging and random forests is

the choice of predictor subset size m

K-means clustering derives its name from

the cluster centroids are computed as the mean of the observations assigned to each cluster.

However, often the choice of where to cut

the dendrogram is not so clear.

PCA is a technique for reducing

the dimension of a n × p data matrix X

Dimension reduction serves to constrain

the estimated βj coefficients,

We could do the same for LDA. If we added all possible quadratic terms and cross-products to LDA,

the form of the model would be the same as the QDA model, although the parameter estimates would be different. This device allows us to move somewhere between an LDA and a QDA model

relationship between the response and the predictors is close to linear,

the least squares estimates will have low bias but may have high variance.

(b) For λ = 0, will ˆg1 or ˆg2 have the smaller training RSS?

the loss functions are identical so the training RSS will be the same

curse of dimensionality

the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor

The coefficients β0 and β1 in (4.2) are unknown

the more general method of maximum likelihood is preferred, since it has better statistical properties.

The higher the ratio of parameters p to number of samples n,

the more we expect this overfitting to play a role

One potential disadvantage of K-means clustering is that it requires us to pre-specify

the number of clusters K.

Note that as the extent of non-linearity increases,

there is little change in the test set MSE for the non-parametric KNN method, but there is a large increase in the test set MSE of linear regression.

PCR suffers from a drawback:

there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response

varying the classifier threshold changes its true positive and false positive rate

these are called sensitivity and one minus the specificity of our classifier

If we use the bias sample mean ˆμ to estimate μ,

this estimate is unbiased

In order to fit a lasso model, we once again use the glmnet() function; however,

this time we use the argument alpha=1.

However, it can still be challenging to determine what is a good R2 value

this will depend on the application

The mathematical definition of a hyperplane is quite simple. In two dimensions, a hyperplane is defined by the equation

β0 + β1X1 + β2X2 = 0 can be easily extended to the p-dimensional setting: β0 + β1X1 + β2X2 + ... + βpXp = 0

The resulting validation set error rate—

—typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate

𝑗‾‾‾‾∑𝑝𝑗=1𝛽^2j is equivalent to

‖𝛽‖2

2. Suppose we define a function by f(x) = ( x^2 x ≥ 0 −x^2 x < 0} (a) Is f a natural cubic spline? Why or why not?

no because there is discontinuity at f'' at x=0. The definition of a cubic spline is a piecewise degree 3 polynomial with continuity of derivative up to degree 2 ii. a natural cubic spline is required to be linear at the boundary (the end-regions). Both of the two regions are end regions. The functions x^2 and -x^2 are not linear in these regions

(b) Ridge regression, relative to least squares, is: [more | less] flexible and hence will give improved prediction accuracy when its [increase | decrease] in bias is [more | less] than its [increase | decrease] in variance.

(b) Ridge regression, relative to least squares, is: less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c) Non-linear methods, relative to least squares, are: [more | less] flexible and hence will give improved prediction accuracy when their [increase | decrease] in bias is [more | less] than their [increase | decrease] in variance.

(c) Non-linear methods, relative to least squares, are: more flexible and hence will give improved prediction accuracy when their decrease in bias is more than their increase in variance.

(d) Repeat (a) for (squared) bias.

(iv) Steadily decrease: When s=0s=0, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. As ss increases, more ββ s become non-zero and thus the model continues to fit training data better. And thus, bias decreases.

Backward selection requires that the number of samples n is larger than the number of variables p

(so that the full model can be fit)

(e) Repeat (a) for the irreducible error

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice of ss, remains constant.

fix()

) function can be used to view it in a spreadsheet like window.

as.factor()

) function converts quantitative variables into qualitative as.factor() variables.

When a qualitative predictor has more than two levels

, a single dummy variable cannot represent all possible values

usually compute the training MSE with relative ease

, but estimating test MSE is considerably more difficult because usually no test data are available

R2 statistic has an interpretational advantage over the RSE

, it always lies between 0 and 1

While it somewhat underestimates the error rate

, it reaches a minimum when fourth-order polynomials are used, which is very close to the minimum of the test curve, which occurs when third-order polynomials are used

bs() also has a degree argument,

, so we can fit splines of any degree, rather than the default degree of 3 (which yields a cubic spline).

On the other hand, if we compute principal components for use in a supervised analysis

, such as the principal components regression presented in Section 6.3.1, then there is a simple and objective way to determine how many principal components to use: we can treat the number of principal component score vectors to be used in the regression as a tuning parameter to be selected via cross-validation or a related approach

though it is possible to perfectly fit the training data in the high-dimensional setting,

, the resulting linear model will perform extremely poorly on an independent test set, and therefore does not constitute a useful model

Function 8.4

. Here |T | indicates the number of terminal nodes of the tree T , Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and ˆyRm is the predicted response associated with Rm—that is, the mean of the training observations in Rm. The tuning parameter α controls a trade-off between the subtree's complexity and its fit to the training data. When α = 0, then the subtree T will simply equal T0, because then (8.4) just measures the training error. However, as α increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for a smaller subtree. Equation 8.4 is reminiscent of the lasso (6.7) from Chapter 6, in which a similar formulation was used in order to control the complexity of a linear model.

we can find the value of λ that makes the cross-validated RSS as small as possible

. It turns out that the leaveone-out cross-validation error (LOOCV) can be computed very efficiently for smoothing splines, with essentially the same cost as computing a single fit,

Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results.

. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met.

There are two common ways to randomly split a data set.

. The first is to produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data. The second is to randomly choose a subset of numbers between 1 and n; these can then be used as the indices for the training observations.

Recall that we obtain the ROC curve by classifying test points based on whether 𝑓̂ (𝑥)>𝑡, and varying t. How large is the AUC (area under the ROC curve) for a classifier based on a completely random function 𝑓̂ (𝑥) (that is, one for which the orderings of the 𝑓̂ (𝑥𝑖) are completely random)?

.5 If 𝑓̂ (𝑥) is completely random, then 𝑓̂ (𝑥𝑖) (and therefore the prediction for 𝑦𝑖) has nothing to do with 𝑦𝑖. Thus, the true positive rate and the false positive rate are both equal to the overall positive rate, and the ROC curve hugs the 45-degree line.

0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, -0.5, -0.5) (0.5, 0.5, 0.5, 0.5) and (-0.5, -0.5, 0.5, 0.5) For the first two choices, the two loading vectors are not orthogonal. For the fifth and sixth choices, the first set of loadings only has to do with two specific tests. For the third and fourth pairs of loadings, the first component is proportional to average score, and the second component measures the difference between the first pair of scores and the second pair of scores.

(b) Does f have "knots"? If so, how many, and where are they?

1 knot at x=0

Let 1{𝑥≤𝑡} denote a function which is 1 if 𝑥≤𝑡 and 0 otherwise. Which of the following is a basis for linear splines with a knot at t? Select all that apply:

1,𝑥,(𝑥−𝑡)1{𝑥>𝑡} 1,𝑥,(𝑥−𝑡)1{𝑥≤𝑡} 1,(𝑥−𝑡)1{𝑥≤𝑡},(𝑥−𝑡)1{𝑥>𝑡}

Parametric methods involve a two-step model-based approach.

1. First, we make an assumption about the functional form, or shape, of f. 2. After a model has been selected, we need a procedure that uses the training data to fit or train the model. The most common approach to fitting the model (2.4) is referred to as (ordinary) least squares. The model-based approach just described is referred to as parametric; it reduces the problem of estimating f down to a set of paramters.

Boosting has three tuning parameters:

1. The number of trees B. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B. 2. The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small λ can require using a very large value of B in order to achieve good performance. 3. The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, in which case each tree is a stump, consisting of a single split. In this case, the boosted stump ensemble is fitting an additive model, since each term involves only a single variable. More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables

In order to select the best model with respect to test error, we need to estimate this test error. There are two common approaches:

1. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. 2. We can directly estimate the test error, using either a validation set approach or a cross-validation approach

Process of building a regression tree

1. We divide the predictor space—that is, the set of possible values for X1, X2,...,Xp—into J distinct and non-overlapping regions, R1, R2,...,RJ . 2. For every observation that falls into the region Rj , we make the same prediction, which is simply the mean of the response values for the training observations in Rj .

For the model y ~ 1+x+x^2, what is the coefficient of x (to within 10%)?

77.7

Examine the plot on pg 23. Assume that we wanted to select a model using the one-standard-error rule on the Cross-Validated error. What tree size would end up being selected?:

In order to perform Boosting, we need to select 3 parameters: number of samples B, tree depth d, and step size 𝜆. How many parameters do we need to select in order to perform Random Forests?:

You are trying to fit a model and are given p=30 predictor variables to choose from. Ultimately, you want your model to be interpretable, so you decide to use Best Subset Selection.

2^30

Using the decision tree on page 5 of the notes, what would you predict for the log salary of a player who has played for 4 years and has 150 hits?:

5.11 The player has played less than 4.5 years, so at the first split we follow the left branch. There are no further splits, so we predict 5.11.

Model Interpretability

: It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. Including such irrelevant variables leads to unnecessary complexity in the resulting model. By removing these variables—that is, by setting the corresponding coefficient estimates to zero—we can obtain a model that is more easily interpreted

the second derivative of a function is a measure of its roughness:

: it is large in absolute value if g(t) is very wiggly near t, and it is close to zero otherwise.

yi = β0 + β1xi1 + β2xi2 + ··· + βpxip + ei

= β0 + f1(xi1) + f2(xi2) + ··· + fp(xip) + ei.

We now show that one can view ridge regression and the lasso through a Bayesian lens.

A Bayesian viewpoint for regression assumes that the coefficient vector β has some prior distribution, say p(β), where β = (β0, β1,...,βp)T . The likelihood of the data can be written as f(Y |X, β), where X = (X1,...,Xp).

we replace it with a generalization of the inner product of the form K(xi, xi ), (9.20) where K is some function that we will refer to as a kernel

A kernel is a kernel function that quantifies the similarity of two observations.

3. What is the difference between a cubic spline and a natural cubic spline?

A natural cubic spline is required to be linear in the end-regions or "at the boundary". A cubic spline is not.

In terms of model complexity, which is more similar to a smoothing spline with 100 knots and 5 effective degrees of freedom?

A natural cubic spline with 5 knots

In general, the optimal value for K will depend on the bias-variance tradeoff,

A small value for K provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. In contrast, larger values of K provide a smoother and less variable fit; the prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in f(X).

AIC

AIC = 1/nσˆ2 (RSS + 2dσˆ2) ,

one might conclude that a fair amount of variance is explained by the first two principal components, and that there is an elbow after the second component.

After all, the third principal component explains less than ten percent of the variance in the data, and the fourth principal component explains less than half that and so is essentially worthless.

the area under the (ROC) curve (AUC)

An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier

Outliers

An outlier is a point for which yi is far from the value predicted by the outlier model. Solution: If we believe that an outlier has occurred due to an error in data collection or recording, then one solution is to simply remove the observation. However, care should be taken, since an outlier may instead indicate a deficiency with the model, such as a missing predictor.

Bias-Variance Trade-Off for k-Fold Cross-Validation

And performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations—fewer than in the LOOCV approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is clear that LOOCV is to be preferred to k-fold CV. the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Non-constant Variance of Error Terms

Another important assumption of the linear regression model is that the error terms have a constant variance, Var(i) = σ2. The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption. Unfortunately, it is often the case that the variances of the error terms are non-constant. One can identify non-constant variances in the errors, or heteroscedasticity, solution: In this case a simple remedy is to fit our model by weighted least squares, with weights proportional to the inverse variances—i.e. wi = ni in this case. Most linear regression software allows for observation weights.

In general, the phenomenon seen in Figure 4.3 is known as confounding

As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.

Why might we want to use another fitting procedure instead of least squares?

As we will see, alternative fitting procedures can yield better prediction accuracy and model interpretability.

Average, complete, and single linkage are most popular among statisticians

Average and complete linkage are generally preferred over single linkage, as they tend to yield more balanced dendrograms.

BIC is derived from a Bayesian point of view, but ends up looking similar to Cp (and AIC) as well.

BIC = 1/no^2 R(SS + log(n)dσˆ2 ).

It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive model: that is, a model of the form f(X) = p sum j=1 fj (Xj ). Explain why this is the case. You can begin with (8.12) in Algorithm 8.2

Based on Algorithm 8.2, the first stump will consist of a split on a single variable. By induction, the residuals of that first fit will result in a second stump fit to a another distinct, single variable. o maximize the fit to the residuals, another distinct stump must be fit in the next and subsequent iterations will each fit XjXj-distinct stumps. The following is the jth iteration, where b=jb=j: Since each iteration's fit is a distinct variable stump, there are only pp fits based on "j) b)".

Why are natural cubic splines preferred to polynomials with the same number of degrees of freedom for most applications?

Because polynomials of high degree are going to display radical changes at the ends. This wild behavior is not desirable for most applications

What is "boosting" and how is it applied to bagged trees? That is, what is the difference between the standard bagged trees algorithm and the boosting version?

Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set

"Bagging" in the statistical learning context is a portmanteau of two words. What are the two words?

Boostrap and Aggregating

We perform best subset and forward stepwise selection on a single dataset. For both approaches, we obtain 𝑝+1 models, containing 0,1,2,...,𝑝 predictors. Which of the two models with 𝑘 predictors is guaranteed to have training RSS no larger than the other model?

Best Subset

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers: (a) Which of the three models with k predictors has the smallest training RSS?

Best subset selection has the smallest training RSS because the other two methods determine models with a path dependency on which predictors they pick first as they iterate to the k'th model.

(b) Which of the three models with k predictors has the smallest test RSS?

Best subset selection may have the smallest test RSS because it considers more models then the other methods. However, the other models might have better luck picking a model that fits the test data better.

At the least squares coefficient estimates, which correspond to ridge regression with λ = 0, the variance is high but there is no bias.

But as λ increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias.

We can try to address this problem by choosing flexible models that can fit many different possible functional forms flexible for f.

But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they overfitting follow the errors, or noise, too closely

MSE

But in general, we do not really care how well the method works training on the training data. Rather, we are interested in the accuracy of the pre- MSE dictions that we obtain when we apply our method to previously unseen test data.

In a real life situation in which the true relationship is unknown, one might draw the conclusion that KNN should be favored over linear regression

But in reality, even when the true relationship is highly non-linear, KNN may still provide inferior results to linear regression. p = 1 predictor. But in higher dimensions, KNN often performs worse than linear regression.

the linear regression model assumes a linear relationship between the response and predictors.

But in some cases, the true relationship between the response and the predictors may be nonlinear. Here we present a very simple way to directly extend the linear model to accommodate non-linear relationships, using polynomial regression

In contrast, the loss function for logistic regression shown in Figure 9.12 is not exactly zero anywhere

But it is very small for observations that are far from the decision boundary. Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.

Disadvantage of non-paramtric

But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.

the examples in this chapter have used Euclidean distance as the dissimilarity measure

But sometimes other dissimilarity measures might be preferred. For example, correlation-based distance considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.

Any time clustering is performed on a data set we will find clusters

But we really want to know whether the clusters that have been found represent true subgroups in the data, or whether they are simply a result of clustering the noise.

Prediction Accuracy

By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. This can lead to substantial improvements in the accuracy with which we can predict the response for observations not used in model training

tuning parameter C

C bounds the sum of the i's, and so it determines the number and severity of the violations to the margin (and to the hyperplane) that we will tolerate. We can think of C as a budget for the amount that the margin can be violated by the n observations. If C = 0 then there is no budget for violations to the margin, and it must be the case that 1 = ... = n = 0

This process results in k estimates of the test error, MSE1, MSE2,..., MSEk. The k-fold CV estimate is computed by averaging these values

CV(k) = 1/k sum k i=1 MSEi

1. The textbook and EdX course spend a lot of time on "cubic splines". In principle, we could fit a regression model with a "quadratic spline" or a "quartic spline". Why do the authors choose cubic splines?

Cubic splines are popular because most human eyes cannot detect the discontinuity at the knots. Cubic (degree 3) is the smallest degree for which this is true

GAMs details

Coefficients not that interesting; fitted functions are.

Collinearity

Collinearity refers to the situation in which two or more predictor variables collinearity are closely related to one another. Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for βˆj to grow Solution: The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.

why would one prefer LDA to QDA?

Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off: if LDA's assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable

hclust()

Correlation-based distance can be computed using the as.dist() func- tion, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix.

Cp estimate of test MSE

Cp = 1/n( RSS + 2dσˆ2) ,where σˆ2 is an estimate of the variance of the error

What is the value of the Cross-Entropy? Give your answer to the nearest hundredth (using log base e, as in R):

Cross Entropy = -.64*ln(.64) - .36*ln(.36) = .6534

Selecting the Tuning Parameter

Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error for each value of λ, as described in Chapter 5. We then select the tuning parameter value for which the cross-validation error is smallest. Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter

high-dimensional.

Data sets containing more features than observations

Basics of Decision Trees

Decision trees can be applied to both regression and classification problems.

If we increase C (the error budget) in an SVM, do you expect the standard error of 𝛽 to increase or decrease?

Decrease Increasing C makes the margin "softer," so that the orientation of the separating hyperplane is influenced by more points.

Centroid

Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.

multiple regression uses

F-statistic (no relationship F take on a value close to 1

True or False: If no linear boundary can perfectly classify all the training data, this means we need to use a feature expansion

False As in any statistical problem, we will always do better on the training data if we use a feature expansion, but that doesn't mean we will improve the test error. Not all regression lines should perfectly interpolate all the training points, and not all classifiers should perfectly classify all the training data.

We compute the principal components of our p predictor variables. The RSS in a simple linear regression of Y onto the largest principal component will always be no larger than the RSS in a simple regression of Y onto the second largest principal component. True or False? (You may want to watch 6.10 as well before answering - sorry!)

False Principal components are found independently of Y, so we can't know the relationship with Y a priori.

True or False: The computational effort required to solve a kernel support vector machine becomes greater and greater as the dimension of the basis increases.

False The beauty of the "kernel trick" is that, even if there is an infinite-dimensional basis, we need only look at the n^2 inner products between training data points

True or False: If we use k-means clustering, will we get the same cluster assignments for each point, whether or not we standardize the variables.

False The points are assigned to centroids using Euclidean distance. If we change the scaling of one variable, e.g. by dividing it by 10, then that variable will matter less in determining Euclidean distance.

How many would you fit using Forward Selection?

For Forward Selection, you fit (p-k) models for each k=0,...p-1. The expression for the total number of models fit is on pg 15 of the notes: p(p+1)/2+1.

How can bagging be extended to a classification problem where Y is qualitative?

For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring majority class among the B predictions.

Stepwise Selection

For computational reasons, best subset selection cannot be applied with very large p. stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

Supervised Learning

For each observation of the predictor measurement(s) xi, i = 1,...,n there is an associated response measurement yi. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors.

For instance, we could create a more flexible version of logistic regression by including X2, X3, and even X4 as predictors. This may or may not improve logistic regression's performance, depending on whether the increase in variance due to the added flexibility is offset by a sufficiently large reduction in bias

is no guarantee that the method with the lowest training MSE will also have the lowest test MSE

For these methods, the training set MSE can be quite small, but the test MSE is often much larger.

However, the dummy variable approach cannot be easily extended to accommodate qualitative responses with more than two levels

For these reasons, it is preferable to use a classification method that is truly suited for qualitative response values, such as the ones presented next.

The tree building algorithm given on pg 13 is described as a Greedy Algorithm. Which of the following is also an example of a Greedy Algorithm?:

Forward Stepwise Selection is a Greedy Algorithm because at each step it selects the variable that improves the current model the most. There is no guarantee that the final result will be optimal.

Therefore, unless p is very small, we cannot consider all 2p models, and instead we need an automated and efficient approach to choose a smaller set of models to consider.

Forward selection, Backward selection, Mixed selection

Forward Stepwise Selection

Forward stepwise selection is a computationally efficient alternative to best forward stepwise selection subset selection. While the best subset selection procedure considers all 2^p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models. Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time until all of the predictors are in the model. In particular, at each step, the variable that gives the greatest additional improvement to the fit is added to the model.

Pros of GAMs

GAMs allow us to fit a non-linear fj to each Xj , so that we can automatically model non-linear relationships that standard linear regression will miss. This means that we do not need to manually try out many different transformations on each variable individually. ▲ The non-linear fits can potentially make more accurate predictions for the response Y . ▲ Because the model is additive, we can still examine the effect of each Xj on Y individually while holding all of the other variables fixed. Hence if we are interested in inference, GAMs provide a useful representation. ▲ The smoothness of the function fj for the variable Xj can be summarized via degrees of freedom.

Polynomial function and d

Generally speaking, it is unusual to use d greater than 3 or 4 because for large values of d, the polynomial curve can become overly flexible and can take on some very strange shapes. This is especially true near the boundary of the X variable.

What is the final classification under the average probability method?:

Green The average of the probabilities is 0.45, so the average probability method will select green.

multiple regression setting with p predictors,

H0 : β1 = β2 = ··· = βp = 0 vs. Ha : at least one βj is non-zero

maximal margin hyperplane!

Hence, M represents the margin of our hyperplane, and the optimization problem chooses β0, β1,...,βp to maximize M.

the F-statistic does not suffer from this problem because it adjusts for the number of predictors.

Hence, if H0 is true, there is only a 5 % chance that the Fstatistic will result in a p-value below 0.05, regardless of the number of predictors or the number of observations

When we fit a spline, where should we place the knots?

Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable. While this option can work well, in practice it is common to place knots in a uniform fashion. One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.

Once again, we apply the predict() function

How well does this pruned tree perform on the test data set?

All three methods offer a significant improvement over least squares.

However, PCR and ridge regression slightly outperform the lasso

Cp, Akaike information criterion (AIC), Bayesian information (BIC), and adjusted R^2

However, a number of techniques for adjusting the training error for the model size are available. These approaches can be used to select among a set of models with different numbers of variables.

test MSE for linear regression is still superior to that of KNN for low values of K.

However, for K ≥ 4, KNN outperforms linear regression.

This constraint on the form of the coefficients has the potential to bias the coefficient estimates

However, in situations where p is large relative to n, selecting a value of M p can significantly reduce the variance of the fitted coefficients

The term hierarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.

However, on an arbitrary data set, this assumption of hierarchical structure might be unrealistic.

K-means and hierarchical clustering will assign each observation to a cluster.

However, sometimes this might not be appropriate. For instance, suppose that most of the observations truly belong to a small number of (unknown) subgroups, and a small subset of the observations are quite different from each other and from all other observations. Then since Kmeans and hierarchical clustering force every observation into a cluster, the clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster. Mixture models are an attractive approach for accommodating the presence of such outliers. These amount to a soft version of K-means clustering,

in which a given observation has no nearby neighbors—this is the so-called curse of dimensionality

However, spreading 100 observations over p = 20 dimensions results in a phenomenon

Shrinkage.

However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly zero. Hence, shrinkage methods can also perform variable selection

what we really want to do is find some function, say g(x), that fits the observed data well: that is, we want RSS = n i=1(yi − g(xi))2 to be small.

However, there is a problem with this approach. If we don't put any constraints on g(xi), then we can always make RSS zero simply by choosing g such that it interpolates all of the yi. Such a function would woefully overfit the data—it would be far too flexible

ad hoc

However, this type of visual analysis is inherently ad hoc. Unfortunately, there is no well-accepted objective way to decide how many principal components are enough

What type of dissimilarity measure should be used to cluster the shoppers?

If Euclidean distance is used, then shoppers who have bought very few items overall (i.e. infrequent users of the online shopping site) will be clustered together. This may not be desirable. On the other hand, if correlation-based distance is used, then shoppers with similar preferences will be clustered together, even if some shoppers with these preferences are higher-volume shoppers than others. Therefore, for this application, correlation-based distance may be a better choice.

Which model is better? (Regresion and classification trees)

If the relationship between the features and the response is well approximated by a linear model as in (8.8), then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure. If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model (8.9), then decision trees may outperform classical approaches.

We have stated that linear regression is not appropriate in the case of a qualitative response. Why not?

If the response variable's values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression

. Note that in this case, we can compute the Bayes classifier because we know that X is drawn from a Gaussian distribution within each class, and we know all of the parameters involved.

In a real-life situation, we are not able to calculate the Bayes classifier.

When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0

In contrast, a larger F-statistic is needed to reject H0 if n is small.

unsupervised learning

In contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation i = 1,...,n, we observe a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis. We can seek to understand the relationships between the variables or between the observations

r = Cor(X, Y ) instead of R2 in order to assess the fit of the linear model

In fact, it can be shown that in the simple linear regression setting, R^2 = r^2.

expect to see approximately five small p-values even in the absence of any true association between the predictors and the response

In fact, we are almost guaranteed that we will observe at least one p-value below 0.05 by chance! Hence, if we use the individual t-statistics and associated pvalues in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship.

cubic spline.

In general, a cubic spline with K knots uses cubic spline a total of 4 + K degrees of freedom

curse of dimensionality

In general, adding additional signal features that are truly associated with the response will improve the fitted model, in the sense of leading to a reduction in test set error. However, adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increased test set error.

maximal margin classifier definition

In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use. A natural choice is the maximal margin hyperplane.

cut()

In order to fit a step function, as discussed in Section 7.2, we use the cut() function

ns()

In order to instead fit a natural spline,

loess()

In order to perform local regression,

Each of these decisions can have a strong impact on the results obtained

In practice, we try several different choices, and look for the one with the most useful or interpretable solution. With these methods, there is no single right answer—any solution that exposes some interesting aspects of the data should be considered.

K-nearest neighbor (K-NN)

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Given a positive in- K-nearest teger K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0

We now elaborate on Step 1 above. How do we construct the regions R1,...,RJ ?

In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model.

We can imagine a scenario in which the best division into two groups might split these people by gender, and the best division into three groups might split them by nationality

In this case, the true clusters are not nested, in the sense that the best division into three groups does not result from taking the best division into two groups and splitting up one of those groups.

In certain settings, however, the variables may be measured in the same units.

In this case, we might not wish to scale the variables to have standard deviation one before performing PCA

Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach.

Instead of fitting a linear model in X, we fit the model yi = β0 + β1b1(xi) + β2b2(xi) + β3b3(xi) + ... + βKbK(xi) + ei. b1(·), b2(·),...,bK(·) are fixed and known.

Of course, this is not practical because we generally do not have access to multiple training sets.

Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

Leave-one-out cross-validation (LOOCV)

LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation (x1, y1) is used for the validation set, and the remaining observations {(x2, y2),...,(xn, yn)} make up the training set. The statistical learning method is fit on the n − 1 training observations, and a prediction ˆy1 is made for the excluded observation, using its value x1 Does not work well if the model has to be refit n times.

PCA provides a tool to do just this.

It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in p-dimensional space, but not all of these dimensions are equally interesting.

However, in practice we are sometimes faced with non-linear class boundaries.

It is clear that a support vector classifier or any linear classifier will perform poorly here

What exactly does this mean?

It means that on the basis of one particular set of observations y1,...,yn, ˆμ might overestimate μ, and on the basis of another set of observations, ˆμ might underestimate μ. But if we could average a huge number of estimates of μ obtained from a huge number of sets of observations, then this average would exactly equal μ. Hence, an unbiased estimator does not systematically over- or under-estimate the true parameter.

When the value of K is large, then KNN performs only a little worse than least squares regression in terms of MSE

It performs far worse when K is small.

how can we fit a piecewise degree-d polynomial under the constraint that it (and possibly its first d − 1 derivatives) be continuous?

It turns out that we can use the basis model (7.7) to represent a regression spline. A cubic spline with K knots can be modeled as yi = β0 + β1b1(xi) + β2b2(xi) + ··· + βK+3bK+3(xi) + i,

clustering is popular in many fields, there exist a great number of clustering methods.

K-means clustering and hierarchical clustering

hatvalues()

Leverage statistics can be computed for any number of predictors using

Quadratic Discriminant Analysis

Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. Quantity x appears as a quadratic function.

partial least squares (PLS), a supervised alternative to partial least PCR.

Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1,...,ZM that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response. Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors

s boosting

Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Here we restrict our discussion of boosting to the context of decision trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set

EdX Chapter 6 Which of the following modeling techniques performs Feature Selection?

Linear Regression with Forward Selection Forward Selection chooses a subset of the predictor variables for the final model. The other three methods end up using all of the predictor variables.

In the regression setting, the most commonly-used measure is the mean squared error (MSE)

MSE = 1/n sumn i=1 (yi − ˆf(xi))^2,

Average

Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.

Single

Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.

We can now ask a natural question: how much of the information in a given data set is lost by projecting the observations onto the first few principal components? That is, how much of the variance in the data is not contained in the first few principal components?

More generally, we are interested in knowing the (PVE) by each proportion of variance explained principal component.

With two-dimensional data, such as in our advertising example, we can construct at most two principal components

More would maximize variance

Suppose that after our computer works for an hour to fit an SVM on a large data set, we notice that 𝑥4, the feature vector for the fourth example, was recorded incorrectly (say, one of the decimal points is obviously in the wrong place). However, your co-worker notices that the pair (𝑥4,𝑦4) did not turn out to be a support point in the original fit. He says there is no need to re-fit the SVM on the corrected data set, because changing the value of a non-support point can't possibly change the fit. Is your co-worker correct?

No When we change 𝑥4, the fourth example might become a support point; if so, the fit may change. However, we could check whether 𝑥4,𝑦4 is still not a support point even after correcting the value. If so, then we really don't need to re-fit the model.

Is principal components regression a good feature selection method? Why or why not?

No because PCR will choose basis to Z1...Zp which goes in direction t=of the largest variation of the Xis and will use all features. So it is not a feature selection method.

5. Suppose that after our computer works for an hour to fit an SVM on a large data set, we notice that x4, the feature vector for the fourth example, was recorded incorrectly (say, one of the decimal points is obviously in the wrong place). However, your co-worker notices that the pair (x4, y4) did not turn out to be a support point in the original fit. He says there is no need to re-fit the SVM on the corrected data set, because changing the value of a non-support point can't possibly change the fit. Is your co-worker correct? Why or why not?

No because when we change the value of a non-support point into a support point. If X4 changes in such a way that we can be sure it is still a non-support point, then our coworker is correct

6. Imagine that you are trying to solve a classification problem using a maximal margin classifier and your training data are as in the figure below. Assume you have two predictors X1 and X2 and that your data give yi = +1 when the plotted character is x and yi = −1 when the plotted character is o. Is the mathematical optimization problem stated in 4 above solvable for this data? If it is, give approximate values for the solving parameters β0, β1, β2, M. If it is not, explain what you would do about that.

No, the optimization problem is not solvable. The optimization problem is trying to find a seperating hyperplane, which in this case is a separating line. There is no line in R^2 which seperates the Xs from the Os. etc.

Why are natural cubic splines typically preferred over global polynomials of degree d?

Polynomials tend to extrapolate very badly

type="response" option in the predict() function.

Note that we could have directly computed the probabilities by selecting the

lda() function

Now we will perform LDA

support vectors

Observations that lie directly on the margin, or on the wrong side of the margin for their class,

When C is small, we seek narrow margins that are rarely violated; this amounts to a classifier that is highly fit to the data, which may have low bias but high variance

On the other hand, when C is larger, the margin is wider and we allow more violations to it; this amounts to fitting the data less hard and obtaining a classifier that is potentially more biased but may have lower variance.

Note that in (4.19) δk(x) is a linear function of x; that is, the LDA decision rule depends on x only through a linear combination of its elements

Once again, this is the reason for the word linear in LDA.

What is the advantage of using a kernel rather than simply enlarging the feature space using functions of the original features, as in (9.16)?

One advantage is computational, and it amounts to the fact that using kernels, one need only compute K(xi, xi ) for all n 2 distinct pairs i, i . This can be done without explicitly working in the enlarged feature space. This is important because in many applications of SVMs, the enlarged feature space is so large that computations are intractable. For some kernels, such as the radial kernel (9.24), the feature space is implicit and infinite-dimensional, so we could never do the computations there anyway!

validationplot()

One can also plot the cross-validation scores

How many knots should we use, or equivalently how many degrees of freedom should our spline contain?

One option is to try out different numbers of knots and see which produces the best looking curve. A somewhat more objective approach is to use cross-validation. we remove a portion of the data (say 10 %), fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion.

suppose that we cluster n observations, and then cluster the observations again after removing a subset of the n observations at random.

One would hope that the two sets of clusters obtained would be quite similar, but often this is not the case!

If we try to use ordinary least squares linear regression when p, the number of predictors, is greater than n, the number of data points, what happens and why?

Overfit the data and will not yield accurate estimates of the response. Small training RSE and large test MSE.

What is the "penalty function" for ridge regression?

P(β) = p sum j=1 β^2_j for ridge regression and P(β) = p sum j=1 |βj | for the lasso.

varImpPlot()

Plots of these importance measures can be produced using the varImpPlot() function

qda() function

QDA is implemented in R

When the boundaries are moderately non-linear

QDA may give better results

The concept of correlation between the predictors and the response does not extend automatically to this setting, since correlation quantifies the association between a single pair of variables rather than between a larger number of variables

R2 fills this role.

How can we decide which threshold value is best?

ROC and AUC curve

. Draw the ROC curve for a "perfect" or near-perfect classifier. Label the axes appropriately. (Hint: x and y are not appropriate labels.) What is the approximate value of the AUC metric for a "perfect" or near-perfect classifier?

ROC for near perfect classifier AUC=1

summary(lm.fit)$sigma

RSE.

Which of the following would be the worst metric to use to select $\lambda$ in the Lasso?

RSS

residual sum of squares (RSS)

RSS = e^2 1 + e^2 2 + ··· + e^2 n,

However, in the classification setting

RSS cannot be used as a criterion for making the binary splits. A natural alternative to RSS is the classification error rate.

What is the difference between bagged trees and random forests?

Random forests will provide an improvement over bagged trees since they uncorrelate the trees lowering variance more. Bagged trees have predictions that are highly corrrelated so averaging out the variance will not decrease a lot in comparison to random forests. (Know def)

bootstrap method

Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set We randomly select n observations from the data set in order to produce a bootstrap data set, Z∗1. The sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set

support vector classifier, sometimes called a soft margin classifier, support vector classifier soft margin classifier does exactly this.

Rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, we instead allow some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane. (The margin is soft because it can be violated by some of the training observations.)

3. In words, describe the results that you would expect if you performed K-means clustering of the eight shoppers in Figure 10.14 in the text, on the basis of their sock and computer purchases, with K = 2. Give three answers, one for each of the variable scalings displayed.

Right: 2 clusters: "shoppers who purchased a computer" "shoppers who did not purchase a computer" Center: The variation in computer purchase data overwhelms the variation in sock purchase data and we get the same clusers as the right Left: "shoppers who purchased a large number of items" "shoppers who purchased a relatively small number of items"

values βˆ0, βˆ1,..., βˆp that minimize RSS

are the multiple least squares regression coefficient estimates.

It turns out that there is a very straightforward way to estimate the test error of a bagged model,

Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.

The text discusses (in Chapters 4 and 9) the ROC curve and the AUC metric. What do R, O, and C stand for? What about A, U, and C?

Receiver Operating Characteristics Area Under Curve

Comparison to Polynomial Regression

Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree (exponent in the highest monomial term, e.g. X15) to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed. Generally, this approach produces more stable estimates. Splines also allow us to place more knots, and hence flexibility, over regions where the function f seems to be changing rapidly, and fewer knots where f appears more stable.

How far off will that single estimate of ˆμ be?

SE(ˆμ). We have standard the well-known formula error Var(ˆμ) = SE(ˆμ) 2 = σ2/n ,

Suppose that we would like to perform classification using SVMs, and there are K > 2 classes. A one-versus-one or all-pairs approach constructs (K) (2)

SVMs, each of which compares a pair of classes

QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches.

Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods. Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary.

adjusted R2 statistic is another popular approach for selecting among a set of models that contain different numbers of variables.

Since RSS always decreases as more variables are added to the model, the R2 always increases as more variables are added.

Given a n × p data set X, how do we compute the first principal component?

Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero).

Clustering can be a very useful tool for data analysis in the unsupervised setting. However, there are a number of issues that arise in performing clustering. We describe some of these issues here.

Small Decisions with Big Consequences Should the observations or features first be standardized in some way? For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one. In the case of hierarchical clustering, - What dissimilarity measure should be used? - What type of linkage should be used? - Where should we cut the dendrogram in order to obtain clusters? • In the case of K-means clustering, how many clusters should we look for in the data?

(a) A school district wants to predict high school drop-outs based on 7 features measured from 1500 students. It is known that different features have similar scales. Moreover, it is highly unlikely that the difference between the two groups (drop-out and non-drop-out) can be captured by a linear boundary. Lastly, the school district has no one who understands optimization so a simple and easy to understand approach is the best. What method do you recommend to the school district?

Small classification tree and other is KNN

What does it mean for a function to have 2 1 2 degrees of freedom? Give a mathematical definition

So if penalty coefficient λ is 0, penalty term is 0 allow quadratic functions which is 3 degrees of freedom. When penalty coeff λ is very large then 2 degrees of freedom. Need intermediate λ

Chapter 7 Which of the following can we add to linear models to capture nonlinear effects?

Spline terms Polynomial terms Interactions Step functions

In this chapter, we discuss three important classes of methods.

Subset Selection, Shrinkage, Dimension Reduction

Advantage of non-paramtric

Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f. Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made.

Imagine that you are doing cost complexity pruning as defined on page 18 of the notes. You fit two trees to the same data: 𝑇1 is fit at 𝛼=1 and 𝑇2 is fit at 𝛼=2. Which of the following is true?

T1 will have at least as many nodes as 𝑇2

posterior probability

That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation

However, it turns out that classification error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferable.

The Gini index is defined by G =sum K k=1 pˆmk(1 − pˆmk), a measure of total variance across the K classes. For this reason, the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class. An alternative to the Gini index is entropy, D = − K sum k=1 pˆmk log ˆpmk

R2 = TSS − RSS TSS = 1 − RSS/TSS

The R2 statistic provides an alternative measure of fit. It takes the form of a proportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale of Y. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error σ2 is high, or both.

You are doing a simulation in order to compare the effect of using Cross-Validation or a Validation set. For each iteration of the simulation, you generate new data and then use both Cross-Validation and a Validation set in order to determine the optimal number of predictors. Which of the following is most likely?

The Validation set method will result in a higher variance of optimal number of predictors

Is the SVM unique in its use of kernels to enlarge the feature space to accommodate non-linear class boundaries?

The answer to this question is "no". We could just as well perform logistic regression or many of the other classification methods seen in this book using non-linear kernels; this is closely related to some of the non-linear approaches seen in Chapter 7.

as polynomial regression, since we have included polynomial functions of the predictors in the regression model.

The approach that we have just described for extending the linear model to accommodate non-linear relationships

the support vector classifier and SVM turning parameter C is important because

The choice of tuning parameter is very important and determines the extent to which the model underfits or overfits the data,

However, there are three sorts of uncertainty associated with this prediction. (for the multiple regression model)

The coefficient estimates βˆ0, βˆ1,..., βˆp are estimates for β0, β1,...,βp. That is, the least-squares plane Yˆ = βˆ0 + βˆ1X1 + ··· + βˆpXp is only an estimate for the true population regression plane f(X) = β0 + β1X1 + ··· + βpXp. The inaccuracy in the coefficient estimates is related to the reducible error. Of course, in practice assuming a linear model for f(X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias Even if we knew f(X)—that is, even if we knew the true values for β0, β1,...,βp—the response value cannot be predicted perfectly because of the random error we referred to this as the irreducible error.

Bagging

The decision trees discussed in Section 8.1 suffer from high variance. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of n to p is moderately large. Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees.

This intuition can be formalized using a mathematical equation called a likelihood function:

The estimates βˆ0 and βˆ1 are chosen to maximize this likelihood function

plot.gam()

The generic plot() function recognizes that gam.m3 is an object of class gam

However, it is sometimes the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not.

The hierarchical principle states that if we include an interaction in a model, we hierarchical should also include the main effects, even if the p-values associated with principle their coefficients are not significant.

Lasso

The lasso is a relatively recent alternative to ridge regression that over- lasso comes to this disadvantage the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like the best subset selection, the lasso performs variable selection

For parts (a) through (c), circle the right words to form a correct statement. (a) The lasso, relative to least squares, is: [more | less] flexible and hence will give improved prediction accuracy when its [increase | decrease] in bias is [more | less] than its [increase | decrease] in variance.

The lasso, relative to least squares, is: less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Problems: Non-linearity of the Data

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced. Solution: If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as log X, √X, and X2, in the regression model.

In the mathematical optimization problem to complete a classification task found in the text: maximize β0,β1,...,βp,M M subject to X p j=1 β 2 j = 1 and yi(β0 + β1xi1 + · · · + βpxip) ≥ M for all i = 1, 2, . . . , n the variable M is called the margin. What is the geometric meaning of the "margin"? What, if any, is the connection between the "margin" and the "support vectors"?

The margin is the distance from the seperaing hyperplace (obtained in solving the optimization problem) to the closest training observations. The training observations closest to the margin are called support vectors

The choice of dissimilarity measure is very important

as it has a strong effect on the resulting dendrogram.

What is the advantage of using k = 5 or k = 10 rather than k = n?

The most obvious advantage is computational. LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive. Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational problems, especially if n is extremely large k=10, CV requires fitting the learning procedure only ten times, which may be much more feasible.

In the heat map for breast cancer data, which of the following depended on the output of hierarchical clustering?

The ordering of the rows The ordering of the columns The dendrograms obtained from hierarchical clustering were used to order the rows and columns. The coloring of the cells was based on gene expression, but without the hierarchical clustering step, the heat map would not have looked like anything meaningful.

Tree Pruning

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex.

model assessment,

The process of evaluating a model's performance

one-standard-error rule (to find out which model is best)

The rationale here is that if a set of models appear to be more or less equally good, then we might as well choose the simplest model—that is, the model with the smallest number of predictors.

Regression Trees

The regression tree shown in Figure 8.1 is likely an over-simplification of the true relationship. However, it has advantages over other types of regression models (such as those seen in Chapters 3 and 6): it is easier to interpret, and has a nice graphical representation.

, as we use more flexible methods, the variance will increase and the bias will decrease.

The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases

on average, each bagged tree makes use of around two-thirds of the observations.

The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations

In fact, even if a separating hyperplane does exist, then there are instances in which a classifier based on a separating hyperplane might not be desirable. A classifier based on a separating hyperplane will necessarily perfectly classify all of the training observations; this can lead to sensitivity to individual observations

The resulting maximal margin hyperplane is not satisfactory—for one thing, it has only a tiny margin. Moreover, the fact that the maximal margin hyperplane is extremely sensitive to a change in a single observation suggests that it may have overfit the training data In this case, we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes

span s

The span plays a role like that of the tuning parameter λ in smoothing splines: it controls the flexibility of the non-linear fit. The smaller the value of s, the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations.

maximal margin classifier

The support vector machine is a generalization of a simple and intuitive classifier

linear regression is at the most restrictive end, with two degrees of freedom.

The training MSE declines monotonically as flexibility increases.

You are trying to reproduce the results of the R labs, so you run the following command in R: > library(tree) As a response, you see the following error message: Error in library(tree) : there is no package called 'tree' What went wrong?

The tree package is not installed on your computer

Starting out at the bottom of the dendrogram, each of the n observations is treated as its own cluster

The two clusters that are most similar to each other are then fused so that there now are n−1 clusters. Next the two clusters that are most similar to each other are fused again, so that there now are n − 2 clusters. The algorithm proceeds in this fashion until all of the observations belong to one single cluster, and the dendrogram is complete.

> A[-c(1,3) ,]

The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index.

na.omit()

There are various ways to deal with the missing data. In this case, only five of the rows contain missing observations, and so we choose to use the na.omit() function to simply remove these rows.

H0 : β1 = 0

There is no relationship between X and Y

Ha : β1 =/=0

There is some relationship between X and Y

KNN is a completely non-parametric approach: no assumptions are made about the shape of the decision boundary.

Therefore, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important

tree based methods

These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods. Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches

(b) Tom, a business analyst in a big pharmaceutical company is trying to predict number of prescriptions for a drug over the next 12 months. He has 90 features of interest measured from 500 health care providers and decides to use ridge regression for his prediction. To select the optimal regularization parameters, he uses five-fold cross-validation. After he selects the regularization parameters, his boss wants an estimate of the prediction error. Tom is aware of the danger of possible under-estimation of the prediction error by using the same data and same procedure for both model training/tuning and model assessment. So he deliberately chooses to not to use the five-fold cross-validation again. Instead, he uses leave-one-out cross validation: for each data point, he fits the ridge regression with the previously selected regularization parameter value to all but that data point, and then measures the prediction error for that point. He averages the prediction error over all the points and reports the result to his boss. Has Tom successfully made this estimate of the prediction error unbiased? If so, why? If not, why not and how would you alter the procedure to obtain an unbiased estimate?

They use the same data for training and validation

Subset Selection

This approach involves identifying a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

Dimension Reduction

This approach involves projecting the p predictors into a M-dimensional subspace, where M

However, we notice that the p-value for the dummy variable is very high.

This indicates that there is no statistical evidence of a difference in average credit card balance between the genders.

Mixed selection

This is a combination of forward and backward selection. We start with no variables in the model, and as with forward selection selection, we add the variable that provides the best fit.

This is in fact a very difficult problem to solve precisely, since there are almost Kn ways to partition n observations into K clusters.

This is a huge number unless K and n are tiny! Fortunately, a very simple algorithm can be shown to provide a local optimum—a pretty good solution—to the K-means optimization problem

Inference

We are often interested in understanding the way that Y is affected as X1,...,Xp change. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y. We instead want to understand the relationship between X and Y , or more specifically, to understand how Y changes as a function of X1,...,Xp.

validation="CV"

causes pcr() to compute the ten-fold cross-validation error for each possible value of M, the number of principal components used

This is often referred to as an elbow in the scree plot.

This is done by eyeballing the scree plot, and looking for a point at which the proportion of variance explained by each subsequent principal component drops off.

bottom-up or agglomerative clustering

This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram (generally depicted as an upside-down tree; is built starting from the leaves and combining clusters up to the trunk

The default prediction type for a glm() model is type="link"

This means we get predictions for the logit: that is, we have fit a model of the form

We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS.

This procedure can be repeated for different numbers of knots K. Then the value of K giving the smallest RSS is chosen

When performing PCR, we generally recommend standardizing each predictor

This standardization ensures that all variables are on the same scale. In the absence of standardization, the high-variance variables will tend to play a larger role in the principal components obtained,

So far, our discussion has been limited to the case of binary classification: that is, classification in the two-class setting. How can we extend SVMs to the more general case where we have some arbitrary number of classes?

Though a number of proposals for extending SVMs to the K-class case have been made, the two most popular are the one-versus-one and one-versus-all approaches. We briefly discuss those two approaches here

Forward stepwise selection's computational advantage over best subset selection is clear

Though forward stepwise tends to do well in practice, it is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors.

In general RSE is defined as which simplifies to for a simple linear regression.

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p

While bagging can improve predictions for many regression methods, it is particularly useful for decision trees.

To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions.

People often loosely refer to the maximal margin classifier, the support vector classifier, and the support vector machine as "support vector machines".

To avoid confusion, we will carefully distinguish between these three notions in this chapter

cutree()

To determine the cluster labels for each observation associated with a given cut of the dendrogram

Linear Discriminant Analysis for p >1

To do this, we will assume that X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix

an SVM using a non-linear kernel, we once again use the svm() function.

To fit an SVM with a polynomial kernel we use kernel="polynomial", and to fit an SVM with a radial kernel we use kernel="radial".

scale() function

To scale the variables before performing hierarchical clustering of the observations,

Suppose I have two qualitative predictor variables, each with three levels, and a quantitative response. I am considering fitting either a tree or an additive model. For the additive model, I will use a piecewise-constant function for each variable, with a separate constant for each level. Which model is capable of fitting a richer class of functions:

Tree

True or False: If we cut the dendrogram at a lower point, we will tend to get more clusters, and cannot get fewer clusters (assuming complete, single or average linkage).

True After cutting the dendrogram at threshold t, we keep all the joins with linkage distance less than t and discard the joins with larger linkage distance. Thus, decreasing the threshold gives us fewer joins, and thus more clusters. If, in decreasing the threshold, we don't cross a junction of the dendrogram, the number of clusters will remain the same.

bias-variance trade-off

Typically as the flexibility of ˆf increases, its variance increases, and its bias decreases

recursive binary splitting

Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down, greedy approach that is known as

Ridge regression does have one obvious disadvantage

Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. it can create a challenge in model interpretation in settings in which the number of variables p is quite large

The hierarchical clustering dendrogram is obtained via an extremely simple algorithm

We begin by defining some sort of dissimilarity measure between each pair of observations. Most often, Euclidean distance is used; we will discuss the choice of dissimilarity measure later in this chapter.

The relationship between bias, variance, and test set MSE

as the bias-variance trade-off. Good test set performance of a statistical learning method re- bias-variance quires low variance as well as low squared bias

GAMs can also be used in situations where Y is qualitative.

Use logit model

In practice we use a value of B sufficiently large that the error has settled down

Using B = 100 is sufficient to achieve good performance in this example

Which of the following is NOT a benefit of the sparsity imposed by the Lasso?

Using the Lasso penalty helps to decrease the bias of the fits. Restricting ourselves to simpler models by including a Lasso penalty will generally decrease the variance of the fits at the cost of higher bias.

You have a bag of marbles with 64 red marbles and 36 blue marbles. What is the value of the Gini Index for that bag? Give your answer to the nearest hundredth:

Using the formula from pgs 25-26: Gini Index = .64*(1-.64) + .36*(1-.36) = .4608

why do we discuss effective degrees of freedom instead of degrees of freedom?

Usually degrees of freedom refer to the number of free parameters, such as the number of coefficients fit in a polynomial or cubic spline. Although a smoothing spline has n parameters and hence n nominal degrees of freedom, these n parameters are heavily constrained or shrunk down. Hence dfλ is a measure of the flexibility of the smoothing spline—the higher it is, the more flexible (and the lower-bias but higher-variance) the smoothing spline

You are working on a regression problem with many variables, so you decide to do Principal Components Analysis first and then fit the regression to the first 2 principal components. Which of the following would you expect to happen?:

Variance of fitted values will decrease relative to the full least squares model

Forward selection

We begin with the null model—a model that contains an intercept but no predictors. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.

This is known as the maximal margin classifier.

We can then classify a test observation based on which side of the maximal margin hyperplane it lies

ntree

We could change the number of trees grown by randomForest()

Support Vector Machines

We first discuss a general mechanism for converting a linear classifier into one that produces non-linear decision boundaries. We then introduce the support vector machine, which does this in an automatic way.

The one-versus-all approach is an alternative procedure for applying SVMs in the case of K > 2 classes.

We fit K SVMs, each time comparing one of the K classes to the remaining K − 1 classes.

High Leverage Points

We just saw that outliers are observations for which the response yi is unusual given the predictor x. In contrast, observations with high leverage high leverage have an unusual value for xi. solution:In contrast, observations with high leverage high leverage have an unusual value for xi. than removing the outlier. In fact, high leverage observations tend to have a sizable impact on the estimated regression line.

Which of the two models with 𝑘 predictors has the smallest test RSS?

We know that Best Subset selection will always have the lowest training RSS (that is how it is defined). That said, we don't know which model will perform better on a test set.

If a qualitative predictor (also known as a factor) only has two levels, or possible values, then incorporating it into a regression model is very simple

We level simply create an indicator or dummy variable that takes on two possible dummy numerical values

However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions.

We now have three regions. Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.

prcomp()

We now perform principal components analysis

small P

We reject the null hypothesis—that is, we declare a relationship to exist between X and Y —if the p-value is small enough.

Backward selection

We start with all variables in the model and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.

scree plot

We typically decide on the number of principal components required to visualize the data by examining a scree plot, s We choose the smallest number of principal components that are required in order to explain a sizable amount of the variation in the data.

knn()

We will now perform KNN

knn var bias

When K = 1, the decision boundary is overly flexible and finds patterns in the data that don't correspond to the Bayes decision boundary. This corresponds to a classifier that has low bias but very high variance. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. This corresponds to a low-variance but high-bias classifier. Just as in the regression setting, there is not a strong relationship between the training error rate and the test error rate.

Principal components

When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. The principal component directions are presented in Section 6.3.1 as directions in feature space along which the original data are highly variable. These directions also define lines and subspaces that are as close as possible to the data cloud

It involves randomly dividing the available set of observations into two parts,

a training set and a validation set or hold-out set

What Goes Wrong in High Dimensions?

When the number of features p is as large as, or larger than, the number of observations n, cannot be performed

This means that as the algorithm is run, the clustering obtained will continually improve until the result no longer changes;

When the result no longer changes, a local optimum has been reached

Chapter 10

5.5. It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw two dendograms which have the same meaning, but in which two or more of the leaves are repositioned in the change from one to the other. Your dendogram should have at least four leaves.

Yes

6. In a GAM, does it make sense to include interaction terms? Why or why not?

Yes. There is no good way to approixmate the function X_1X_2 with a function of the form f_1(X_1)+f_2(x_2).

We can select a value of α using a

a validation set or using cross-validation.

which.min()

can plot the Cp and BIC statistics, and indicate the models with the smallest statistic

prediction

Yˆ = ˆf(X), where ˆf represents our estimate for f, and Yˆ represents the resulting prediction for Y . The accuracy of Yˆ as a prediction for Y depends on two quantities, which we will call the reducible error and the irreducible error. This error is reducible because we can potentially improve the accuracy of ˆf by using the most appropriate statistical learning technique to estimate f. This is known as the irreducible error, because no matter how well we estimate f, we cannot reduce the error introduced by .The quantity may also contain unmeasurable variation. E(Y − Yˆ )2 represents the average, or expected value, of the squared expected difference between the predicted and actual value of Y , and Var() represents the variance associated with the error term.

3. Suppose we estimate the regression coefficients in a linear regression model by minimizing n sum i=1 (yi − β0 − p sum j=1 βjxi)^2 subject to p sum j=1 |βj | ≤ s for a particular value of s. For parts (a) through (e), indicate which of the completions i. through v. of the statement is correct (a) As we increase s from 0, the training RSS will:

a (iv) Steadily decreases: As we increase ss from 00, all ββ 's increase from 00 to their least square estimate values. Training error for 00 ββ s is the maximum and it steadily decreases to the Ordinary Least Square RSS

Maximal Margin Classifier

a hyperplane is a flat affine subspace of hyperplane dimension p − 1.1 For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In p > 3 dimensions, it can be hard to visualize a hyperplane, but the notion of a (p − 1)-dimensional flat subspace still applies.

Unlike Cp, AIC, and BIC, for which a small value indicates a model with a low test error,

a large value of adjusted R2 indicates a model with a small test error

However, since the true relationship is linear, it is hard for a non-parametric approach to compete with linear regression:

a non-parametric approach incurs a cost in variance that is not offset by a reduction in bias

Finally, for much more complicated decision boundaries,

a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully

, the LDA classifier results from assuming that the observations within each class come from

a normal distribution with a class-specific mean vector and a common variance σ2, and plugging estimates for these parameters into the Bayes classifier.

constructing the maximal margin hyperplane based on

a set of n training observations x1,...,xn ∈ Rp and associated class labels y1,...,yn ∈ {−1, 1}.

unsupervised learning,

a set of statistical tools intended for the setting in which we have only a set of features X1, X2,...,Xp measured on n observations.

4. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using complete linkage. We obtain two dendrograms. (a) At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell? Explain. (b) At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on the tree, or will they fuse at the same height, or is there not enough information to tell? Explain.

a. not enough info. Fusion height is the cluster dissimilarity which is the max/min of dissimilarities btw elements of the clusters depending on if complete or single linkage. If all dissimilarities between the elements of the clusters are the same, and min and max are the same, then same height. If not the same, then max greater than min and fusion for complete linkage is higher. b. same height. Each cluster only has one eleemnt so the min and max are the same

Quiz for 10

Thus, bagging improves prediction

accuracy at the expense of interpretability

The intuition behind the adjusted R2 is that once all of the correct variables have been included in the model,

additional noise variables will lead to only a very small decrease in RSS.

Consequently

all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting

Generalized additive models

allow us to extend the methods above to deal with multiple predictors.

e, we cannot draw conclusions about the similarity of two observations based on their proximity

along the horizontal axis

In the previous section, we describe the principal component loading vectors as the directions in feature space

along which the data vary the most, and the principal component scores as projections along these directions.

This same connection between LDA and logistic regression

also holds for multidimensional data with p > 1.

summary() function

also provides the percentage of variance explained in the predictors and in the response using different numbers of components

the first and second derivatives of the piecewise polynomials are continuous

also very smooth

support vector machine (SVM)

an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best "out of the box" classifiers.

support vector classifier

an extension of the maximal margin classifier that can be applied in a broader range of cases.

contour()

an outline, especially one representing or bounding the shape or form of something

training error tends to be quite a bit smaller than test error,

and a low training error by no means guarantees a low test error.)

Regression splines

are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve di- viding the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots . Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.

Classical approaches such as least squares linear regression

are not appropriate in this setting.

all variables in our linear regression model

are quantitative. some predictors are qualitative.

data.frame()

function to merge High with the rest of the Carseats data.

Next, we repeat the process, looking for the best predictor

and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions

The estimate of σ is known as the residual standard error,

and is given by the formula residual standard error RSE = RSS/(n − 2).

This is admittedly a subjective approach,

and is reflective of the fact that PCA is generally used as a tool for exploratory data analysis

Therefore, we recommend performing clustering with different choices of these parameters

and looking at the full set of results in order to see what patterns consistently emerge.

Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard

and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision tree to the residuals from the model.

Now the Bayes decision boundary is quadratic

and so QDA more accurately approximates this boundary than does LDA.

Again the training error rate declines as the method becomes more flexible,

and so we see that the training error rate cannot be used to select the optimal value for K

a linear kernel was used with cost=10

and that there were seven support vectors, four in one class and three in the other.

The validation set hold-out set model is fit on the training set

and the fitted model is used to predict the responses for the observations in the validation set

In fact, the question of how many principal components are enough is inherently ill-defined,

and will depend on the specific area of application and the specific data set.

3. Here we explore the maximal margin classifier on a toy data set. (a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class label. Sketch the observations. (b) Sketch the optimal separating hyperplane, and provide the equa- tion for this hyperplane (of the form (9.1)). (c) Describe the classification rule for the maximal margin classifier. It should be something along the lines of "Classify to Red if β0 + β1X1 + β2X2 > 0, and classify to Blue otherwise." Provide the values for β0, β1, and β2. (d) On your sketch, indicate the margin for the maximal margin hyperplane. (e) Indicate the support vectors for the maximal margin classifier. (f) Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane. (g) Sketch a hyperplane that is not the optimal separating hyper- plane, and provide the equation for this hyperplane. (h) Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

answer online

In the high-dimensional setting, the multicollinearity problem is extreme

any variable in the model can be written as a linear combination of all of the other variables in the model. Essentially, this means that we can never know exactly which variables (if any) truly are predictive of the outcome, and we can never identify the best coefficients for use in the regression

apply() apply(USArrests , 2, mean)

apply a function—in this case, the mean() function—to each row or column of the data set.The second input here denotes whether we wish to compute the mean of the rows, 1, or the columns, 2.

e principal components regression (PCR)

approach involves constructing the first M principal components, Z1,...,ZM, and then using these components as the predictors in a linear regression model that is fit using least squares.

The minimum MSE is achieved at

approximately λ = 30.

Two of the most important assumptions state that the relationship between the predictors and response

are additive and linear.

Resampling methods

are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model

Smoothing splines

are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty

R2 increases to 1

as the number of features included in the model increases, and correspondingly the training set MSE decreases to 0 as the number of features increases, even though the features are completely unrelated to the response.

We refer to z11,...,zn1

as the scores of the first principal component

splines can have high variance

at the outer range of the predictors—that is, when X takes on either a very small or very large value

(b) Repeat (a) for test RSS.

b (ii) Decrease initially, and then eventually start increasing in a U shape: When s=0s=0, all ββ s are 00, the model is extremely simple and has a high test RSS. As we increase ss, betabeta s assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, as betabeta s approach their full blown OLS values, they start overfitting to the training data, increasing test RSS.

Also like forward stepwise selection,

backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors

Claims of causality should

be avoided for observational data in interpreting regression coefficients.

It is greedy

because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

the Cp, AIC, and BIC approaches are not appropriate in the high-dimensional setting,

because estimating ˆσ2 is problematic.

For C > 0 no more than C observations can be on the wrong side of the hyperplane

because if an observation is on the wrong side of the hyperplane then i > 1, and (9.15) requires that n i=1 i ≤ C. As the budget C increases, we become more tolerant of violations to the margin, and so the margin will widen. Conversely, as C decreases, we become less tolerant of violations to the margin and so the margin narrows.

On the other hand, the MSE on an independent test set becomes extremely large as the number of features included in the model increases,

because including the additional predictors leads to a vast increase in the variance of the coefficient estimates.

Local regression is sometimes referred to as a memory-based procedure

because like nearest-neighbors, we need all the training data each time we wish to compute a prediction.

Note that there are three lines representing the Bayes decision boundaries

because there are three pairs of classes among the three classes.

However, local regression can perform poorly if p is much larger than about 3 or 4

because there will generally be very few training observations close to x0.

It is called an additive model

because we calculate a separate fj for each Xj , and then add together all of their contributions.

We note that the issue of whether or not to scale the variables

before performing clustering applies to K-means clustering as well.

Further cuts can be made as one descends the dendrogram in order to obtain any number of clusters,

between 1 (corresponding to no cut) and n (corresponding to a cut at height 0, so that each observation is in its own cluster). In other words, the height of the cut to the dendrogram serves the same role as the K in K-means clustering: it controls the number of clusters obtained.

The concept of dissimilarity

between a pair of observations needs to be extended to a pair of groups of observations.

synergy or interaction effect

between the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium.

Hence, both logistic regression and LDA produce linear decision boundaries. The only difference

between the two approaches lies in the fact that β0 and β1 are estimated using maximum likelihood, whereas c0 and c1 are computed using the estimated mean and variance from a normal distribution

λ controls the

bias-variance trade-off of the smoothing spline

Ridge regression's advantage over least squares

bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias

. In the case of bagging regression trees,

can record the total amount that the RSS (8.1) is decreased due to splits over a given predictor, averaged over all B trees

This strategy will result in smaller trees,

but is too short-sighted since a seemingly worthless split early on in the tree might be followed by a very good split—that is, a split that leads to a large reduction in RSS later on.

Hence each individual tree has high variance,

but low bias. Averaging these B trees reduces the variance. Bagging has been demonstrated to give impressive improvements in accuracy by combining together hundreds or even thousands of trees into a single procedure

Centroid linkage is often used in genomics,

but suffers from a major drawback in that an inversion can occur, whereby inversion two clusters are fused at a height below either of the individual clusters in the dendrogram. This can lead to difficulties in visualization as well as in interpretation of the dendrogram.

What we really want is a function g that makes RSS small,

but that is also smooth.

Any of these three approaches might be used when pruning the tree,

but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.

As model flexibility increases, training MSE will decrease

but the test MSE may not.

With K = 1, the KNN training error rate is 0,

but the test error rate may be quite high. In general, as we use more flexible classification methods, the training error rate will decline but the test error rate may not.

c (iii) Steadily increase: When s=0s=0, the model effectively predicts a constant and has almost no variance. As we increase ss, the models includes more ββ s and their values start increasing. At this point, the values of ββ s become highly dependent on training data, thus increasing the variance.

cv.glm() function

can also be used to implement k-fold CV. Below we use k = 10, a common choice for k,

training error

can be easily calculated by applying the statistical learning method to the observations used in its training

Principal components regression (PCR)

can be performed using the pcr()

persp()

can be used to produce a three-dimensional persp() plot.

! symbol

can be used to reverse all of the elements of a Boolean vector.

The latter property is important because glmnet()

can only take numerical, quantitative inputs.

The term dimension reduction comes

from the fact that this approach reduces the problem of estimating the p+ 1 coefficients β0, β1,...,βp to the simpler problem of estimating the M + 1 coefficients θ0, θ1,...,θM , where M

lm() number 2

function can also accommodate non-linear transformations of the predictors

boot.fn()

function can also be used in order to create bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement

svm()

function can be used to fit a support vector classifier when the argument kernel="linear" is used

which.max()

function can be used to identify the location of the maximum point of a vector.

is.na()

function can be used to identify the missing observations.

randomForest()

function can be used to perform both random forests and bagging.

table()

function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

predict()

function can be used to produce confidence intervals and prediction intervals for the prediction

title()

function creates a figure title that spans both title() subplots.

glm()

function fits generalized glm() linear models, a class of models that includes logistic regression. The syntax generalized of the glm() function is similar to that of lm(), except that we must pass in linear model the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

cbind()

function for building a matrix from a collection of vectors; any function call such as cbind() inside a formula also serves as a wrapper

bs()

function generates the entire matrix of bs() basis functions for splines with the specified set of knots.

regsubsets()

function has a built-in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, Cp, adjusted R2, or AIC.

which.max()

function identifies the index of the largest element of a which.max() vector. In this case, it tells us which observation has the largest leverage statistic.

hclust()

function implements hierarchical clustering in R.

gam()

function in gam() order to fit a GAM using these components.

prune.misclass()

function in order to prune the tree to prune. obtain the nine-node tree.

attach()

function in order to tell R to make the variables in this data frame available by name

model.matrix()

function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables.

model.matrix()

function is used in many regression packages for build- model. matrix() ing an "X" matrix from data.

cv.glm()

function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic given in (5.1). Below, we discuss a situation in which the two numbers differ

cor()

function produces a matrix that contains all of the pairwise correlations among the predictors in a data set.

na.omit()

function removes all of the rows that have missing values in any variable.

contrasts()

function returns the coding that R uses for the dummy variables.

apply()

function to average over the columns of this apply() matrix in order to obtain a vector for which the jth element is the crossvalidation error for the j-variable model.

We use the ifelse()

function to create a variable, called ifelse() High, which takes on a value of Yes if the Sales variable exceeds 8, and takes on a value of No otherwise

text()

function to display the node labels.

anova()

function to further quantify the extent to which the quadratic fit is superior to the linear fit. performs a hypothesis test comparing the two models.

jitter()

function to jitter the age values a bit so that observations jitter() with the same age value do not cover each other up. This is often called a rug plot.

Forward stepwise selection can be applied even in the high-dimensional setting where n

however, in this case, it is possible to construct submodels M0,...,Mn−1 only, since each submodel is fit using least squares, which will not yield a unique solution if p ≥ n

Standard errors

hypothesis tests on the hypothesis coefficients

In greater detail, for any j and s, we define the pair of half-planes R1(j, s) = {X|Xj < s} and R2(j, s) = {X|Xj ≥ s}, and we seek the value of j and s that minimize the equation

i: xi∈R1(j,s) sum (yi − yˆR1 )^2 + i: xi∈R2(j,s)sum (yi − yˆR2 ) 2

Correlation of Error Terms

if the errors are uncorrelated, then the fact that i is positive provides little or no information about the sign of i+1. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. solution: In general, the assumption of uncorrelated errors is extremely important for linear regression as well as for other statistical methods, and good experimental design is crucial in order to mitigate the risk of such correlations

the parametric approach will outperform the nonparametric approach

if the parametric form that has been selected is close to the true form of f

prune.tree()

if we wish to prune the tree,

1. In the course of studying the SVM we discussed the hyperplane defined by β0 + β1X1 + β2X2 + · · · + βpXp = 0 Answer the following questions about this hyperplane. (a) Is this hyperplane always a vector subspace of R p ? Why or why not? (b) Give a vector n which is normal to this hyperplane. (c) Suppose we have a point X = (X1, . . . , Xp) which is not on this hyperplane. What is the (perpendicular) distance between this point and the hyperplane. (Note: you may not assume that the coefficients βi have been normalized.)

if β0=/ 0 the hyperplane does not contain the origian and so is not a vector subspace of R^p (β1, β2, ...βp) is normal to the hyperplane (sum p i=1 (βi)^2)^-1/2(β0+β1x1+...)

β1 is positive then increasing X will be associated with increasing p(X),

if β1 is negative then increasing X will be associated with decreasing p(X).

plsr() function

implement partial least squares (PLS)

read.table()

importing a data set into R.

The figure indicates that performing PCR with an appropriate choice of M can result

in a substantial improvement over least squares, especially in the left-hand panel

This highlights one difference between boosting and random forests:

in boosting, because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient. Using smaller trees can aid in interpretability as well; for instance, using stumps leads to an additive model.

This highlights a very important point

in interpreting dendrograms that is often misunderstood.

In practice, we tend to look at the first few principal components

in order to find interesting patterns in the data. If no interesting patterns are found in the first few principal components, then further principal components are unlikely to be of interest.

A large value

indicates an important predictor. Similarly, in the context of bagging classification trees, we can add up the total amount that the Gini index (8.6) is decreased by splits over a given predictor, averaged over all B trees.

mtry=13

indicates that all 13 predictors should be considered for each split of the tree—in other words, that bagging should be done.

The p-value for the interaction term, TV×radio, is extremely low,

indicating that there is strong evidence for Ha : β3 =/ 0. In other words, it is clear that the true relationship is not additive

k-Fold Cross-Validation

involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times;

Bayes Classifier

is a conditional probability: it is the probability conditional that Y = j, given the observed predictor vector x0. This very simple classifier is called the Bayes classifier. The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.

Local regression is

is a different approach for fitting flexible non-linear functions, which involves computing the fit at a target point x0 using only the nearby training observations

This is because each of the M principal components used in the regression

is a linear combination of all p of the original features

function g(x) that minimizes (7.11)

is a natural cubic spline with knots at x1,...,xn! However, it is not the same natural cubic spline that one would get if one applied the basis function approach described in Section 7.4.3 with knots at x1,...,xn—rather, it is a shrunken version of such a natural cubic spline, where the value of the tuning parameter λ in (7.11) controls the level of shrinkage

Principal components analysis (PCA)

is a popular approach for deriving a low-dimensional set of features from a large set of variables.

A natural spline

is a regression spline with additional boundary constraints: the natural function is required to be linear at the boundary (in the region where X is spline smaller than the smallest knot, or larger than the largest knot). This additional constraint means that natural splines generally produce more stable estimates at the boundaries.

K-means clustering

is a simple and elegant approach for partitioning a data set into K distinct, non-overlapping clusters.

Simple linear regression

is a useful approach for predicting a response on the basis of a single predictor variable

the MSE associated with the least squares fit, when λ = 0,

is almost as high as that of the null model for which all coefficient estimates are zero, when λ = ∞.

RSE

is an estimate of the standard deviation of e.

The support vector machine (SVM)

is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels.

The test error rate associated with a set of test observations of the form test error (x0, y0)

is given by Ave (I(y0 =/=ˆy0))

In the linear regression setting, the least squares approach is

is in fact a special case of maximum likelihood

The level with no dummy variable

is known as the baseline

The function I()

is needed since the ^ has a special meaning I() in a formula; wrapping as we do allows the standard usage in R, which is to raise X to the power 2

A good classifier

is one for which the test error (Ave (I(y0 =/=ˆy0))) is smallest.

Discriminant analysis

is popular for multiple-class classification

Local regression

is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.

The kernel approach that we describe here

is simply an efficient computational approach for enacting this idea.

The first principal component direction of the data

is that along which the observations vary the most.

The potential disadvantage of a parametric approach

is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.

The recursive binary splitting approach is top-down because

it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree.

backward stepwise

it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. can be applied in settings where p is too large to apply best subset selection.2

in the interest of • Greater robustness to individual observations, and • Better classification of most of the training observations.

it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations

LOOCV has a couple of major advantages over the validation set approach.

it has far less bias. Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does Second, in contrast to the validation approach which will yield different results when applied repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training/validation set splits.

because there is no predict()

it is a little tedious

simple linear regression

it is a very straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X Y ≈ β0 + β1X yˆ = βˆ0 + βˆ1x,

when p>n,

it is easy to obtain a useless model that has zero residuals. Therefore, one should never use sum of squared errors, p-values, R2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting.

In marketing, this is known as a synergy effect, and in statistics

it is referred to as an interaction effect. Since β˜1 changes with X2, the effect of X1 on Y is no longer constant: adjusting X2 will change the impact of X1 on Y

The first principal component loading vector has a very special property:

it is the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness).

Though the cross-validation error curve slightly underestimates the test error rate,

it takes on a minimum very close to the best value for K

The notion of principal components as the dimensions that are closest to the n observations extends beyond

just the first principal component

K-Means is a seemingly complicated clustering algorithms. Here is a simpler one: Given k, the number of clusters, and n, the number of observations, try all possible assignments of the n observations into k clusters. Then, select one of the assignments that minimizes Within-Cluster Variation as defined on page 30. Assume that you implemented the most naive version of the above algorithm. Here, by naive we mean that you try all possible assignments even though some of them might be redundant (for example, the algorithm tries assigning all of the observations to cluster 1 and it also tries to assign them all to cluster 2 even though those are effectively the same solution). In terms of n and k, how many potential solutions will your algorithm try?

k^n

appropriate tuning parameter selection is crucial

l for good predictive performance

The MSE will be small

l if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

RSE is considered a measure of the

lack of fit of the model

Each of the dimensions found by PCA is a

linear combination of the p features

K(xi, xi ) = p sum j=1 xijxij ,

linear kernel

Another application of clustering arises in marketing

market segmentation by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product. The task of performing market segmentation amounts to clustering the people in the data set.

To fit logistic function we use

maximum likelihood

The additive assumption

means that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors

A smaller tree with fewer splits (that is, fewer regions R1,...,RJ )

might lead to lower variance and better interpretation at the cost of a little bias. One possible alternative to the process described above is to build the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold.

β0 and β1 are unknown

minimizing the least squares criterion

Using more knots leads to a

more flexible piecewise polynomial.

The bootstrap is used in several contexts,

most commonly model to provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method

Then in the collection of bagged trees,

most or all of the trees will use this strong predictor in the top split.

R2 statistic, which is also computed on the training data,

must increase

A natural approach is to find the function g that minimizes

n sum i=1 (yi − g(xi))^2 + λ - integral g''(t)^2dt (7.11) The term n i=1(yi − g(xi))2 is a loss function that encourages loss function ages g to fit the data well, and the term λ . g(t)2dt is a penalty term 278 7. Moving Beyond Linearity that penalizes the variability in g. The notation g(t) indicates the second derivative of the function g. The first derivative g (t) measures the slope of a function at t, and the second derivative corresponds to the amount by which the slope is changing.

RSS TSS

n sum i=1 (yi − yˆi) 2

The most direct way to represent a cubic spline using (7.9) is to start off with a basis for a cubic polynomial

namely, x, x2, x3—and then add one truncated power basis function per knot

However, which method leads to better prediction accuracy?

neither ridge regression nor the lasso will universally dominate the other. In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size. However, the number of predictors that is related to the response is never known a priori for real data sets. A technique such as cross-validation can be used in order to determine which approach is better on a particular data set

Suppose we want to fit a generalized additive model (with a continuous response) for 𝑦 against 𝑋1 and 𝑋2. Suppose that we are using a cubic spline with four knots for each variable (so our model can be expressed as a linear regression after the right basis expansion). Suppose that we fit our model by the following three steps: 1) First fit our cubic spline model for 𝑦 against 𝑋1, obtaining the fit 𝑓̂ 1(𝑥) and residuals 𝑟𝑖=𝑦𝑖−𝑓̂ 1(𝑋𝑖,1). 2) Then, fit a cubic spline model for 𝑟 against 𝑋2 to obtain 𝑓̂ 2(𝑥). 3) Finally construct fitted values 𝑦̂ 𝑖=𝑓̂ 1(𝑋𝑖,1)+𝑓̂ 2(𝑋𝑖,2). Will we get the same fitted values as we would if we fit the additive model for 𝑦 against 𝑋1 and 𝑋2 jointly?

not necessarily, even if 𝑋1 and 𝑋2 are uncorrelated.

notation means not equal to, and so the last command computes the test set error rate.

The first principal component

of a set of features X1, X2,...,Xp is the normalized linear combination of the features Z1 = φ11X1 + φ21X2 + ... + φp1Xp that has the largest variance

R2 statistic is a measure

of the linear relationship between X and Y

he results obtained when we perform PCA will also depend

on whether the variables have been individually scaled (each multiplied by a different constant).

Although the collection of bagged trees is much more difficult to interpret than a single tree,

one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees).

therefore highlights a very attractive aspect of hierarchical clustering:

one single dendrogram can be used to obtain any number of clusters. In practice, people often look at the dendrogram and select by eye a sensible number of clusters, based on the heights of the fusion and the number of clusters desired

t the maximal margin hyperplane depends directly on

only a small subset of the observations is an important property that will arise later in this chapter when we discuss the support vector classifier and support vector machines

regsubsets()

only reports results up to the best eight-variable model.

constrains or regularizes the coefficient estimates,

or equivalently, that shrinks the coefficient estimates towards zero.

Variance

refers to the amount by which ˆf would change if we estimated it using a different training data set. However, if a method has high variance then small changes in the training data can result in large changes in ˆf. In general, more flexible statistical methods have higher variance. In contrast, the orange least-squares line is relatively inflexible and has low variance, because moving any single observation will likely cause only a small shift in the position of the line

We can again use cross-validation to choose s,

or we can specify it directly

A natural way to extend the multiple linear regression model

order to allow for non-linear relationships between each feature and the response is to replace each linear component βjxij with a (smooth) nonlinear function fj (xij )

We might also want to scale the variables to have standard deviation one if they are measured on different scales;

otherwise, the choice of units (e.g. centimeters versus kilometers) for a particular variable will greatly affect the dissimilarity measure obtained. It should come as no surprise that whether or not it is a good decision to scale the variables before computing the dissimilarity measure depends on the application at hand.

prcomp() function also

outputs the standard deviation of each prin- cipal component

Hierarchical clustering has an added advantage

over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.

This null hypothesis implies that

p(X) = eβ0/1+eβ0

odds

p(X)/1 − p(X) = e^β0+β1X

Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution, which takes the form

p(β|X, Y ) ∝ f(Y |X, β)p(β|X) = f(Y |X, β)p(β). where the proportionality above follows from Bayes' theorem, and the equality above follows from the assumption that X is fixed.

randomForest() uses

p/3 variables when building a random forest of regression trees, and √p variables when building a random forest of classification trees. Here we use mtry = 6. rf.boston= randomForest(medv∼.,data=Boston , subset=train , mtry=6, importance =TRUE)

most statistical learning methods for this task can be characterized as either

parametric or non-parametric

kmeans()

performs K-means clustering in R.

cv.tree()

performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration

nfolds

performs ten-fold cross-validation, though this can be changed using the argument

Piecewise Polynomials (part of Regression Splines)

piecewise polynomial regression involves fitting separate low-degree polynomials piecewise polynomial regression over different regions of X. The points where the coefficients change are called knots. A piecewise cubic polynomial with a single knot at a point c takes the form yi = β01 + β11xi + β21x2 i + β31x3 i + ei if xi < c; β02 + β12xi + β22x2 i + β32x3 i + ei if xi ≥ c

regularization or shrinkage

plays a key role in high-dimensional problems,

K(xi, xi ) = (1 +psum j=1 xijxij )^ d.

polynomial kernel

Polynomial function can use logistic regression to be used to

predict this binary response

There are two main reasons that we may wish to estimate f:

prediction and inference

Of course, other considerations beyond simply test error may come into play in selecting a statistical learning method; for instance, in certain settings,

prediction using a tree may be preferred for the sake of interpretability and visualization

However, whether the

predictors are qualitative or quantitative is generally considered less important.

lasso has a major advantage over ridge regression

produces simpler and more interpretable models that involve only a subset of the predictors

Generalized additive models (GAMs)

provide a general framework for generalized additive model extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. Just like linear models, GAMs additivity can be applied with both quantitative and qualitative responses.

Random forests

provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.

identify()

provides a useful identify() interactive method for identifying the value for a particular variable for points on a plot.

The fact that adding newspaper advertising to the model containing only TV and radio advertising leads to just a tiny increase in R2

provides additional evidence that newspaper can be dropped from the model.

The standard linear regression model

provides interpretable results and works quite well on many real-world problems

Suppose we produce ten bootstrap samples from a data set containing red and green classes. We then apply a classification tree to each bootstrap sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75 There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in the notes. The second approach is to classify based on the average probability. What is the final classification under the majority vote method?:

red 6 of the 10 probabilities are greater than 1/2, so the majority vote method will select red.

averaging a set of observations

reduces variance.

Each constraint that we impose on the piecewise cubic polynomials effectively frees up one degree of freedom,

reducing the complexity of the resulting piecewise polynomial fit.

Clustering

refers to a very broad set of techniques for finding subgroups, or clustering clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. Of course, to make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

The reason is

regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are zero.

cv.tree() function

reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to α in (8.4)).

a linear spline is obtained by fitting a line in each region of the predictor space defined by the knots

requiring continuity at each knot.

hierarchical clustering can sometimes yield worse (i.e. less accurate)

results than K-means clustering for a given number of clusters.

The two best-known techniques for shrinking the regression coefficients towards zero

ridge regression and the lasso.

Non-parametric methods do not make explicit assumptions about the functional form of f.

seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly.

Both clustering and PCA

seek to simplify the data via a small number of summaries, but their mechanisms are different:

PCA

seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension

cbind() function

short for column bind

In other words, we assume that the directions in which X1,...,Xp

show the most variation are the directions that are associated with Y .

has the lowest training MSE of all three methods,

since it corresponds to the most flexible

Similarly, problems arise in the application of adjusted R2 in the high-dimensional setting

since one can easily obtain a model with an adjusted R2 value of 1. Clearly, alternative approaches that are better-suited to the high-dimensional setting are required.

. We constrain the loadings so that their sum of squares is equal to one,

since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance

The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points,

since such a line will likely provide a good summary of the data

lm() function

since that function provides more useful outputs, such as standard errors and p-values for the coefficients.

These directions are identified in an unsupervised way,

since the response Y is not used to help determine the principal component directions

gbm() with the option distribution="gaussian"

since this is a regression problem; if it were a binary classification problem, we would use distribution="bernoulli". The argument n.trees=5000 indicates that we want 5000 trees, and the option interaction.depth=4 limits the depth of each tree.

p-value

small p mean that we can infer that there is association between the predictor and the response.

In order to fit a smoothing spline

smooth.spline() =smooth.spline (age ,wage ,cv=TRUE)

As with bagging, random forests will not overfit if we increase B,

so in practice we use a value of B sufficiently large for the error rate to have settled down

βˆ0 and βˆ1 are very large relative to their standard errors

so the t-statistics are also large;

Just as in the regression setting, we use recursive binary

splitting to grow a classification tree.

scale()

standardize the data so that all standardize variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale

We strongly recommend always running K-means clustering with a large value of nstart

such as 20 or 50, since otherwise an undesirable local optimum may be obtained.

supervised learning methods

such as regression and classification. In the supervised learning setting, we typically have access to a set of p features X1, X2,...,Xp, measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using X1, X2,...,Xp.

the p-values associated with the coefficient estimates for the two dummy variables are very large

suggesting no statistical evidence of a real difference in credit card balance between the ethnicities.

na.strings

tells R that any time it sees a particular character or set of characters (such as a question mark), it should be treated as a missing element of the data matrix

regression problems

tend to refer to problems with a quantitative response

As a consequence, the Cp statistic

tends to take on a small value for models with a low test error, so when determining which of a set of models is best, we choose the model with the lowest Cp value

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

the number of red predictions is greater than the number of green predictions based on a 50% threshold, thus RED. The average of the probabilities is less than the 50% threshold, thus GREEN.

Note that the two arguments to the plot.svm() function are

the output of the call to svm(), as well as the data used in the call to svm()

The bootstrap

the power of the bootstrap lies in the fact that it can be easily applied to a wide range of statistical learning methods, including some for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.

using methods like bagging, random forests, and boosting,

the predictive performance of trees can be substantially improved. We introduce these concepts in the next section.

We generally standardize

the predictors and response before performing PLS

model selection.

the process of selecting the proper level of flexibility for a model

(The notation {X|Xj < s} means

the region of predictor space in which Xj takes on a value less than s.) That is, we consider all predictors X1,...,Xp, and all possible values of the cutpoint s for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS.

Consider two curves, ˆg1 and ˆg2, defined by gˆ1 = arg min g Xn i=1 (yi − g(xi))2 + λ Z g (3)2 dx! gˆ2 = arg min g Xn i=1 (yi − g(xi))2 + λ Z g (4)2 dx! (a) As λ → ∞, will ˆg1 or ˆg2 have the smaller training RSS?

the requirement becomes that g^_1 becomes quadratic but g^_2 becomes cubic. So this means that g^_2 is more flexible and will have smaller training RSS

, cross-validation can be used to estimate

the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility

As the number of features increases

the test set error increases.

As we have seen previously,

the training error tends to decrease as the flexibility of the fit increases

The validation set approach is conceptually simple and is easy to implement. But it has two potential drawbacks:

the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validation set—are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data se

we can directly estimate the test error using

the validation set and cross-validation methods

Now, suppose that X does not satisfy (9.2); rather, β0 + β1X1 + β2X2 + ... + βpXp > 0. (9.3) Then this tells us that X lies to one side of the hyperplane. On the other hand, if β0 + β1X1 + β2X2 + ... + βpXp < 0

then X lies on the other side of the hyperplane. So we can think of the hyperplane as dividing p-dimensional space into two halves

If the variables are scaled to have standard deviation one before the inter-observation dissimilarities are computed,

then each variable will in effect be given equal importance in the hierarchical clustering performed.

If g is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of λ,

then it follows that the posterior mode for β is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector.)

• If g is a Gaussian distribution with mean zero and standard deviation a function of λ,

then it follows that the posterior mode for β—that posterior is, the most likely value for β, given the data—is given by the ridge mode regression solution

glm() to fit a model without passing in the family argument

then it performs linear regression,

If the predictions obtained using the model are very close to the true outcome values

then rse will be small and we can conclude that the model fits the data very well.

To perform K-means clustering, we must first specify the desired number of clusters K;

then the K-means algorithm will assign each observation to exactly one of the K clusters.

When the true decision boundaries are linear

then the LDA and logistic regression approaches will tend to perform well

Consequently, if we perform PCA on the unscaled variables,

then the first principal component loading vector will have a very large loading, esecally for variables with the highest variance

When the tuning parameter C is large,

then the margin is wide, many observations violate the margin, and so there are many support vectors. this classifier has low variance (since many observations are support vectors) but potentially high bias

When λ = 0,

then the penalty term in (7.11) has no effect, and so the function g will be very jumpy and will exactly interpolate the training observations.

In contrast, if C is small

then there will be fewer support vectors and hence the resulting classifier will have low bias but high variance.

PCR will tend to do well in cases when the first few principal components are sufficient

to capture most of the variation in the predictors as well as the relationship with the response.

names()

to check the names() variable names.

Random forests overcome this problem by forcing each split

to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. (This is decorrelating)

If p>n then there are more coefficients βj

to estimate than observations from which to estimate them.

write.table()

to export data.

tree() function

to fit a classification tree in order to predict tree() High using all variables but Sales. The syntax of the tree() function is quite similar to that of the lm() function

gbm() function

to fit boosted gbm() regression trees to the Boston data set

In addition to carefully selecting the dissimilarity measure used, one must also consider whether or not the variables should be scaled

to have standard deviation one before the dissimilarity between the observations is computed.

pretty=0 instructs R

to include the category names for any qualitative predictors, rather than simply displaying a letter for each category

FUN=prune.misclass

to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.

β0, β1,...,βp

to minimize the sum of squared residuals

confint()

to obtain a confidence interval for the coefficient estimates, we can use the confint() command

Although the maximal margin classifier is often successful, it can also lead

to overfitting when p is large.

tune()

to perform cross- validation. By default, tune() performs ten-fold cross-validation on a set of models of interest. In order to use this function, we pass in relevant information about the set of models that are under consideration

regsubsets() function

to perform forward stepwise or backward stepwise selection, using the argument method="forward" or method="backward". regfit.fwd=regsubsets (Salary∼.,data=Hitters , nvmax=19, method ="forward ")

predict() function can be used

to predict the class label on a set of test observations, at any given value of the cost parameter. We begin by generating a test data set.

scale=FALSE tells the svm() function not

to scale each feature to have mean zero or standard deviation one; depending on the application, one might prefer to use scale=TRUE.

cv.tree() function

to see whether pruning the tree will improve performance

n the former case we also use the degree

to specify a degree for the polynomial kernel (this is d

and in the latter case we use gamma

to specify a value of γ for the radial basis kernel (9.24).

Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is

to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.

In contrast, for a classification tree, we predict that each observation belongs

to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the results of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.

s()

which is part of the gam library, is used to indicate that s() we would like to use a smoothing spline. . We specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. ∼s(year ,4)+s(age ,5)

We saw in Section 6.3.1 that we can perform regression using the principal component score vectors as features. In fact, many statistical techniques, such as regression, classification, and clustering, can be easily adapted

to use the n × M matrix whose columns are the first M p principal component score vectors, rather than using the full n × p data matrix. This can lead to less noisy results, since it is often the case that the signal (as opposed to the noise) in a data set is concentrated in its first few principal components.

Therefore, from a Bayesian viewpoint, ridge regression and the lasso follow directly from assuming the usual linear model with normal errors,

together with a simple prior distribution for β

If 𝛽 is not a unit vector but instead has length 2, then ∑𝑝𝑗=1𝛽𝑗𝑋𝑗 is

twice the signed Euclidean distance from the separating hyperplane ∑𝑝𝑗=1𝛽𝑗𝑋𝑗=0 Reason: We know 𝛽′=12𝛽 has length 1, so it is a unit vector in the same direction as 𝛽. Therefore, ∑𝑝𝑗=1𝛽𝑗𝑋𝑗=2∑𝑝𝑗=1𝛽′𝑗𝑋𝑗, where ∑𝑝𝑗=1𝛽′𝑗𝑋𝑗 is the Euclidean distance.

the test error tends to increase as the dimensionality of the problem (i.e. the number of features or predictors) increases,

unless the additional features are truly associated with the response.

Each principal component loading vector is unique,

up to a sign flip. This means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ.

As with PCR, the number M of partial least squares directions

used in PLS is a tuning parameter that is typically chosen by cross-validation

We then use least squares to fit a linear model

using C1(X), C2(X),...,CK(X) as predictors yi = β0 + β1C1(xi) + β2C2(xi) + ... + βKCK(xi) + ei

The number of trees B is not a critical parameter with bagging;

using a very large value of B will not lead to overfitting

The resulting OOB error is a

valid estimate of the test error for the bagged model

cylinders

variable is stored as a numeric vector, so R has treated it as quantitative

In addition, clustering methods generally are not

very robust to perturbations to the data.

why would we ever choose to use a more restrictive method instead of a very flexible approach?

we are mainly interested in inference, then restrictive models are much more interpretable. very flexible approaches can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response.

When a given method, yields a small training MSE but a large test MSE,

we are said to be overfitting the data.

raw=TRUE argument to the poly() function.

we can also use poly() to obtain age, age^2, age^3 and age^4 directly, if we prefer.

In general, we can cluster observations on the basis of the features in order to identify subgroups among the observations, or

we can cluster features on the basis of the observations in order to discover subgroups among the features. In what follows, for simplicity we will discuss clustering observations on the basis of the features, though the converse can be performed by simply transposing the data matrix.

The maximal margin classifier is a very natural way to perform classification, if a separating hyperplane exists. However, as we have hinted, in many cases no separating hyperplane exists, and so there is no maximal margin classifier.

we can extend the concept of a separating hyperplane in order to develop a hyperplane that almost separates the classes, using a so-called soft margin. The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier.

importance()

we can view the importance of each variable.

On the other hand, in hierarchical clustering

we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram. we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram

In order to perform recursive binary splitting, we first select the predictor Xj and the cutpoint s such that splitting

we first select the predictor Xj and the cutpoint s such that splitting the predictor space into the regions {X|Xj < s} and {X|Xj ≥ s} leads to the greatest possible reduction in RSS.

To perform best subset selection

we fit a separate least squares regression best subset for each possible combination of the p predictors. That is, we fit all p models selection that contains exactly one predictor. Now in order to select a single best model, we must simply choose among these p + 1 options. This task must be performed with care because the RSS of these p + 1 models decreases monotonically, and the R2 increases monotonically, as the number of features included in the models increases.

In the case of logistic regression, instead of ordering models by RSS

we instead use the deviance, a measure deviance that plays the role of RSS for a broader class of models.

If the test MSE of KNN is only slightly lower than that of linear regression,

we might be willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which p-values are available.

Even in problems in which the dimension is small,

we might prefer linear regression to KNN from an interpretability standpoint.

Note that in order for the svm() function to perform classification (as opposed to SVM-based regression),

we must encode the response as a factor variable. We now create a data frame with the response coded as a factor.

p(X) = eβ0+β1X/ 1 + eβ0+β1X

we must model p(X) using a function that gives outputs between 0 and 1 for all values of X. We use logistic function

How do we determine the best prune way to prune the tree?

we need a way to select a small set of subtrees for consideration

In the multiple regression setting with p predictors,

we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = ··· = βp = 0

Once the regions R1,...,RJ have been created,

we predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs.

Since clustering can be non-robust,

we recommend clustering subsets of the data in order to get a sense of the robustness of the clusters obtained. Most importantly, we must be careful about how the results of a clustering analysis are reported. These results should not be taken as the absolute truth about a data set

the prcomp() function centers the variables to have mean zero. By using the option scale=TRUE,

we scale the variables to have standard deviation one.

However, by examining the ridge regression and lasso results

we see that PCR does not perform as well as the two shrinkage methods in this example

In K-means clustering hierarchical clustering K-means clustering

we seek to partition the observations into a pre-specified number of clusters.

smooth.spline()

we select the smoothness level by crossvalidation;

Perform Principal components

we simply use principal components as predictors in a regression model in place of the original larger set of variables

polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X

we step functions in order to avoid imposing such a global structure.

Because it is undesirable for the principal components obtained to depend on an arbitrary choice of scaling,

we typically scale each variable to have standard deviation one before we perform PCA

In order to obtain the fitted values for a given SVM model fit,

we use decision.values=TRUE when fitting svm().

run the kmeans() function in R with multiple initial cluster assign- ments,

we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple random assignments in Step 1 of Algorithm 10.1, and the kmeans() function will report only the best results.

In general, a n × p data matrix X has min(n − 1, p) distinct principal components. However, we usually are not interested in all of them;

we would like to use just the first few principal components in order to visualize or interpret the data In fact, we would like to use the smallest number of principal components required to get a good understanding of the data

. Given estimates βˆ0, βˆ1,..., βˆp

yˆ = βˆ0 + βˆ1x1 + βˆ2x2 + ··· + βˆpxp.

The OOB approach for estimating the test error is particularly convenient

when performing bagging on large data sets for which cross-validation would be computationally onerous.

As a general rule, parametric methods will tend to outperform non-parametric approaches

when there is a small number of observations per predictor

Local regression also generalizes very naturally

when we want to fit models that are local in a pair of variables X1 and X2, rather than one

lm(y∼x,data)

where y is the response, x is the predictor, and data is the data set in which these two variables are kept.

In particular, the ridge regression coefficient estimates βˆR are the values that minimize RSS + λ sum p j=1 β^2_j ,

where λ ≥ 0 is a tuning parameter, to be determined separately When λ = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as λ → ∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.

Thus, observations that fuse at the very bottom of the tree are quite similar to each other,

whereas observations that fuse close to the top of the tree will tend to be quite different.

And if p>n, then the least squares estimates do not even have a unique solution,

whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance

The problem is that a low RSS or a high R2 indicates a model with a low training error,

whereas we wish to choose a model that has a low test error.

in the simple linear regression setting

whether there is a relationship between the response and the predictor we can simply check whether β1 = 0

First of all, training error rates will usually be lower than test error rates

which are the real quantity of interest

This extension is achieved by developing the notion of linkage,

which defines the dissimilarity between two groups of observations

Hierarchical clustering is an alternative approach

which does not require that we commit to a particular choice of K.

LDA is trying to approximate the Bayes classifier

which has the lowest total error rate out of all classifiers (if the Gaussian model is correct). That is, the Bayes classifier will yield the smallest possible total number of misclassified observations, irrespective of which class the errors come from

support vector machine,

which is a further extension of the support vector classifier in order to accommodate non-linear class boundaries.

maximal margin hyperplane

which is the separating hyperplane that optimal separating hyperplane is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin.

t-statistic

which measures the number of standard deviations that βˆ1 is away from 0.

anova() function

which performs an analysis of variance (ANOVA, using an F-test) in order to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. In order to use the anova() function, M1 and M2 must be nested models: the predictors in M1 must be a subset of the predictors in M2.

Although the sample size for this data set is substantial (n = 3,000), there are only 79 high earners,

which results in a high variance in the estimated coefficients and consequently wide confidence intervals.

alpha.fn()

which takes as input the (X, Y ) data as well as a vector indicating which observations should be used to estimate α

The support vector machine,

which we present next, allows us to enlarge the feature space used by the support vector classifier in a way that leads to efficient computations.

The extra flexibility in the polynomial produces undesirable results at the boundaries,

while the natural cubic spline still provides a reasonable fit to the data

classification problems

while those involving a qualitative response

The best model

will always have the smallest RSS and the largest R2, since these quantities are related to the training error. Instead, we wish to choose a model with a low test error. the training error can be a poor estimate of the test error. Therefore, RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.

val.type="MSEP"

will cause the cross-validation MSE to be plotted. validationplot(pcr.fit ,val. type="MSEP")

Therefore, in theory, the model with the largest adjusted R2

will have only correct variables and no noise variables. Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model.

If the assumption underlying PCR holds, then fitting a least squares model to Z1,...,ZM

will lead to better results than fitting a least squares model to X1,...,Xp, since most or all of the information in the data that relates to the response is contained in Z1,...,ZM, and by estimating only M p coefficients we can mitigate overfitting.

rstudent()

will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

Overfitting refers specifically to the case in which a less flexible model

would have yielded a smaller test MSE.

Smarket[train,]

would pick out a submatrix of the stock market data set, corresponding only to the dates before 2005, since those are the ones for which the elements of train are TRUE.

Chapter 8 Questions

yes

EdX Quizzes First

yes

Why we need Linear Discriminant Analysis over logistic regression?

• When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. • If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. • As mentioned in Section 4.3.5, linear discriminant analysis is popular when we have more than two response classes.

Load the data from the file 7.R.RData, and plot it using plot(x,y). What is the slope coefficient in a linear regression of y on x (to within 10%)?

−0.6748 Explanation The slope is negative for most of the data, and the coefficient reflects that

Advantages of Trees

What are some advantages and disadvantages of classification trees compared to logistic regression?

▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! ▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. ▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). ▲ Trees can easily handle qualitative predictors without the need to create dummy variables. ▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. ▼ Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree. The main challenge of logistic regression is that it is difficult to correctly interpret the results. In this post I describe why decision trees are often superior to logistic regression, but I should stress that I am not saying they are necessarily statistically superior. All I am saying is that they are better because they are easier and safer to use. Even the most experienced statistician cannot look at the table of outputs shown below and quickly make precise predictions about what causes churn. By contrast, a decision tree is much easier to interpret.

disadvantage

▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. ▼ Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

Cons for GAMs

◆ The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed. However, as with linear regression, we can manually add interaction terms to the GAM model by including additional predictors of the form Xj × Xk. In addition we can add low-dimensional interaction functions of the form fjk(Xj , Xk) into the model; such terms can be fit using two-dimensional smoothers such as local regression, or two-dimensional splines (not covered here).

You are fitting a linear model to data assumed to have Gaussian errors. The model has up to p=5 predictors and n=100 observations. Which of the following is most likely true of the relationship between Cp and AIC in terms of using the statistic to select a number of predictors to include?

𝐶𝑝 will select the same model as 𝐴𝐼𝐶

You perform ridge regression on a problem where your third predictor, x3, is measured in dollars. You decide to refit the model after changing x3 to be measured in cents. Which of the following is true?:

Math 457 Final Study

Conjuntos de estudio relacionados

DC Generators

Ch. 18 Prepu OB

ECP 3403 Final Exam (Combined Quizlet)

Exam 3

Zeměpis - pojmy

Psychology Midterm Review-ReeRoi

Astronomy Exam 2

Simulated Exam - Health Insurance

Ch 2

Medical Office Assistant Module Review (Big Module test)

Mathematics In The modern World Midterm Exam Reviewer

Midterm 1 Study Guide

PLS170 Chapter 20

Chapter 5 Review: Forming a Business

Phrasal verbs and expressions with GO

English Cumulative Exam Review

Chess and Thomas concluded that there are THREE basic types of temperaments

Writer's Workshop: Media Analysis Essay

Chapter 66

HRMG 480 Exam 1