4510 Test 2

Ace your homework & exams now with Quizwiz!

Support vectors

observations which are equidistant from the hyperplane. The support the maximal margin hyperplane in that if they were different, the maximal margin hyperplane would be different as well

formulation of regression trees

1. Divide predictor space into J distinct regions R1, R2, ... Rj 2. Within each region Rj, we take the average of all the observations. This is the prediction of the response in each region Goal is to find a set of regions which minimizes the RSS

Notes on forward/backward selection

1.) Both fit only 1 + [p(p+1)]/2 models. This is much better than best subset selection which fit 2^p models. 2.) Backwards selection REQUIRES n>p (n=observations, p=predictors) for the full model to be estimated. However, forward selection can be used even when n<p 3.) Both not guaranteed to get you the best model

Best subset selection issues

1.) Cannot be applied with very large p 2.) Enormous search space can lead to overfitting and high variance of the coefficient estimates

Rank the following statistical methods in terms of their flexibility from least (1) to most (3) flexible: CART Subset selection Boosting

1.) Subset selection 2.) CART 3.) Boosting http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.7.pdf

Rank the following statistical methods in terms of their interpretability from least (1) to most (3) interpretable: GAMs, lasso, bagging

1.) bagging 2.) GAMs 3.) lasso http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.7.pdf

Steps of dimension reduction

1.) obtain the transformed variables Z1...ZM 2.) Fit a least squares model using these transformed variables get step one, the transformed variables, by using principle components and partial least squares

Advantages of lasso function

1.) shrinks coefficient estimates towards zero as in ridge regression 2.) When the tuning parameter, lambda, is large enough, some of the coefficient estimates will be shrunk to exactly 0 *This helps for variable/model selection! *since many coefficients will be shrunk to 0, we call this class of models sparse lambda is selected by cross validation

Boosting

A procedure that begins with the original data set, but grows the tree sequentially. Aka using info from the previous trees. This fits the data slowly in an iterative fashion Use lambda, which is a shrinkage parameter, in the calculations(this controls the rate at which boosting learns). The smaller the lambda(aka the slower the learning) the greater the B(number of trees)

Support vector classifier

A separating hyperplane rarely exists. We can relax this constraint to one which almost separates the classes. Sometimes data are noisy or hyperplane can overfit This method maximizes a soft margin. We still want to maximize the margin, M but willing to cushion for some observations to be inside the margin or even on the wrong side of the hyperplane. This will introduce a little bias, but drastically cut on variance

Interpretation: comparing decision tree, bagging, random forests or boosting

A single decision tree is extremely easy to interpret. Bagging, random forests and boosting all are taking the average of many trees This means they're great for prediction, not great for interpretation since different trees are typically completely different from one another To remedy this, if we have many different trees and the same variables keep appearing and their appearance yields a large decrease in the RSS, this variable is important! Variable importance plots used to see this in a graphical way

Classification error rate

A way to measure error in classification trees Measures the fraction of the training observations in that region that do not belong to the most common class 1-max(p hat sub j, k )

comments on GAMS

Allows us to fit a non-linear fj to each Xj so that we can automatically model non-linear relationships that standard linear regression will miss. The non linear fits can possibly make better predictions for the response, Y Because the model is additive, we can still examine the effect of each Xj on Y individually while holding all the other variables fixed The additivity makes it interpretable

local regression

Allows us to fit a non-linear function to the data by computing a localized fit at each location of the xi data. at each point conduct a weighted least squares regression by weighing the other local observations by how far away they are from that point (Closer points get more weight)

support vector machines

Attempt to find a plane that separates the 2 classes

Ridge regression

Attempt to minimize residual sums of squares (RSS). This is done with by using a constraint term, which is driven by a tuning parameter lambda

Drawback of boosting

Can overfit the data(though typically slowly) so need to be careful to not use indiscriminately large numbers

Trees vs. linear models

Classification and Regression Tree methods is known as CART If the relationship between the predictors and response is linear, then classical linear models such as linear regression would outperform regression trees If the relationship between the predictors is nonlinear, then decision trees would outperform classical approaches

Ways to measure error in classification trees

Classification error rate Gini Index Cross Entropy or Deviance All 3 used to build a tree and to prune it Classification error rate often preferred for pruning if prediction accuracy of of the pruned tree is the goal

Main issue with ridge regression

Coefficients will not be shrunk all the way to zero. All variables will therefore be included in the final model, hindering interpretation

Ways to selection number of variables in step 3 of forward/backwards selection

Cp, AIC, BIC, Adjusted R^2 Low values better for Cp, AIC and BIC. Larger values better for adjusted R^2 value

Bagging

Decision trees have high variance Bagging(short for bootstrap aggregation) uses the idea that averaging reduces variance Bagging aims to average the results from B separate bootstrapped training data sets Use the same variables to build the trees, so they're correlated with one another(this increases the variance of the mean) For classification: Same procedure except instead of taking an average of trees, we use majority voting to classify each observation

True or False: When using ridge regression, it is not important to scale the variables because the ridge regression coefficients are scale equivariant.

False

True or False: Ridge regression is scale equivariant

False! Ridge regression is NOT scale equivariant. Therefore when you're going to use ridge regression you want to scale the predictor variables

True or False: Basis functions, bk(X) can be used to represent the behavior of a function across the domain of X.

False(?)

True or False: In bagging, trees are built sequentially

False(?). Note that in boosting, unlike in bagging, the construction of each tree depends strongly on the trees that have already been grown

True or False: The penalty term for lasso is labmda times summation from j=1 to p of Beta^p squared

False, this is the penalty term for ridge regression. The penalty term for lasso is the same except Beta^p squared is in absolute values.

True or False: Random forests provide a set of highly correlated trees because at each split in the building process we select the splitting rule from a randomly selected subset of the original input variables.

False. "Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable."

True or False: CART will work better than linear regression/classification even when the underlying relationship is linear

False. If the underlying relationship is linear, linear regression will work better. If the underlying relationship is non-linear and complex then Classification and Regression Trees will likely outperform linear regression.

True or False: The first principle component is a linear combination of the X variables that explains the least variation in X.

False. It explains the greatest variation in X. "The idea is that out of every possible linear combination of pop and ad such that φ2 11 + φ2 21 = 1, this particular linear combination yields the highest variance"

True or False: Support vectors in minimal margin classification are the vectors that lie on or within the margin

False. Maximum margin classification. "These three observations are known as support vectors, since they are vectors in p-dimensional space and they "support" the maximal margin hyperplane in the sense vector that if these points were moved slightly then the maximal margin hyperplane would move as well."

True or False: Best subset selection preferred over stepwise approaches when p is large

False. When p is large, best subset becomes too computationally intensive. Best subset selction can also suffer from statistical problems when p is large.

True or False: The smoothness penalty term in smoothing splines is a function of the first derivative of the function g.

False. is a penalty term that penalizes the variability in g. The notation g''(t) indicates the second derivative of the function g. AKA, it's a function of the second derivative of the function of g.

Subset selection definition

Fit the model with a subset of p predictors that we believe to be related to the response Y. Attempts to trade some bias for variance reduction by removing variables.

Cost complexity pruning

Grow a very large tree To Consider a sequence of trees indexed alpha>0 alpha controls the tradeoff between the subtree's complexity and how well it fits the training data Behaves similar to Lasso penalty As alpha grows, there is a larger penalty for a more complex tree

Pros of CART

Highly interpretable May mirror human decision making Easy to make graphics which explain results

Maximal margin hyperplane

Hyperplane furthest from the training data

Separating Hyperplane

If even one hyperplane exists, then infinite exist. How do we choose between them? We want to use this hyperplane on test data classifications. Pick the hyperplane which is furthest from the training data on either side, this will yield fewest mistakes on new data

Computation advantage of ridge regression

If p is large, best subset selection requires evaluating a large number of models ridge regression only needs to fit one model and computations are not overly complicated. Ridge regression CAN be used when n>p

Cons of CART

Lack predictive accuracy compared to other regression and classification approaches It can be difficult to implement CART with categorical predictors CART has high variance(small change in data leads to different splits)

Effect of lambda on shrinkage penalty term in ridge regression

Lambda=0: constraint term=0 so it reduces to simply RSS Lambda getting greater and greater: The constraint term is attempting to minimize, the only way this is possible is by forcing the Bi's closer to 0 Cross validation is used to select lambda!

comparison of ridge regression and lasso

Neither will ever be universally better than the other Lasso tends to perform better when the response is a function of only a relatively small number of predictors However, number of predictors truly related to the response is never known in practice Cross validation can be used to choose between the methods

knot point

Place where we can split the data to analyze different sections of the models separately. c is known as the knot point. Can capture more difficult patterns this way. The more knots and plits you have over the region, the more flexible the model is ex. if have K knots in the range of X, fit K+1 different cubic polynomials

Note on R^2

R^2 will always choose the biggest model. For this reason use R^2 in some circumstances(final step of best subset selection process)

Regularization methods

Ridge regression Lasso

smoothing spline function relation to cubic basis

Smoothing spline function corresponds to a cubic polynomial basis with knots at each unique value of x1...xn

Lasso function

Solves the issue with ridge regression. AKA will shrink coefficients all the way to 0. Only difference is that in the formula, in the constraint term its the summation of the absolute value of the Bj's. Before it was Bj^2

Notes on splines

Splines have high variance at the boundary A natural spline constrains the function to be linear in the boundary regions(outside the knots) Use cross validation to decide the number of knots

decision trees

Splitting rules can be collectively displayed in a figure resembling a tree. the resulting nodes, or "leaves" of the tree will be used to predict the response. Decision trees can be applied to both classification and regression. Very easy to interpret

Backward stepwise selection

Start with the full model and progress backwards, removing one variable at a time, picking the best model at each step and comparing them

Forward stepwise selection definition

Start with the null model and progress forward, one variable at a time, picking the best model at each step and comparing them

Hyperplane

Subspace which is one dimension below the dimension of the space ex. in 2 dimensions a hyperplane is a straight line in 3 dimensions a hyperplane is a sheet of paper etc The hyperplane therefore divides the space into 2 parts(those on one side belong to one class and those on the other side belong to another class) If a hyperplane exists, its called a separating hyperplane

Shrinkage penalty

Term in residual sums of squares which includes the constraint term(which therefore includes lambda)

main limitation of GAMS

The model is additive. This means we are not able to use interaction effects. This can be very limiting for many problems

Smoothing splines

To fit a smooth curve to a set of data, find a function g(x) that fits the observed data well. Goal: Find a smooth function gi(x) which makes RSS really small! They avoid the problem of choosing the number of knots, leaving a single lambda to be chosen

True or False: Principle component regression is a dimension reduction method.

True

True or False: Random forests are a bootstrap based procedure that attempt to reduce correlation among trees.

True

True or False: Support vector machines are used for binary classification

True

True or False: The learning rate parameter in some iterative procedures can help limit overfitting by controlling the step size.

True(?) What is the idea behind this procedure? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y , as the response. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. By fitting small trees to the residuals, we slowly improve ˆf in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals. In general, statistical learning approaches that learn slowly tend to perform well

True or false: The tuning parameter, C, in support vector classifiers can help balance bias-variance tradeoff.

True. "...plays a similar role in controlling the bias-variance trade-off for the support vector classifier... In contrast, if C is small, then there will be fewer support vectors and hence the resulting classifier will have low bias but high variance"

True or False: Natural splines are regression splines with extra constraints at the boundaries.

True. "A natural spline is a regression spline with additional boundary constraints"

True or False: When the tuning parameter is large enough with lasso, some coefficients are shrunk to 0 exactly.

True. "As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large."

True or False: Bootstrap aggregation uses the idea that averaging reduces variance

True. "Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method"

True or False: Binary partitions in regression trees follow a greedy approach because at each step the best split is made without consideration of splits that could lead to a better split at a later step.

True. "For this reason, we take a top-down, greedy approach that is known as recursive binary splitting...It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step."

True or False: In general, having fewer number of support vectors in support vector classification reduces bias

True. "In contrast, if C is small, then there will be fewer support vectors and hence the resulting classifier will have low bias but high variance. The bottom right panel in Figure 9.7 illustrates this setting, with only eight support vectors."

True or False: A cubic spline with K knots has K + 4 degrees of freedom

True. "In general, a cubic spline with K knots uses cubic spline a total of 4 + K degrees of freedom."

True or False: p-values are typically not useful in high dimensional regression.

True. "Therefore, one should never use sum of squared errors, p-values, R2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting."

True or False: Polynomial functions impose global structure on the non-linear function of X while step functions focus on more local behavior.

True. "Using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X. We can instead use step functions in order to avoid imposing such a global structure"

True or False: Partial least squares is a supervised alternative to PCR where new features are chosen that account for covariability with the response Y

True. "We now present partial least squares (PLS), a supervised alternative to partial least PCR. Like PCR, PLS is a dimension reduction method, which first identifies squares a new set of features Z1,...,ZM that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response."

True or False: The Gini index is a measure of node purity for growing classification trees

True. For this reason the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.

Basis functions

Used to represent the behavior of a function across its domain. The choice of bk(x) will allow us to capture increasingly complex behavior. Ex. STandard linear model: bk(x)=xk polynomial: bk(x)=xk^2 step functions: bk(x)=I(cj<x<cj+1)

Gini Index

Way to measure error in classification trees Measure of the total variance across the K classes

Cross Entropy or Deviance

Way to measure error in classification trees Related to an information measure and also the deviance

Comparing Support Vector machines to other methods

When classes are nearly separable, SVM does better than logistic regression When they're not, logistic regression with a ridge penalty and SVM perform similarly logistic regression is much more interpretable, giving probabilities of class membership For nonlinear boundaries, kernel SVMs are popular and easy to implement

Which of the following are true about GAMs [select all that apply]? (a) GAMs allow us to fit a nonlinear fj to each Xj . (b) Since the model is additive, we are not able to use interaction effects. (c) Nonlinear fits can potentially make more accurate predictions. (d) Because the model is additive, we can still examine the effect of each Xj on Y

a(true) b(true? "The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.") c(true) d(true)

Which TWO of the following are not true about regression trees? (a) They always perform better than multiple linear regression models. (b) The cost complexity pruning penalty is similar to the lasso. (c) They lead to a smooth prediction surface. (d) They can be applied to both classification and regression problems. (e) They cannot accommodate additive structure.

a- not true. If the underlying relationship is linear then multiple linear regression method can be better b-true c- true d-false(not for classification) e-true (Not 100% on these)

Which of the following is not a tuning parameter in boosting? (a) The number of variables to consider at each split. (b) The interaction depth (number of splits in each tree). (c) The shrinkage parameter. (d) The number of trees.

a- the number of variables to consider at each split

Random forests

bagging uses same variables to build trees, so they're correlated with one another(this inceases the variance of the mean) This is a bootstrap based procedure For each bootstrapped sample build a tree as before. however, at each split of the trees, we only consider a subset of m<p randomly selected variables In theory we can pick anything for m, in practice m= square root of p works well

regression splines

build on polynomial regression they fit piecewise polynomial regressions adding some continuity and smoothness constraints

Tuning parameter

condition added to support vector classifier. M(1-epsilon) allows individual observations to be inside the margin. Summation of epsilons<C C is a tuning parameter which controls how much we "budget" for these anomalies. C=0 is maximal margin classifier. It controls bias variance tradeoff. C controls the number of support vectors. Larger C budgets for more violations of the margin, aka more support vectors. smaller C results in fewer support vectors (low bias, high variance)

Pruning

continue process of splitting tree up until we have reached a predetermined number of terminal nodes Produces good predictions on training data, but overfits test data typically To improve test performance, use pruning(removing leaves of a tree) Use cost complexity pruning, which will examine how much the RSS is reduced when we include additional splits

Regularization

fit a model using p predictors. Coefficients shrink towards zero this will bias the estimates but to the benefit of drastically reducing the variance

Generalized Additive Models(GAMS)

formula takes B0 plus the sum of each function of x1...xp plus an error term. Can be used in a classification setting as well by using logistic regression

classification tree fitting

grown in the same manner as the regression tree using recursive binary splitting RSS cannot be the basis for making binary splits Use indicator function in the summation

How smoothing spline changes with changes of lambda

if lambda=0, minimize the RSS to 0 such that g(x) would go through all the points and overfit the data As lambda goes to infinity, g(x) would have NO roughness, (aka a straight line) Lambda in general helps balance between fitting the data well(reducing bias), but not so well that its too flexible(too much variance)

Roughness penalty(in smoothing splines)

lambda times the integral of the second derivative of g(x)^2. The more "wiggly" the function is, the higher the penalty lambda is the tuning parameter as before. The second derivative of g(x) squared gives us a measure of the roughness of the function at point t

classification trees

similar to regression tree, but used to predict a qualitative response For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs

Recursive binary splitting

the "top-down" or "greedy" method of splitting to select Rj in regression trees. Top down because it starts at the top of the tree, splits and moves down Greedy because at each split pick best split at that time rather than looking ahead and picking a split that will lead to a better tree in the future

Dimension reduction methods

transform the original predictors and perform least squares on the transformed variables The transformed variables will then be used in linear regression The number of transformed variables is much less than the number of original variables. Hence dimension reduction methods.


Related study sets

American Government - Unit 1 Test

View Set

Prep U Chapter 34: Assessment and Management of Patients with Inflammatory Rheumatic Disorders

View Set

Mod 08: Nation Building in the Americas

View Set

Managing Anxiety and Delivering your Speech

View Set

IB Economics SL - T2: Macroeconomics

View Set