4510 Test 2
Support vectors
observations which are equidistant from the hyperplane. The support the maximal margin hyperplane in that if they were different, the maximal margin hyperplane would be different as well
formulation of regression trees
1. Divide predictor space into J distinct regions R1, R2, ... Rj 2. Within each region Rj, we take the average of all the observations. This is the prediction of the response in each region Goal is to find a set of regions which minimizes the RSS
Notes on forward/backward selection
1.) Both fit only 1 + [p(p+1)]/2 models. This is much better than best subset selection which fit 2^p models. 2.) Backwards selection REQUIRES n>p (n=observations, p=predictors) for the full model to be estimated. However, forward selection can be used even when n<p 3.) Both not guaranteed to get you the best model
Best subset selection issues
1.) Cannot be applied with very large p 2.) Enormous search space can lead to overfitting and high variance of the coefficient estimates
Rank the following statistical methods in terms of their flexibility from least (1) to most (3) flexible: CART Subset selection Boosting
1.) Subset selection 2.) CART 3.) Boosting http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.7.pdf
Rank the following statistical methods in terms of their interpretability from least (1) to most (3) interpretable: GAMs, lasso, bagging
1.) bagging 2.) GAMs 3.) lasso http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.7.pdf
Steps of dimension reduction
1.) obtain the transformed variables Z1...ZM 2.) Fit a least squares model using these transformed variables get step one, the transformed variables, by using principle components and partial least squares
Advantages of lasso function
1.) shrinks coefficient estimates towards zero as in ridge regression 2.) When the tuning parameter, lambda, is large enough, some of the coefficient estimates will be shrunk to exactly 0 *This helps for variable/model selection! *since many coefficients will be shrunk to 0, we call this class of models sparse lambda is selected by cross validation
Boosting
A procedure that begins with the original data set, but grows the tree sequentially. Aka using info from the previous trees. This fits the data slowly in an iterative fashion Use lambda, which is a shrinkage parameter, in the calculations(this controls the rate at which boosting learns). The smaller the lambda(aka the slower the learning) the greater the B(number of trees)
Support vector classifier
A separating hyperplane rarely exists. We can relax this constraint to one which almost separates the classes. Sometimes data are noisy or hyperplane can overfit This method maximizes a soft margin. We still want to maximize the margin, M but willing to cushion for some observations to be inside the margin or even on the wrong side of the hyperplane. This will introduce a little bias, but drastically cut on variance
Interpretation: comparing decision tree, bagging, random forests or boosting
A single decision tree is extremely easy to interpret. Bagging, random forests and boosting all are taking the average of many trees This means they're great for prediction, not great for interpretation since different trees are typically completely different from one another To remedy this, if we have many different trees and the same variables keep appearing and their appearance yields a large decrease in the RSS, this variable is important! Variable importance plots used to see this in a graphical way
Classification error rate
A way to measure error in classification trees Measures the fraction of the training observations in that region that do not belong to the most common class 1-max(p hat sub j, k )
comments on GAMS
Allows us to fit a non-linear fj to each Xj so that we can automatically model non-linear relationships that standard linear regression will miss. The non linear fits can possibly make better predictions for the response, Y Because the model is additive, we can still examine the effect of each Xj on Y individually while holding all the other variables fixed The additivity makes it interpretable
local regression
Allows us to fit a non-linear function to the data by computing a localized fit at each location of the xi data. at each point conduct a weighted least squares regression by weighing the other local observations by how far away they are from that point (Closer points get more weight)
support vector machines
Attempt to find a plane that separates the 2 classes
Ridge regression
Attempt to minimize residual sums of squares (RSS). This is done with by using a constraint term, which is driven by a tuning parameter lambda
Drawback of boosting
Can overfit the data(though typically slowly) so need to be careful to not use indiscriminately large numbers
Trees vs. linear models
Classification and Regression Tree methods is known as CART If the relationship between the predictors and response is linear, then classical linear models such as linear regression would outperform regression trees If the relationship between the predictors is nonlinear, then decision trees would outperform classical approaches
Ways to measure error in classification trees
Classification error rate Gini Index Cross Entropy or Deviance All 3 used to build a tree and to prune it Classification error rate often preferred for pruning if prediction accuracy of of the pruned tree is the goal
Main issue with ridge regression
Coefficients will not be shrunk all the way to zero. All variables will therefore be included in the final model, hindering interpretation
Ways to selection number of variables in step 3 of forward/backwards selection
Cp, AIC, BIC, Adjusted R^2 Low values better for Cp, AIC and BIC. Larger values better for adjusted R^2 value
Bagging
Decision trees have high variance Bagging(short for bootstrap aggregation) uses the idea that averaging reduces variance Bagging aims to average the results from B separate bootstrapped training data sets Use the same variables to build the trees, so they're correlated with one another(this increases the variance of the mean) For classification: Same procedure except instead of taking an average of trees, we use majority voting to classify each observation
True or False: When using ridge regression, it is not important to scale the variables because the ridge regression coefficients are scale equivariant.
False
True or False: Ridge regression is scale equivariant
False! Ridge regression is NOT scale equivariant. Therefore when you're going to use ridge regression you want to scale the predictor variables
True or False: Basis functions, bk(X) can be used to represent the behavior of a function across the domain of X.
False(?)
True or False: In bagging, trees are built sequentially
False(?). Note that in boosting, unlike in bagging, the construction of each tree depends strongly on the trees that have already been grown
True or False: The penalty term for lasso is labmda times summation from j=1 to p of Beta^p squared
False, this is the penalty term for ridge regression. The penalty term for lasso is the same except Beta^p squared is in absolute values.
True or False: Random forests provide a set of highly correlated trees because at each split in the building process we select the splitting rule from a randomly selected subset of the original input variables.
False. "Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable."
True or False: CART will work better than linear regression/classification even when the underlying relationship is linear
False. If the underlying relationship is linear, linear regression will work better. If the underlying relationship is non-linear and complex then Classification and Regression Trees will likely outperform linear regression.
True or False: The first principle component is a linear combination of the X variables that explains the least variation in X.
False. It explains the greatest variation in X. "The idea is that out of every possible linear combination of pop and ad such that φ2 11 + φ2 21 = 1, this particular linear combination yields the highest variance"
True or False: Support vectors in minimal margin classification are the vectors that lie on or within the margin
False. Maximum margin classification. "These three observations are known as support vectors, since they are vectors in p-dimensional space and they "support" the maximal margin hyperplane in the sense vector that if these points were moved slightly then the maximal margin hyperplane would move as well."
True or False: Best subset selection preferred over stepwise approaches when p is large
False. When p is large, best subset becomes too computationally intensive. Best subset selction can also suffer from statistical problems when p is large.
True or False: The smoothness penalty term in smoothing splines is a function of the first derivative of the function g.
False. is a penalty term that penalizes the variability in g. The notation g''(t) indicates the second derivative of the function g. AKA, it's a function of the second derivative of the function of g.
Subset selection definition
Fit the model with a subset of p predictors that we believe to be related to the response Y. Attempts to trade some bias for variance reduction by removing variables.
Cost complexity pruning
Grow a very large tree To Consider a sequence of trees indexed alpha>0 alpha controls the tradeoff between the subtree's complexity and how well it fits the training data Behaves similar to Lasso penalty As alpha grows, there is a larger penalty for a more complex tree
Pros of CART
Highly interpretable May mirror human decision making Easy to make graphics which explain results
Maximal margin hyperplane
Hyperplane furthest from the training data
Separating Hyperplane
If even one hyperplane exists, then infinite exist. How do we choose between them? We want to use this hyperplane on test data classifications. Pick the hyperplane which is furthest from the training data on either side, this will yield fewest mistakes on new data
Computation advantage of ridge regression
If p is large, best subset selection requires evaluating a large number of models ridge regression only needs to fit one model and computations are not overly complicated. Ridge regression CAN be used when n>p
Cons of CART
Lack predictive accuracy compared to other regression and classification approaches It can be difficult to implement CART with categorical predictors CART has high variance(small change in data leads to different splits)
Effect of lambda on shrinkage penalty term in ridge regression
Lambda=0: constraint term=0 so it reduces to simply RSS Lambda getting greater and greater: The constraint term is attempting to minimize, the only way this is possible is by forcing the Bi's closer to 0 Cross validation is used to select lambda!
comparison of ridge regression and lasso
Neither will ever be universally better than the other Lasso tends to perform better when the response is a function of only a relatively small number of predictors However, number of predictors truly related to the response is never known in practice Cross validation can be used to choose between the methods
knot point
Place where we can split the data to analyze different sections of the models separately. c is known as the knot point. Can capture more difficult patterns this way. The more knots and plits you have over the region, the more flexible the model is ex. if have K knots in the range of X, fit K+1 different cubic polynomials
Note on R^2
R^2 will always choose the biggest model. For this reason use R^2 in some circumstances(final step of best subset selection process)
Regularization methods
Ridge regression Lasso
smoothing spline function relation to cubic basis
Smoothing spline function corresponds to a cubic polynomial basis with knots at each unique value of x1...xn
Lasso function
Solves the issue with ridge regression. AKA will shrink coefficients all the way to 0. Only difference is that in the formula, in the constraint term its the summation of the absolute value of the Bj's. Before it was Bj^2
Notes on splines
Splines have high variance at the boundary A natural spline constrains the function to be linear in the boundary regions(outside the knots) Use cross validation to decide the number of knots
decision trees
Splitting rules can be collectively displayed in a figure resembling a tree. the resulting nodes, or "leaves" of the tree will be used to predict the response. Decision trees can be applied to both classification and regression. Very easy to interpret
Backward stepwise selection
Start with the full model and progress backwards, removing one variable at a time, picking the best model at each step and comparing them
Forward stepwise selection definition
Start with the null model and progress forward, one variable at a time, picking the best model at each step and comparing them
Hyperplane
Subspace which is one dimension below the dimension of the space ex. in 2 dimensions a hyperplane is a straight line in 3 dimensions a hyperplane is a sheet of paper etc The hyperplane therefore divides the space into 2 parts(those on one side belong to one class and those on the other side belong to another class) If a hyperplane exists, its called a separating hyperplane
Shrinkage penalty
Term in residual sums of squares which includes the constraint term(which therefore includes lambda)
main limitation of GAMS
The model is additive. This means we are not able to use interaction effects. This can be very limiting for many problems
Smoothing splines
To fit a smooth curve to a set of data, find a function g(x) that fits the observed data well. Goal: Find a smooth function gi(x) which makes RSS really small! They avoid the problem of choosing the number of knots, leaving a single lambda to be chosen
True or False: Principle component regression is a dimension reduction method.
True
True or False: Random forests are a bootstrap based procedure that attempt to reduce correlation among trees.
True
True or False: Support vector machines are used for binary classification
True
True or False: The learning rate parameter in some iterative procedures can help limit overfitting by controlling the step size.
True(?) What is the idea behind this procedure? Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y , as the response. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm. By fitting small trees to the residuals, we slowly improve ˆf in areas where it does not perform well. The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals. In general, statistical learning approaches that learn slowly tend to perform well
True or false: The tuning parameter, C, in support vector classifiers can help balance bias-variance tradeoff.
True. "...plays a similar role in controlling the bias-variance trade-off for the support vector classifier... In contrast, if C is small, then there will be fewer support vectors and hence the resulting classifier will have low bias but high variance"
True or False: Natural splines are regression splines with extra constraints at the boundaries.
True. "A natural spline is a regression spline with additional boundary constraints"
True or False: When the tuning parameter is large enough with lasso, some coefficients are shrunk to 0 exactly.
True. "As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large."
True or False: Bootstrap aggregation uses the idea that averaging reduces variance
True. "Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method"
True or False: Binary partitions in regression trees follow a greedy approach because at each step the best split is made without consideration of splits that could lead to a better split at a later step.
True. "For this reason, we take a top-down, greedy approach that is known as recursive binary splitting...It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step."
True or False: In general, having fewer number of support vectors in support vector classification reduces bias
True. "In contrast, if C is small, then there will be fewer support vectors and hence the resulting classifier will have low bias but high variance. The bottom right panel in Figure 9.7 illustrates this setting, with only eight support vectors."
True or False: A cubic spline with K knots has K + 4 degrees of freedom
True. "In general, a cubic spline with K knots uses cubic spline a total of 4 + K degrees of freedom."
True or False: p-values are typically not useful in high dimensional regression.
True. "Therefore, one should never use sum of squared errors, p-values, R2 statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting."
True or False: Polynomial functions impose global structure on the non-linear function of X while step functions focus on more local behavior.
True. "Using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X. We can instead use step functions in order to avoid imposing such a global structure"
True or False: Partial least squares is a supervised alternative to PCR where new features are chosen that account for covariability with the response Y
True. "We now present partial least squares (PLS), a supervised alternative to partial least PCR. Like PCR, PLS is a dimension reduction method, which first identifies squares a new set of features Z1,...,ZM that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response."
True or False: The Gini index is a measure of node purity for growing classification trees
True. For this reason the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.
Basis functions
Used to represent the behavior of a function across its domain. The choice of bk(x) will allow us to capture increasingly complex behavior. Ex. STandard linear model: bk(x)=xk polynomial: bk(x)=xk^2 step functions: bk(x)=I(cj<x<cj+1)
Gini Index
Way to measure error in classification trees Measure of the total variance across the K classes
Cross Entropy or Deviance
Way to measure error in classification trees Related to an information measure and also the deviance
Comparing Support Vector machines to other methods
When classes are nearly separable, SVM does better than logistic regression When they're not, logistic regression with a ridge penalty and SVM perform similarly logistic regression is much more interpretable, giving probabilities of class membership For nonlinear boundaries, kernel SVMs are popular and easy to implement
Which of the following are true about GAMs [select all that apply]? (a) GAMs allow us to fit a nonlinear fj to each Xj . (b) Since the model is additive, we are not able to use interaction effects. (c) Nonlinear fits can potentially make more accurate predictions. (d) Because the model is additive, we can still examine the effect of each Xj on Y
a(true) b(true? "The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.") c(true) d(true)
Which TWO of the following are not true about regression trees? (a) They always perform better than multiple linear regression models. (b) The cost complexity pruning penalty is similar to the lasso. (c) They lead to a smooth prediction surface. (d) They can be applied to both classification and regression problems. (e) They cannot accommodate additive structure.
a- not true. If the underlying relationship is linear then multiple linear regression method can be better b-true c- true d-false(not for classification) e-true (Not 100% on these)
Which of the following is not a tuning parameter in boosting? (a) The number of variables to consider at each split. (b) The interaction depth (number of splits in each tree). (c) The shrinkage parameter. (d) The number of trees.
a- the number of variables to consider at each split
Random forests
bagging uses same variables to build trees, so they're correlated with one another(this inceases the variance of the mean) This is a bootstrap based procedure For each bootstrapped sample build a tree as before. however, at each split of the trees, we only consider a subset of m<p randomly selected variables In theory we can pick anything for m, in practice m= square root of p works well
regression splines
build on polynomial regression they fit piecewise polynomial regressions adding some continuity and smoothness constraints
Tuning parameter
condition added to support vector classifier. M(1-epsilon) allows individual observations to be inside the margin. Summation of epsilons<C C is a tuning parameter which controls how much we "budget" for these anomalies. C=0 is maximal margin classifier. It controls bias variance tradeoff. C controls the number of support vectors. Larger C budgets for more violations of the margin, aka more support vectors. smaller C results in fewer support vectors (low bias, high variance)
Pruning
continue process of splitting tree up until we have reached a predetermined number of terminal nodes Produces good predictions on training data, but overfits test data typically To improve test performance, use pruning(removing leaves of a tree) Use cost complexity pruning, which will examine how much the RSS is reduced when we include additional splits
Regularization
fit a model using p predictors. Coefficients shrink towards zero this will bias the estimates but to the benefit of drastically reducing the variance
Generalized Additive Models(GAMS)
formula takes B0 plus the sum of each function of x1...xp plus an error term. Can be used in a classification setting as well by using logistic regression
classification tree fitting
grown in the same manner as the regression tree using recursive binary splitting RSS cannot be the basis for making binary splits Use indicator function in the summation
How smoothing spline changes with changes of lambda
if lambda=0, minimize the RSS to 0 such that g(x) would go through all the points and overfit the data As lambda goes to infinity, g(x) would have NO roughness, (aka a straight line) Lambda in general helps balance between fitting the data well(reducing bias), but not so well that its too flexible(too much variance)
Roughness penalty(in smoothing splines)
lambda times the integral of the second derivative of g(x)^2. The more "wiggly" the function is, the higher the penalty lambda is the tuning parameter as before. The second derivative of g(x) squared gives us a measure of the roughness of the function at point t
classification trees
similar to regression tree, but used to predict a qualitative response For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs
Recursive binary splitting
the "top-down" or "greedy" method of splitting to select Rj in regression trees. Top down because it starts at the top of the tree, splits and moves down Greedy because at each split pick best split at that time rather than looking ahead and picking a split that will lead to a better tree in the future
Dimension reduction methods
transform the original predictors and perform least squares on the transformed variables The transformed variables will then be used in linear regression The number of transformed variables is much less than the number of original variables. Hence dimension reduction methods.