ITEC 621 -- Quiz 4, 5, 6
measures in Classification Models
2LL Classification Error Impurity Error AIC BIC
how are all of the polynomial models evaluates?
ANOVA or cross-validation
what is the main goal in shrinkage?
find the tuning factor that yields the optimal model with the best CV results
Penalized or Shrinkage Methods
fit a model with all P predictors, but shrink (or "regularize") the coefficient estimates towards 0, but not always 0.
what happens when model increase in size and complexity?
fit and errors stats improve and so does model variance
when is ridge regression specifically useful?
for large models with lots of predictors and dimensionality issues
when you are removing a variable from a model what is this the equivalent of doing?
forcing the beta coefficient to be 0
How does CV help tuning parameter selection?
several modeling methods have tuning parameters (e.g., polynomial power, number of variables, number of tree leaves), that analysts can change to tune a model. CV helps evaluate the optimal value of model tuning parameters.
test subset
part of the data not in the train subset used to evaluate the trained models
train subset
part of the data selected randomly to fit models and estimate parameters
cross-validation used in predictive accuracy
partitioning data into train and test subsets and computing predictive accuracy scores
what areas are included in ML branch of AI?
pattern recognition, computational modeling and artificial intelligence
in regard to alpha, what is the Ridge and LASSO penalty?
ridge penalty --> 1 - alpha LASSO --> alpha
what is the result of Ridge Regression coefficients usually?
ridge regressions are biased with the resulting model has the lower the variance
in PLS, what happens when the M is increased?
same as OLS and PCR Bias decreases, Variance Increases
Polynomial Splines
same as linear splines, but using the degree=n attribute to specify a polynomial degree
how is predictive accuracy evaluation measured?
unstable models don't yield consistent results when making predictions with new data. CV helps evaluate the predictive accuracy and stability of models.
what must you NOT do with a dataset in ML?
use al data to fit (train) a model and then use that same data set to test model -- will be over-fitted
what is the amount of shrinkage in Ridge and LASSO Regressions controlled by?
"tuning" parameter
summary for Shrinkage and Regularization, when do you use them?
(1) removing predictors is not an option.
summary for Dimension Reduction Methods, when do you use them?
(1) the number of predictors is very large relative to the number of data observations available; (2) you must use all predictors; and (3) interpretation is NOT a goal, but predictive accuracy is.
when is it problematic when models NEED many important variables in Dimension Reductions?
(1) the ratio of predictors to observations is large (low degrees of freedom) and when (2) predictors are correlated.
what does a ridge regression coefficient minimize?
SEE + shrinkage penalty = SSE r
LASSO Regression finds coefficients that minimize...
SSE l = SSE + shrinkage penalty the penalty tuning parameter is applied over the sum of the absolute values of the coefficients, rather than over the sum of their squared values (unlike OLS)
in class, how will we view ML?
the development of models that can learn patterns from data (i.e., train) and the ability to test these models for predictive accuracy
what are spline regressions good for?
great for interpretation, but are good for prediction
Natural Splines
helps correct for this problem to some extent by forcing the very first and last segments to a linear fit.
if the MSE is low, the R-squared will be..
high
what is ML about?
maximizing predictive accuracy
what does dimension reduction do to a model?
may increase bias but substantially reduce variance of the coefficients, particularly when P is large relative to N
two distinct advantages of CV
1)Regardless of modeling assumptions/issues, cross-validation will always identify the most accurate model method and specification among the models tested 2)Variable selection methods require nested models of the same type, but cross-validation can not only compare nested models, but also dissimilar models
when is using PCR beneficial?
1. Good for Prediction, Less so for Interpretation 2. when dimensionality is high -> low ratio of observations to predictors, high multi-collinearity among predictors 3. if the first few PC's explain a large proportion of the variance in the data
process of LASSO regression
1.Like Ridge, all variables, y and x's, are standardized in LASSO 2.Also, the β coefficients are un-standardized back to their normal scales when reported. 3.a tuning parameter, just like in Ridge
Typical ML process using CV to select the best model
1.Select various models and specifications to evaluate 2.Split the data into train and test subsets 3.Fit (train) the models with the train subset 4.Make predictions with the test subset using the fitted (trained) models 5.Compute the MSE or deviance statistics of the test predictions 6.Re-sample, retrain and re-test the models several times 7.Select the model with the smallest test MSE or deviance 8.Once you select a model, re-fit that model using ALL the data (for statistical power) for actual predictions
what is ther most popular # for K in KFCV?
10
Spline Regression
A spline regression is very similar in concept to a piecewise regression, but you could fit in each segment, not just linear models, but polynomials too.
what happens in Ridge and LASSO Regressions?
All predictors that matter for business reasons are included, but predictors are shrunk (i.e., penalized, rather than removed) so that small coefficients have a very low weight in the prediction, but not removed.
types of non-linear models
B x C Interactions Polynomials Step and Piecewise Splines C x C Interactions Smoothing Splines
what is a big trade off in CV?
Bias vs Tradeoff
what is the biggest AI business application?
Customer Engagement Application
What are the functions of Piecewise and Splines?
Divide the data into various segments Fit a different regression in each segment A point where 2 segments connect is called a "knot"
the math of a spline regression?
For a given variable, start at the origin with a polynomial Then add a "truncated" power function for values beyond the 1st knot Repeat this procedure at every knot This ensures that the various polynomial segments connect at the knot, thus yielding a continuous curve
what are the most popular partitioning approaches?
Hold-out or random splitting (set percentage, ex. 70%) K-Fold CV (e.g. 10-Fold) - KFCV (e.g. 10FCV) Leave One Out CV (or Leave P Out) - LOOCV Bootstrapping
how do you know how to combine predictors in dimension reduction?
If we have P somewhat correlated predictors we can combine them into M components, with M << P --> fewer variables to use the correlation matrix to figure out how to group predictors into group
what does a step function do with scatterplot?
If we note patterns in scatterplots showing that relationships shift at various ranges of the data, a step function breaking up the data into such sections may provide better predictions
difference in LASSO regression and Ridge Regression?
In contrast, the LASSO method does shrink small coefficients all the way to 0 with large values of tuning parameter
if it is not important to retain ALL available variables in a model, should you use LASSO or Ridge?
LASSO
types of Spline Models
Linear Splines Polynomial Splines Cubic Splines B-Splines (Basis Splines) N-Splines (Natural Splines) Smoothing Splines
how to determine the polynomial degree?
Low degree polynomials are preferred and more interpretable - i.e., no higher than cubic is recommended
what are some predictive accuracy scores?
MSE Deviance Classification Error
How to determine number of segments / knots to use in tuning spline regressions?
More segments/knots add complexity to the model and make it harder to interpret More knots will yield a tighter training fit, but will not necessarily improve the test MSE Each knot uses up 1 degree of freedom in the model, so be mindful before adding more knots. Knots can be evenly spaced or specifically selected based on business knowledge or observations of plots
Spline (MARS) Models .. what does this stand for?
Multivariate Adaptive Regression Spline
what are 2LL AIC and BIC good for?
NOT individual models good for comparing models smaller deviance is better
Ridge Regression Intuition
OLS minimizes SSE Ridge minimizes SSE plus a penalty We can vary the penalty l thus controlling the shrinkage If we set shrinkage factor= 0, Ridge minimizes SSE -> same as OLS So if tuning parameter= ¥ Ridge yields the null model y = β0 If we set l very large, then the resulting β's have to be very small --> i.e., we shrink the coefficients
what is the difference between PCR and Partial Least Squares (PLS)?
PCR is an unsupervised method in transforming M PLS is a supervised method of Transforming M
what makes PLS a supervised method in finding M?
PLS is a dimension reduction method, but unlike PCR, PLS does further rotation of the dimensions to maximize the correlation of the PC's with Y PLS attempts to find directions that not only explain the predictors, but also the outcome variable by placing stronger weight on variables that are more strongly correlated with Y
when is a quadratic model appropriate to use?
Positive (negative) but diminishing and, at a point the effect becomes negative (positive) -- βx (+); βx2 (-) Positive (negative) but augmenting (diminishing) with x -- βx (+); βx2 (+)
what are the two popular dimension reduction methods>
Principal Components Regression Partial Least Squares
two types of Dimension Regression Methods
Principal Components Regression (PCR) Partial Least Squares Regression (PLS)
What polynomial degree is optimal?
Quadratic is best for interpretation Higher polynomials produce better training model fit and less bias --> But generate more variance and overfitting -- Especially at both ends of the curve -> "wagging the tail" Best polynomial should be selected using cross-validation
Holdout Samping aka?
Random Splitting Cross Validation (RSCV)
Polynomial Models (Squared, Cubic, etc.)
Relationship between Y and X's are not linear
Binary x Continuous Interaction Models
Relationship between Y and X's are not linear very popular in predictive analytics multiplicative effect
2 regularization methods?
Ridge Regression LASSO Elastic Net uses WA of Ridge and LASSO
what are the most common penalized shrinkage / regularization methods?
Ridge and LASSO Regressions
summary for Shrinkage and Regularization, when is it better to use one over the other?
Ridge is better when all predictors must be included. LASSO is better when some variable coefficients can shrink to 0.
Bootstrapping process
Suppose you have a dataset with N observations draw a random sample of N observations with replacement Because you sampled with replacement, a few observations may be selected more than once and some observations will be left out You now have a bootstrapped sample, which is equal in size to the original dataset You can now train the model with the bootstrapped sample of N observations you can test the model with the observations left out repeat this process hundreds or thousand of times, no limit
how to find MSE in KFCV?
The resulting MSE is an average of each of the K MSE calculations
what are important / useful properties of PCs?
They are perpendicular (i.e., "orthogonal") with each other, thus they are independent with 0 correlation -> no multicollinearity) are sorted from highest to lowest variance dimensions If the predictors are highly correlated, the first few PCs will account for a large proportion of the variance in the data
summary of Variable Selection, what do you do with predictors?
Use Subset or Step methods to find the predictor set that maximizes predictive accuracy while keeping multicollinearity at tolerable levels.
what are the two approaches to decide how far to grow a model?
Use model fit statistics that adjust (i.e., penalize), such as: Adjusted R2 and Akaike Information Criterion (AIC) Use CV testing -> train and test with different subsamples
how do you specify a piecewise linear model?
We do this by creating binary & interaction variables an specifying the model
example of unsupervised learning
We learn many things in life (e.g., how to walk) without even thinking or without specific goals in mind. Likewise, when we apply machine learning methods to learn from the data without a specific goal, this is called "unsupervised" learning.
after finding the 1st PC, what do you do?
We then evaluate all linear combinations that are pendicular (i.e., "orthogonal") to the 1st PC and find the one with the highest variance is called the 2nd PC; and so on.
When do you use polynomial functions?
When there is a business reason to suspect a non-linear relationship
there are various type of piecewise functions but what are they dependent on?
Which kind of model is fit in each segment How the segments are connected at the knots
how to shrink a model in variable selection?
With variable selection methods the final model is somewhere in between the Null model (no predictors) and the Full model, and we add and/or remove variables until we find an optimal set of predictor to retain in the model.
what is the different between models with and without an interaction term?
Without the interaction term, the equation above is like having two parallel regression lines -- additive with the interaction term what we also get two regression lines, but the lines are no longer parallel -- multiplicative
when do you use Ridge Regression?
XI - X's are not independent (i.e., are highly correlated --> high dimensionality)
when do you use LASSO regression?
XI - X's are not independent (i.e., are highly correlated--> high dimensionality)
what does a nonlinear model mean?
Y does not have a linear relationship with all of the X's
Step Function (Piecewise Function)
a flat average (no slope) in each section
In PLS, are the X's standardized?
Yes
What do higher polynomials yield?
Yield more peaks, valleys and inflections (i.e., curvier functions) Produce more overfitting and other dimensionality issues Yield lower bias and higher variance estimators harder to interpret
caret package
You can do model training (i.e., fitting) with a wide selection of modeling methods, and testing with a wide selection of cross-validation methods, with a single command train(), just by changing a few attributes.
relationship of size of datasets and N%
You can use higher percentages with small data sets and lower percentages with large data sets.
Smoothing Spline (Piecewise Function)
a combination of Ridge (i.e., shrinkage) and Spline, which has the effect of smoothing the curve at the knot; how much smoothing is determined by a tuning parameter "tuning parameter"
Piecewise Linear(Piecewise Function)
a different linear regression in each section
Piecewise Polynomial (Piecewise Function)
a different polynomial in each section
Because LASSO coefficients CAN become 0 when the tuning parameter is sufficiently large, what is LASSO thought off as?
a hybrid between variable selection and shrinkage
what does a step function predict?
a mean value for each section
Mallow′s Cp
a measure that makes adjustments for model size and complexity is a constant that adjusts the MSE by applying a penalty for the number of variables "p" (thus the p) and for the variance s in the errors (same idea than the Adjusted R^2)
Spline (Piecewise Function)
a piecewise function, but connected at the "knot" (i.e., no jumps a the knot)
Cubic Splines
a popular polynomial spline of degree=3, which is the default degree in most spline functions.
similar to PCR, what is M for the PLS regression?
a tuning parameter
unlike ridge coefficients, the LASSO methods how small will it shrink coefficients?
all the way to 0 with large tuning parameter
in a linear model, what are the effects of the x's?
additive
Akaike Information Criterion (AIC)
adjusting for model size and complexity adjusts 2LL for model complexity (k = number of parameters estimated): AIC=2LL+ 2k
Bayesian Information Criterion (BIC)
adjusting for model size and complexity similar to AIC, but it also factors in the sample size n: BIC=2LL+k∗log(n)
what happens with LASSO regressions when the tuning parameter grows?
all coefficients shrink proportionally L1 Norm gets smaller providing a measure of total shrinkage
process of Ridge Regression?
all y and z are standardized beta coefficients are un-standardized back to there normal scales when reported the best tuning parameter is used to minimize the CV MSE
why use piecewise linear models?
an attractive alternative to polynomials to fit different linear regression models in each section retain the simplicity of linear models and are easier to interpret. But because the slope of the regression can change at any selected point, it can provide better training fit with less overfitting
what does it mean when a model is over-fitted with high variance?
are notorious for having high predictive accuracy when evaluated with the same train subset, but tend to perform poorly when tested with different data.
what happens to bias and variance in PCR?
as with regular OLS as M (combination of P predictors) increases , Bias decreases, and Variance Increases
Cross Validation (CV)
at the heart of machine learning -- refers to fitting (training) models with one dataset sub-sample and evaluating (testing) them with a different sub-sample
Logic Based AI
based on "what-if" decision algorithms
how to find the penalty in ridge regression?
based on the sum of the squared coefficients -> re-scaling changes the results disproportionately
what is low variance good for?
best for hypothesis testing, inference and interpretation
what is low bias good for?
best for predictive accuracy
common value of N in random splitting?
between 70 and 80 %
what are similar about LASSO and Ridge coefficients?
biased low variance scale variant
what type of learning does ML include?
both supervised and unsupervised
what is the best selector in finding the optimal M in PLS and PCR?
cross-validation
How does CV help model estimation?
certain methods like neural networks fit models through repeated iterations. CV is used in each iteration to adjust the model recursively and find the best model fit.
how do you come to a conclusion on the best tuning parameter in Ridge and LASSO regression?
compare several tuning parameters and select the one with the best cross validation results
what does Machine Learning help with?
comparing models and model specifications to evaluate the predictive accuracy of each alternative and select best one
Expert Systems
contains large databases with rules acquired from experts would make decisions based on rules
when tuning parameter increases what happens to ridge coefficients?
decrease
goal of trade off between Bias and Variance?
develop predictive models with low bias and low variance This is not always possible -> variance vs. bias is a tradeoff
Splines
effect is different across sections of the data
Step and Piecewise
effect is different across sections of the data
Smoothing Splines
effect is different across sections of the data with a "smoothing" parameter
Polynomial
effect is quadratic, cubic
types of AI
expert systems logic based machine learning (includes deep learning)
unsupervised learning methods
exploring descriptive statistics and correlations no specific goal
what is the trade off in dimension reduction?
gain predictive accuracy at the expense of interpretability
impurity error
how mixed are actual values after classified
LOOCV vs KFCV
identical to KFCV when K is equal to the number of observations N (K=N)
example of dimension reduction?
if a vehicle's volume, horsepower, and weight are negatively correlated with its gas mileage, but we know that these 3 predictors are highly correlated with each other, we can take advantage of this correlation and combine them into a new variable called something like "car size" composed of some percentage of volume, plus some of horsepower, plus some of weight, reducing the model variables from 3 to 1
when is a higher polynomial appropriate?
if the relationship between y and x is more wavy
the MSE and RMSE decrease as model complexity and size...
increase (over fitting)
what happens to a shrinking regression?
increases bias -- because no longer true coefficients but artificially shrunk reduces variances improves predictive accuracy compared to variable selection methods
tuning parameter
is something we can change to refine and improve a model Most models that allow for tuning parameters employ the symbol upsidedown y to denote them
what happens if you use a large K in KFCV?
large K is computationally expensive and may over fit the data --> small bias, high variance
supervised learning
learning things in life with a specific purpose or goal
what are piecewise linear model similar to?
linear spline
how do Piecewise and Splines help polynomials?
lower variance and make a simpler alternative
Deep Learning
machine learning with several layers of learning nodes tradeoff: computational power vs accuracy
what does regularized, penalized, and shrinkage methods all aim to do?
methods aim at shrinking (and thus biasing) the regression coefficients to avoid issues of multi-collinearity
what is the objective of error measures with classification models?
minimize 2LL (deviance), minimize classification error, maximize classification purity
what is the objective of error measures with quantitative models?
minimize MSE and maximize R-Squared
"interaction" effect
multiplicative one enhances/offsets the effect of the other
two branches of deep learning in ML
neutral networks tensor flow
how small will ridge coefficients go?
never all the way to 0 except when the tuning parameter = infinity
validation subset
new data used to further evaluate and validate the models with new data
why do we use Dimension Reduction?
not feasible to use subset of variables or shrinking when there are hundreds and thousands of predictors in these cases must explore linear relationships among predictors -- so you use observed correlation to create new variables as linear combinations
what are the two most important tuning parameters?
number of segments or knots to use polynomial degree or function to use in each segment
why are Piecewise Functions and Splines used?
ploynomials can be complex with over-identification and high dimensionalilty with also a high variance ("wagging the tail") and the splines offer a simpler alternative to complex polynomials
Bootstrapping
popular and conceptually simple statistical approach, applicable to just about any machine learning method
L1 norm
popular measure of the actual overall shrinkage of LASSO regression coefficients
L2 norm
popular measure of the actual overall shrinkage of the Ridge regression coefficients
main uses of CV
predictive accuracy evaluation model comparison tuning parameter selection model estimation
what is the main goal of Principal Components Regression (PCR)?
reduce the dimensionality (i.e., number of variables) of a model without removing predictors -- since the PC's are sorted from highest to lowest variance -> possible that first few PC's may explain a high portion of the variance in the data
what is dimension reduction about?
reducing the estimation of P + 1 coefficients (Bo, B1, B2) to M + 1 coefficients (alphao, alpha1, alpha2)
Artificial Intelligence
refers to computer systems programmed to respond and have like humans what-if logic includes learning, reasoning, and self-correction
two branches of statistical learning in ML
regression models tree models natural language processing
N-Splines - or Natural Splines
similar to polynomial splines, except that the first and last segments are fitted with linear models. This often improves the predictive accuracy because the standard errors of the linear splines at the tail ends are typically lower than those of polynomial splines (less "tail wagging").
what happens if you use a small K in KFCV?
small K yields small training samples -> not great with small datasets -> high bias, small variance
What is different in LASSO regression than Ridge?
some LASSO coefficients do eventually become exactly 0 as the tuning parameter increases
Smoothing Splines
splines with further smoothing at the knots. The degree of smoothing can be tuned
two branches of machine learning
statistical learning deep learning
what happens if M << P?
substantial dimension reduction; otherwise, not useful -- Since each PC is a linear transformation of all predictors --> all P predictors are represented
what type of learning are most predictive analytics?
supervised learning -- there is a specific prediction goal
Machine Learning
systems that learn mostly from historical data without explicit programming
when is CV Test of Classification Error used?
testing predictive accuracy of classification models
when is CV Test of Deviance used?
testing predictive accuracy of classification models
when is CV Test of MSE used?
testing predictive accuracy of quantitative models
"shaking the tree"
testing the robustness of your model by dropping a few random data points and see if your model parameter estimates change substantially
train error
the MSE or deviance calculated with the same training data used to develop the model --> NOT very useful !!
why is Bias important?
the ability whether an estimator or model to predict the correct value.
In a LASSO regression, if you set the tuning parameter to be very large, what happens to the betas?
the betas have to be very small -- shrinking the coefficients
what does it mean for the coefficients in PLS if they are supervised?
the coefficients are less biased but have more variance than PCR
data mining
the computational process of discovering trends in data which were previously unknown
components
the correlation matrix that groups the predictors together
Continuous x Continuous Interactions
the effect of one continuous predictor is affected by another continuous predictor
Binary x Continuous Interactions
the effect of one predictor is affected by another; one predictor is binary and the other is continuous
how to shrink a model with shrinkage methods?
the final model is also somewhere in between the Null model and the Full model. But rather than removing variables, we start with the Full model and shrink the coefficients by a factor l, but we vary l from 0 (no shrinkage - i.e., Full model) to ¥ (all coefficients shrunk to 0 - i.e., Null model).
what does PLS regression do with M1?
the first direction M1 starts by placing more weight on the variable that has the strongest correlation with Y
why is cross validation helpful in machine learning?
the increase in variance introduced by multi-collinearity, heteroskedasticity, serial correlation, dimensioanlity becomes evident with CV
Leave-One-Out Cross Validation (LOOCV)
the model is trained with N-1 data points and the Test MSE or other fit statistics are calculated with the data point left out Repeated N times, leaving each data point out, one at a time
what makes PCR an unsupervised method in finding M?
the predictor variable dimensions are rotated to find directions in which the data exhibits highest variance an unsupervised method because the outcome variable Y is not taken into account when doing PCA is no guarantee the first few M components will be the best directions to predict Y
in a nonlinear model, what are the effects of the x's?
the predictors may be multiplicative
what happens if N% is too big in partitioning?
the resulting training models may be over-fitted.
B-Splines (Basis Splines)
the same thing as polynomial splines, but there are some nuances associated with the name, and they differentiate from N-Splines discussed next.
what happens when models are over-identified?
the training error tends to be small
linear model
the truth is never linear, but the linearity assumption is often good enough: y=β_0+β_1 x_1+β_2 x_2+etc.
why are quadratic models popular?
they are intuitive while providing a curvilinear fit
why are the PLS components no loner PC's (Partial Compnents)?
they deviate from the directions of higher variance, so they are simply called first, second, etc. "directions"
1st Principal Component (PC)
try every possible combination of p predictors and identify one with highest variables -- the linear combination where the data has more variance
what does dimension reduction do?
use observed correlation to create new variables as linear combinations a few of the new components may explain a large portion of the variance in the data, thus helping reduce the model dimension without losing much predictive power
summary of Variable Selection, when do you use?
use when: (1) exploring variables and adding/ removing predictors is OK; and (2) when the goal is interpretation.
how can dimensionality sometimes be resolved?
variable selection if predictors can be added or removed as needed and multi-collinearity is at tolerable levels with the variables selected, then dimensionality is taken care of.
what is the difference between variable selection and shrinkage methods in a full model?
variable selection: some predictors are retained and some others are dropped shrinkage methods: all predictors are retained but: betas are the shrunk coefficients, x's are standardized, but betas are unstandardized back to raw
when is CV necessary?
when comparing dissimilar predictive models (OLS reg vs reg tree)
what is the effect of the tuning parameter in LASSO Regression?
when tuning parameter = 0 --> minimizes SSE -- and LASSO = OLS when tuning parameter is infinity --> LASSO = Null Model
when is caret package useful?
when you need to fit many models and test them in different ways, because it uses the same syntax for all model types and cross-validation methods
K-Fold Cross Validation (KFCV)
widely used approach for cross validation testing The idea is to randomly divide the data into K (roughly) equal parts The model is trained on K-1 folds and tested on the remaining fold Repeat K times rotating each of the K partitions the test fold if k= 5 Train Train Train Train Test no rule of thumb for # of k
when is CV particularly important?
with over-fitted models
Linear Splines
work just like the piecewise linear models, but using R spline( ) functions with degree=1
when do you use Dimension Reduction Models?
x's are not independent -- highly correlated --> high dimensionality
when is PCR not useful?
x1 and x2 are uncorrelated
what happens if N % is too small in partitioning?
you lose statistical power with the reduced training set, which is problematic with small samples