MAS II: Statistical Learning

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

does scaling have a significant effect on the result of PCA?

YES

for KNN, should predictor variables be scaled before performing?

YES

in a boosted model, is a small or large shrinkage parameter (learning rate) desirable?

a SMALL shrinkage parameter is often desirable a slow learning rate prevents the model from fitting the training data too quickly and closely, which allows the model to capture more essential patterns over the noise in the data

what is a reversal in a simple quantile plot?

a reversal occurs when the average actual response decreases as quantile increases

what kind of plot can be used to determine the number of principal components?

a scree plot

principal components features

all principal components are uncorrelated with one another a dataset has min(n-1,p) distinct principal components the first k principal component scores and loadings approximate the original dataset principal components are low-dimensional surfaces in p-dimensional space that are closest to the observations each principal component loading vector is unique up to a sign flip

dense layer

another name for hidden layer after all convolution and pooling parameters: (Ki+1)*Ki₊₁

bayes classifier

assigns each observation to the most likely class given its predictor values f(x₁,...,xp) = arg max Pr(Y=c|X₁=x₁,...,Xp=xp)

which linkage is preferred over single linkage?

average and complete linkage are generally preferred over single linkage, as they tend to produce more balanced dendrograms single linkage tends to produce skewed dendrograms, while centroid linkage could produce dendrograms with inversions

what methods can use out-of-bag error to estimate test error?

bagging AND random forests

how does bagging reduce variance?

bagging reduces variance by averaging the predictions of all the (unpruned) bagged trees

as the number of trees increases, which method affects the error rate the least and the most out of bagging, boosting and random forests?

boosting error rate decreases the most, and bagging error rate decreases the least. random forest is in between. the test errors for bagging and random forest tend to be similar. this is because bagging is a special case of random forests. random forest tends to perform better than bagging, thus resulting in a lower test error. this is because random forests have the ability to decorrelate trees, which can lead to a larger variance reduction.

the result of clustering depends on many parameters, such as:

choice of k in k-means clustering choice of number of clusters, linkage, and dissimilarity measure in hierarchical clustering choice to standardize variables

decision trees - recursive binary splitting classification

classification: minimize [1/n]∑nm*Im p^m,c=nm,c/nm

decision trees - cost complexity pruning classification

classification: minimize [1/n]∑nm*Im + λ|T|

neural networks parameters

coefficients are also called weights (this includes intercepts) intercepts are also called biases

input layer

consists of p features 0 parameters

neural networks are estimated to minimize what for a classification problem?

cross entropy -∑∑yi,c*ln[fc(xi)]

advantages of decision trees

easy to interpret and explain can be presented visually manage categorical variables without the need of dummy variables mimic human decision-making

output layers

f(x)=β₀+∑βk*Ak Ol=β₀+∑βk*Al,k parameters: (K+1)*w

what is the complement of specificity/true negative rate?

false positive rate

clustering methods seek to

find subgroups of homogeneous observations true for all clustering methods

steps to calculate the proportion of variance explained by the first principal component

first calculate the first principal component for each observation, zi,₁=∑∅j,₁*xi,j then, calculate the sample variance of the first principal component scores, s²z₁=[1/n]∑ z²i,₁ then divide sample variance of the first principal component scores by the total variation in the data set, s²z₁/∑s²xj. if the data is normalized and scaled then the total variance in the data set is equal to the total number of variables

for k-means clustering, the algorithm needs to be repeated...

for each k

rectified linear unit (ReLU) activation function

g(x) = 0 for x <0 x for x ≥0

sigmoid activation function

g(x) = e^x / [1+e^x]

softmax activation function

g(xc)=e^xc / ∑e^xm

stochastic gradient descent

gradient descent, but instead of all n observations contributing to the calculation of gradient, only a sampled minibatch does

detector layer

hidden layer for application of ReLU function to convolved image 0 parameters

pooling layer

hidden layer to reduce size of detected patterns 0 parameters

convolution layer

hidden layer used to detect small patterns in an image parameters: (Ki+1)*Ki₊₁

embedding layer

hidden layer used to lower p dimensions into m dimensions parameters: m*p

knn bias and flexibility

high k → low flexibility → high bias → low variance → high training error/MSE for test data, the error rate is minimized for an intermediate value of k

discrimination threshold

if the response is greater than or equal to the threshold, we predict event will occur if the response is less than the threshold, we predict event will not occur

bagging properties

increasing b does not cause overfitting bagging reduces variance out-of-bag error is a valid estimate of test error makes the model difficult to interpret a special case of random forests

random forests properties

increasing b does not cause overfitting decreasing k reduces the correlation between predictions every tree is constructed independently of every other tree

boosting properties

increasing b, the number of cycles, can cause overfitting boosting reduces bias d, the number of splits for each tree, controls complexity of the boosted model λ, the proportion of the response that is added in at each cycle, controls the rate at which boosting learns

how does knn compare to linear regression?

knn regression will outperform linear regression when the chosen functional form poorly approximates the true relationship between the response and explanatory variables

boosting algorithm

let z₁ be the actual response variable, y 1. for k=1,2,...,b: - use recursive binary splitting to fit a tree with d splits to the data with zk as the response -update zk by subtracting λ*f^k(x), i.e., let zk₊₁=zk-λ*f^k(x) 3. calculate the boosted model prediction as f^(x)=∑λ*f^k(x) prediction is obtained through fitting successive trees

bayesian additive regression trees (BART) properties

like bagging and random forests, BART incorporates randomness like boosting, BART sequentially builds trees to capture information not captured by previous trees

max pooling

max pooling collapses each non-overlapping 2×2 submatrix in a matrix by choosing the maximum value divide the height and width of the matrix by 2 example: a 1,920×1,080 matrix would condense into a 960×540 matrix

lift

measures a model's ability to avoid adverse selection by accurately determining an actuarially fair premium rate for each insured

PCA is most useful when...

multicollinearity is present in the features

for KNN, the number of training observations n must be ____ to produce good predictions

n must be LARGE

disadvantages of decision trees

not robust do not have the same degree of predictive accuracy as other statistical methods

for hierarchical clustering, the algorithm only needs to be performed...

once for any number of clusters

actual vs predicted plots

plots the actual response variable against the predicted response variable for each model the better model is closer to the diagonal line

simple quantile plots

plots the average actual response and the average predicted response for each quantile for each model the better model is better at predicting the actual response in each quantile, has fewer reversals, and has a larger vertical distance between the first and last actual quantiles (big lift) the predicted line will always be monotonically increasing, while the actual line may not be

lorenz curves

plots the cumulative percentage of actual response against the cumulative percentage of exposures the Gini index is twice the area between the Lorenz curve and the line of equality (the diagonal line) the better model has a larger Gini index 1. start by sorting the data in ascending order of predictions 2. calculate cumulative % of exposures/observations (x-axis) 3. calculate cumulative % of actual response (y-axis) 4. plot

receiver operating characteristics (ROC) curves

plots the true positive rates (sensitivity) against the false positive rates (1-specificity) for different values of the discrimination threshold AUROC is the area under the ROC curve the better the model has a larger AUROC

bayes error rate

proportion of class * probability of not selecting the correct outcome/probability of least likely outcome =1-average(max(p,1-p))

decision trees - recursive binary splitting regression

regression: minimize ∑∑ (yi-ybarRm)²

decision trees - cost complexity pruning regression

regression: minimize ∑∑(yi-ybarRm)²+ λ|T|

what mitigates the risk of overfitting for deep learning?

regularization such as lasso, ridge, early stopping, and dropout learning

confusion matrix

sensitivity = the actual observations that were correctly predicted as positive, i.e., true positive/actual positive specificity = the actual observations that were correctly predicted as negative, i.e., true negative/actual negative

loss ratio charts

sort by predicted loss ratio, group into quantiles, and plot the actual loss ratio for each quantile used to examine the efficacy of the current rating plan if new rating plan can distinguish between policies with low loss ratios and those with higher loss ratios, the current rating plan is poor if the chart shows monotonically increasing actual loss ratios from left to right, the model is doing a good job

double lift charts

sorts data by sort ratio = predicted model a/predicted model b plots the average actual response and the average predicted response for each quantile for each model in one chart each index if calculated by dividing the average response for that decile by the overall average response the better model is better at predicting the actual response in each quantile directly compares two models

neural networks are estimated to minimize what for a regression problem?

squared-error loss ∑[yi-f(xi)]²

k-mean clustering within-cluster variation

sum up all of the different euclidean distances within a cluster multiply by 2 because they can go either way, i.e. it could be (x₁)(x₂) or (x₂)(x₁) divide by the number of observations in the cluster OR calculate the centroid then the euclidean distance from the centroid to each observation multiply by 2

decision trees structure

terminal nodes or leaves represent the partitions of the predictor space internal nodes are points along the tree where splits occur terminal nodes do not have child nodes, but internal nodes do branches are lines that connect any two nodes a decision tree with only one internal node is called a stump

test error rate for classification problems

test error rate = E[I(Y≠Y^)], which can be estimated using [∑ (yi≠y^i)]/n proportion of observations that are incorrectly predicted predicted ≠ actual

average linkage

the arithmetic mean dendrograms created using the average linkage tend to be balanced

how to determine the difference between a complete linkage dendrogram and an average linkage dendrogram

the complete linkage uses the largest pairwise dissimilarities as the inter-cluster dissimilarity between two clusters. therefore, the last fusion would most likely occur at a higher height in complete linkage dendrograms versus average linkage dendrograms

centroid linkage

the dissimilarity between the cluster centroids dendrograms created using the centroid linkage can result in inversions

complete linkage

the largest dissimilarity dendrograms created using the complete linkage tend to be balanced

what are the three tuning parameters that boosting has?

the number of trees, the shrinkage parameter (learning rate), and the number of splits in each tree (interaction depth)

single linkage

the smallest dissimilarity dendrograms created using the single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time, i.e. a skewed dendrogram

are predictors categorical/numerical for KNN?

they are ALL numerical. we are using euclidean distance to determine the distance

specificity

true negative rate the percentage of negative observations with correct predictions a non-decreasing function of the discrimination threshold as the discrimination threshold increases, specificity increases from 0 to 1

sensitivity

true positive rate/hit rate the percentage of positive observations with correct predictions a non-increasing function of the discrimination threshold as the discrimination threshold increases, sensitivity decreases from 1 to 0

out-of-bag (OOB) validation

using the items that were not used to build the decision tree to compute test MSE for b sufficiently large, the OOB error is virtually equivalent to the leave-one-out cross-validation error

what is the probability that an observation is not selected for a bootstrap sample? what is the expected # of times the observation will be in the out of bag samples?

(1-[1/n])ⁿ then multiply by # of trees to get expected frequency

Calculate, on average, the proportion of the number of splits that will not consider the first/strongest/weakest/second/etc. predictor.

(p-m)/p p→# of predictors total m→# of predictors being considered at a split

The following information about a three-layer neural network is given: There are 24 predictors. There are 18 activations in the first layer. There are 5 output variables. The number of parameters needed for the second layer is 1,026. The number of parameters needed for the output layer is 60. Determine the number of parameters needed for the third layer.

Let Ki be the number of activations in the ith layer, and w be the number of output variables. The number of parameters needed for the second layer is K₂(K₁+1)=K₂(18+1)=1,026. So, K₂=54. The number of parameters needed for the output layer is w(K₃+1)=5(K₃+1)=60. So, K₃=11. The number of parameters needed for the third layer is K₃(K₂+1)=11(54+1)=605.

does boosting involve bootstrap sampling?

NO boosting does not involve bootstrap sampling. in boosting, trees are grown sequentially, i.e. each tree is grown using information from previously grown trees.

does KNN perform well in high dimensions?

NO high dimension means when there are a lot of predictors n<<p

can a random forest be easily visualized with a single tree diagram?

NO since a random forest averages the predictions across many trees, it cannot be easily visualized with a single tree diagram

true or false: recursive binary splitting can lead to overfitting the data

TRUE

principal components

zm=∑∅jm*xj zi,m=∑∅j,m*xi,j ∑∅²j,m=1 ∑∅j,m*∅j,u=0, m≠u example with 2 variables: Z₁ = ∅₁₁X₁+∅₂₁X₂ Z₂ = ∅₁₂X₁+∅₂₂X₂ sum of squares for ∅ for each z equal 1 ∅₁₁²+∅₂₁²=1∅₁₂²+∅₂₂²=1 first principal and second principal ∅ have dot product = 0 (think of it has ∅'s that sit on top of each other when written out in form above^) ∅₁₁∅₁₂+∅₂₁∅₂₂=0 ----------------- if the observations xj's are not centered then the first formula for the score becomes zm=∑∅j,m*(xj-xbar)

principal components - proportion of variance explained (PEV)

∑s²x=∑1/n ∑x²i,j s²z=[1/n] ∑z²i,m PVE=s²z/∑s²x the variance of the component/total variance the variance explained by each subsequent principal component is always less than the variance explained by the previous principal component total variance is the sum of the variance explained by the first k principal components and MSE of the k-dimensional approximation

in BART models, the burn-in samples are...

...initial iterations that are discarded these initial samples are often considered part of the model's warm-up phase and are not used for inference or analysis

for a neural network, the columns in B represent...

...the amount of output variables 5 columns in B mean the network is for predicting 5 output variables

for a neural network, 3 W matrices would mean...

...there are 3 hidden layers

decision trees algorithm

1. construct a large tree with g terminal nodes using recursive binary splitting 2. obtain a sequence of best subtrees, as a function of λ, using cost complexity pruning 3. choose λ by applying k-fold cross validation. select the λ that results in the lowest cross-validation error 4. the best subtree is the subtree created in step 2 with the selected λ value

bagging algorithm

1. create b bootstrap samples from the original training dataset 2. construct a decision tree for each bootstrap sample using recursive binary splitting 3. predict the response of a new observation by averaging the predictions (regression) or by using the most frequent category (classification) across all b trees

random forests algorithm

1. create b bootstrap samples from the original training dataset 2. construct a decision tree for each bootstrap sample using recursive binary splitting. at each split, a random subset of k variables are considered 3. predict the response of a new observation by averaging the predictions (regression) or by using the most frequent category (classification) across all b trees

k-nearest neighbors (KNN)

1. let the observation having inputs x₁,...,xp be the center of the neighborhood 2. starting from the center of the neighborhood, identify the k nearest training observations 3. for classification, y^ is the most frequent category among the k training observations; for regression y^ is the average of the response among the k training observations

k-means clustering algorithm

1. randomly assign a cluster to each observation. this serves as the initial cluster assignment 2. calculate the centroid (average) of each cluster 3. for each observation, identify the closest centroid and reassign to that cluster 4. repeat steps 2 and 3 until the cluster assignments stop changing

hierarchical clustering algorithm

1. select the dissimilarity measure and linkage to be used. treat each observation as its own cluster. 2. for k=n,n-1,...,2: - compute the inter-cluster dissimilarity between all k clusters. - examine all kC2 pairwise dissimilarities. the two clusters with the lowest inter-cluster dissimilarity are fused. the dissimilarity indicates the height in the dendrogram at which these two clusters join

slow learning, using gradient descent

1. start with an initial estimate θ^⁰ for θ and set t=0 2. iterate until the objective R(θ) fails to decrease: a) set θ^t⁺¹←θ^t←p*R'(θ^t) b) set t←t+1

Given the following three statements about tree-based methods for regression and classification: I. The main difference between bagging and random forests is the number of predictors considered at each split when building trees. II. Single decision tree models generally have higher variance than random forest models. III. Random forests provide an improvement over bagging because trees in a random forest are less correlated than those in bagged trees. Determine which of the statements I, II, and III are true.

ALL ARE TRUE Bagging is a special case of random forests. At each split, random forests consider a random subset of predictors. Random forest models average the result of each individual tree to obtain a single prediction for each observation. The action of averaging reduces the variance. Random forests consider a random subset of predictors at each split to decorrelate trees.

when the model is able to separate the negative observations from the positive observations, and the negative observations all have a lower probability than the positive observations, what is the AUROC?

AUROC = 1

hidden layers

Ak=g(wk,₀+∑wk,j*xj) parameters: (p+1)*K Ak=g(wk,₀+∑wk,j*xl,j+∑uk,s*Al-₁,s) parameters: (p+1)*K, K*K

A modeler creates a regression tree model using recursive binary splitting. Looking at the results, the decision tree appears to be too large, which causes overfitting. The modeler would like to adjust the model to make it more interpretable. Determine which of the following actions does NOT help solve the interpretability issue. A. Apply cost complexity pruning to the large tree, and choose the best subtree using cross-validation B. Increase the minimum number of observations required in the terminal nodes C. Decrease the number of splits allowed in the model D. Split tree nodes only if the decrease in the residual sum of squares of that split exceeds a threshold E. Apply bagging in constructing the decision tree model

Bagging does not make the model more interpretable; in fact, it makes it less interpretable. All the other answer choices result in a smaller tree, which make the model more interpretable.

Determine which of the following statements about decision trees are true. A. They generally have better predictive accuracy compared to other statistical methods. B. Like in linear regression, categorical variables should be handled using dummy variables. C. Pruning helps to reduce variance and leads to a smoother fit. D. Constructing a tree using a subset of a data set could produce a completely different tree from one constructed using the entire data set. E. Decision trees are better at mimicking human decision-making than linear regression.

C, D, and E are true A is false because decision trees typically have poorer predictive accuracy compared to other statistical methods such as bagging and boosting. B is false because dummy variables are not necessary since every possible grouping of the classes into two groups is considered when splitting.

cross entropy

Dm=-∑ p^m,c*ln(p^m,c)

classification error rate

Em=1-max(k)p^m,c the total number of training observations that do not belong to the most common class for each region

true or false: a tree with more splits tends to have lower variance

FALSE a tree with more splits tends to have higher variance

gini index

Gm=∑∑ p^m,c*(1-p^m,c)

determine which of the following activation functions are nonlinear i. g(z)=z ii. g(z)=1/(1+e^-x) iii. g(z)=z₊

I is a linear activation function. It is known as the identity activation function. II is a nonlinear activation function. It is the sigmoid activation function. III is a nonlinear activation function. It is the ReLU (rectified linear unit) activation function.


Ensembles d'études connexes

Chapter 29: The Child With Cancer

View Set

MKTG 310 Exam 2 Chapter 10 - Motivation, Personality, and Emotion

View Set

The Nursing Process in Pharmacology (330)

View Set

Chapter 40 - Corporate Directors, Officers, & Shareholders (Final Exam)

View Set

Chapter 1- Application, Underwriting, and Delivering Policy

View Set