SOA PA Exam 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

What is one drawback for GLM Models and one drawback from Decision Trees?

- Limited ability to extract complex relationships from the data (GLM) - Sensitivity to noise and a tendency to overfit (Decision trees)

Define some limitations of Feature Selection using Regularization

- Performing feature selection as a part of the optimization of a model is an automatic method and may not always result in the most interpretable model - It is a model dependent technique, so the features we select as a part of linear regression using the lasso method may have been optimized for linear regression, and there may not necessarily translate to other model forms. (We can perform feature selection using regularization but due to complexity of some objective functions, this can be suboptimal.

Define Pruning a Tree

- Reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. - Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

When should you remove a column (variable) in a data set?

- if data is >50% missing then remove column - If there are an overwhelming amount of Factor levels (50 for example) - if one factor level has the overwhelming amount of observations (and other levels are not credible)

What are three main uses of PCA?

Feature Transformation: The new principal components are linear combinations of the original variable - should have a unifying theme and usually are correlated Feature Extraction: You produced new variables, that is, the principal components. Feature Selection: Hopefully, and in most cases, fewer variables are needed to capture most of the information.

Define forward Selection

which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Define Error Rate

(FP+FN)/N = 1 - Accuracy

Define accuracy and impact if data is unbalanced

(TP+TN)/N Accuracy is a weighted average of specificity and sensitivity Weighted by the amount of data in each class If classes are unbalanced, can scew accuracy

Define confusion matrix

- A convenient summary of the model predictions from which several performance measures are derived

What to examine when assessing the bivariate relationship between a Continuous predictor variable and a binary target variable?

- A graph with separate histograms for a continuous variable, one for those with target binary = 0 and one for those with binary = 1; - Box plots summarized based on binary target; - Tables summarizing the mean, median, and count of the predictor based on each binary target

Define Stratified Random Sampling (Stratification)

- A method of sampling that involves the division of a population into smaller sub-groups known as strata. - the strata are formed based on members' shared attributes or characteristics such as income or educational attainment. (dividing into homogeneous groups called strata)

Define Partial dependence plots

- A method that visualizes the structure of the model's dependence on a feature or pair of features - Allow us to get an understanding of the relationship between the features and the target Used in random forests and boosting

What are the benefits of Ensemble methods

- Allows the increase ability to reflect complex relationships - Reduces the variance in our model's output by taking the average over all of the component models - Most noise in the model introduced by fitting to a specific subset of data will be canceled out because all models have been considered

Define Principal Component Analysis

- An unsupervised learning technique which linearly combines the initial variables in a data set to create new orthogonal principal components which then can be used to assess the correlation between the initial variables. - Layman's Terms: "Transforming a large number of possibly correlated variables into smaller, much more manageable sets of representative variables that capture most of the information in the original high-dimension data."

Define how Ensemble Methods of model building are built

- Build many models on random subsets of the data and take the answer in aggregate.

Define CP:

- Decision tree control parameter - Complexity Parameter - : if any split does not increase the overall R^2 of the model by at least cp (where R^2 is the usual linear-models definition) then that split is decreed to be, a priori, not worth pursuing - Higher value indicates less complex tree

Define maxdepth:

- Decision tree control parameter - The maximum depth any node of the final tree, with the root node counted as depth 0.

Define minbucket

- Decision tree control parameter - The minimum number of observations in any terminal node - (minimum bucket size)

Define minsplit:

- Decision tree control parameter - minimum # of observations that must exist in a node for a split to be attempted. - A good choice for this value is <= 5% of the total number of rows in data (larger for more interpretable models) - (.05*nrow(data))

What underlying assumptions are made about the error of a ols model?

- Errors are independent - Errors have a mean of zero - Errors have a constant variance - Errors have a gaussian (normal) distribution

Define Decision Tree

- It is a non-parametric supervised learning method used for classification and regression. - The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. - A tree can be seen as a models with a simple, easy to interpret structure based on a set of if/then rules that clearly highlight key factors and interactions (piecewise constant approximation)

Define Information Gain (in Decision Tree Analysis)

- It is the reduction in entropy or surprise by transforming a dataset and is often used in training decision trees. - Information gain is calculated by comparing the entropy of the dataset before and after a transformation.

What is an advantage of Stratification from simple random sampling?

- Stratified random sampling allows researchers to obtain a sample population that best represents the entire population being studied. - Stratified random sampling differs from simple random sampling, which involves the random selection of data from an entire population, so each possible sample is equally likely to occur.

What are Key assumptions of OLS Regression?

- The conditional distribution of the response variable is normal. - It is symmetric about the mean, continuous, and can assume all positive and negative values.

Define Residuals vs Leverage graph

- This graph shows us where the influential data points are. - Influential data points are points with high leverage and high residual. These will have high cook's distance and will be above the red dotted "0.5" line - (Leverage is a measurement of a data point's distance from the mean of the x values and residuals measure distance from predicted value) - Graphs Standardized Residuals vs Leverage

Define Regularization Regression

- This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. - In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. - The reduction in the number of dimensions can be only indirectly chosen via the lambda hyperparameter but is also applicable to categorical variables after binarization.

Describe advantages of LASSO regression from forward or backward selection.

- Through cross‐validation it selects the best hyperparameter using the same criterion (RMSE) that will ultimately be used to judge the model against unseen data. - Regularization methods requires binarization of categorical variables, so unlike the stepAIC, which treated all factor levels of one variable as a single object to remove or retain in the model, the LASSO removes individual factor levels if they are not significant with respect to the base level.

When would someone prefer to use hierarchical clustering vs k-means clustering?

- Typically, one would use hierarchical clustering to better understand the data when we expect there to be hierarchical structure. - Hierarchical clustering ends up with a 'dendrogram', or a sort of connectivity plot. You can use that plot to decide after the fact of how many clusters your data has, by cutting the dendrogram at different heights. - With k-Means clustering, you need to have a sense ahead-of-time what your desired number of clusters is (this is the 'k' value), so if the business problem makes this clear - Hierarchical clustering can be more computationally expensive but usually produces more intuitive results.

Why should factor level reduction be considered?

- Variables with a large number of dimensions can dilute predictive power (Extremely large indicators may be dropped from the analysis if not specified) - For interpretability - Low Exposure within a factor level may also be an issue as a result of the potential for noise to overwhelm the signal

Define AIC

-Akaike Information Criterion; estimate of the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection. -Method: adding a variable requiring an increase in the loglikelihood of two per parameter added. = 2*k - 2*ln(L) [Lower = better] - Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. - The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.

Define ROC Curve

-Receiver Operator Characteristic Curve - A plot of TPR (Sensitivity) vs FPR (1-Specificity) - The ROC curve shows the sensitivity and FPR at various probability cutoffs. - Curves that bend to the upper left of the square represent greater accuracy, and hence the area under the curve (AUC) is an overall measure of accuracy.

After finding the optimal cp value in a decision tree, what are options to move forward with the model?

1. Build a new tree from scratch using the optimal complexity parameter value 2. Use the overly complex tree we just built and prune back (remove) any splits which the subtree doesn't satisfy the new impurity reduction.

List three impurity measures and compare each of them

1. Entropy 2. Gini 3. Classification Error - Entropy and Gini are more sensitive to changes in the node probabilities than the classification error. Therefore, either Gini or Entropy should be used when growing a tree. Though studies have shown that the choice of impurity measure has little effect on the performance of a decision tree because all three are consistent with each other.

After Building a GLM Regression Model, what are some ways (steps) to validate the model?

1. Evaluate the RMSE against an OLS model or other test models to determine the best or most interpretable model 2. Interpret diagnostic plots to assess the model assumptions - Residuals Versus Fitted Graph: - Normal QQ Plot: - Scale-Location graph: - The Residuals vs Leverage: Note: Be aware if the question asks you to re-run the model on the full data set 3. Understand how to interpret Predictors and which predictors are boolean and what the coefficients represent.

List 3 Commonly Used Link Functions

1. Identity: g(n) = n, g^(-1)(n) = n For every unit increase in 𝑥j the predicted target changes by 𝑏j 2. Log: g(n) = log(n), g^(-1) = exp(n) For every unit increase in 𝑥j the predicted target changes by e^(bj) 3. Logit: g(n) = log[n/(1-n)], g^(-1)(n)= e^(n) / (1 +e^n) For every unit increase in 𝑥j the predicted odds changes by e^(bj)

Define 4 Random Forest Parameters

1. Number of Trees 2. Proportion of Observations 3. Proportion of features 4. Decision Tree Parameters

Define undersampling and oversampling

1. Undersampling: keeps all instances of minority class and samples from majority class 2. Oversampling: keeps all instances of majority class and samples replacement instances of minority class Should be performed only to the train data, because resampled minority class observations that get duplicated could end up in both the train and test data 3. Combination of each (These introduce bias in the data to represent one class of the data more or less than others) Both will cause GLM to pick up the signal of the minority class more reliably

Define the 4 Control Parameters in a decision tree.

1. cp 2. minbucket 3. maxdepth 4. minsplit

Define a Balanced binary tree

A binary tree in which the left and right subtrees of any node differ in depth by at most one.

Define an unbalanced binary tree

A binary tree in which the left and right subtrees of some nodes differ in depth by more than one.

Define a Loss Function

A function of two variables; the prediction and a single observed new value that measures the error. (e.g squared error loss function: [y_new -g(x_new, B_hat)]^2

Define Feature Importance

A method that analyzes the structure of the model and ranks the contribution of each feature

What to examine when assessing the bivariate relationship between a Factor predictor variable and a binary target variable?

A table to asses (with rows as factor levels) the mean probabilities, counts of observations of each factor, and counts of each observation of each binary target.

Define Pruning

A technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little predictive power. Pruning reduces the complexity of the final model and, hence, improves predictive accuracy by reducing overfitting.

Define and provide examples of Supervised learning

A type of machine learning task of learning a function that maps an input to an output based on example input-output pairs. includes target variable Examples; Generalized Linear Models (GLM), Regularization (lasso & ridge), decision trees.

What are some Advantages and Disadvantages of Cross Validation?

Advantages: - Reduces Overfitting - Hyperparameter Tuning (helps in finding the optimal value of hyperparameters) Disadvantages: - Needs Expensive Computation: it can be computationally very expensive (k times as much computation to make an evaluation)

Define some advantages and disadvantages of bagging

Advantages: - they reduce the expected loss of models by reducing variance without affecting the bias - They retain all the advantages of the decision tree model, (e.g the ability to handle categorical and numerical variables as well as missing values) Disadvantages: - We lose interpretability

Discuss the advantages and disadvantages of using binarization on the factor variables to allow dropping individual levels

Advantages: -Allows the stepAIC procedure to drop individual factor levels if they are not significant compared to the base level. Disadvantages: - There will be more steps in the stepAic procedure. - Nonsensical results can be obtained if only a handful of factor levels are chosen. (e.g if only one factor level is significant out of 20)

Define and describe the two types of Hierarchical Clustering

Agglomerative: starter by considering each observation as its own cluster, then gradually grouping them with nearby clusters at each stage until you only have one cluster left (bottom-up) Divisive: starts by considering all observations as a single cluster and then progressively splitting into subclusters recursively (top-down)

Define Hierarchical Clustering

An technique that allows you to build tree structures from data similarities. You can use it to see how different sub-clusters relate to each other, and how far apart data points are.

What to examine when assessing (univariate analysis) a Factor predictor variable?

Assess Bar chart. (Count of observations per factor level)

What to examine when assessing (univariate analysis) a Continuous predictor variable?

Assess the histogram of the distribution. Check the skewness (does it need to have a log transformation). - Check for extreme (unreasonable) outliers - Check for obvious errors in data - Check for obvious duplicates

Define BIC

Bayesian information criterion; criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. = k * ln(L) - 2 * ln( L ) - It penalizes the complexity of the model where complexity refers to the number of parameters in the model.

What is your Candidate ID Number?

Better just to memorize this

What type of data to use a Logit transformation?

Binary (boolean) Target variable

What to examine when assessing the bivariate relationship between a Factor predictor variable and a Continuous target variable?

Box Plots and tables summarizing the mean, median, and count of the target based on each factor

Define Normal QQ Plot

Checks for the normality of the distribution by graphing the standardized residuals vs the theoretical quantities of the normal distribution.

Define the Alpha parameter in Regularization

Controls the weighted distribution between the L1 (LASSO) and L2 (ridge regression) penalties in elastic net regression.

Select a distribution and link function for Binary Target (GLM)

Distribution: binomial Link: Logit - Note: probit, cauchit, and cloglog are also acceptable selections, however the logit function is the canonical link function and therefore more likely convergence.

Define False Positive Rate

FPR = 1 - Specificity = FP/(FP+TN) - The proportion of False positives among all Negative cases

How would one justify using forward selection vs backward selection, or vice versa?

Forward Selection is more conservative, so if you want a simpler (more easily explainable model) choose forward. Though it has it's pitfalls (adding additional terms to the model may make the other previously added terms in the model not significant)

When should you remove rows with missing data?

Generally, if <5% of the data missing then remove Rows

Select a distribution and link function for Continuous Positive Target (GLM)

In many cases it is sufficient to do a log transformation of the variable. The following distributions work: - gamma, - lognormal - inverse gaussian Additionally, generally use a log link function to ensure positive predictions.

What data questions should be considered while reading the project statement?

Is the project statement more interested in interpretable models or more accurate complicated models? What type of variable is the target variable? What type of variable are the predictor variables? Are there any outliers that need to be removed? Are there any Factor variables that could be combined?

Why do we use Principal Component Analysis?

It is commonly used for dimensionality reduction by projecting each data point onto only the first few PCs This substitution, specifically the score produced by the linear combination of original variables each component represents, can reduce dimensionality and improve the predictive power of the resulting model while preserving as much of the data's variation as possible - Feature Transformation: - Feature Extraction: - Feature Selection: does not reference and cannot include target variable in PCA

Define the Lambda parameter in Regularization

It represents the extent to which you want your model to prefer simplicity. When Lambda= 0 there is no penalty added (and all parameters are free to contribute), and when lambda = infinity where the penalty is so great that all parameters must be 0.

What are some advantages and disadvantages of Hierarchical Clustering compared to K-Means Clustering?

K-Means Clustering Advantages: - Specialized to clusters of different sizes and shapes. - Convergence is guaranteed. K-Means Clustering Disadvantages: - K-Value is difficult to predict - Doesn't work well with global cluster. Hierarchical clustering Advantages: - Ease of handling of any forms of similarity or distance. - Consequently, applicability to any attributes types. Hierarchical clustering Disadvantages: - Requires the computation and storage of an n×n distance matrix. For very large datasets, this can be expensive and slow

Define Lasso Regression

Least Absolute Shrinkage and Selection Operator [L1]; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. Asseses fit of the coefficients in predicting the target and the number of dimensions in the mode Note; The penalty is added to the SSE and proportional to the sum of the absolute values of the estimated coefficients. All of the coefficients are reduced and some may be reduced to zero, effectively removing that variable

Why would one choose the logit link function over the probit, cauchit, and cloglog functions?

Note: These all require a binary target variable, so it is essential that the prediction be in the zero to one range since we need to predict a probability. - The logit link is the canonical link, which results in faster processing and more likely convergence.

When would one choose AIC over BIC or vice versa?

Often BIC is more conservative model selection approach since it will induce a higher penalization for models with an intricate parameterization in comparison with AIC criterion. When desiring to choose a model with greater interpretability (a smaller one) consider BIC.

Define Ridge Regression

Ridge (L2) is a method of estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated. Note; The penalty is added to the SSE and proportional to the sum of the squares of the estimated coefficients. All of the coefficients are reduced but none are reduced to zero. Hence, all variables are retained.

What type of data to use a log transformation?

Right Skewed (common with variables of Time, Distance, or Money which have a lower boundary of 0)

What to examine when assessing the bivariate relationship between a Continuous predictor variable and a Continuous target variable?

Scatter plots. Correlation between each variable [cor() in R].

Describe random forests and a boosted tree and explain the similarities and differences

Similarities: - Both random forests and boosted trees are ensemble methods, which means that rather than creating a single tree, many trees are produced. - Neither method is transparent, so they require variable importance measures to understand the extent to which input variable is used. For bagged/RF regression trees, we record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor Differences - Random forests do not use the residuals to build successive trees, so there is less risk of overfitting as a result of building too many trees - Random forests will typically reduce the variance of predictions when compared to boosted trees, while boosted trees will result in a lower bias - The best boosted trees learn slowly (use lots of trees) and thus can take longer than a random forest to train.

Compare and contrast Ridge and Lasso Regression

Similarities: In both cases there is a hyperparameter to estimate that controls the extent of the reduction. This is normally selected through cross‐validation. Differences; -Lasso regression not only helps in reducing over-fitting but it can help us in feature selection - (LASSO provides an alternative to forward and backward selection for variable selection)

Define Specificity

TNR = TN/(TN+FP) - the proportion of true negatives among all negatives cases

Define Precision

TP/(TP+FP)

Define Sensitivity

TPR = TP/(TP+FN) - the proportion of true positives among all positive cases

Define Variance

The Expected Loss arising from the model being too complex and overfitting to specific instance of the data. - High variance means that the model won't be accurate because it overfit the data it was trained on and won't generalize well to new data

Define Bias

The Expected Loss arising from the model not being complex/flexible enough to capture the underlying signal. - High bias means the model won't be accurate because it doesn't have the capacity to capture the signal in the data

What Principal Component should I mostly reference? (AKA Which Principal Component explains the most variation?)

The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.

How are Boosted Trees built?

These are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).

How are Random Forests built?

These use bagging to produce many independent trees. When a split is to be considered, only a randomly selected subset of variables is considered. In addition, each tree is built from a bootstrap sample from the data. This approach is designed to reduce overfitting, which is likely when a single tree is used.

Define Cross Validation

This approach involves dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold.

Define Scale-Location graph

This graph assess the homoscedasticity assumption based on how the data points get wider along the fitted values. - It also can be used to asses the linearity assumption based on if the mean value of the residuals (the red line) remains approximately at 0. - Plots Sqrt(standardized residuals) vs Fitted Values

Define Residuals Versus Fitted Graph

This graph will be used to check the homogeneity of the variance (heterogeneity is bad) and the linearity of the relationship (biased vs unbaised).

Define K-Means Clustering

This is a method of partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. (Recall this uses a Heuristic Algorithm so the user provides the number of "k" clusters)

Define Elastic Net Regression

This method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

Define Gini

[Impurity Measure] A measure of how often a randomly chosen element from the [segment] would be incorrectly classified if it were randomly classified according to the distribution of targets in the subset = 1 - sum (p^2) - The Lower the value, the better the node is at correctly splitting the data

Define Entropy

[Impurity Measure] A measure of impurity of each node in a decision tree (classification only, although there are regression equivalents). - A measure of randomness; the higher the entropy the greater the disorder = - sum[ p_i ln( p_i) ]

Define Classification Error

[Impurity Measure] = 1- max(p) Either a false negative response or a false positive response

Define Backward Selection

which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.

Define and provide examples of Unsupervised learning

a type of machine learning algorithm used to draw inferences from datasets without human intervention. does not include target variable Examples; Hierarchical clustering, k-Means clustering, Principal Component Analysis (PCA)

R-Code; Remove all observations in entire data set of a variable greater than or equal to 50

data <- data[data$variable <= 50, ]

R-Code; Remove all observations of a factor variable = "value"

data2 <- dat1[dat1$variable != ToBeRemoved, ]

R-Code; remove a variable from the dataframe.

df$variable <- NULL

R-Code; Histogram Continuous Variable

ggplot(df, aes(x = variable)) + geom_histogram(bins = 30) + labs(x = "variable")

R-Code; Plot a Decision tree and choose optimal CP

rpart.plot(tree1) plotcp(tree.mod1) tree.mod1$cptable

R-Code: RMSE

sqrt(sum((test$Y_Variable-predict)^2)/nrow(test))

R-Code; Combine factor levels into new factors.

var.levels <- levels(df$variable) df$occupation_comb <- mapvalues(df$variable, var.levels, c("Group12", ... , "GroupNA"))


Conjuntos de estudio relacionados

American Government Exams 1-3/ Final Exam

View Set

Mathis Foundations: Chapter 9 Finance

View Set

BUS 251: Chapter 50 Reading & Assessment Questions

View Set

R8 M6 - Business Structures: Part 2

View Set

Section 1B: Interpretive Communication: Print and Audio / Audio Texts

View Set

mastering a&p 2 ch. 26 group 4 modules 26.11-26.17 DSM

View Set