SOA Exam PA

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

AIC and BIC formulas in a GLM sense

AIC: D +2(p+1) BIC: D + ln(n)(p+1) This is because the deviance = -2*log likelihood

Hyperparameters

AKA tuning parameters, or parameters that control some aspect of the fitting process itself, we specify it ourself, help us create the model

Define Regularization (aka penalization and shrinkage)

An alternative to stepwise selection for reducing the complexity of a linear model We consider only a single model hosting every potentially useful feature but instead of removing a feature entirely, we shrink the coefficient estimates towards zero and reduce their variance and impact In some cases, the regularization can be so strong that a variable's estimate becomes zero

R command for creating decision trees

rpart (recursive partioning) method (either "anova" for regression tree or "class" for classification tree) control (either minsplit (specifies minimum number needed to make a split) or minbucket (specifies minimum number we must have AFTER a split)) cp (complexity paramter) maxdepth (maximum depth of the tree) xval: number of folds when doing cross-validation to determine the fit of the tree parms: determines the node impurity measure to use when making splits (gini is "gini" while entropy is "information")

What is arguably the most important parameter in random forests?

"mtry" which specifies the number of features considered in each split As mtry increases, the variance reduction is not as great because the trees will be more similar. If mtry is too small, each split won't have enough freedom in choosing the split variables

Total Sum of Squares

(Observed value - mean value)^2 Goodness of fit of the intercept-only model

Classification Error rate in confusion matrix

(false negative + false positive)/n

Best Subset Selection

Fit a separate linear model for each possible combination of the available features and select the best subset of features to form the best model, best model fares best according to a pre-specified criterion (AIC, BIC, etc.) Requires us to analyze 2^p models, since each parameter is either in the model or not

Define a cutoff in binary classifiers

It is used to translate the predicted probabilities of a binary classifier into predicted classes If the predicted probability of the event of interest is higher than the cutoff, the event is predicted to occur If the predicted probability of the event of interest is lower than the cutoff, the event is predicted not to occur

How do you ensure you arrive at a global optimum for k-means clustering?

Run the k-means algorithm many times and choose the best result

Define sampling (and random/stratified sampling/Systematic Sampling)

Sampling is the process of taking a subset of observations from the data source to generate the dataset. Random Sampling: Randomly draw observations from the underlying population without replacement until we have our desired number of observations, each observation is equally likely to be chosen, example is a survey or questionnaire Stratified Sampling: Dividing the underlying population into a number of non-overlapping "strata" or groups in a non-random fashion, and randomly sampling a set number of observations from each stratum Pros: Ensures every stratum is properly represented in the collected data Systematic sampling: draw observations according to a set pattern and there's no randomness (ex. Arrange according to height and pick every fourth observation, pick every worker from Iowa, etc.)

Linear Model Equation

Target variable = B0 + B1X1 + B2X2 + ... + error term

Scree plot

Tells each principal component and proportion of variance explained Simple visual inspection method for selecting the number of PCs to use. Eye ball the scree plot and see where the drop off becomes large and it doesn't explain very much anymore. That point is known as an elbow For choosing based on a scree plot, just eye ball it and justify my decision

Define Terminal Node

Terminal nodes (leaves): Constitute an end point, they are not split further

F-test

Tests the joint significance of the entire set of predictors excluding the intercept Null hypothesis is that none of the predictors are valuable, only tests jointly! However, the test does not say which variable is important, only that at least one of them is

Methods for Performing Feature Selection

1. Best Subset Selection 2. Backward Stepwise Selection 3. Forward Stepwise Selection

Data Quality Issues to Check for

1. Check for reasonableness (does an observation make sense?) 2. Check for consistency (make sure all variables are formatted and designed consistently throughout, don't change in the middle of a study) 3. Also make sure the dataset is sufficiently documented so other users can understand it (include description of the dataset overall, each variable, any updates or irregularities, statement of accountability for the correctness, and description of governance processes)

Steps in the Model Building Process

1. Define the Business Problem 2. Data Collection 3. Exploratory Data Analysis 4. Model Construction and Evaluation 5. Model Validation 6. Model Maintenance

Solutions to deal with collinearity

1. Delete one of the problematic predictors 2. Pre-process the data using dimension reduction techniques

Can PCA be applied to categorical variables?

No, unless the categorical variable is binarized, then you can

Partial dependence plots

Variable importance plots only tell us the important variables, not the direction Partial dependence plots tell the association between a given variable and the model prediction after averaging the values or levels of other variables

Variables vs. Features

Variable: A raw measurement that is recorded and constitutes the original dataset before any transformations are applied Features: Derivations from the original variables that provide an alternative, more useful view of the information in a dataset. Features serve as final inputs into a predictive model

Interpretation of GLM coefficients using log link

(if numeric predictor): When all other variables are held fixed, a unit increase of X is associated with a multiplicative increase in the target mean by a factor of e^b (if dummy variables): The target mean when the categorical predictor lies in the non-baseline level is e^(b) times of that when it is in the baseline level, holding all other predictors fixed This is because we take e^(function with predictors), meaning we can multiply e^(each individual predictor)

Accuracy in confusion matrix

(true negative + true positive)/n

Akaike Information Criterion (AIC)

-2*maximized loglikehood + 2 (number of parameters +1) First term measures goodness of fit, second term penalizes for complexity Smaller AIC means better model, it fits reasonably well without overfitting

Bayesian Information Criterion (BIC)

-2*maximized loglikehood + ln(n)(parameters+1) Lower values are better, tests likelihood but penalizes for complexity BIC is more conservative than AIC because it penalizes more

Three main categories of predictive modeling problems

1. Descriptive - Focuses on what happened in the past and aims to "describe" or explain the observed trends by identifying the relationships between different variables 2. Predictive - Focuses on what will happen in the future and is concerned with making accurate predictions 3. Prescriptive - Uses a combination of optimization and simulation to investigate and quantify the impact of different prescribed actions in different scenarios (if we increase premium, how will this affect the lapse rate?)

Regularization trades off two desirable characteristics of coefficient estimates:

1. Goodness of fit to training data (we want our model to effectively model the training data) 2. Model Complexity (Larger the estimates, the heavier the penalty because we want to reduce the variance)

Steps for interpreting GLM coefficients in the exam

1. Interpret the precise value of the estimated coefficients 2. Comment on whether the sign of the estimated coefficients forms to intuition 3. Relate the findings to the big picture and how it can help clients make a better decision in life

Why is unsupervised learning often more challenging than supervised learning?

1. Objectives are not as clear, since there isn't a simple goal like prediction 2. It's hard to assess results, since we can't just test predictive power, nothing is supervising our work so we can't really test it

What are the three ways in linear regression to use nonlinear regressors?

1. Polynomial Regression 2. Binning - Using piecewise constant functions 3. Piecewise Linear Functions

Process for Calculating Partial Correlation (between sales and newspaper)

1. Regress sales on everything other than newspaper 2. Regress newspaper on the other variables 3. Find the correlation between the two sets of residuals from steps 1 and 2 If the partial correlation is very low, this suggests that newspaper is virtually uncorrelated with sales after accounting for the effects of the other variables

Options for penalty function in regularization

1. Ridge regression (sum of squares of slope coefficients) 2. Lasso (sum of absolute values of slope coefficients) 3. Elastic Net (1-alpha)(ridge)+alpha(lasso) alpha is the mixing coefficient, which determines the weight for the ridge and lasso analyses

What are the differences between backward and forward stepwise selection?

1. Which model to start with 2. Add or drop variables in each step? 3. Which model tends to produce a simpler model? (forward because the beginning model is simpler) When applying forward/backward, remember that each step needs to be a subset of the step before! (If doing forward and I have predictor number 2, the next step MUST also include predictor number 2)

Formula for the deviance of the GLM

2*(log likelihood of saturated - log likelihood of our model)

What is a logistic regression model?

A GLM with a binary target variable that uses the logit link

Hierarchical Clustering

A clustering method but it does not require the choice of K in advance and the clusters can be displayed on a dendrogram, which is a tree-based visualization of the "hierarchy" of the clusters formed

Describe Cost-complexity pruning

A great strategy for controlling tree complexity We first grow a very large tree and then prune it back to get a smaller subtree. We trade off the goodness of fit of the tree by using the following objective function: relative training error + tree complexity The tree complexity portion is number of terminal nodes * complexity parameter Relative training error is RSS/TSS Goodness of fit and penalized for complexity! The penalty term is a hyperparameter that may need to undergo cross-validation, this is super important

What is the end result of K-means clustering?

A local optimum, meaning that the algorithm can't reduce within-cluster variation any further, but a different starting point could potentially have different restuls The initial clusters are random and this impacts the results

Define node

A point on a decision tree that corresponds to a subset of the training data

What is the interpretation of the regression coefficient on a numeric predictor? (B1 in y=B0+B1X1)

A unit increase in X is associated with an increase of B1 in Y on average, holding all other predictors fixed.

Interpretation of AUC = 1 and AUC = 0.5 for ROC curves

AUC = 1: classifier has perfect discriminatory power, as every observation is separated correctly AUC = 0.5: Classifies the observations randomly without using the information contained in the predictors, which is similar to an intercept-only GLM (represented by a 45 degree line in the ROC curve). Based on this, any reasonable classifier should have an AUC above 0.5, as it must be better than a random classification

Describe the relationship between accuracy, sensitivity, and specificity

Accuracy is a weighted average of specificity and sensitivity, where the weights are the proportions of observations belonging to the two classes

Polynomial Regression Definition

Adding X^2, X^3, etc. to account for non-linear relationships in linear regression models

What is a dendrogram?

An upside-down tree in hierarchical clustering that shows the sequence of fusions and inter-cluster dissimilarity when each fusion takes place We can easily see that clusters that are joined towards the bottom are similar to each other. The height of each fuse shows the inter-cluster dissimilarity

Differences in Granularity and Dimensionality

Applicability: Dimensionality is specific to categorical variables, while granularity applies to numerical and categorical Comparability: We can always order categorical variables by dimension (how many factor levels are there for each?), but we can't always order by granularity unless one group is a subset for every value (if we have actual age, we can group by age so it's more granular)

Considerations informing the choice of the link function in GLM:

Appropriateness of predictions (the range of values of the target mean implied by the GLM should be consistent with the range of values of the target mean) Interpretability (can we easily say how much the target mean will change if a continuous predictor increases by 1 unit)

Properties of deviance residuals in GLM

Approximately normally distributed for most target distributions in the linear exponential family No systematic patterns Constant variance when standardized Use standardized deviance residuals with Q-Q plots to determine goodness of fit for a model!

How can recursive binary splitting be described? (it is the whole tree-growing process)

As a top-down, greedy algorithm Top-down: we start at the top of the tree and work our way down Greedy: At each split, we adopt the binary split that is currently the best without looking ahead to see what will be the best in the long-run

What trade-off exists when we determine a cutoff in GLM?

As the cutoff increases, we are classifying more observations as negative. Therefore, we must weigh the benefits of correct classifications with the costs of misclassifications when determining a cutoff. There is a clear tradeoff between sensitivity and specificity as the cutoff increases/decreases!

Define K-Means Clustering

Assigns each observation in the dataset into one of K clusters, where the number of clusters is specified beforehand and the observations in each cluster will be relatively similar

Homoscedastic assumption

Assumes that in linear models, the error term has a common variance If the variance appears to change over various observations, it is heteroskedastic (not homoskedastic)

How do average linkage and centroid linkage differ?

Average we compute all pairwise distances, then average Centroid we average each cluster first to get the center, then compute distance between the centers

Balanced/unbalanced decision trees

Balanced trees have the left and right subtrees coming out of any node differ in depth by at most one

How do we handle categorical variables in a model?

Binarization, or giving each factor a dummy variable representing 1 if that factor is present or 0 if it is not Common practice is to have the variable with the most observations be the base level, since all p-values are based off of the baseline level. Thus, we don't want a variable that is sparse and unreliable to be the baseline. When eliminating levels using feature selection, it is important to have the right baseline value, since it could impact which levels we drop

Define Binary Tree

Binary tree: Each node only has two children (this is what will be shown in PA)

Bootstrapping

Bootstrapped - Randomly sample the original training observations with replacement

Describe the two ways to classify predictive analytic variables (by role and by nature)

By role: Their intended use. We are interested in predicting the target variable using the predictors/explanatory variables/independent variables.... We want to factor out the noise (anything interfering with specific explanatory variables) to find out as much as we can about the signal function (the function helping us predict the target variable) By nature: Number (discrete or continuous) or categorical variables. Binary variables only have two possible levels (usually yes or no)

How to do the training/test split?

Can be random according to pre-specified proportions or special statistical techniques ensuring the distribution between the two sets are comparable. If it's based on time, we could use the older datapoints for training and test the newer datapoints

Interpreting a biplot

Can interpret the PCs based on their shape and magnitude in both directions Can interpret the observations based on the relative size of their PC scores Can see which variables have their lines close to each other to determine correlation

Define centering in PCA

Centering to have a mean of zero, but this is not important in PCA because PCA looks at variances, and centering doesn't impact variances

How to choose the target distribution in a GLM?

Choose a member of the exponential family that best aligns with the characteristics of a given target variable

Classification measures of nude impurity in decision trees

Classification error rate (which can be easily found because the prediction for each node is the mode, so just see how many would be the same class) gini index (measures variance across all K classes of the variable, as opposed to just looking at the max class) entropy (also takes on a similar approach as the gini index) All three variables are best when minimized, since they are a measure of error/impurity

2 Key characteristics of hierarchical clustering

Clusters are nested, as the differing heights must be nested (which creates the hierarchy) One dendrogram suffices, as the height of the cut controls the number of clusters. If we want to change this, we only need to change the height of the cut, not the dendrogram itself

Rank-deficient linear model

Collinearity is so large that the coefficient estimates of the model cannot be determined uniquely (be able to use this term)

Types of linkage for hierarchical clustering

Complete - furthest pair of observations between the clusters Single - closest pair of observations Average - Average of all distances between observations in one cluster and the other Centroid - Distance between the centers

A complex model does not necessarily make a good predictive model!

Complexity does not guarantee good prediction performance! Don't make the model too simple or too complex, try minimizing the MSE in the U-shape (MSE=bias^2+Variance)

Difference between correlation and interactions

Correlation is how two variables interact with each other, interactions are how two or more predictors relate to a target variable and involve at least three variables. Just because two predictors are correlated does not mean that they also relate to the target variable!

Differences in the number of possible predictions for decision tree and GLM

Decision tree: However many leaves there are represent the number of possible predictions, very finite GLM: Infinitely many possible values

Define linkage

Defines the dissimilarity between groups of observations in hierarchical clustering

Define depth in decision tree

Depth: The number of branches needed to go from the tree's root node to the furthest terminal node, this is a measure of complexity of the decision tree. Larger depth means more complex.

List of Discrete and Continuous models for GLM

Discrete: Binomial, Poisson Continuous: Normal, gamma, inverse Gaussian Mixture: Tweedie

T-Test

Estimate of beta/standard error of beta It shows the effect of adding a variable to the model after accounting for the effects of other variables in the model The null hypothesis is that the test statistic = 0, meaning we can eliminate the variable. Larger test statistics mean there's a great association between the predictor and target variable

Key purposes of PCA in PA

Exploratory data analysis (data exploration and visualization) by reducing the dimension of a dataset to a smaller set of principal components that retain most of the information from the original dataset Feature generation (it can help improve predictive modeling by removing correlation with PCs and the complexity of the model, helping to optimize the bias-variance trade-off and improve predictive performance) - the features created can be used in a supervised problem We can use principal components in our models, but make sure to remove the original variables first! (otherwise we will have perfect collinearity)

What are two important applications of unsupervised learning

Exploratory data analysis (great for datasets with huge numbers of variables) Feature generation that can assist in supervised learning

Feature Selection

Feature removal, dropping features/variables with limited predictive power to reduce the dimension of the data. It controls model complexity and prevents overfitting (reducing the variance) Can combine sparse categories with other categories (ensure each category has enough observations while keeping distinctions) Combine similar categories (regarding mean, median, etc.) Use prior knowledge of a categorical variable (instead of each hour, classify them all as morning) ALWAYS mention the reasons and justify my answer! Always consider the behavior of the variables you're combining Don't combine sparse observations if their mean/medians are way different!

How to interpret a regressor on interaction terms of numeric values (B3 on X1X2)

For every unit increase in X1, the change in the expected value of the target variable associated with a unit increase in X2 increases by B3.

Difference between GLMs and Decision Trees (Collinearity)

GLM: Vulnerable to highly correlated predictors, which can inflate the variance of coefficient estimates and the model predictions. Also may reduce interpretability. Trees: Collinearity won't damage the model, but in perfect collinearity it may not matter which variable is used for a split

Difference between GLMs and Decision Trees (Numeric Predictors)

GLM: in standard form, assumes effect of a numeric predictor on target is monotonic and it gets a single regression coefficient, non-linear effects need to be added manually with polynomial terms Trees: No monotonic structure imposed, trees excel in handling non-linear relationships, great for modeling interactions by just altering the splits

Difference between GLMs and Decision Trees (output)

GLM: produces a closed-form equation, with the predictors indicating magnitude and direct of effects of features on the target Trees: end product is a set of if-else splitting rules showing how target mean varies with feature values, there is no equation

Difference between GLMs and Decision Trees (Categorical Predictors)

GLMS: Predictors need to be converted to dummy variables, which are numeric Trees: Splits are made by directly separating the levels without any binarization needed (binarization actually restricts it to one split, whereas if they are categories you can split multiple groups at once)

Difference between GLMs and Decision Trees (Interaction)

GLMs: Must be manually inserted into the model equation Trees: Splits are based on different predictors, which automatically incorporates any interaction effects. Any interactions are inherent in the observations and this will be accounted for during the splits

Which GLM target distribution should be used for positive, continuous, and right-skewed data?

Gamma, inverse gaussian, or you could do a linear model that log-transforms the variables Gamma has the mean and variance be positively related, and as the mean increases, the right skewness becomes more pronounced. This makes gamma very widely used.

Random forests

Generating multiple bootstrapped samples of the training set and fitting base trees independently of each sample. Also, at each split, we take a random subset of those predictors to decide the split (number of predictors is a hyperparameter). If we used all the predictors, the results would be correlated. Once we have our bootstrapped samples, we train a base, unpruned decision tree on each sample separately and combine the results to form an overall prediction If numeric target variable, the prediction is the average of the predictions. If categorical, it's the mode of the predictions "Random forest" - Numerous trees (so a forest) that are on random subsets of the data

Node Impurity

How dissimilar the observations are in a node. Our goal is to have terminal nodes with all variables containing similar characteristics, so each split we want to reduce the impurity as much as possible

Things to point out when interpreting a decision tree

How many splits or terminal nodes the tree has, which variables appear to be the most informative as shown in the first few splits and whether that makes sense, which terminal nodes have the most observations and therefore are most reliable, for classification trees point out interesting combinations of feature values that lead to the event of interest, etc.

Granularity in Data Analysis

How precisely a variable in a dataset is measured (more granular means more specific, ex. Zip code is more granular than state) Start with more granular, because we can always get state from ZIP code but not the other way around Be careful, because increasing granularity increases the dimensions and fewer observations in each level The optimal level optimizes the bias-variance trade-off

Considerations for using RSS and R^2 as model comparison criterion

If the models have the same number of predictors, we prefer smaller RSS or larger R^2 If the models have a different number of predictors, we likely prefer the more complex model because it will explain more of the training set, but we must be careful of overfitting

When do GLM's equal linear functions?

If the target variable is normally distributed and the link function is the identity function g(mean) = mean

Options for the link function in a GLM

If there is a positive mean that is unbounded from above, use the log link (it ensures the mean of the GLM is always positive) If there is a unit-valued mean (binary variable and the target mean is between 0 and 1, since it's the probability something occurs), use the logit link: ln(mean of binary/(1-mean of binary)). The logit link restricts the Bernoulli target mean to the range zero to one, so it should always be used for binomial

When to convert a numerical variable to a factor?

If there is no numerical order between them and it would be more useful to allow different levels to have different relationships with the target variable (increasing flexibility)

Drawbacks of Principal Component Analysis

Interpretability (difficult to interpret the linear combinations created) PCA depends on linear relationships and may struggle for non-linear relationships operational efficiency is not achieved target variable is ignored

Key difference between gamma and inverse gaussian distributions?

Inverse gaussian generally has a fatter right tail They both generally use log link because it's easier to interpret

Differences between k-means clustering and hierarchical clustering

Is randomization needed: k-means does for the initial clusters, hierarchical doesn't Is the number of clusters pre-specified: k-means does, hierarchical doesn't Are the clusters nested: k-means no, hierarchical yes

What's the impact that transformations have on decision trees?

It can impact the splits that are made, since splits are based on the residual sum of squares. For right-skewed data, the large outliers can have a disproportionate effect on the RSS, impacting the ultimate split that is made. Transformations limit this effect. Clarification: Transforming the target variable responses will impact the RSS and future splits; however, transforming predictors will have absolutely no impact because those are based on rank

What is the drawback of changings years into a factor instead of numerical

It can make it difficult to predict the value in future years, since future years won't fit into any of the categories

Limitations of using the likelihood ratio test for GLMs

It can only compare one pair of GLMs at a time and the simpler GLM must be a subset of the more complex GLM If the two GLMs are not subsets, the LRT can't be used

What is the one-standard-error rule in selecting a value of lambda for cross validation?

It corresponds to the greatest value of lambda, or the simplest model, where the cross-validation error is within one standard error of the minimum cross-validation error These models have more or less the same predictive performance and we may opt for the simplest model among the models

Issues with using the deviance of a GLM?

It is a goodness-of-fit measure on the training set and always decreases as the model becomes more complex. This can lead to overfitting

What does cluster analysis do?

It partitions the observations in a dataset into distinct, non-overlapping subgroups known as clusters, with the goal of revealing patterns in the data Observations in the same cluster are very similar to each other and distinct from the other clusters

How does the regularization penalty in regularization relate to the bias-variance trade-off (lambda)

It penalizes for a complex model, so a larger regularization penalty makes it more likely to have a simplified model, which reduces variance but increases bias, and vice-versa Lambda measures the tradeoff between model fit and model complexity As lambda increases, it more heavily penalizes the complex models and we end with a simpler model, with a lower variance but larger bias

How does hierarchical clustering work?

It's a bottom-up method. It starts with individual observations and fuses the closest pairs of clusters one pair at a time. The process goes on until every cluster is fused together into one cluster with every observation

Interpreting a PC (strongly negative loadings for rape, assault, and murder, neutral for urban population)

It's mainly a measure of overall crime rate, and the more negative it is the higher the crime rate Be able to interpret different loadings! Generally whichever is strongest we say that it's a measure looking primarily at that, and the sign indicates which values are greatest for that PC

Define the business problem step in the model building process

Know the business problem to which predictive analytics will be applied, what is our objective? Most projects can be classified as prediction-focused (predict the target variable based on the other variables) or interpretation-focused (understand the relationship between the target variable and predictors) Also consider any constraints (available of high quality data, implementation issues such as IT restrictions, time constraints, etc.)

Between lasso and ridge regression, which method may remove variables (and thus shows only the key factors)

Lasso forces them to be absolutely zero, whereas ridge has their variables reduced but never reaches zero (so none are removed)

In a decision tree, which way do "yes" observations go

Left!

Model Selection Methods for GLM

Likelihood ratio test: 2(likelihood of larger model-likelihood of smaller model) = deviance of smaller model - deviance of larger model If the likelihood is very large, we know that the larger model does a better job of fitting the data Otherwise do AIC and BIC or regularization

Lis of some models that are good/bad for continuous/discrete models

Linear models are only good for continuous target variables, while GLM or decision trees can work for continuous or discrete

Describe principal component loadings

Loadings are the weight of each PC corresponding to the overall set. The PC loadings are created to maximize the sample variance, since we want to describe as much of the dataset as possible The loadings create a line along which the data varies the most, then the second loading creates a line that is perfectly perpendicular to the first, trying to explain the observations in that direction Each additional loading must be uncorrelated with the previous ones (explain something that the others haven't), be perpendicular with the rest

What are good options in a GLM for positive, continuous, right-skewed data

Log-link gamma because it ensures that all predictions are positive and it makes interpreting it easier log-link normal GLM Inverse gaussian GLM

Interpretation of GLM coefficients using logit link

Logit always applies to odds!! A unit increase in a numeric predictor with coefficient B is associated with a multiplicative change of e^(B) in the odds

Common Performance Metrics for Classification Problems

Look at classification error rates for test or training sets (how many accurately classify an observation). A value is labeled 1 if it's misclassified, since we are looking for the error rate Smaller the error rate the better Short Answer - Classification Error Rate

How do you choose the number of principal components to use?

Look at the "Proportion of Variance explained" by each of the PCs in comparison to the total variance The first PC explains the most, and each following PC explains slightly less due to the extra uncorrelated restriction Once we know the proportion of variance explained for each PC, we can use a scree plot

What is a drawback of using the Poisson distribution in a GLM?

Mean and variance must be equal

What is the Tweedie Distribution in GLM?

Mixture of discrete (poisson) and continuous (gamma) components. It has a discrete probability mass at zero and a probability density function on the positive real line, making it a great bet for modeling aggregate claim losses

Bias-Variance Tradeoff

More flexible models have lower bias but higher variance, and less flexible models have higher bias but lower variance. We need to consider the trade-offs when selecting a model MSE is bias^2+variance, so if we underfit the MSE is high because of the large bias, and if we overfit the MSE is high because of the large variance, must find a middle ground, MSE is a U-shape.

Trade-off of determining the size for training and test sets

More training data means a more robust predictive model that can learn patterns and is less susceptible to noise, but if too little data is in the test set we can't assess the predictive performance on new, unseen observations

Define residual

Observed value minus expected value Measures how well the fitted model matches a training observation

Normal Q-Q Plot

Plots quantiles, which plots the standardized residuals against theoretical standard normal quantiles. If the standardized residuals are normally distributed, they will follow the line closely. If they don't follow the line, we can conclude that the standardized residuals are not normally distributed Follows a 45 degree plot

Residuals vs. Fitted Plot

Plots residuals of the model against the fitted values on the training set. Can easily see on here if there is a trend in the error variance. We want homoskedasticity (a constant variance), so any trend in variance is a huge red flag We want the variance to be trending around zero and deviate randomly on either side in the same magnitude

Describe ROC Curves

Plots sensitivity against specificity of a given classifier for each cutoff ranging from 0 to 1. ROC curves end in the bottom left and top right corner, but a better classifier has the ROC curve bend towards 1 for both sensitivity and specificity. The closer the curve gets to the top-left, the better the predictive ability Computing the AUC (area under the curve) can tell us the predictive performance of the model

Define biplot

Plots the scores of the first two PCs against each other to help us visualize the data in a scatterplot and see which variables are similar

Considerations for selecting the best model

Prediction performance (smallest test RMSE or test classification error rate), interpretability, ease of implementation For the exam, consider if they want a model that's more focus on prediction ability or interpretability and choose the model accordingly (prediction ability models will be more complex than interpretability models) Always focus on how it performs with the TEST SET, this is what we are analyzing

What are the two types of unsupervised learning

Principal Components Analysis and Cluster Analysis

Pros and cons of using piecewise linear functions in regression models

Pros: They are simple but powerful way to handle non-linearity, interpretation is relatively simple (just a change in slope at break points) Cons: The break points need to be user-specified in advance

Pros and cons of using a log transformation when modeling?

Pro: Can help symmetrize a right-skewed dataset Cons: Can't be used if any values are 0 or negative! (must use square root transformation in this case)

Sensitivity in confusion matrix

Proportion of positive observations that are correctly classified true positive/(true positive + false negative) How many new "positives" will be correctly classified as positive

Precision in confusion matrix

Proportion of positive predictions that truly are positive true positive / (false positive + true positive)

Pros and cons of boosting relative to random forests

Pros: Boosted trees perform betters in terms of prediction accuracy Cons: more prone to overfitting due to multiple base trees trying to capture the signal in the original data

Polynomial Regression Pros and Cons

Pros: Can account for more complex relationships between the target variable and predictors than we could with a linear relationship. The more polynomial terms that are included, the more flexible the fit Cons: Interpretation is very tough, the choice of the powers to include is very subjective

Pros and Cons of Best Subset Selection

Pros: Conceptually straightforward and gives the best model Cons: May not be feasible with even a moderate amount of potential features

Pros and cons of boosting relative to base trees

Pros: Gain in prediction performance Cons: Loss of interpretability and computational efficiency

Pros and Cons of Decision Trees

Pros: Interpretability (as long as there's a reasonable number of buckets) is easy for non-technical audiences, non-linear relationships are handled very well in the splits, skewed variables don't need to be transformed (split are just based on rank in the dataset), interactions are recognized automatically, categorical variables don't need dummy variables, variables are automatically selected by trees (if not used in a split they are eliminated) Cons: Decision trees are more prone to overfitting (bad initial split can screw up the whole tree), may need to split numeric variables repeatedly (which reduces interpretability), trees may split based on levels in a categorical predictor that aren't important

Pros and Cons of binning-using piecewise constant functions

Pros: It allows the regression function to avoid picking a shape. This means the target mean can vary irregularly over the bins, since each bin is completely different. The larger the number of bins, the wider the variety of relationships we can fit between variables and the more flexible the model Cons: Selecting the number of bins and range of values is very arbitrary, binning results in a loss of information (turning exact values into ranges), an excessive number of bins may lead to sparse categories and unstable coefficient estimates

Pros and cons of stepwise selection

Pros: Much less computationally intensive than best subset selection (only requires us to fit 1+(p(p+1))/2 models) Cons: It does not consider all possible combinations of features and therefore may not choose the best combination We can only add or drop one feature at a time, not multiple ONLY EVER DROP ONE VARIABLE AT A TIME because interactions may make one significant once another is dropped

Pros and Cons of random forests relative to a single tree

Pros: Much more robust than a single tree, the averaging of results reduces the variance greatly and produces much more precise predictions Cons: interpretability, computational power

Pros and Cons of Ensemble Methods

Pros: Predictions are often far superior because of having multiple base models Con: Computationally intensive and difficult to interpret

Pros and Cons of regularization techniques for feature selection

Pros: R makes it extremely easy to use regularization for categorical predictors, regularized models can be tuned by cross-validation using the same criterion that will ultimately judge the model Cons: Many not apply to all model forms, may be tough to interpret

Advantages and Disadvantages of binarizing factor variables on our own

Pros: R may treat factor variables as a single feature and either retain all of the levels or none, which ignores the possibility that individual levels of a factor can be insignificant. With binarization, we can add or drop individual factor levels, and thus have a better representation when performing stepwise selection. Can eliminate just insignificant levels of a factor and keep significant levels, as opposed to having to drop the entire factor at once Cons: May take much longer to perform stepwise selection with more features, and interpretation may be non-intuitive if only some levels of a factor remain

Pros and Cons of Random Sampling

Pros: it is the easiest sampling method Cons: Voluntary surveys are vulnerable to respondent bias (only those interested will respond), which may not represent the true population, surveys also have low response rates

Pros and cons of treating values as factors instead of numeric variables in GLM

Pros: treating them as numerical values restricts them to have a global monotonic association with claim occurrence across all age categories, removing the restriction creates dummy variables with different coefficients, providing more flexibility Cons: Converting numbers into factors inflates the dimensions of the data and may result in overfitting

Cross-Validation (k-fold cross-validation)

Provides a convenient means to assess the prediction performance of a model without additional test data, it also helps to select the values of hyperparameters K-fold cross-validation is performing a series of splits repeatedly across the dataset. Randomly split data into k folds of equal size, one fold is left out and the predictive model fits the rest of the folds and tests on the left-out fold, and this is repeated for each fold. The average of the errors for each fold is the cross-validation error Cross-validation allows us to find the best values for the hyperparameters in each test set

Types of Ensemble Methods

Random forests and boosting

Variance

Quantifies the amount by which the signal function would change if it was calculated using a different training set. Ideally, the signal function should be stable across training sets. The more flexible a model, the higher the variance because a new training set would have very different results, it's very sensitive to the training data. It is the part of the test error from having too complex of a model PRECISION OF SIGNAL FUNCTION Small variance means precise, since we'd have almost the same model no matter what

Define Coefficient of Determination

R^2 The proportion of the variance of the target variable that can be explained by the fitted linear model R^2 = 1 - RSS/TSS Always between 0 and 1 The higher the R^2, the better it fits to the training set

Bagged Model

Random forest where every predictor is used in the splits

Differences between random forests and boosting:

Random forests fit trees independently, while boosting fits trees in series/together Random forests decrease variance through averaging, while boosting decreases bias through improved predictions

Three main qualities of good data

Reasonable, consistent, and have sufficient documentation

Piecewise linear functions in regression models definition

Regression function is linear over different intervals in a range of the numeric variable, with the straight lines connected continuously. The slope changes in each interval, but it is a continuous line

Measures of Node Impurity in Decision trees for regression

Regression: Residual sum of squares. The lower the residual sum of squares, the more accurate the predictions are

Regression vs. Classification Problems

Regressions problems generally are supervised learning problems with a numeric target variable Classification problems have a target variable that is categorical

How does regularization work?

Regularization uses RSS as the starting point and then incorporates a penalty term that reflects the complexity of the model. More complex models are more penalized. (lambda is our regularization parameter)

Common Performance metrics for Regression Problems

Residual - discrepancy in the training set Prediction Error - Discrepancy in the test set Root Mean Squared Error - sqrt (residual^2/n) Smaller RMSE means it fits the data better. Mean squared error is the same effect. RMSE is the size of a typical residual/prediction error in absolute value RMSE is better than MSE because it's the same unit as the target (square root of a square cancels out, MSE doesn't) Short Answer - Root of mean squared error

Root node vs. leaves in decision trees vs. branches

Root node is at the top, leaves are all of the possible prediction values, branches are the interior splits

Define root node

Root node: At top of the decision tree representing the full dataset

Simple linear regression model vs. multiple linear regression model

Simple only has one predictor, multiple has more

Define weights in GLM

Sometimes we are working with grouped data that use averages over different number of observations, so we can apply a greater weight to the groups that have more observations to account for the larger credibility, this can help with the reliability of the fit

What does a Q-Q plot in GLMs look at?

Standardized Deviance Residuals

Naive Pruning Method:

Start at the root node and only make a split if it satisfies a pre-specified threshold Cons: Some "bad" splits at the start may lead to really good splits, which we will never get to if we don't make that initial split

Backward Stepwise Solution Process

Start with the model with every feature and work backwards. In each step, drop the feature that, if removed, it would improve the model the greatest according to AIC or BIC. Continue until no more features can be dropped to improve the model

Forward Stepwise Solution Process

Start with the simplest model (only the intercept) and go forward. In each step, among all the features not already in the model, we add the feature that results in the greatest improvement in the model. When no more features can be added to improve the model, you have your model

Structured vs. Unstructured Data

Structured: Data that fits into a tabular arrangement Unstructured: Doesn't fit into a tabular arrangement (text, video, recordings)

Residual Sum of Squares

Sum of squares of residuals (Actual-expected)^2 The smaller the RSS, the better the linear model fits the set

Supervised vs. Unsupervised Problems

Supervised: Those for which there is a target variable "supervising" our analysis and we want to understand the relationship between the target variable and predictors and/or make accurate predictions for the target based on the predictors Unsupervised: No target variable supervising our analysis, we are interested in extraction relationships and structures between different variables in the data

Describe the center of the means in a cluster

The algorithm iteratively calculates the K means for each cluster. As a new observation is added, it calculates the new center of the cluster, and it keeps moving observations around until no observations need to move to the new center of the clusters It starts with a center and assigns every observation to a cluster, then recalculates the center and some observations move, recalculates center, until no observations will move

Define interactions in predictive modeling (SUPER IMPORTANT)

The association between one predictor and the target variable depends on the value (or level) of another predictor

What impacts the interpretation of GLM coefficients

The choice of the link function because it determines the relationship between the target mean and the features NOT the target distribution

Interpretation of the slopes on a qualitative variable in a linear regression model

The difference between the expected value of a target variable when the categorical predictor takes on a specific value rather than the base value, holding all other predictors fixed ALWAYS COMPARE TO BASE VALUE WHEN INTERPRETING "We expect it to be 17.3 higher in the south than the north, on average, holding all other predictors fixed." if coefficient is 17.3 on south and base is the north

Bias

The difference between the expected value of the signal function and the true value of the signal function. Expected value - true value. Measures the accuracy of the signal function prediction. The more complex the model, the lower the bias! It has a higher ability to capture the signal in the data. Bias is caused by the model not being flexible enough to capture the underlying signal. ACCURACY OF SIGNAL FUNCTION Small bias means accurate, since it's centered around the true mean

How do you determine how many clusters to include in K-means clustering?

The elbow method Keep adding clusters and finding it as a proportion of total variance explained. When adding clusters only results in minimal increases in the variance explained, you have reached an elbow and should use that value as the number of clusters

Define intercept term

The expected value of Y when all X's are zero

Interpretation of the intercept in a linear regression model with only categorical variables

The expected value of the target variable when the base value for the categorical variables is assumed for each variable

How does the training and test error relate to the complexity\flexibility of a model?

The more complex a model, the lower its training error because it captures a lot in the training data However, the test error (our main focus) exhibits a U-shape. As model complexity increases, the error initially decreases but then the model is too complex and error starts to increase We need to select the right level of flexibility when constructing an effective model, don't overcomplicate it

What do variable importance plots tell us?

The most important variables in ensemble trees, since they are very hard to determine by looking at trees (there are hundreds of them), higher importance score means it's more important. Specifically, it measures the drop in node impurity due to that predictor

Dimensionality of a categorical variable

The number of possible levels a variable has (can inflate the dimensions of the predictors)

Target Leakage (Important!)

The phenomenon that some predictors in a model include ("leak") information about the target variable that won't be available when the model is applied in practice. They may be strongly associated with the target variable but won't be known until after we know the target variable's value. (Don't include these, because we want to predict the target variable BEFORE we know it, so this will make our model look deceptively well)

Considerations to make when making a decision tree split

The predictor to split and the cutoff value

Feature Generation

The process of generating new features based on existing variables in the data. This is very useful for transforming unstructured data so that it can be applied to our analysis. Feature generation tries to increase the flexibility of the model, since it is adding more features. Only add features that are more likely to have a direct relationship with the target variable than the raw variable *Think of creating helper columns!

What is the one key drawback of linear models?

The target variable needs to be normally distributed, so it can't account for target variables that are binary, highly skewed, or "count" variables (count how many times something occurs)

Irreducible Error

The variance of the noise, which is independent of the predictive model but inherent in the random nature of the target variable. It can't be reduced no matter how good the model is, so it's not in our interest. Bias and variance are reducible errors.

How are the groupings for a cluster decided?

They are chosen such that the variation of the observations within each cluster is relatively small, while the variation between clusters is relatively large Minimize the within-cluster variation summed over all K clusters

Define offsets in GLM

They are commonly used with count data when the number of claims is aggregated over multiple rows of data. We can use the number of policies as an "offset" to better account for the number of claims in different rows Use the log link for an offset related to count data, do this by putting (ln(exposures for the ith observation)) at the start of the model, ln(E) is technically a regressor before the intercept term It is all about regulating exposures for the target variable. If one regressor would have a varying number of exposures, this could serve as a good offset(and always do natural log of that variable!) ALWAYS DO THE LOG IN THE OFFSET and don't include the normal variable in the equation other than in the offset

How does a GLM compare to a linear model?

They are much more flexible. Distribution of target variable: It can account for target variables that take on any distribution in the exponential family of distributions. GLMs can model binary, discrete, continuous target variables with mean-variance relationships and do regression and classification Relationship between target mean and linear predictors: Allows us to relate a function of the target mean with the predictors, as opposed to just the mean itself (we can do identify, log, inverse, etc.)

Why are the log and logit by far the most popular for GLM link functions?

They are the most interpretable!

How do decision trees work?

They divide the feature space (all combinations of feature values) into a finite set of non-overlapping regions containing relatively homogenous observations more amenable to analysis and prediction To make a prediction, simply locate the region to which it belongs and use the mean (if numerical) or mode (if categorical) of the target variable in that region as the prediction

In a linear model, describe the error term

They follow a normal distribution with mean zero and a common variance

Which GLM target distribution should be used for binary data?

This is a classification problem that measures whether something happens, so binomial or Bernoulli are the only options!

Which GLM target distribution should be used for count data?

This looks at how many times an event happens over a reference time period, are non-negative integer values possibly right skewed, so Poisson is a good option. Negative binomial is also an option.

Why prune a tree?

To optimize the bias-variance trade-off and maximize predictive performance on test data

Binning-Using piecewise Constant Functions Definition

Treat a numeric variable as categorical by "binning" ranges of values into categorical variables, and applying each range of values (each bin) a dummy variable, then treat as a categorical variable Piecewise Constant - the regression function is constant over pieces in the range of the numeric variable

Define Principal Components Analysis

Transforms a high-dimensional dataset into a smaller, much more manageable set of representative variables that capture most of the information in the original dataset The principal components are mutually uncorrelated and attempt to simplify the dataset, reducing the dimensions and making it better for data exploration and visualization. It does very good with correlated data too, because it needs less variables to describe everything The principal components are chosen from the existing features to try describing the data

Describe Ensemble Methods

Try to improve on the instability of decision trees. They do this by using multiple base models, which can capture complex relationships in the data (bias) and improve stability via averaging (variance)

Collinearity (aka multicollinearity)

Two or more features are closely (or exactly) linearly related In this scenario, some variables don't bring much additional information because their values can be deduced from the values of other variables Collinearity can lead to inflated variances (this is the big one!!) since the estimates vary widely or the estimates can be very weird values, it makes it very hard to interpret as well because holding other values constant is not appropriate

Test Set

Typically 20-30% of the full data. Once the model is fitted using the training set, we will use it to make a prediction for each observation in the test set and assess the predictive performance of the model. Simulates new, "future" data to see how good the model is. The model has not seen these datapoints yet, DONT USE THESE TO FIT THE MODEL The test set will determine which model is the best

Training set

Typically 70-80% of the full data. This is used to fit the predictive model to estimate the signal function and model parameters.

What is the issue with unbalanced data in GLMs?

Unbalanced data implicitly places more weight on the majority class and tries to match the training observations in that class, without paying much attention to the minority class. This is very problematic if the minority class is the class of interest (ex. positive tests)

Two solutions for dealing with unbalanced training data

Undersampling: draws fewer observations from the negative class and retains all of the positive class observations. This improves the ability for the classifier to pick up the signal leading to the positive class, as there are relatively more data points. Drawback to undersampling: The classifier is now based on less data and we have less information about the negative class, so it could be less robust and prone to overfitting Oversampling: Keeps all original data, but oversamples (with replacement) the positive class to reduce the imbalance. Some positive class observations will appear more than once. Must do this after the training/test split and only to the training split, or you run the risk of the test split not being truly unseen. What to do in practice: a mixture, oversample so that you retain all information but also undersample so it's not too computationally intensive

How do you measure the distance between clusters in hierarchical clustering?

Use Euclidean distance of observations, otherwise it depends on the type of linkage used

How to fit a GLM

Use maximum likelihood estimation to estimate the unknown coefficients of a GLM. Choose the parameter estimates that maximize the likelihood of observing the given data.

Confusion Matrices

Used for binary classifiers in GLM Sets up a 2x2 matrix with the predicted value and actual value. First letter for each datapoint is either T or F (if prediction is true or not) and second letter is N or P (was prediction negative or positive) Results in TP, TN, FP, FN

Ordinary Least Squares Approach

Used in linear models, chooses the estimates of beta that will make the sum of the squared differences between the predicted target values and observed target values the least These beta values are called ordinary least squares estimators

Define boosting

Uses a sequence of interdependent trees using information from previously grown trees. We fit a tree to the residuals of the preceding tree to form new residuals. We keep repeating this, improving on poor predictions from the past tree, progressively moving in the right direction.

Goodness of Fit Measures for GLMs on the training data

We can't use RSS or R^2 because we didn't use least squares for the parameter estimates, we used MLE Training deviance measures the extent to which the GLM departs from the most saturated model (perfectly fits each training observation). The lower the deviance, the closer the GLM fits the training set Deviance residuals tell us the square root contribution of the ith observation to the deviance

Training/test set split

We don't want to use all of the data available to us to create our models, since then we can't test it and we want the model to represent the future, not the past. 70% generally training, 30% test

Define Scaling in PCA

We should scale the variables to be their standardized variables (divided by standard deviation to have unit standard deviation), to get them all on the same scale. Otherwise the observations with a large variance on their scale will dominate the PCA, even if they are unimportant observations Scale the variables ahead of time because we don't want PCA depending on an arbitrary choice of scaling Always scale the variables if they are on relatively different scales, as we can't perform reliable PCA if they have different magnitudes/scales

Data collection step in the model building process

We will spend a lot of time finding data for a good model Data Design: Consider relevance (the data we collect must represent the true population we are studying, otherwise bias occurs) and time frame (choose the time period which best represents the business environment of the model, recent history is generally better)

Differences between weights and offsets

Weights: The observations of the target variable should be averaged by exposure, variance of each observation is inversely related to the size of the exposure but mean is not impacted Offsets: Observations are values aggregated over exposure units. Exposure is in direct proportion to the mean, but variance is not affected If the mean is directly proportional to the number of exposures, use an offset. If the variance is inversely related to the number of exposures, use the weight!

Define Hierarchical Principle in Modeling

Whenever a higher-order variable is retained in a model, we need to include all lower-order variables even if they aren't statistically significant

How does the height of the cut determine the number of clusters that are formed?

Wherever we cut, what is left above it is the number of clusters left. Every observation below the cut line will form one cluster

Do you scale variables in clustering?

Yes, otherwise the variables on a much larger scale will have a huge effect on the final clusters, scaling it gives equal weight to each variable

Can cluster analysis generate features?

Yes, the cluster groups can become a qualitative variable or we can use the cluster centers as a new feature

State the equation of a PC

Z1 = loading 1 * standardized variable + loading 2 * standardized variable + ...

Best way to measure predictive performance of decision trees

accuracy/sensitivity/specificity on test set OR ROC and AUC Also accounts for interpretability

R command to split data into training and test sets

createDataPartition() Note: This automatically creates stratified samples for the training and test sets, which is very appealing!

Regularization Formula for GLM

deviance + regularization penalty Deviance measures goodness of fit, while regularization penalty measures model complexity. All other key concepts regarding regularization are consistent

Possible combinations of target distribution/link function for binary target variables in a GLM

distribution must be binomial link function can be logit (this is most common because it deals directly with odds and is interpretable), or probit

formula for a log link GLM

ln(u) = B0 + b1x1 + ...

formula for a logit link GLM

ln(u/(1-u)) = b0 + b1x1 + ...

How do you measure complexity in a decision tree and how is it managed?

the complexity is measured by the number of splits/terminal nodes it has, and can be managed through pruning. An overblown tree is difficult to interpret and vulnerable to overfitting the training data, will have a high variance. Also, the final splits of a large tree are often based on a very small number of observations, which are vulnerable to noise

Specificity in confusion matrix

true negative/(true negative + false positive) how many new "negatives" will be classified correctly How many negatives were predicted to be negative

formula for an identity link GLM

u = b0 + b1x1 + ...


Ensembles d'études connexes

Investigating God's World Chapter 7 Test

View Set

S-130 Module 1 (Preparedness, ICS, & Resources)

View Set

Social Studies, Chapter 16 Lesson 2

View Set

MYM 6.1a: Factors that Influence Buying Decisions

View Set

Budgets and Balance Sheets PERSONAL FINANCE

View Set