Exam PA

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Explain why it is not a good idea to add or drop multiple features at a time when doing stepwise selection

It is not a good idea to add/drop multiple features at a time because the significance of a feature can be significantly affected by the presence or absence of other features due to their correlations. A feature can be significant on its own but become highly insignificant in the presence of another feature.

Alt distance measures: Correlation

Motivation: Focuses on shapes of feature values rather than their exact magnitudes. Limitation: Only make sense when p >= 3, for otherwise the correlation between two observations always equals +/- 1

Describe how principal components analysis can provide an alternative method for determining whether a student will pass or not based on all of G1, G2 and G3

Note: General description - specific structure is generally a good strategy for framing responses to many subtasks in PA Principal components analysis is an analytic technique for summarizing high-dimensional data. It relies on the use of composite variables known as principal components (PCs), which are linear combinations of the existing variables generated such that they are mutually uncorrelated and reduce the dimension of the data while retaining as much info as possible. It is particularly useful when the variables in the data are highly correlated, in which case a few PCs are enough to represent most of the info in the data. Instead of basing pass/fail only on G3, we can apply PCA to G1, G2, and G3, which are all numeric, and take the first PC as an overall grade variable and the new target variable. Due to the strong correlations between the three grade variables, the first PC will retain most of the information contained in the three variables. A student will be deemed to pass if the overall grade is higher than a pre-determined threshold.

PC Scores

On bottom and left axes of biplots... Deduce characteristics of observations (based on meanings of PCs) First, they can be used to visualize the data in a lower-dimensional space defined by the principal components. For example, the first two or three principal components can be plotted against each other to create a scatter plot that summarizes the variation in the data. Second, PC scores can be used as input variables in downstream analyses, such as regression or clustering, to reduce the dimensionality of the data and remove correlated variables. Finally, PC scores can be used to identify outliers or unusual observations that have extreme values on one or more principal components.

PC loadings

On top and right axes of biplots... The PC loadings indicate how much weight each original variable has on each principal component. They are calculated as the correlation between each original variable and each principal component. The loadings are often presented in a matrix format, where each row corresponds to an original variable and each column corresponds to a principal component. PC loadings are useful in interpreting the results of PCA, as they can help identify which original variables are most strongly associated with each principal component. They can also be used to create linear combinations of the original variables that approximate the principal components, which can be useful in certain applications such as dimensionality reduction.

Explain the problem with RSS and R^2 as model selection measures

One problem with RSS and R^2 is that they are merely goodness-of-fit measures of a linear model with no explicit regard to its complexity or prediction performance. As we add more predictors to a linear model and make it more complex, RSS will always decrease and R^2 will always increase.

Explain why oversampling must be performed after splitting the full data into training and test data:

Oversampling should be only performed on the training data, following the training/test split. If we do this before the training/test split on the full data, then we will have duplicate observations in the test set and the test set will not be truly unseen to the training classifier, defeating the purpose of the training/test split.

Explain the rationale behind and the difference between AIC and BIC

Performance metrics based on the idea of penalized likelihood to rank competing models. AIC and BIC are indirect measure of prediction performance in the sense that they adjust goodness-of-fit measure like the RSS or training loglikehood to account for model complexity. Difference between AIC and BIC is penalty per parameter is higher for the BIC than AIC for almost all reasonable values of n. The BIC is more stringent than the AIC for complex models and tends to result in a simpler final model when used as the model selection criteria.

Regularized GLMS Pro's and Con's

Pro: 1) Categorical predictors: Binarization of cat variables is done automatically and each factor level treated as a separate feature to be removed. 2) Tuning: Can be tuned by CV using the same criterion (MSE, accuracy) ultimately used to judge the model against unseen test data. 3) Variable selection: For lasso and elastic nets, variable selection can be done by making lambda large enough. Cons: 1) Target distribution: Limited/restricted model forms allowed by glmnet 2) Categorical predictors: Possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected. 3) Interpretability: Weaker point, coefficient estimates are more difficult to interpret if the variables are standardized.

Advantages and disadvantages of polynomial regression

Pro: We are able to take care of substantially more complex relationships between target variable and predictors than linear ones. The more polynomial terms included, the more flexible the fit that can be achieved. Con: On the downside the regression coefficients are much more difficult to interpret. Con: There is no simple rule as to how to choose the value of m, which is a hyperparameter.

Ensemble trees Pros and Cons

Pros: 1) Much more robust and predictive than base trees by combining the results of multiple trees. Cons: 1) Opaque/difficult to interpret Reason: Many base trees are used, but variance importance plot or partial dependence plots can help 2) Difficult to implement: Huge computational burden with fitting multiple base trees

Your assistant suggested creating a new variable that flags any previous class failures and including this flag variable in your models, in addition to variables that already exist in the data. The value of the variable is 1 if failures is higher than or equal to 1, and 0 if failures is 0. Your assistant thinks that this variable may be a useful feature for predicting pass. (The original failures ranges from 0-3) 1) Describe the modeling impacts of adding the new flag variable when running a GLM 2) Describe the modeling impacts of adding the new flag variable when running a decision tree.

1) The new binary variable, 1 {failures}, is an extra feature generated from failures. While the new variable does not contribute additional info (its value can be deduced from that of failures), it provides a GLM with more flexibility to capture the effect of failures on the pass rate. Without the variable, the effect of moving from 0 to 1 failure on the linear predictor of the GLM is the same as any other unit increase (e.g. 1 to 2 and 2 to 3). With the variable, the GLM will be given an extra "boost" and be able to differentiate students with no failure from those with at least one failure more effectively. Downside- intro of new feature will increase the complexity of the GLM (# of coefficients increases by 1) and worsens the extent of overfitting. The overall impact on prediction performance depends on the bias-variance trade-off and is uncertain. 2) The new feature will have no impact on a decision tree. This is because making a tree split based on the new feature is identical to making a split based directly on failures with 1 as the split point, i.e. the two resulting branches are defined by failures >=1 and failures <1. There is no change in the flexibility of the tree, which will have the same set of tree splits and leaves, although there is one more (redundant) variable in the data.

Functions: sumDiff <-function(x,y) { s <- x+y d <- x-y return(list(sum=s, diff=d)) } 1) sumDiff(1,2)$sum 2) sumDiff(1:3,1:3) This function won't know what s or d mean anymore since they are sum and diff

1) 3 (sum= 1+2) 2) $sum: 2 4 6 (1+1, 2+2, 3+3) $diff: 0 0 0 (1-1, 2-2, 3-3)

State three reasons to convert a numeric variable to a factor variable when fitting for a GLM.

1) A factor variable may be more interpretable than numeric. For example, it may be simpler to convey discrete age group comparisons than a general linear effect of age. 2) The variable having a small number of distinct values 3) The variable values merely being numeric labels with no sense of numeric order 4) Converting to factor allows each level to be examined separately. For example, A regularized regression can distinguish which levels of the factor are significant.

Why is OLS not a good choice to model turnout time (Given four plots in task 6 of Tempe test)

1) Constant variance assumption for residuals not met. The fitted vs. residuals plot shows that variability increases as the fitted values increase, particularly for the negative residuals. Additional observation from this is that positive residuals exhibit much greater variation than the negative residuals. The asymmetric spread is an indication of the residuals right-skewed nature. 2) The Q-Q plot of residuals indicates that they are not normally distributed, violating another OLS assumption, there are too many extreme high values.

Describe two ways impurity measures are used in a classification tree

1) Construction: To decide which split in the decision tree should be made next. 2) Pruning: Decide which branches of the tree to prune back after building a decision tree by removing branches that don't achieve a defined threshold of impurity reduction through cost-complexity pruning. Future exam task: Which impurity measure are commonly used for the two above. Construction: Gini index and entropy are often used as the splitting criterion because they are more sensitive to node impurity than the class error rate. Pruning: The class error rate is often used because of its direct connections to prediction accuracy.

1) ggplot(airplane, aes(x = flight time, y= altitude)) + geom_point(color = "blue") 2) ggplot(airplane, aes(x = flight time, y= altitude, color = "blue")) + geom_point() 3) ggplot(airplane, aes(x = flight time, y= altitude, color = airline)) + geom_point() 4) ggplot(airplace,aes(x=flight time)) + geom_bar(fill=airline) 5) ggplot(airplace,aes(x=flight time)) + geom_bar(aes(fill=airline))

1) Creates scatterplot with blue points 2) Creates scatterplot with default color (red). It treats blue as a variable (which it is not) and puts it as the legend 3) Creates scatterplot with two levels on legend (American and United - made up example) 4) Does not work! 5) Does work, fill aesthetic of the geom_bar() function is mapped to airline.

1) nrow(df) 2) ncol(df) 3) dim(df)

1) Displays # rows 2) Displays # columns Displays # rows then # columns

What are the two applications of PCA?

1) EDA (incl. data visualization) Plot the scores of the 1st PC vs. the scores of the 2nd PC to gain a 2D view of the data in a scatterplot 2) Feature Generation: Replace the original variables by PCs to reduce overfitting and improve prediction performance.

Gini and entropy index calculations 1) Gini 2) Entropy

1) G1= 1- (x)^2 - (1-x)^2. G2 = 1-(y)^2 - (1-y)^2. G= Xweight * G1 + Yweight * G2 / number of observations. After split: G1 = 1 - .96^2 - .04 ^2 = .077 G2 = 1-.898^2 - .1019 ^2 = .183 G = 4491 * .077 + 2992 * .183 / 7483 = .1198 2) Entropy = -(x *log2(x) - (1-x *log2(1-x)) E1 = -(.96*log2(.96) - (.04 *log2(.04)) =

Recommend a distinct improvement on each of two high dimensional variables in the ALS data to reduce granularity and likely improve predictive power.

1) I recommend generating a feature from the hour variable that groups hours into interpretable levels. Possible levels could be morning, afternoon Small differences between adjacent hours are not likely to be predictive. 2) I recommend transforming the day of month variable into one that contains levels for weekdays, weekends and holidays. Day of the month would likely lead to overfitting without any increase in predictive power. Now we would have this variable from 31 down to 3, this reducing potential for OVERFITTING.

Drawbacks of PCA

1) Loss of interpretability: PC's as composite variables can be hard to interpret 2) Not good for non-linearly related variables: PCs rely on linear transformations of variables 3) PCA does dimension reduction, but not feature selection: PCs are constructed from all original features 4) Target variable is ignored: PC is unsupervised

Selecting the value of K by the elbow method

1) Make a plot of the prop of variance explained = between cluster variation / total variation against K 2) Choose the "elbow", beyond which the proportion of variation explained is marginal

Insights from a dendrogram:

1) Similarities between clusters: Clusters joined towards the bottom of the dendrogram are rather similar to one another, while those fused towards the top are rather far apart. 2) Considerations when choosing the no. of clusters. Try to cut the dendrogram at a height such that: The resulting clusters have similar number of observations (balanced) The difference between the height and the next threshold should be large enough -> observations in different clusters have materially different characteristics.

The predicted and actual results used are Log(stay+1) Model formula Mean Variance Mean Squared Bias stay ~ age 0.0010 1.5805 stay ~ animal 0.0011 1.4856 stay ~ age + animal 0.0016 1.4786 stay ~ age + animal + mf 0.0021 1.4792 stay ~ age + animal + black 0.0019 1.4767 1) Explain what the variance and bias values indicate about the relative quality of predictions when comparing models. 2) Calculate for the first model listed, the typically errors up or down from the true value due separately to variance and bias for predictions of stay + 1 3) State two reasons why bias, as calc here, may not always decrease with additional degrees of freedom, as seen with the model that adds mf 4) State which predictors should be selected based on the above data, putting them in order from most to least predictive. Explain your selection and ranking.

1) The variance figures indicate how much the predictions vary depending on the training data used. As more predictors are used, the variance increases because the model more precisely fits the training data for each trail and becomes less generalized. The bias figure indicates how close expected predictions and the true value of the signal function at the test observations are on unseen data. Generally, as more predictors are used, the bias decreases as more accurate predictions are made 2) The variance and squared bias figures above are calculated at the level of the linear predictor. For the predictions of stay+1 itself, this means, using the results of the first model, that the typical error due to variance is a exp(.001 ^.5) = 103% factor up or down from the predicted value and the typical error due to bias is a exp(1.5805 ^.5) = 352% factor up or down 3) The bias-variance phenomena is theoretical, long term-results that hold when the bias and variance are averaged over infinitely many training and test sets. In this task there is only one training set (20% random sample of adopted pets that came into AAC before 2020) and one test set with finitely many observations. For a given set of training and test data, the long term results about bias and variance may not apply exactly. Another reason, in the case of fitting a Poisson GLM, minimizing the squared bias given the training data is not the same as minimizing the objective function. 4) Variance + Squared Bias: Stay ~ age = 1.5815. Stay ~ animal = 1.4867. Stay ~ age + animal = 1.4803. Stay ~ age + animal + mf = 1.4814. Stay ~ age + animal + black = 1.4786. To select the best model, the sum of the variance and squared bias errors should be used. Where this sum is lowered by including a variable, the variable should be included, and the greater the reduction, the more predictive the variable. On its own, animal has a lower sum error than age. (Variance + Squared bias is 1.4867 for stay ~ animal and it is 1.5815 for stay ~age). When stay ~ age + animal... adding age reduces the sum error by 1.4867-1.4803 = 0.0064. Black reduces the sum error by a further 1.4803 - 1.4786 = .0017. So black is less important than age. MF increases the sum error by 1.4814-1.4803 = .0011. So from most predictive to least predictive it goes: Animal, age, black and mf.

Idea of Principle Components Analysis

1) To transform a set of numeric variables into a smaller set of representative variables (PCs) -> reduce dimension of data 2) Especially useful for highly correlated data -> A few PCs are enough to capture most information

Propose 3 questions for the City of Tempe that will help clarify the business objective -help us get to 90% of total response under 6 minutes

1) What is the practical significance of 90%, which is the goal set by Tempe? How is it determined? 2) Do you have any initial hypothesis or intuition that might explain potential variation in response times? 3) Are there any subject matter experts you would recommend talking to prior to performing the analysis? 4) Over the span of data collection period, were there any notable changes or events that we should know about?

What is a limitation of the drop1 function?

A limitation of the drop1 function is that it drops all levels of the factor variables at once. The drop1 function will not tell us whether levels of vehicle have predictive power if considered individually. This is unless we manually binarize the factor variable in advance.

Describe both the challenge of interpreting a random forest model and a method to identify which predictors from a random forest model we should focus on.

A random forest is difficult to interpret because, unlike a decision tree where the splits and the impact of those splits can be observed, a random forest is made up of the aggregated results of hundreds or thousands of decision trees. Directly observing the component decision trees is generally uninterpretable. Variance importance is a measure of how much a predictor contributes to the overall fit of the model. This can be used to rank which predictors are most important. It is calculated by aggregating across all trees in the random forest the reductions in error that all splits on a selected variable produce. Variable importance cannot be used to draw inference as to what is causing model results, but can identify which variables cause the largest reduction in model error on the training data.

Explain for a general audience, what AUC represents

AUC, which stands for "area under the curve" is a number that ranges from 0 to 1 and represents how well a particular classification model distinguishes one outcome from another. The number combines the efficacy of many possible interpretations of the classification model, each interpretation leaning more heavily towards one outcome or another. When comparing the AUC of two models, a higher AUC indicates a better predictive model. As a baseline, a model that randomly guessed one outcome or another based on how common each one is, akin to a coin flip, has an AUC of .5, so only models with AUC above .5 are desirable.

Explain the difference between accuracy and AUC in terms of overall model assessment

Accuracy is measured by the ratio of correct number of predictions to total number of predictions made. It doesn't directly use the modeled probabilities, but rather the classifications based on a fixed cutoff point. AUC measures the area under the ROC curve. It assesses the overall model performance by measuring how true positive rate (TPR) and false positive rate (FPR) trade off across a range of possible classification thresholds. AUC measures performance across the full range of thresholds while accuracy measures performance only at the selected threshold.

How to justify dropping a predictor when given DF/Deviance/AIC table and AIC is less when removing a variable vs. original model but the Deviance is lower in the original model vs. the model with the dropped variable.

Adding a variable always decreases the deviance, but in the case of predictor x, that decrease is not enough to justify the addition of x degrees of freedom

What advantages does stratified sampling have over simple random sampling?

Advantage: A stratified sample can provide greater precision than a simple random sample of the same size, because it provides greater precision a stratified sample often requires a smaller sample, which saves money. A stratified sample can guard against an "unrepresentative" sample (e.g. an all male sample from a mixed gender population) Two disadvantages: It may require more administrative effort than a simple random sample. The analysis is computationally more complex.

Binarization Advantage/Disadvantage

Advantage: Many feature selection functions treat factor variables as a single feature and either: Retain the variable with all of its levels Remove the variable completely Binarization -> able to drop individual factor levels if not significant w.r.t. baseline Disadvantage: Each factor level considered a separate feature to add/drop -> stepwise selection may take considerably longer to complete

K-means vs. hierarchical clustering

Is randomization needed: Hierarchical: No K-means: Yes (for initial cluster centers) Is the number of clusters pre-specified? Hierarchical: No (specify the height of the dendrogram later) K-means: Yes (K needs to be specified) Are the clusters nested? Hierarchical: Yes (A hierarchy of clusters) K-means: No

Discuss one advantage and one disadvantage of modeling a numeric target vs. a categorical target (relating to G3 scores vs. pass/fail)

Advantage: Predictive models for G3 retain all info about G3 and will yield an entire estimated distribution for G3, which contains a lot more information than mere pass or fail. A predicted G3 score automatically translates into a pass or fail but not the other way around. Such regression models could be useful to the school if info beyond pass or fail is of interest (such as how do certain demographics affect grades or if the pass mark is changed in the future. Disadvantage: The distributional peculiarities for G3 revealed by the histogram may make it difficult to choose a distribution (in the linear expo family) that matches the characteristics of G3 in a GLM framework. From a modeling perspective, pass is a more tractable target variable than G3. For example, we can use logistic regression model for pass.

Applicability and Comparability of: Dimensionality Granularity

Applicability of Dimensionality: Specific to categorical variables Comparability of Dimensionality: Two categorical variables can always be ordered by dimension Applicability of Granularity: Applies to both numeric and categorical variables. Comparability of Granularity: Not always possible to order two variables by granularity. Ex: Dimensionality: If one age variable Age1 has 4 bands of age and another Age2 has 3 bands of age... then Age2 has lower dimensionality than Age1. Ex: Granularity: If one age variable Age1 has exact value of age and another age variable Age2 has four age bins... Then Age1 is more granular opposed to Age2. Age1 can be mapped to Age2 but Age2 cannot be mapped to Age1.

Describe the characteristics of the prediction produced by the intercept only GLM model and its ROC curve

Because of the absence of predictors, the intercept-only GLM will assign all observations (whether training or test observations) the same predicted pass probability which is the pass rate on the training set, regardless of students' characteristics. The predicted class is determined as follows: Case 1) If the common predicted pass probability is greater than the cutoff, then all students will be predicted to pass. With "pass" treated as the positive class, the sensitivity and specificity of the model will be 1 and 0, respectively Case 2) If the common predicted pass probability is less than the cutoff, then all students will be predicted to fail. The sensitivity and specificity of the model will be 0 and 1, respectively As a result, the ROC curve of this model will consist of only two points, (0,1) and (1,0)

Contrast best subset and stepwise selection for selecting predictors

Best subset: Performed by fitting all p models, where p is the total number of predictors being considered, that contain exactly one predictor and picking the model with smallest deviance, fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest deviance and so forth. Then a single best model is selected from the models picked, using a metric such as AIC. In general there are 2^p models that are fit, which is quite a large search as p increases. Stepwise: Alternative to best subset, computationally more efficient, considers a much smaller set of models. For example, forward selection begins with a model containing no predictors, and then adds predictors to the model, one at a time until adding a predictor that leads to a worse model by a measure such as AIC. At each step the predictor that gives the greatest additional improvement to the fit is added to the model. The best model is the one fit just before adding a variable that leads to a decrease in performance. It is not guaranteed that stepwise will find the best possible model out of all 2^p models.

Explain the difference between bias and variance in a predictive analytic context

Bias: Difference between the expected value of f(x(0)) and the true value of the signal function f at X0 i.e. Bias = EV(Training)[fhat(X0)]-f(X0) More complex a model, the lower the bias. The bias is the part of the test error caused by the model not being flexible enough to capture the underlying signal. The variance of f(X0) quantifies the amount by which f(X0) would change if we estimated it using a different training set. A more flexible model has higher variance. The variance is the part of the test error caused by the use of an overly complex model. Bias v Variance is Accuracy vs precision. If the predictions on average lie in the middle, hitting the true signal value approx... These two models have small bias and they make accurate predictions. If the predictions are concentrated in a small region and are very close to one another. Then we say they have a small variance and they make precise predictions.

Binary (0/1) Variable type.. what is common dist and common link

Binomial and logit link

Describe one limitation of the partial dependence plot

By keeping the variable of interest at a fixed value while averaging the model predictions over the different observed values for the other variables in the data, a partial dependence plot IGNORES THE RELATIONSHIP between the variable of interest and the other variables.

Why changing the random seed affects tree constructed using cost-complexity pruning

Changing the random seed changes the training and validation folds used in cross-validation, which in turn changes the trained models and the calculation of model performance. Changing the seed can result in different pruned trees.

Explain how PCA can be used to develop features for a supervised predictive model?

Cluster analysis can be used to create a factor variable identifying the cluster groups that different observations in the data are assigned to. Using this variable as a predictor in place of the original variables can sidestep issues arising from complicated distributions (extreme skew) and complicated relationships (strong correlations or co-dependence) and may improve prediction performance.

Cluster analysis. Two feature generation methods

Cluster groups: As a new factor variable Cluster means: As a new numeric variable

What does this signal: [ , 12 ]

Column 12

Each type of linkage look in clustering (how does the dendrogram look)

Complete- Balanced (highest heights usually) Average - Balanced (not quite as high as complete) Single - extended, trailing clusters with a single observation fused one-at-a-time Centroid- may lead to inversion (some later fusions occur at a lower height than an earlier fusion)

Determine the test AUC of the GLM intercept only model

When the two points of the ROC curve are connected, we get a 45 degree diagonal line and the AUC is the area of the triangle below this diagonal line, which equals .5. Not related to this subtask but another baseline model that yields an AUC of .5 is the purely random classifier.

Explain how cost-complexity pruning works, including how complexity is optimized.

Cost complexity pruning involves growing a large tree and then pruning it back by dropping splits that do not reduce the model error by a fixed value determined by the complexity parameter. We can use cross validation to optimize the complexity parameter, which is the process repeatedly training models and testing models on different folds of the data. This is done for different values for the complexity parameter, and the one with the lowest cross val error is selected as the optimal choice. We then prune back our trained tree using the complexity parameter from the cross validation.

Random forest to predict whether post alarm time is in the highest 10% of the original observations using 3 adjusted datasets: Outlining advantages of 3 training sets from April 12th task 11. Df1: Full dataset that includes hour and day variables (even though month/weekday exist) No oversampling. 11 total predictors. 7,800 obs. DF2: Excludes hour and day variables. No oversampling. 9 total predictors. 7,800 obs. DF3: Excludes hour and day variables. Yes oversampling. 9 total predictors. 12,000 obs.

DF1: Includes all predictor variables and observations from the original set. This data set allows random forest to look at all potential variables so we can assess, which variables are most important for our predictors. DF2: The day/hour are factor variables with many levels and may only have slightly more info than the weekday/month variables. Given random forest model's GREEDINESS with many-level factor variables and the lack of large potential info gain from these variables, removing them will help our random forest to avoid OVERFITTING. DF3: Uses oversampling to create a more even balance for our target classes. This should allow the model to focus on accurately predicting observations in the highest 10%, which is the group we are most interested in. The size of the dataset is still relatively small so computing issues should not impact the oversampled dataset.

The variable day, when used as a categorical variable, is deemed important by the tree-based model. However, day is not actually a significant variable. Explain why a decision tree model may emphasize day, when used as cat variable, despite not being an important variable.

Day of month is coded as a categorical variable, number of levels is 31. Means number of ways to split day of month into two groups is very large, making it likely that tree will find splits that happens to produce info gain for that particular training data. Decision trees tend to create splits on cat variables with many levels because it is easier to choose a split where the info gain is large. However, splitting on these variables will likely lead to OVERFITTING.

Explain descriptive and predictive modeling objectives. Write for a general audience and include an example of how each type of objective4 could be applied to this business problem

Descriptive modeling project: The primary goal is to understand the relationship between conditions and outcomes. The TEMPE ALS project is primarily a descriptive analysis project since reducing longer response time requires Tempe to take action on the key factors that impact response time. Tempe must understand the relationship between various factors and components of response time to decide how to manage them. Predictive modeling objective: Focuses on the future and addresses what might happen next. In the Tempe project interpretability may not be as important since we are interested in results. Implementing a model to predict response time at the time of the initial ALS call could possibly be used by the dispatch team to help manage communication with the emergency caller or consider alternative stations for response.

Stratified sampling

Divide the underlying population into a number of non-overlapping strata non-randomly, and randomly sample a set number of observations from each stratum. Special cases: Oversampling and under sampling Systematic Sampling

Systematic sampling

Draw observations according to a set pattern; no random mechanism controlling which observations are sampled. For a population with 100,000 units, we may order them in a well-defined way and sample every tenth observation to get a smaller, more manageable sample of 10,000 units.

Contrast the two methods drop 1 and LASSO, for selecting predictor variables

Drop1 shows the AIC impact from individually removing each predictor variable. The modeler removes the predictor that produces the largest drop in AIC, and then iterates until no more predictors should be removed. Lasso uses a penalty in the optimization function that penalizes large coefficients in the model. As a result, the coefficients are pushed towards zero, and can be set to zero, effectively removing the predictor. The differences include: Drop1 requires the modeler to manually remove the predictor. Lasso automatically removes predictors. Drop1 removes the entire categorical variable. Lasso binarizes categorical variables can remove individual levels. Drop1 removes one predictor at a time. LASSO assesses all predictors in a single model fitting.

Class error rate In general: Ref: - Ref + Pred (-): TN FN Pred (+): FP TP

FP + FN / total num obs Falsely predicting target class + Falsely predicting non-target class / total num obs

Explain how binarizing factor variables in advance affects a decision tree

For decision trees, advance binarization is unnecessary because they handle factor variables directly without the use of dummy variables. Manually binarizing factor variables, however, does affect a fitted decision tree because of some unnecessary restrictions imposed on the way tree splits can be made based on factor variables. If the factor variable is left intact, then we can make tree splits by directly separating the levels of the variable into two arbitrary groups. When the variable is transformed into a set of dummy variables and a tree split is made according to these variables, we are only allowed to separate the observations in the data into two groups according to one and only one level of the factor variable. For high-dimensional factor variables, such restrictions may make the fitted tree much less flexible, with a larger bias but a smaller variance. Because the fitted tree may differ, so may the final tree if we do cost-complexity pruning to reduce tree size.

For these two common distributions, target variables have to be strictly positive. Values of zero are not allowed

Gamma and Inverse Gaussian

State domain of distribution function for Gaussian (Normal), Poisson and Gamma Distributions. State the target variable that is appropriate for distribution (FROM TEMPE DATA).

Gaussian: Domain of all real values. A target variable that is appropriate would be the response time Poisson: Domain is non-negative integers. A target variable that is appropriate would be the number of calls in an hour (this could include 0) Gamma: Domain is all positive values. An appropriate target variable could be turnout time plus one second since it is continuous and positive. Note: If non-pos values are removed prior, we could just say turnout time.

Explain how the Pearson statistic can be used to rank competing models

Given a list of competing models, we can fit each one on the training set, which is a subset of the full data for model construction (or fitting) and calculate its Pearson statistic on the test set, which is the left-out set for model evaluation. This resembles the way our model will be predicting the number of claims in the future for new policyholders not involved in model development. The smaller the Pearson statistic, the more predictive the model. Using the Pearson statistic on the test set as the selection criterion, we will choose the model with the lowest statistic. Note: it is important to point out that the Pearson statistic should be computed on the test set.

Your client is more interested in a more accurate tree-based model but is concerned about the model variance. Recommend whether to use a random forest or a gradient boosting machine given client's concern.

Given variance concerns, I recommend a random forest, which tends to do well in reducing variance while having a similar bias to that of a basic tree model. The variance reduction arises from the use of many small trees and sampling of the data (Bagging). Both practices hinder overfitting to the idiosyncrasies of the training data and hence keep variance low. Gradient boosting machines use the same underlying training data at each step. This is very effective at reducing bias but is very sensitive to the training data (high variance)

(Given four plots in task 6 of Tempe test) Recommend a transformation to turnout time that will improve the residuals when fitting an OLS model.

I recommend a log transformation to turnout time. The transformation will shrink the large values relative to the smaller values. This should reduce the phenomenon where the residuals grew in variability as the fitted values increased in the OLS using the transformed target compared to those of the OLS on the untransformed target variable. In doing this, a small positive value should be added to turnout time to make the log operation feasible.

All models have mtry value of 3. Nodesize value of 1. Recommend an adjustment to either mtry or nodesize to address the decline in AUC between train and test datasets..

I recommend increasing nodesize, the min number of observations contained in the final node. It had been set to 1 which is the default for class trees. Increasing node size will reduce complexity by limiting tree growth when the data on which the split is based becomes sparse. Reducing complexity will reduce the overfitting that is causing the decline in AUC. FYI: Mtry has a less direct relationship with tree complexity and you would have a harder time justifying why reducing mtry limits the tree growing process.

Regularization Idea: How it works: Common forms of penalty term: Two hyperparameters:

Idea: Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the training deviance of the GLM and minimizing this penalized objective function to get the coefficient estimates. The objective function balances the goodness of fit of the GLM on the training data with the complexity of the model. The regularization penalty has the effect of shrinking the magnitude of the coefficient estimates of features, especially those with limited predictive power. Regularization has following benefits: Predictability - The reduction in model complexity increases the squared bias but decreases the variance f the predictions. Provided that the drop in variance outweighs the rise in the bias, there will be an improvement in prediction performance. Interpretability: In some cases (e.g. Lasso) the effect of regularization is so strong that the coefficient estimates of some features are forced to be exactly zero, effectively removing those features and making the resulting model simpler and easier to interpret. Common forms of penalty terms: Lasso: lambda * (sum j=1 to p) abs(Bj)). = Some coefficients may be zero Ridge regression: lambda * (sum j=1 to p) (Bj^2)) = None reduced to zero Elastic net: a * (Lasso) + (1-a) * Ridge = some coefficients may be zero Two Hyperparameters: 1) Lamda: regularization (a.k.a. shrinkage) parameter. Controls the amount of regularization -> as lambda increases complexity decreases -> bias^2 increases & variance decreases Typically tuned by CV: Choose lambda with the smallest CV error Feature selection: For lasso and elastic net with a>0 some coefficient estimates become exactly zero when lambda is large enough. 2) alpha: mixing parameter Controls the mix between ridge (a=0) and lasso (a=1) Cannot be tuned by cv.glmnet(); may use trial and error

Boosting: What is the Idea? What do Key parameters do? eta: nrounds:

Idea: In each iteration, fit a tree to the residuals of the preceding tree and a scaled-down version of the current tree's predictions is added to previous predictions. Each tree focuses on observations the previous tree predicted poorly. Eta: Learning rate (shrinkage) parameter. Higher eta, algorithm converges faster but more prone to overfitting. Rule of thumb is to set relatively small value. Nrounds: Max # rounds in the tree construction process

Describe the modeling impacts of the factor conversions when running a decision tree

If the variables remain as numeric, then tree splits will be made in a way that respects the ordered nature of the variable values. In other words, they must be in the form Variable <= Cutoff in one branch and Variable > Cutoff in another branch. If the variables are converted into factors, then the restriction above will be lifted and tree splits can be made in whatever way to separate the factor levels into two groups (e.g. a split with NCD = 20, 40, or 50 in one branch and NCD = 0, 10, or 30 in another branch is permissible). This has the potential of improving prediction performance, at the cost of a higher risk of overfitting and heavier computational burden (the # of possible splits to make will increase substantially)

Explain how measures of impurity are related to info gain in a decision tree.

Info gain is the decrease in impurity created by a decision tree split. At each node of the decision tree, the model selects the split that results in the most info gain. therefore choice of measure (Gini or entropy) directly impacts info gain calculations

K-means. How it works:

Initialization: For a fixed K, randomly select K points in the feature space as initial cluster centers Iteration: 1) Assign each observation to the cluster with the closest center 2) Recalculate the K cluster centers (hence K-means) 3) Repeat until the cluster assignments no longer change

GLMs Pros and Cons

Pros: 1) Target distribution: Excel in accomidating a wide varity of distributions for the target variable. 2) Interpretability: The model equation clearly shows how the target mean depends on features; coefficients=interpretable measure of directional effect of features 3) Implementation: Simple to implement. Cons: 1) Complex relationships: Unable to capture non-linear (e.g. polynomial) or non-additive relationships (e.g. interaction) unless additional features are manually incorporated 2) Interpretability: For some link functions (inverse link) the coefficients may be difficult to interpret.

Random sampling

Randomly draw observations from the underlying population without replacement. Each record is equally likely to be sampled.

Describe the curse of dimensionality and how it can lead to problems in a GLM

Refers to so called, "Small n, large p" settings. Number of explanatory variables or number of levels in explanatory factor variables, is large compared to the number of observations. Having many variables or variable levels can result in model complexity that is greater than that of the underlying process being modeled.

What does this signal: [ 182, ]

Row 182

If corr matrix is above .8/-.8 for example with say 3 predictors all resembling that amount of higher/lower... Explain why basing result on just one predictor is sensible.

Since they are all strongly correlated, they all move in the same direction. Simply just using one of these variables is okay because it will not result in too much loss of information.

Density Plot

Smoothed and scaled versions of histograms displaying density rather than counts on the vertical axis. Shows density rather than count and area under density is always 1.

Why should we set nstart to a large integer for K-means?

Solution of the algorithm depends on the randomly selected initial cluster centers. Running the algorithm multiple times improves the chances of finding a better local optimum.

what does Scale() do? Hierarchical clustering question.

Standardizes the variables so that they all have unit standard deviation before clustering is done. This is important as (hierarchical) clustering in its default form relies on the calculations of Euclidean distances, which depend very much on the scale of the variable values. Variables of a much larger order of magnitude will dominate the distance calculations and exert a disproportionate impact on the cluster arrangements. This is not the case with standardization

Describe strengths and weaknesses of a correlation matrix as a bivariate data exploration tool.

Strength: A correlation matrix provides a convenient way to summarize the strength of the linear relationship between two numeric variables, one pair at a time, by a set of metrics (the correlations), ranging from -1 to +1. Entries of the matrix that are close to +1 or -1 indicate strongly linearly related variables. Weakness: A correlation matrix only captures linear relationships. Two numeric variables that have a nearly zero correlation can be related in many other regular ways (quadratic) A correlation matrix only captures the linear relationship between two numeric variables at a time. It may fail to represent relationships that exist among a group of numeric variables. A correlation matrix only works for numeric variables, not categorical (factor) variables

Specificity In general: Ref: - Ref + Pred (-): TN FN Pred (+): FP TP

TN / (TN+FP) Correctly predicting non-target class / (Correctly predicting non-target class+ falsely predicting target class)

Accuracy In general (Class of interest is positive class): Ref: - Ref + Pred (-): TN FN Pred (+): FP TP

TN+TP / N Correct predicting non-target class + Correct predicting target class / total num obs

Sensitivity In general: Ref: - Ref + Pred (-): TN FN Pred (+): FP TP

TP / (TP+FN) Correctly predicating target class/ (Correctly predicting target class + Falsely predicting non-target class)

April 12 Exam Task 11 (RANDOM FOREST BASED): Explain what led to large difference in AUC values between the train and test datasets. (Train was .9 and Test was .55)

The AUC show significant decline in predictive power from the training dataset to the test dataset. This indicates that the models are overfit to noise in the training datasets. The likely cause is that the chosen hyperparameters create very deep trees, and therefore the hyperparameters need to be adjusted to create simpler trees which are less prone to overfitting. Also possible that predators used to construct the random forests have weak predictive power.

Explain the rationale behind the Pearson statistic

The Pearson statistic is a scaled measure of how far the observed values of the target variable differ from the predicted values on a set of data (training or test sets). It is defined as the sum of the squared discrepancy between the observed and predicted values scaled by the predicted value for each observation. The smaller the squared discrepancies, the smaller the value of the statistic. The division by the predicted values serves to reduce the impact of the disproportionately large values of the target variable, which commonly arise for right-skewed target variables, on the statistic.

Explain what is meant by "Variance Importance" in the plot above, and what the zero variable importance of AgeCat indicates

The importance of a variable in the random forest is the total drop in node impurity (Gini index here for class trees) due to splits over the variable, averaged over all the base trees. The fact that AgeCat has an exactly zero variable importance score means that it is not used in any of the splits in any of the base trees. This shows that AgeCat is of negligible importance in the presence of other predictors in the data.

Scaling of variables matters for both PCA and clustering

Without scaling: Variables with a large order of magnitude will dominate variance and distance calculations -> have a disproportionate effect on PC loadings & cluster groups With scaling (generally recommended): All variables are on the same scale and share the same degree of importance.

Explain how the intercept-only GLM can be used to assess the prediction performance of other models?

The intercept-only Generalized Linear Model (GLM) is a simple model that uses only the intercept term, without any predictors, to make predictions. It serves as a baseline model that can be used to assess the prediction performance of other more complex models. One common approach to assess the prediction performance of other models is to compare their performance to that of the intercept-only GLM. The idea is that if a more complex model cannot outperform the simple intercept-only GLM, then it may not be adding any significant predictive value beyond what can be achieved with just the intercept term.

Explain how changing the link function in the GLM impacts the model fitting and how this can impact predictor significance

The link function specifies a functional relationship between the linear predictor and the mean of the distribution of the outcome conditional on the predictor variables. Different link functions have different shapes and can therefore fit to different nonlinear relationships between the predictors and the target variable. For example: If predictor variables have very linear relationships to the mean, a link function that preserves that linearity (like the identity link function) should provide a better model fit than a link function that creates a more nonlinear, curved relationship to the mean. When the link function matches the relationship of a predictor variable, the mean of the outcome distribution (the prediction) will generally be closer to the actual values for the target variable, resulting in smaller residuals and more significant p-values.

Your assistant is curious about how the tree would change if Exp_weights was replaced by its logarithmic counterpart log(Exp_weights). Explain how, if at all, the tree would change.

The tree would not change. Tree splits based on Exp weights depend on its observed values only through their ranks but not their exact values. As a strictly increasing function, the log transformation will preserve the ranks of the observed values of Exp weights so the resulting tree would remain unchanged. (This tests the idea that monotone transformations of a numeric predictor will not alter a decision tree)

Number of PCs (M) to use

Trade off: As M increases... 1) Cumulative PVE (proportion of variance explained) increases 2) Dimension increases 3) (If y exists) model complexity increase We can choose M by: Scree plot: Choose # such that the cumulative PVE is high enough CV: Treat M as a hyperparameter to be tuned if y exists

Describe handling of cat variables in linear models and tree-based models.

Tree based: Decision trees split the levels of cat variables into groups. The more levels, the more potential splits. Decision tree identifies which variables to split and into which groups based on maximizing info gain. Decision trees naturally allow for interactions between cat variables based on how tree is created. For instance, a leaf node could have two or more parent nodes that split based on cat variables, which would rep interactions of those cat variables. Tree may also split on the same variable more than once in the tree. Cat based: Linear models fit a coefficient for each level of cat variable except the base level. Coefficient for each level represents impact relative to the base level variable. Also cat variables are automatically binarized to form dummy variables rep each non-baseline level. These dummy variables, equal to either 0 or 1, are numeric variables, which can then enter the model equation

Positive, continuous with a large mass at zero (what is common dist and common link)

Tweedie distribution with log link

Explain how under sampling and oversampling work to make an unbalanced dataset more balanced.

Under sampling- produces roughly balanced data by drawing fewer observations from the negative class (which is under sampled) and retaining all of the positive class observations. With relatively more data from the positive class (in proportion), this will improve the classifier's ability to pick up the signal leading to the positive class. The drawback is that the classifier, now based on less data and therefore less information about the negative class, could be less robust and more prone to overfitting. Oversampling: An alternative to under sampling, keeps all the original data, but oversamples (with replacement) the positive class to reduce the imbalance between the two classes. Some of the positive class observations will appear more than once in the balanced data.

Explain one reason for oversampling opposed to under sampling and vice versa.

Using oversampling to retain the info about the positive class and using under sampling to ease the computational burden and reduce run time when the training data is excessively large.

Two interpretation tools for Random Forest/boosted trees. Variance importance plot Partial dependence plot

VIP: The total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees. Variables with larger importance scores are more important (but unclear how they affect the target variable. Doesn't tell us if that contribution is positive, negative or follows a more complex relationship) PDP: Model prediction obtained after averaging the values/levels of variables not of interest. Fir a categorical target variable, the predictions are on a logit scale, meaning that what is shown on the vertical axis of PDP is ln(p/1-p). In the study manual it shows a downward parabolic pattern for age that shows odds that a randomly selected worker is a low earner decreases up to about age 50, beyond which it increases.

Explain how to use a time variable to make the training/test split and the advantage of doing so

We can utilize the training/test split with a time variable by using older data in the training and newer data in the test. The advantage of having more older training data is that it can reduce the noise and sets out to learn helpful patterns to predict future data. We need to set aside enough new data in the test set so that unseen observations are more reliable.

Explain why the option sampling = "down" became less essential if a gradient boosting machine was used in place of a random forest

Working on the principle of sequential learning, a GBM addresses the unbalanced nature of Count by iteratively adjusting the training data to place more emphasis on the data points that it is predicting poorly (those with at least one claim here) in each iteration of the algorithm. With a focus on improving the bias of predictions, a GBM will do well even without the use of under sampling.

Explain the process of how to tune a hyperparameter What about for eta?

You can tune a hyperparameter by cross-validation. Varying hyperparameter across a range of possible values and performing cross-validation at each value. Performance is then determined based on a cross-val metric such as AUC, and then select the hyperparameter value with best performance based on this metric. A series of reasonable values for eta would be chosen beforehand. Then for each value of eta, cross validation would be performed with each model fitting run using the same eta. The result is one average test metric result from each cross validation for each value of eta. The value of eta with the superior test metric such as AUC, would be chosen for subsequent predictive modeling work.

Remove NA's (use dat as df and remove from Q1) Note: Complete cases removes all NA values, na.omit() is also the same thing

dat<-dat[!is.na(dat$Q1) , ]

List's in R: l<-list(p="This is a list", q= 4:6, r=matrix(1:9,nrow=3), s=data.frame(x=c(1,4,9))) Output: $p = "This is a list" $q = 4 5 6 $r: 1 4 7 2 5 8 3 6 9 $s: x 1 1 2 4 3 9

l [[ 1 ]] Returns: "This is a list" l[["r"]] Returns: third component (named r) of the list by name... shows same result as $r l[[2]] [3] Returns: 6 (third element of second component of the list.. from $q) l$s Returns same as $s

Single Trees Pros and Cons

pros: 1) Interpretability: If there are not too many buckets, tree easy to interpret because of the if/then nature of the classification rules and their graphical representation 2) Complex relationships: Excel in handling non-linear and non-additive relationships without the need to insert extra features manually 3) Categorical variables: Cat predictors are automatically handled without the need for binarization or selecting the baseline level. 4) Variable selection: Variables are automatically selected as part of the model building process. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree. Cons: 1) Overfitting: Strongly dependent on training data (prone to overfitting) -> predictions unstable with a high variance -> lower user confidence. 2) Numeric variables: Usually need to split a numeric predictor repeatedly to capture its effect effectively -> tree becomes large, difficult to interpret 3) Categorical variables: Tend to favor cat predictors with large number of levels. Reason: Too many ways to split -> easy to find an artificial split that looks good on training data, but doesn't really exist in the signal.

What does this signal: df[ 3, 2:3]

third row, columns 2 through 3


Ensembles d'études connexes

Preeclampsia - HESI Case Study - NRSG376

View Set

Blood Bank Book Questions - Unit 2

View Set

Abstract Classes, Interfaces, Static Variables

View Set

Structure and Function of Cartilaginous Joints

View Set

Second Quarter Exam Review (Survey of the Bible)

View Set

Chapter 35: Assessment of Immune Function

View Set