Exam PA - Vocab and responses

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Trees make splits how? General to encompass both regression and classification

"Impurity" is a general enough term similar to "error" (at least according to the SOA). Minimizing impurity = maximizing information gain Regression trees make splits that produce the highest reduction of SSE, and classification trees for the highest reduction of node impurity (highest reduction of Gini index or entropy). The regression tree makes a best split to minimize the SSE; for classification tree, the best split is targeted to maximize the information gain in Gini index and Entropy. To be general, trees choose splits that result in the highest reduction of error or highest reduction of node impurity

What are the 2 additional MLR concerns? (not the 5 assumptions) What is the impact?

1. Outliers - inflates SSE and RSE, sometimes can affect hypothesis test credibility 2. High dimensions - likely will overfit

Provide 2 limitations of a decision tree

1. Trees use recursive binary splitting. They make greedy splits based on the largest information gain at each immediate step, not necessarily to produce the best fitting overall model. 2. Trees are not as robust. Highly sensitive to the training data and can vary a lot if data changes

For GLM: What happens if you choose a clearly incorrect distribution? What happens if you choose a bad link function?

Bad distribution --> holistic view of what the model claims to represent is not consistent with reality Bad link --> model permits predictions that don't make sense, like negative predictions for count variables

Define / explain bootstrapping

Bootstrapping creates many bootstrap samples. The training set has n observations. Sample from the training set WITH replacement until you have n observations = one bootstrap sample. Repeat for as many bootstrapped samples as needed.

Definition of dimensionality and granularity

Dimensionality - number of variables in a dataset or number of levels in a factor Granularity - how specifically a variable is recorded (can mean more specific levels rather than more levels)

How to count number of predictors in a regression

For glm() the difference between the null model df and the current model df values in summary() gives the number of predictors (does not include intercept) For glmnet() the first line of coef(glmnet()) will show how many rows. 74 rows = 73 predictors + 1 intercept

Another word for dummy variables Remember what?

Indicator variables. Each factor level coefficient represents the difference in the prediction relative to a base level of the factor variable

Relationship between probability and odds

Predicted prob Mu = odds / (1+odds) odds = mu / (1 - mu)

What is an OLS regression?

Same as MLR

What is the regularization term in shrinkage methods?

Same as penalty term

Define the 3 data types: Structured, unstructured, semi-structured

Structured data - suitable in tabular form Unstructured - not suitable in tabular form (ex: audio recording) Semi-structured - elements of both. Survey data can have demographic MC questions and open answer prompts

(a) (2 points) Explain the purpose of the hyperparameter interaction.depth and whether it is typically large or small in value.

The hyperparameter controls the maximum depth of each tree built for the boosted model. Its value is typically small since each tree is intended to be small, as they represent weak learners to be aggregated to obtain the boosted model's prediction.

If you fit a Gamma or Inverse Gaussian distribution, what do you need to remember?

These distributions only have a domain >0. must add 1 to the target variable (or add another small positive value) if the target has values of 0.

How to interpret a coefficient in a GLM with weights?

To interpret a coefficient in a model with weights: "For every unit increase in x1, the predicted average price of a house per room is expected to increase by b1"

Key phrase "for this project" means what?

You need to consider what is valued or has priority, interpretability / accuracy

Outlier vs leverage vs influence point

- An outlier is a data point whose response y does not follow the general trend of the rest of the data. - A leverage point is an observation that has an unusual predictor value (very different from the bulk of the observations). - An influence point is an observation whose removal from the data set would cause a large change in the estimated regression model coefficients - A leverage point may have no influence if the observation lies close to the regression line. - A point has to have at least some leverage in order to be influential.

Reasons to combine levels

- Similar distributions (mostly the central tendency of the distribution like the median/mean. Do not combine based on sums/freq, this is not helpful for impact on target) - By necessity, maybe similar distributions AND low frequency count. Remember low count prevents us from knowing the distribution reliably to know its mean/median so up to you to judge - Wanting to change the interpretation of the reference level - Some levels aren't significant (so potentially combine these into the reference level if interpretation makes sense when combined) -Another reason would be similarly defined levels. For example, it might be less helpful/intuitive to keep crimson and maroon as separate levels of a color factor.

How can you tell this tree may not be the best model for this dataset? Based on cp table or cp plot or tree diagram

1. If CV chooses the smallest cp out of all the ones tested --> From this, we can speculate that it is possible that an even larger, more flexible tree would result in better predictions but then concerns with too big. Can't visualize well, lose interpretability, and may overfit. 2. Consistent but slow decrease in CV error for cp values tested (very flat on the cp plot) --> may be a continuous relationship with a key numeric predictor(s) since it keeps making more splits that lead to marginal improvements 3. If we see a continuous relationship with predictor in the tree (multiple splits on same numeric predictor, this can be horizontally across the tree as well) 4. Using too few variables for a problem that focuses on accuracy. Leads us to question the reasonableness of using a prediction that only requires a small handful of predictors to attain reasonably high accuracy. Likely jeopardizes prediction accuracy with too few predictors

What are the 5 MLR assumptions and what happens if violated? Which ones are residual related?

1. Mean of error = zero 2. Variance of error is constant (homosk.) 3. Independent errors 4. Normally distributed errors 5. xj is not a linear combo of the other x's (no collinearity) (The first 4 are the residual related concerns) MLR outputs are not valid/reliable in any sense.

Benefits of shrinkage or regularization generally

1. Overall is good when the model has too much flexibility (is overfitted) 2. Can handle high dimensions (although ridge is the weakest at this of the 3 b/c unable to drop predictors. Many of the predictors would be 'practically dropped' when their coefficients have near-0 estimates). Regularization will 'sift out' the many predictors that lack predictive information by making their coefficient estimate be 0 or near 0. Only some of the predictors should have meaningful estimates that are far enough from 0. 3. Can help with collinearity (although ridge is the weakest at this of the 3 b/c unable to drop predictors)

Benefits of each type of regularization (SOA)

1. Ridge or elastic net: Reduces variance by shrinking coefficients. 2. Elastic net: Reduces variance by shrinking coefficients, can also be used to perform variable selection and is helpful in instances where there is HD data with few data points. Maggie addition: combination of lasso and ridge and less likely to drop predictors than lasso when alpha <1. 3. Lasso or elastic net: Reduces variance by shrinking coefficients and can also be used to perform variable selection and remove nonpredictive variables.

Define these key hyperparameters for xgboost() 1. nrounds 2. max_depth 3. eta 4. gamma 5. subsample 6. colsample_bytree

1. nrounds: number of trees 2. max_depth: maximum depth of terminal nodes 3. eta: shrinkage parameter 4. gamma: minimum reduction of splitting measure, like cp 5. subsample: portion of observations used for each tree (uses only a sample like 60% of training observations at each iteration) 6. colsample_bytree: portion of predictors used for each tree (not each split)

Define these key hyperparameters for randomForest(): 1. ntree 2. mtry 3. nodesize 4. maxnodes

1. ntree: number of trees to fit 2. mtry: number of candidate predictors for each split 3. nodesize: minimum number of observations permitted for a terminal node 4. maxnodes: maximum number of terminal nodes

What does median mean?

50th percentile. Half of the data is at or below that value, and half of the data is above that value.

Def of high dimensions. What happens in HD? What do you avoid

A dataset is considered high-dimensional when it has too many predictors relative to the number of observations. Overfitting likely occurs in this situation (too high p is a very flexible model). It is best if n >> p. The issues of high dimensions are rooted in the curse of dimensionality. As the number of predictors increases, more observations are needed to retain the same wealth of information, else the wealth becomes more and more diluted. Data becomes sparse in HD Want to avoid models with high flexibility caused by many possible predictors

When to log transform the target and when not

A log transformation is about narrowing the range of the original values, thus changing the distribution of the values. Depending on the original, the transformation's impact/usefulness would vary. The mere presence of 0's is normally not going to matter much because that's easily circumvented by adding a constant first. What would matter more is the frequency of specific values, including 0. If the original is particularly wonky in some way (e.g. consists of a majority of 0's), then what a log transformation could do for it is probably limited.

OOB - Out of Bag Predictions for Random Forest

A random forest does bootstrapping to create many samples and fit a tree to each one. An OOB observation is an observation from the original dataset that was not in the bootstrapped sample to train that particular tree. So when we use predict(), it looks at all the trees where that observation is OOB, and averages those predictions.

Random forest definition. What are the individual trees typically like?

A random forest model fits a decision tree to each bootstrap sample. (runs in parallel) To produce a prediction, the model aggregates the predictions from each tree. For regression trees, "aggregate" means to take their average; for classification trees, it means to select the most frequent one. The individual trees are typically large with no pruning, having high variance and low bias. This is permissible since aggregating the trees reduces the high variance. Even so, this does not mean a random forest is unable to overfit. A random forest considers a random subset of predictors at each split to minimize the risk of correlated predictions, reducing the variance a step further. Special case of bagging. A random forest/bagging generally has smaller variance than a single decision tree (and random forest will have smaller variance than bagging) (technically relative to one of its individual trees that are typically large - but i can just say single decision tree and get away with it)

What is the impact of a right skewed predictor on the model?

A single coefficient for Mortgage will simply enforce a consistent pattern/relation across its entire range of values, which may not make as much sense when very high values are far from the bulk of the data. The large values of a right-skewed variable would have high leverage. Coefficient estimates are usually sensitive to high leverage points, so these estimates are unreliable (or misleading)

For x1 = mortgage, why is a single coefficient not sufficient in a GLM?

A single coefficient for Mortgage will simply enforce a consistent pattern/relation across its entire range of values. The degree that we think that rigidity is not very reasonable (which could happen, in severe cases of right skewness) is the degree we think there might be issues. Acknowledging that Mortgage might have two distinct effects as mentioned (having a mortgage vs mortgage value) is in support of that.

Watch out for this phrasing: "Recommend two different adjustments to the dataset prior to modeling that involve the new variable."

According to CA, adjustments to the dataset means to use the most recent dataset. If FinanceSavvy was just added to the dataset, then keeping it is not an "adjustment to the dataset" b/c that's leaving the dataset as is. Dropping it is an adjustment to the dataset.

Question asks for "adequacy of the variables" vs "adequacy of the scope of the variables"

Adequacy of the variables is about the variables themselves, e.g. does the variable have issues such that it should be modified. Adequacy of the scope of the variables is about the variety of the variables, e.g. do we have enough variables and the appropriate ones for the project.

(a) (3 points) Explain what a cutoff is and how it is involved in calculating the AUC.

After a model produces predictions of probabilities, we decide a cutoff to obtain predictions of positive and negative. All predicted probabilities above the cutoff will be predicted as a positive observation, while those below the cutoff will be predicted as negative instead If the cutoff is very high, most of the obs will be predicted as negative, thus likely producing a high specificity but low sensitivity. Setting the cutoff to be low likely produces high sensitivity but low specificity. In short, cutoff dictates the values of sensitivity and specificity. By plotting every possible pair of sensitivity and specificity values due to changing the cutoff, the result is an ROC curve. The points are connected from the bottom-left of the plot, where sensitivity is 0 and specificity is 1 (predicting all obs negative) to the top right of the plot where sensitivity is 1 and specificity is 0 (predicting all observations positive). The area under the ROC curve is the AUC.

Coefficients - include what

Also includes the INTERCEPT

(a) (1 point) Explain what an AUC near the value of 0.5 would imply.

An AUC near 0.5 would result from the ROC curve following close to the diagonal line connecting the bottom left starting point to the top right ending point. This occurs when a model generates probabilities with similar distributions for both the positive and negative observations. Clearly, such a performance does not describe a good model. Such a model would be alternating frequently between increasing sensitivity (ROC curve going up) and decreasing specificity (going right) as the cutoff decreases. That happens when there's frequently alternating 0's and 1's when sorted by their predicted probabilities, hence similarly distributed predicted probabilities. An old SOA exam emphasized that an AUC of 0.5 does not simply mean the model is correct about 50% of the time. But essentially, yes it denotes a weak (though not abject) model.

What is an advantage of backward stepwise selection?

An advantage of backward selection is that it maximizes the potential of finding complementing predictors. Especially when used with AIC, tends to result in more predictors. Remember to note backward selection starts from the given model with all predictors

(a) (2 points) State one advantage and one disadvantage that decision trees have regarding interactions.

An advantage of decision trees is its ability to detect interactions without specifying them in advance. A disadvantage of decision trees is how interactions are not always easy to find in the tree. (Especially when a tree is very 'balanced' in shape; you have to pay attention to where the splits occur (with which predictors) and how that trickles down to the predictions below. Almost none of the tree can be ignored.)

What is an advantage of forward stepwise selection?

An advantage of forward selection is that it works well in high-dimensional settings. Especially when used with BIC, tends to result in fewer predictors. Remember to note forward selection starts from the null model with no predictors

Define an interaction

An interaction is when the effect of a predictor on the target depends on the value of another predictor. An interaction introduces a source of dependence between predictors in a model, so that predictors may have a joint influence on the target. An interaction effect is when the target variable has a relationship with a combination of input variables in addition to potentially having a relationship with those variables on their own.

Def of outlier (in context of MLR violations) and how it impacts model

An outlier is an observation with an extreme residual. Inflates SSE

(a) (3 points) Explain the impact of alpha on the model as it increases and as it decreases for an elastic net regularization.

As alpha increases, the elastic net performs regularization more and more like lasso, until it BECOMES lasso at a value of 1. But as alpha decreases, the elastic net performs regularization more and more like ridge, until it becomes ridge at a value of 0. Therefore, increasing alpha makes it easier for the model to produce coefficient estimates that equal 0, thus encouraging variable selection, where as the opposite occurs when decreasing alpha.

Can elastic net regression perform variable selection?

Assuming lambda is at some reasonable (moderate) flexibility, elastic net can perform variable selection for any alpha >0. And the larger the alpha, the more variables are likely to be removed, up to alpha = 1 is lasso regression. Elastic net is less likely to drop variables compared to lasso for alpha <1.

Say mortgage (x1) has a median = 0. What does this mean and why is it a big deal?

At least half of the observations have mortgage = 0. This can be a big deal to have so many obs = 0. When you see lots of zero values, immediately ask yourself if there is a difference between the effect of having a mortgage, and the impact of mortgage value. Using Mortgage with no change will treat these impacts as inseparable.

Explain boosting

Boosting grows decision trees sequentially and then aggregates them. The first tree is fit regularly on the training set. Before fitting the second tree, the training set is updated by removing information explained by the first tree; information that remains unexplained is referred to as "residuals". Before fitting the third tree, the training set is updated again, and so on, for as many trees as desired. The individual trees are typically small (usually from specifying a low depth), having high bias and low variance. Aggregating the trees reduces the high bias. The shrinkage parameter is a value between 0 and 1 that controls the speed of learning or the amount of information that we gain from each tree. The smaller the shrinkage parameter,the less information is gained from each tree, the slower boosting learns. In turn, a large number of trees may be required to obtain a good model. Conversely, a large shrinkage parameter means few trees would be required, but the few trees in aggregate will tend to be similar to a single decision tree. The boosting prediction is the shrinkage parameter times a tree's prediction, summed over all the trees. The number of trees is a flexibility measure (in tandem with the shrinkage parameter) which should be tuned by cross-validation. A boosting method generally has less bias than a single decision tree (relative to one of its individual trees that are typically small - but you can just say single decision tree and get away with it)

How to justify you're not dropping too many outliers?

Can look for the sample mean and the sample median being close together. Because means are far more sensitive to extreme values than medians, it's helpful to compare both to detect the sway. Ex: Since the sample mean and the sample median are not far apart, there should be just a few observations with values exceeding 10. Can look for the mean not being pulled too high/low. Ex: Since the variable's sample mean is low, the dataset should not have a prominent number of obs with values that high.

Describe dist / link function for: 1. Normal 2. Gamma 3. Inverse Gaussian 4. Binomial 5. Poisson

Canonical links for: 1. Normal --> symmetric, continuous, all real numbers. Identity link 2. Gamma --> R skew, continuous, > 0. Inverse link 3. Inverse Gaussian --> R skew, continuous, > 0. Inverse squared link 4. Binomial --> Integers >=0. Logit link 5. Poisson --> R skew, integers >= 0. Logarithmic link

Define descriptive analytics

Descriptive analytics - focuses on studying the past to identify relationships and patterns between / among the variables. this is exploring the RAW data. This is past data b/c we're using data that has already been collected. We are allowed to create graphs and find things like summary means and make observations like "there are more houses with central air conditioning than without", and "overall_quality has a positive correlation with sale_price", variable looks "predictive" from a graph. Unsupervised learning falls in this category. Data cleaning falls in this category if it does not relate to making something "more predictive" or to fit a model (i.e. removing outliers or erroneous data is descriptive data cleaning). Anything that helps to understand relationships between variables counts as descriptive analytics, even if it uses model outputs to do so.

Watch out for factor vars with lopsided levels

Especially for binary vars. Personally I think a binary factor with 10% vs. 90% is still pretty lopsided but it probably isn't bad enough to cause problems (assuming a sufficiently large dataset).

Factors vs numeric

Factor variables just need the target to be different by level, but it does not need to be a monotonic relationship. Numeric predictors in GLM only have one coefficient, so only monotonic relationships btw predictor and target can be captured. (not necessarily linear) Factor vars will fit a different coefficient for each level (less limiting on the monotonic, but at expense of more complexity). Also if predictor is monotonic but is irregular (not fit well by simple/nice formula), factor may be better

Feature improvement definition

Feature improvement is any sort of level combining, transformations, combining variables, even dropping a variable (if a variable is better off not being a feature/predictor)

When can you compare flexibility of different shrinkage models?

For elastic net with changing alpha, there is no flexibility measure that ties them together. Cannot make a general statement that lasso is less flexible than ridge, vice versa. For a fixed alpha, lambda is inversely related to flexibility. Or can comment on flexibility if you can compare the training RMSEs for specific models. ------------ Regularization is about lowering the model flexibility without necessarily changing the number of predictors, so # predictors is no longer how flexibility should be measured. Flexibility measure now hinges on the shrinkage parameter instead. Lambda for ridge is not comparable 1:1 to lambda for lasso.

How to use PCA to tell if original vars are correlated

For the first PC, look at at the loadings and notice which variables have similar loadings. This only applies for PC1. PC1 is free to find which loadings maximize its variance, but all subsequent PCs are restricted to also be uncorrelated to prior PCs. PC1 loadings are most sensitive to the correlations of the original variables. Ex: For PC1, notice that the loadings for sepal length, petal length, and petal width are similar. This indicates that they are more correlated with one another than with sepal width.

What is a good range for specificity and sensitivity?

Generally we care that sensitivity and specificity are close to each other to be considered "good", and also the overall goal of both being high so we get as high of an AUC as possible. If one of them is really low (like 0.2) and the other is really high, this is not great. If both are really low, this is not great. AUC can't be high unless both are high (picture the top left of the ROC curve)

Why should you always scale variables before PCA? Why would you not scale

Generally you should always standardize before. Loadings come from maximizing the variance explained by each PC, so variables with a large variance would get a loading large in magnitude. Scaling the variables helps put them on the same scale so loadings are not incorrectly swayed. (Scaling makes them unitless). Having the same unit would likely (not always) mean their ranges are more united already; the phrase you might see is "having the same scale".

Are interactions important for this project?

Given how hesitant Millennial Money appears with models that lack interpretability, my judgement is that interactions are not very important in this project, due to their tendency to complicate models. **This effect is less obvious for a non-parametric model since they look to capture whatever is in the data, complex or not, which tend to make it more complex already

Why do trees prefer to split on variables that have many levels?

Having many levels generates so many splitting possibilities for it that it's easier to find a split where the information gain is large. For numeric variables, or factor variables with many levels, a tree favors splits on these variables as there are many ways to split (not many splits). Having more levels or wider range allows the tree to examine more ways to split the observations. B/c of that it is possible for the granular predictor to end up with a split that results in the greatest information gain. A predictor that provides many candidate splits (e.g. a high-dimensional factor) can 'game the system' and by sheer chance supply the best split when it's actually spurious — a mere coincidence due to the specific training data at hand. So if that best split isn't 'genuine', any interaction based on it is also suspect. If you can combine so fewer levels, it reduces opportunity for overfitting and ensures that model splits reflect meaningful differences in predictor values.

The firm is looking to expand their business beyond providing homes to buyers and is now extending their operations into purchasing undervalued homes to then sell for a profit. They have data of homes sold in the Ames area from 2016 - 2020 for your analysis. The firm will rely on the predictions of your model, compared to the listing price of the home, to identify profitable homes which may be listed for a discount. (a) (2 points) Explain why year is not appropriate as a factor in this situation.

If year is a predictor and we set it to e.g. 2016 for a prediction, we most definitely are predicting the sale price of a house sold (or to be sold) in 2016; year as a predictor has no other interpretation. That they want to examine house prices 'today', but their data provided is just limited. A better dataset would include houses that were sold more recently, but they are simply working with what they have. A model trained on a certain dataset implies that we 'accept' that when we make a prediction, it is for something that's part of the same population as the data. To the degree we can't bring ourselves to 'accept' that, we should use different, more relevant data. For Express Realty, they are most definitely interested in the population of houses that they can purchase 'today', none of which would be identified as "sold in 2016-2020". Without different data, they'll have to 'accept' that the population is the same.

When would you choose lognormal in MLR vs Gamma GLM vs Inverse Gaussian GLM ?

If you have good reason (they'd tell you) to believe the target follows a lognormal distribution, you'd not use GLM. There's a tendency for lognormal and inverse gaussian to be more right-skewed than gamma, so might fit better for severely right-skewed targets

Discuss sample mean and sample median, what it means when they're close vs one higher than other

In general, if the sample mean and sample median are close together, this means the data is relatively symmetrical or balanced. If the mean is much higher than the median, then there are some very large outliers or is right skewed. If the mean is much lower than the median, then there are some very small outliers or is left skewed

Is number of trees a measure of flexibility in ensemble methods?

It is not a measure of flexibility in random forests. It is a measure of flexibility in boosting. More trees means more of the target has been explained, causing the predictions to get closer and closer to the actual target values.

What is important about a canonical link?

It makes it more likely for the GLM estimates to converge

Residuals have no autocorrelations. What does this mean

It means the correlations of any two error terms (the ε's) are 0. It is technically a weaker claim than them being independent.

Def of latent variables

It refers to a variable that quantifies a property/attribute that cannot be literally measured or observed. (like a PC usually, it's a measurement but a weird amalgamation)

How does MLR estimate coefficients? How does GLM estimate coefficients? When are they the same?

MLR uses OLS to estimate coefficients GLM uses MLE to estimate coefficients These are called estimation procedures. MLR = GLM with normal distribution and identity link

What is overfitting? How to determine overfit

Model fits too closely to the training data. Bad b/c it will miss key patterns in the dataset and not necessary have good predictive accuracy on the test set. Must use the training and test RMSE!!! MSE ok too but will be larger naturally since squared. If training RMSE << test RMSE, likely overfitted Difference between training and test metrics alone is not sufficient to judge a black and white overfit. The training metric needs to show model is 'highly accurate' on the training data. We can say Model 1 is overfitting compared to Model 2 if model 1 is more flexible, but Model 2 performs better on test data. But this is not necessarily a black and white overfit since neither is very accurate to training data. Remember that RMSE is in the unit of the target. This helps us determine if it's a large or small difference

When you change the cutoff, does increasing sensitivity always mean decreasing specificity and vice versa?

Not always; it is uncommon for both to change at the same time as per the ROC curve (inching from left to right). But over a substantial decrease in the cutoff, both sensitivity and specificity would change in that manner (technically it's non-decreasing for the former; non-increasing for the latter).

What are potential problems of large decision trees?

Not easy to visualize (which is huge advantage of decision trees). Can cause loss of interpretability and potential overfitting (check if overfitted or state as a speculative concern)

Be careful with count target variables because

Not enough observations = low frequency Graphs will often show count or sum, not mean or median

Does a numeric predictor need a monotonic or linear relationship with the target in a GLM?

Numeric predictors always have a monotonic relationship (not necessarily linear) with the target in a GLM assuming no manipulation of the predictor, like a squared term. B/c of the link function, GLMs it can be non-linear relationship between the target and a predictor, but will always be monotonic. (predictors have a linear relationship with the link(target) ) A unit increase in the predictor does NOT generalize to produce "same magnitude of effect" for all possible link functions; every link function has its own quirk.

Monotonic / linear in MLR

Numeric predictors in MLR look for a monotonic linear relationship between the predictor and the target (unless there is a polynomial term or other transformation of a predictor. Then it looks for a linear relationship between the target and the transformed predictor (or transformed target and the predictor depending on which one was transformed)).

Oversampling / Undersampling applies to

Only for balancing a binary target. Not predictors. Only do this to the training set to better train the model's predictions.

Definition of overdispersion / underdispersion (only for Poisson and Binomial for this exam) How to check for overdispersion / underdispersion

Overdispersion: variability of data is greater than the model's estimated variance (estimation of it). Addressing it by the quasi-likelihood method aims to make certain hypothesis tests more reliable, such as changing z tests into t tests. However, nothing about the systematic component would change (predictions SAME/ coefficient estimates SAME / deviance and df same etc). It makes standard errors more accurate and improves estimates for hypothesis testing so h-tests more reliable. Check the (residual) deviance with its df: If deviance much greater than df, then overdispersion. If deviance << df, then underdispersion (less common) Can be difficult to judge if very few training obs Both regular and quasi can be used for predictions, but if any further analysis like confidence intervals / hypothesis tests, you should use quasi model (Because the distribution is quasi / fake, log-likelihood doesn't make sense so quasi model output will be AIC = NA)

Explain how PCA is typically used

PCA is an unsupervised learning technique that creates new uncorrelated variables that maximize variance. Often the first few principal components explain most of the variability in the original variables. These principal components can be used in place of the original variables to reduce dimensionality and create a simpler model. (Feature generation is the use. Data exploration is the most common alternative use being awarded credit) 1st sentence means: variance in the dataset explained by each PC is maximized

Parametric is better for which business focus? Non-parametric is better for which business focus?

Parametric = better for interpretability Non-parametric = better for accuracy, these are generally more flexible than parametric methods

Define predictive analytics

Predictive analytics - a study that emphasizes obtaining a model that accurately predicts a future event. We take recorded data and look for a relationship of other variables with the target, in a way that looks predictive. We can then use that predictive relationship to make predictions for a target mean. We work on refining the model to produce the best predictions. Parametric and non parametric supervised learning methods both fall in this category. Data cleaning and transformation falls in this category if the data cleaning is to help us get a more predictive feature or to transform a variable to better fit a model. Future = unseen, making a prediction before the target is known or realized. Not literally in the future like next year

Define prescriptive analytics

Prescriptive analytics - a study that emphasizes the outcome or consequences of decisions made in modeling or implementation. This is the recommendation to help meet the business goals after we have completed the predictive modeling. "Your contact mentions that it is your responsibility to provide recommendations if the goal is not met" "Your contact mentions that it is your responsibility to recommend metrics to track in order to successfully meet the goal. For example, if our predictive model predicts that will be high ferry boarding rates in the early mornings, then our prescriptive recommendation is to increase staffing between those hours to best handle the influx of boardings. We think of options to meet the business need after predictive analytics. Prescriptive analytics can also intertwine with modeling. Maybe specific predictors are chosen for the purpose of quantifying a goal set out by prescriptive analytics; it all depends on how 'integrated' we want the pieces to fit together.

How do shrinkage methods work?

Regularization aims to lower the flexibility of a model by restricting the coefficient estimates to be closer to 0. Variance would be reduced at the expense of some bias. Optimize a loss function that includes a penalty parameter that penalizes large coefficients. This will shrink model coefficients towards zero and reduce variance / overfitting. λ controls the shrinkage of the estimates. λ=0 (no shrinkage), the estimates will match the OLS estimates; λ --> infinity (maximum shrinkage), the estimates excluding the intercept's will equal 0 (i.e. no predictors). λ is inversely related to flexibility. loss function = SSE + penalty term Minimizes SSE s.t. constraints for MLR Minimizes (-) log-likelihood s.t. constraints for GLM Regularization ignores the hierarchical principle.

When would you choose ridge over lasso or elastic net? And vice versa

Ridge would be chosen over Lasso in cases where we want maximum accuracy because it will keep all predictors in the model, but still reduces variance through a penalty that shrinks statistically insignificant coefficients closer to zero. Will keep all predictors, even those with near-zero coefficients. Also, ridge would be incapable of violating the hierarchical principle. Lasso and Elastic Net can perform variable selection, so this may be preferable in terms of interpretability. Also more preferable if there's collinearity or high dimensions (better at it then ridge).

State the difference between dissimilarity and linkage.

SOA: Dissimilarity measures the proximity of two observations in the dataset, while linkage measures the proximity of two clusters of observations. CA: To be precise, the closeness between clusters (derived from the closeness between observations) is still based on dissimilarity, where linkage tells us how that dissimilarity should be measured When at least one cluster has multiple observations, how do we know which/what dissimilarities between observations would determine the closeness of two clusters? Complete linkage says we find the two observations (one from each cluster) that produce the largest distance, defining that value as the two clusters' dissimilarity. inter-cluster dissimilarity is different than dissimilarity. The complete linkage method considers the maximum intercluster dissimilarity and the single linkage method uses minimum intercluster dissimilarity.

What does scaling do? What does standardizing do?

Scaling makes a variable unitless (while forcing its sample variance to be 1). Means dividing each var by its standard deviation Standardizing centers and scales. Means subtracting the mean and dividing by standard deviation

What does it mean for a coefficient to be significant? What would insignificance look like on a graph?

Significant means the coefficient is meaningfully different from zero (remember hypothesis tests, beta = 0 or beta <> 0) For a numeric variable, if it is a very flat line, the coefficient / slope will be close to zero (not significant)

Binarizing for stepwise selection vs shrinkage. Hierarchical principle? Consideration?

Stepwise - a model will drop or keep the whole factor by default. If you want the ability to keep only certain levels, you will need to binarize the factor. Then individual levels will be left out if they do not contribute significantly to the model (abides by hierarchical principle) Shrinkage - Regularization uses binarized dummy variables. "Starts" with binarizing factors, it will automatically evaluate individual levels of factors to drop / keep (does not abide by hierarchical principle) However, including only some of the dummy vars from a factor will result in the merging of certain levels. Since the resulting merger is purely performance-based, it can complicate the model's interpretability when unintuitive levels are combined

Explain stepwise selection in contrast to best subset selection

Stepwise selection is an alternative to best subset selection, and is computationally more efficient, since it considers a much smaller set of models. For example, forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until adding a predictor leads to a worse model by a measure such as AIC. At each step the predictor that gives the greatest additional improvement to the fit is added to the model. The best model is the one fit just before adding a variable that leads to a decrease in performance. It is not guaranteed that stepwise selection will find the best possible model out of all 2^p models. Stepwise selection is iterative and a greedy algorithm, will find a local minimum. Will treat factor levels all together UNLESS you binarize to consider individual levels.

Define stratified sampling (a) (2 points) Discuss the benefits of stratified sampling.

Stratification is the act of creating/forming distinct groups called strata. Stratified sampling is the act of randomly sampling from each strata. Stratified sampling results in test and training datasets that are similar with respect to the stratification variables. To the extent that the stratification variables are related to the target variable, stratification will allow for more precise train and test estimates (estimates of target variable). Not stratifying on important predictors would add variance to the model b/c it would be fit to the segmentation of the training data, which is similar to overfitting to noise in the dataset. The test dataset would have a different segmentation, and therefore the model may not fit the test data as well as the train data. (means overfitting, patterns in the train data that aren't in the test data) Benefits of stratifying on the target variable would be to ensure that test and training datasets have similar distributions of the target variable. This will improve predictive accuracy since the dataset trained on is similar to the dataset tested on (with respect to the target).

List assumptions for OLS with respect to residuals. What is the impact of these if violated?

The 'proper' term here is actually "error" instead of "residual", symbolized as ε, but solution treats them the same. ε's are random variables 1. Under MLR, residuals are normally distributed (not standard normal). 2. Residuals follow a constant variance (homoskedasticity). 3. The expected value of the residuals equal zero. 4. Residuals have no autocorrelations. Means there are deficiencies of the model.

When would PCA not be appropriate for a dataset?

The dataset has small dimensionality. While there are many requests within the data, as our assistant notes, there are only 9 variables. PCA is effective when there is high dimensionality (many variables) which can make univariate and bivariate data exploration and visualization techniques less effective. PCA is used to summarize high-dimensional data into fewer composite variables while retaining as much information as possible. As we do not have high dimensionality, any information loss from feature transformation will not be outweighed by improvements in model performance or capture of latent variables. The dataset includes a significant number of factors variables, which will require conversion to numeric values prior to applying PCA. PCA attempts to maximize the variance or spread in our data distribution by linearly combining original variables (cannot linearly combine if not a numeric var).

How is variable importance determined? Define IncMSE and IncNodePurity

The importance of a predictor is determined by how much of the splitting measure, like SSE, is minimized across all the trees. For classifiers, it'd be reduction in Gini index specifically. IncMSE = mean increase in error from jumbling up values of that predictor IncNodePurity = total decrease in RSS due to splitting with var x1, averaged across all the trees in the random forest. Variable importance is a measure of how much a predictor contributes to the overall fit of the model. This can be used to rank which predictors are most important in the model. It is calculated by aggregating across all trees in the random forest the reductions in error that all splits on a selected variable produce. Variable importance cannot be used to draw inference as to what is causing model results but can identify which variables cause the largest reduction in model error on the training data.

Which link is the most interpretable for a binary target in GLM?

The logit link is more easily interpretable than either the probit link or the complementary log-log. The logit link sets the predictors equal to the natural log of the odds of having a personal loan, so it is relatively easy to identify and interpret the effect of a predictor on the odds of having a personal loan.

Perform PCA on entire dataset. Why?

The problem with performing PCA on the training set is that you would have to replicate PCA on the test set as well, and it may create PC's differently between the test and training sets since the variation between the datasets are not necessarily the same

Define bagging (the model). What issue may occur?

This is an ensemble method that fits a decision tree to each bootstrap sample. To produce a prediction, the model aggregates the predictions from each tree. For regression trees, "aggregate" means to take their average; for classification trees, it means to select the most frequent one. When all predictors are considered for every split, the model is referred to as bagging. When there is a really significant or dominant predictor, it will likely appear as the first split in many of the fitted trees due to the greedy nature of recursive binary splitting. This is undesirable, resulting in similar/correlated predictions across trees. This means the resulting reduction in variance from aggregating the tree predictions won't be as great, so more susceptible to overfitting.

Define best subset selection

This procedure compares all possible models from the available predictors, so it will find the best set of predictors (optimal model), based on the chosen criterion. Time and computationally intensive. Potentially more susceptible to overfitting that stepwise selection methods. Best subset selection is performed by fitting all p models, where p is the total number of predictors being considered, that contain exactly one predictor and picking the model with smallest deviance, fitting all p choose 2 models that contain exact 2 predictors and picking the model with lowest deviance, and so forth. Then a single best model is selected from the models picked, using a metric such as AIC. In general, there are 2^p models that are fit, which can be quite a large search space as p increases. Not iterative, will find the optimal set of predictors, global minimum. Will treat factor levels all together UNLESS you binarize to consider individual levels.

What does it signify if you use cross-validation and it chooses a cp of zero?

This suggests the original tree was not flexible enough to provide good predictions. If the tree gets too big and is still not flexible enough, maybe the tree isn't the best model b/c you lose one of the biggest advantages of the tree being easy to visualize / interpret. (Make sure no other params are limiting the tree)

Why must you always include less than k PC's? (where k = number of vars used in the PCA)

Using all the PC"s ends up having the exact same predictive info as the original variables. There is no dimension reduction and PC's are harder to interpret than the original vars. Using PCs instead of original vars hopes that using fewer predictors may result in predictive improvement, similar to how stepwise selection may find an improvement with fewer predictors.

What are weights in a GLM? List 3 benefits. Explain the math. Explain how to model in R

Weights refer to a variable of known positive constants that gives certain observations more importance (more reliable) according to the weights. GLM will estimate coefficients in a way that places a higher priority on observations with larger weights. Benefits: - measure the relative importance of each observation - scale the variances of the target for all observations. More weight causes variance to decrease. - are intuitive as integers, making the target interpretable as an average --> y_i's are average [target] per [weight] (ex: target var = average price per room) Mathematically, the weights are w_i's that come from replacing ϕ with ϕ/w_i in the log-likelihood function l(β). Consequently, the weights impact how β is estimated, attributing more importance to the observations with greater weight. If the target is recorded on a total basis, it should be transformed to be on a per weight basis before you can use weights in the model.

When to scale histograms (use density) and when not?

Without scaling, you can see difference in counts but can't properly compare distribution shapes. When scaled, distribution / shape comparison is better but can't tell the difference in counts. Generally want to see both.

Do you need to binarize factors for PCA?

Yes - need to include numeric dummy vars b/c PCA attempts to maximize variance in our data distribution by linearly combining variables (you can only do linear combo if everything is numeric). Use fullRank = F to include all k levels (not k-1)

glmnet() key hyperparameters

alpha = resemblance towards lasso vs ridge (mixing coefficient) lambda = shrinkage parameter

Define these key hyperparameters for gbm() 1. n.trees 2. interaction.depth 3. shrinkage 4. n.minobsinnode 5. bag.fraction

gbm() = BOOSTING 1. n.trees: number of trees 2. interaction.depth: maximum depth of terminal nodes 3. shrinkage: shrinkage parameter 4. n.minobsinnode: minimum number of observations permitted for a terminal node 5. bag.fraction: portion of observations used for each tree (uses only a sample like 60% of training observations at each iteration)

Explain an offset in a Poisson regression

wi = the exposure for the ith observation, and λi = rate component (i.e. mean target per exposure unit) for the ith observation. offset = the term added to the x^T B - predictors only explain the rate (mean PER exposure) - offset's impact on the target is known - count = # exposures * rate per exposure - exposure chosen based on preference / interpretability - Poisson counts within a given interval = one exposure unit


संबंधित स्टडी सेट्स

AP Computer Science A Unit 9 Progress Check: MCQ

View Set

Interpersonal Communication Final Exam

View Set

Vocabulary workshop level c unit 4 ** antonyms

View Set

Bio 206 Ch 35: Water and Sugar Transport in Plants

View Set