Exam PA

Ace your homework & exams now with Quizwiz!

Problem with unbalanced data

A classifier implicitly places more weight in the majority class and tries to match the training observstions in that class withiut paying enough attention to the minority class.

Advantages of polynomial regression

Able to take care of more complex relationships

Accuracy, sensitivity, specificity relationship

Accuracy is a weighted average of specificity and sensitivity.

Bias

Accuracy. The difference between the expected value and the true value. More complex model = lower due to higher ability to capture the signal of the data

Best subset selection limitations

Computationally prohibitive

Hierarchical clustering

Does not require the choice of K in advance. Groupings it produces can be displayed in a dendrogram. Bottom up clustering method that starts with individual observations each treated as a separate cluster and fuses closest pair of clusters one pair at a time.

Drawbacks/limitations of PCA

Doesnt lead to interpetable results, not good for non-linear relatonships because involves using linear transformations, does not perform feature selection because all original variables must still be collected for model construction because combinaton of all original features, target variable is ignored because loadings and scores are generated independently of target.

As lambda increases

Increasing pressure for coefficient estimates to be closer and closer to zero, flexibility of model drops decreasing variance and increasing bias.

What K-means refers to

Iteratively calculating the averages, aka means, of K clusters

High variance

Our model won't be accurate because it overfit to the data it was trained on and thus won't generalize well to new, unseen data

Supervised Learning

Target variable and is our goal to understand the relationship between the target and predictors and/or make accurate predictions for the target based on the predictors

Why unsupervised learning is more challenging than supervised

There is no simple goal like prediction. Methods for assessing model quality are generally not applicable and makes it hard to evaluate results.

Not a good idea to grow a tree by specifying an impurity reduction threshold and making a split only when the reduction in impurity exceeds the threshold

An early on not so good split could be followed by a good split that we will never get to see resulting in the model results not being as goos as they possibly can

Granularity

Applies to both numeric and categorical variables. Not always possible to order variables, each distinct level of variable one should be a subset of variable two (if we know the level of variable one then we can deduce variable two)

Decision trees

Applies to both regression and classification problems. Divides the feature soace into non-overlappaing regions containing homogeneous data. To make a prediction locate the region to which it belongs and use the average

Interaction

Arises if the association between one predictor snd the target variable depends on the value of another predictor

K-means clustering

Assigns each observation into one and only one of K clusters where the observations are relatively homogeneous. K is specified up front. Clusters are chosen such that the within cluster variation is small

Partial dependence plot

Attempt to visualize the association between a given variabke and fhe model prediction after averaging thenvakues or levels of other variables.

AIC & BIC Differences

BIC penalty term is more stringent than AIC and results in a higher penalty term and simpler final model.

Boosted tree

Builds a sequence of interdependent trees using info from previously grown trees. Fit a tree to the residuals of the preceding tree and subtract a scaledxdown version to form new residuals.

How are lambda and alpha selected

By cross validation. Choose pair that produces the lowest cross validation error

Problem with deviance as a model selection criterion

Can only be used to compare GLMs having the same target distribution

Limitations of LRT as a model selection method

Can only be used to compare one pair of GLMs at a time. Simpler GLM must be a special case/nested within the complex GLM.

Prescriptive Modeling

Combination of optimization and simulation to investigate and quantify the impact of different prescribed actions in different scenarios

Regularization parameter (lambda)

Controls the extent of regularization and quantifies our prefernce fir simple models

Features

Derivations from the original variables and provide an alternative,more useful view of the info contained in the dataset

Why desirable to run K-means multiple times

Differrent initial cluster assignments leads to different final results.

Stepwise selection

Do not search through all possible combinations but determine the best model from a carefully restricted list of candidate models by sequentially adding/dropping features one at a time.

Reason to use underdampling instead of oversampling

Eases computational burden and reduces run time when training data is excessively large

Variance

Expected loss arising from the model being too complex and overfitting to a specific instance of the data

Best subset selection

Fitting a separate linear model for each possible combination of availabe features and selecting thr best. Analyze 2^p models

Descriptive Modeling

Focuses on what happened in the past and aims to describe or explain observed trends by identifying the relationships between different variables

Predictive Modeling

Focuses on what will happen in the future and is concerned with making accurate predictions

Difference between average and centroid linkage

For average first compute all pairwise distances then take the average. For centroid take an average of the features to get the two centroids then take the distance between the centroids.

Random forest

Generating multiple bootstrapped (with replacement) samples and fitting base trees in parallel indpendently on each of the bootstrapped samples. Then combine the results from the base trees by taking an average for numeric target or most commonly occuring class for categorical target

Why classification error rate is less sensitive to node impurity than Gini index and entropy

Gini and entropy are differentiable with respect to the class proportions and therefore are more amenable to numerical optimization

How clustering can be used to generate features

Group assignments can be used as a factor variable in place of original variables. Cluster centers can replace original variables and retain numeric features to be used for interpretation and prediction.

Cost-complexity pruning

Grow a very large tree then prune it bsck to get s smaller subtree trading off goodness-of-fit of the tree to the training data with the size of the tree using thr penalized objective function

Complexity parameter affect on the quality of a decision tree

If =0 then fitted tree is identical to large overblown tree. As increases branches are snipped off retroactively forming new larger nodes, bias increases and variance decreases. If =1 then minimized to root node only.

Time variable on set split

If the behsvior of the tsrget variable over time is of interest, then split set by time with older obsrvstions making the training set snd newer observations to test.

*Why decision trees tend to favor categorical predictors with many levels

It is easier to choose a seemingly good split based on multi-level categorical predictor

Differences between K-means and hierarchical

K-means needs randomization, K prespecified, and clusters are not nested. Hierarchical does not need randomization, does not need K prespecified, and clusters are nested.

Using oversampling to balance data

Keeps all original data but samples with replacement in the positive class to reduce imbalance between the two classes.

Considerations for choice of mtry parameter of a random forest

Larger = less variance reduction as the base trees will be more similar to one another. Too small = each split maybhsve too little freedom in choosing the split variables.

Statistical method typically used to estimate the parameters of a GLM

Maximum likelihood estimation. Choose parameter estimates in such a way to maximize the likelihood of observing the given data

How centering and scaling affects the results of PCA

Mean centering often does not have any affect but scaling has great affect. If no scaling is done and variables are of vastly different orders of magnitude then those with unusually large variance will dominate.

Problem with RSS and R2 as model selection measures

Merrely goodness-of-fit measures with no regard to model complexity or prediction performance. Adding more predictors will always decrease and increase respectively.

High bias

Model won't be accurate because it doesn't have the capacity to capture the signal in the data

Disadvantages of polynomial regression

More difficult to interpret. No simple way to chose the degree of the model.

Importance of setting a cutoff for a binary classifier

Must be specified in order to translate predicted probabilities into the predicted classes.

Unsupervised Learning

No target variable and interested in extracting relstionshios snd structures between different variables in the data

Hyperparameter

Parameters thst control some aspect of the fitting process. Aka tuning parameters. Values must be supplied in advance.

Importance of hyperparameters

Play a role in mathematical expression or in the set of constraints defning thr optimization problem

Elbow method

Plot of proportion of variaton explained as a new cluster is added. Once the plot plateaus the elbow is reached and the corresponding value of K provides an appropriate number of clusters to segment the data

Variance

Precision. Quantifies the amount by which function would change if it was estimated using a different training set. More flexible model = higher becausd it is matching training observations more closely and therefore more sensitive to the training data.

How a monotone transformation on a numeric predictor and on the target variable affect a decision tree

Produces the same exact tree because the split points are based on the ranks of the feature values not their actuals values.

Specificity

Proportion of negative observations that are correctly classified as negative

Sensitivity

Proportion of positive observstions that are correctly classsified as positive

Variables

Raw measurement that is recorded and constitutes the original dataset before any transformations are applied

Data quality issues to examine

Reasonableness: do key metrics make sense, consistency: same basis and rules applied to all values, and sufficient documentation: so others can easily gain an accurate understanding of different aspects of the data

Reasons for reducing granularity

Reduces the complexity of the model, makes model results more intuitive, increases the number of observations per level of variable high smooths out trends and reduces the likelihood of overfitting to noise

Granularity

Refers to how precisely a variable is measured. Examples in decreasing order: address, postal code! State, country

Lambda approaches infinity

Regularization oenalty dominates and becomes so great that the estimates of the slope coefficients have no choice but to be 0

When lambda =0

Regularization penalty vanishes and nthe coefficient estimates are identical to the oridnary least squares estimates

Reason to use oversamplaing instead of undersampling

Retains more information about the positive class

Difference between random forest and boosted trees

Rf is fitted independently while boosted is in series, rf reduces variance while boosted reduces bias.

How cutoff of a binary classifier affects sensitivity and specificity

Selection involves a trade-off between high sensitivity and high specificity. When cutoff is 0 sensitivity is high. When cutoff is 1 specificity is 1.

How a partial dependence plot improves a variable importance plot

Shows the directional relationship between predictors and target variables

Why oversampling must be performed after splitting set

Since sampling with replacement, some oositive class observations could then appear in both training and test sets making the test set not be truly unseen to the trained classifier.

Target leakage

Some predictirs in a model include (leak) info about the target variabke that will not be available when the model is applied in practice. Typically strongly related to target, but values are not known until the targrt is observed.

Dimensionality

Specific to categorical variables.ability to order-by counting the number of factor levels to seenwhich variable has more levels.

Top-down, greedy algorithm

Start from the top of the tree going down amd sequentially partitioning adopting thr split that is currently the best leading to the greatest reduction in impurity at that point instead of looking ahead

Differences b/n stepwise selection and regularization

Stepwise goes through a list of candidate models fitted by least squares and decide on a model with respect to a certain selection criterion. Regularization considers a single model hosting all potentially useful features fit using unconventional methods that shrink coefficients towards zero.

Stratified compared to random sampling

Stratified divides pooulstion into a numbe of groups in a nonrandom fashion and then randomly samoles a set number of observations from each group ensuring equal representation. Random is randomky drawing for while pooulation without replacement number if observations wanted.

Bias

The expected loss arising from the model not being complex/flexible enough to capture the underlying signal

Why can't add/drop multiple features at a time in stepwise selection

The significane of a feature can be significantly affected by the presrnce or absence of other fratures due to their correlations

Why are lambda and alpha hyperparameters

They are pre-specified inouts that arent determined as part of the optimization procedure.

Principal components analysis

Transforms high dimensional data set into a much smaller, more manageable set of representative variables that are mutually uncorrelated.

Using under sampling to balance data

Under draws fewer from the negative class and retains all of the positive class improving the classifiers ability to pick up the signal leading to the positive class. Drawback is now based on less data and therefore less info about the negative class which could lead to overfitting.

Advantage of using time variable to split set

Useful for evaluating how well a model extrapolates time trends observed in thr past to future unseen years

Difference between weights and offsets

Weights use average exposure of target variable. Offsets use aggregate values over the exposure units. Weights do not affect mean. Offsets do not affect variance.

How scaling affects hierarchical

Without scaling if one variable has a much larger magnitude it will dominate the distance calculations and exert a disporportionate impact on the cluster assignments.


Related study sets

Unit 2.2 Quiz: Estates And Ownership

View Set

Peregrine accounting and finance

View Set

chapter 13- personal selling and sales promotion

View Set

Quiz 1 (chapters 1, 4, 11, 17, 18, 23, 29, & 30)

View Set

Fundamentals Chapter 29: Infection Control and Prevention Practice Questions

View Set

Holt Chemistry Standards Assessment Answers

View Set