Exam PA
Problem with unbalanced data
A classifier implicitly places more weight in the majority class and tries to match the training observstions in that class withiut paying enough attention to the minority class.
Advantages of polynomial regression
Able to take care of more complex relationships
Accuracy, sensitivity, specificity relationship
Accuracy is a weighted average of specificity and sensitivity.
Bias
Accuracy. The difference between the expected value and the true value. More complex model = lower due to higher ability to capture the signal of the data
Best subset selection limitations
Computationally prohibitive
Hierarchical clustering
Does not require the choice of K in advance. Groupings it produces can be displayed in a dendrogram. Bottom up clustering method that starts with individual observations each treated as a separate cluster and fuses closest pair of clusters one pair at a time.
Drawbacks/limitations of PCA
Doesnt lead to interpetable results, not good for non-linear relatonships because involves using linear transformations, does not perform feature selection because all original variables must still be collected for model construction because combinaton of all original features, target variable is ignored because loadings and scores are generated independently of target.
As lambda increases
Increasing pressure for coefficient estimates to be closer and closer to zero, flexibility of model drops decreasing variance and increasing bias.
What K-means refers to
Iteratively calculating the averages, aka means, of K clusters
High variance
Our model won't be accurate because it overfit to the data it was trained on and thus won't generalize well to new, unseen data
Supervised Learning
Target variable and is our goal to understand the relationship between the target and predictors and/or make accurate predictions for the target based on the predictors
Why unsupervised learning is more challenging than supervised
There is no simple goal like prediction. Methods for assessing model quality are generally not applicable and makes it hard to evaluate results.
Not a good idea to grow a tree by specifying an impurity reduction threshold and making a split only when the reduction in impurity exceeds the threshold
An early on not so good split could be followed by a good split that we will never get to see resulting in the model results not being as goos as they possibly can
Granularity
Applies to both numeric and categorical variables. Not always possible to order variables, each distinct level of variable one should be a subset of variable two (if we know the level of variable one then we can deduce variable two)
Decision trees
Applies to both regression and classification problems. Divides the feature soace into non-overlappaing regions containing homogeneous data. To make a prediction locate the region to which it belongs and use the average
Interaction
Arises if the association between one predictor snd the target variable depends on the value of another predictor
K-means clustering
Assigns each observation into one and only one of K clusters where the observations are relatively homogeneous. K is specified up front. Clusters are chosen such that the within cluster variation is small
Partial dependence plot
Attempt to visualize the association between a given variabke and fhe model prediction after averaging thenvakues or levels of other variables.
AIC & BIC Differences
BIC penalty term is more stringent than AIC and results in a higher penalty term and simpler final model.
Boosted tree
Builds a sequence of interdependent trees using info from previously grown trees. Fit a tree to the residuals of the preceding tree and subtract a scaledxdown version to form new residuals.
How are lambda and alpha selected
By cross validation. Choose pair that produces the lowest cross validation error
Problem with deviance as a model selection criterion
Can only be used to compare GLMs having the same target distribution
Limitations of LRT as a model selection method
Can only be used to compare one pair of GLMs at a time. Simpler GLM must be a special case/nested within the complex GLM.
Prescriptive Modeling
Combination of optimization and simulation to investigate and quantify the impact of different prescribed actions in different scenarios
Regularization parameter (lambda)
Controls the extent of regularization and quantifies our prefernce fir simple models
Features
Derivations from the original variables and provide an alternative,more useful view of the info contained in the dataset
Why desirable to run K-means multiple times
Differrent initial cluster assignments leads to different final results.
Stepwise selection
Do not search through all possible combinations but determine the best model from a carefully restricted list of candidate models by sequentially adding/dropping features one at a time.
Reason to use underdampling instead of oversampling
Eases computational burden and reduces run time when training data is excessively large
Variance
Expected loss arising from the model being too complex and overfitting to a specific instance of the data
Best subset selection
Fitting a separate linear model for each possible combination of availabe features and selecting thr best. Analyze 2^p models
Descriptive Modeling
Focuses on what happened in the past and aims to describe or explain observed trends by identifying the relationships between different variables
Predictive Modeling
Focuses on what will happen in the future and is concerned with making accurate predictions
Difference between average and centroid linkage
For average first compute all pairwise distances then take the average. For centroid take an average of the features to get the two centroids then take the distance between the centroids.
Random forest
Generating multiple bootstrapped (with replacement) samples and fitting base trees in parallel indpendently on each of the bootstrapped samples. Then combine the results from the base trees by taking an average for numeric target or most commonly occuring class for categorical target
Why classification error rate is less sensitive to node impurity than Gini index and entropy
Gini and entropy are differentiable with respect to the class proportions and therefore are more amenable to numerical optimization
How clustering can be used to generate features
Group assignments can be used as a factor variable in place of original variables. Cluster centers can replace original variables and retain numeric features to be used for interpretation and prediction.
Cost-complexity pruning
Grow a very large tree then prune it bsck to get s smaller subtree trading off goodness-of-fit of the tree to the training data with the size of the tree using thr penalized objective function
Complexity parameter affect on the quality of a decision tree
If =0 then fitted tree is identical to large overblown tree. As increases branches are snipped off retroactively forming new larger nodes, bias increases and variance decreases. If =1 then minimized to root node only.
Time variable on set split
If the behsvior of the tsrget variable over time is of interest, then split set by time with older obsrvstions making the training set snd newer observations to test.
*Why decision trees tend to favor categorical predictors with many levels
It is easier to choose a seemingly good split based on multi-level categorical predictor
Differences between K-means and hierarchical
K-means needs randomization, K prespecified, and clusters are not nested. Hierarchical does not need randomization, does not need K prespecified, and clusters are nested.
Using oversampling to balance data
Keeps all original data but samples with replacement in the positive class to reduce imbalance between the two classes.
Considerations for choice of mtry parameter of a random forest
Larger = less variance reduction as the base trees will be more similar to one another. Too small = each split maybhsve too little freedom in choosing the split variables.
Statistical method typically used to estimate the parameters of a GLM
Maximum likelihood estimation. Choose parameter estimates in such a way to maximize the likelihood of observing the given data
How centering and scaling affects the results of PCA
Mean centering often does not have any affect but scaling has great affect. If no scaling is done and variables are of vastly different orders of magnitude then those with unusually large variance will dominate.
Problem with RSS and R2 as model selection measures
Merrely goodness-of-fit measures with no regard to model complexity or prediction performance. Adding more predictors will always decrease and increase respectively.
High bias
Model won't be accurate because it doesn't have the capacity to capture the signal in the data
Disadvantages of polynomial regression
More difficult to interpret. No simple way to chose the degree of the model.
Importance of setting a cutoff for a binary classifier
Must be specified in order to translate predicted probabilities into the predicted classes.
Unsupervised Learning
No target variable and interested in extracting relstionshios snd structures between different variables in the data
Hyperparameter
Parameters thst control some aspect of the fitting process. Aka tuning parameters. Values must be supplied in advance.
Importance of hyperparameters
Play a role in mathematical expression or in the set of constraints defning thr optimization problem
Elbow method
Plot of proportion of variaton explained as a new cluster is added. Once the plot plateaus the elbow is reached and the corresponding value of K provides an appropriate number of clusters to segment the data
Variance
Precision. Quantifies the amount by which function would change if it was estimated using a different training set. More flexible model = higher becausd it is matching training observations more closely and therefore more sensitive to the training data.
How a monotone transformation on a numeric predictor and on the target variable affect a decision tree
Produces the same exact tree because the split points are based on the ranks of the feature values not their actuals values.
Specificity
Proportion of negative observations that are correctly classified as negative
Sensitivity
Proportion of positive observstions that are correctly classsified as positive
Variables
Raw measurement that is recorded and constitutes the original dataset before any transformations are applied
Data quality issues to examine
Reasonableness: do key metrics make sense, consistency: same basis and rules applied to all values, and sufficient documentation: so others can easily gain an accurate understanding of different aspects of the data
Reasons for reducing granularity
Reduces the complexity of the model, makes model results more intuitive, increases the number of observations per level of variable high smooths out trends and reduces the likelihood of overfitting to noise
Granularity
Refers to how precisely a variable is measured. Examples in decreasing order: address, postal code! State, country
Lambda approaches infinity
Regularization oenalty dominates and becomes so great that the estimates of the slope coefficients have no choice but to be 0
When lambda =0
Regularization penalty vanishes and nthe coefficient estimates are identical to the oridnary least squares estimates
Reason to use oversamplaing instead of undersampling
Retains more information about the positive class
Difference between random forest and boosted trees
Rf is fitted independently while boosted is in series, rf reduces variance while boosted reduces bias.
How cutoff of a binary classifier affects sensitivity and specificity
Selection involves a trade-off between high sensitivity and high specificity. When cutoff is 0 sensitivity is high. When cutoff is 1 specificity is 1.
How a partial dependence plot improves a variable importance plot
Shows the directional relationship between predictors and target variables
Why oversampling must be performed after splitting set
Since sampling with replacement, some oositive class observations could then appear in both training and test sets making the test set not be truly unseen to the trained classifier.
Target leakage
Some predictirs in a model include (leak) info about the target variabke that will not be available when the model is applied in practice. Typically strongly related to target, but values are not known until the targrt is observed.
Dimensionality
Specific to categorical variables.ability to order-by counting the number of factor levels to seenwhich variable has more levels.
Top-down, greedy algorithm
Start from the top of the tree going down amd sequentially partitioning adopting thr split that is currently the best leading to the greatest reduction in impurity at that point instead of looking ahead
Differences b/n stepwise selection and regularization
Stepwise goes through a list of candidate models fitted by least squares and decide on a model with respect to a certain selection criterion. Regularization considers a single model hosting all potentially useful features fit using unconventional methods that shrink coefficients towards zero.
Stratified compared to random sampling
Stratified divides pooulstion into a numbe of groups in a nonrandom fashion and then randomly samoles a set number of observations from each group ensuring equal representation. Random is randomky drawing for while pooulation without replacement number if observations wanted.
Bias
The expected loss arising from the model not being complex/flexible enough to capture the underlying signal
Why can't add/drop multiple features at a time in stepwise selection
The significane of a feature can be significantly affected by the presrnce or absence of other fratures due to their correlations
Why are lambda and alpha hyperparameters
They are pre-specified inouts that arent determined as part of the optimization procedure.
Principal components analysis
Transforms high dimensional data set into a much smaller, more manageable set of representative variables that are mutually uncorrelated.
Using under sampling to balance data
Under draws fewer from the negative class and retains all of the positive class improving the classifiers ability to pick up the signal leading to the positive class. Drawback is now based on less data and therefore less info about the negative class which could lead to overfitting.
Advantage of using time variable to split set
Useful for evaluating how well a model extrapolates time trends observed in thr past to future unseen years
Difference between weights and offsets
Weights use average exposure of target variable. Offsets use aggregate values over the exposure units. Weights do not affect mean. Offsets do not affect variance.
How scaling affects hierarchical
Without scaling if one variable has a much larger magnitude it will dominate the distance calculations and exert a disporportionate impact on the cluster assignments.