QUIN 499: Week 6
Reasons for overfitting
- Data used for training is not cleaned and contains noise - model has high variance - size of training dataset used is not enough - model is too complex
Reasons for underfitting
- Data used for training is not cleaned and contains noise (garbage values) - model has high bias - size of training dataset used is too small - model is too simple
Tack underfitting
- incr # of features in dataset - incr model complexity - reduce noise in data - incr duration of training the data
Tackle overfitting
- using K-fold cross-validation - using regularization techniques such as Lasso and Ridge regression - training model w/ sufficient data - adopt ensemble techniques
Automated hyperparameter tuning example
Grid & Random Search
K-means clustering execution
Initialization Assignment Step Update Step Repeat
Advantages of K-means clustering: Simplicity & Efficiency
K-means is easy to understand & implement Highly efficient in terms of computational cost, making it suitable for large datadest
ensembles methods
ML technique that combines multiple models to improve overall performance, stability, & accuracy of predictive analytics
K-fold Cross-validation steps in model selection: step 3
MSE computed on observation in held-out fold
Overfit model
Ml model learns details and noise in the training data such that it negatively affects the performance of the model on test data
Advantages of K-means clustering
Simplicity & Efficiency Adaptability Scalability
K-means clustering
a method used to partition a set of observations into a predefined number of clusters, k each observation belongs to cluster w/ nearest mean, serving as a prototype of cluster
Model selection: predictive error
all models have some predictive error, given statistical noise in data, incompleteness of data sample, and limitations of each model type
Bagging & boosting
are ensemble methods
K-means clustering execution: Assignment step
assign each data point to the nearest centroid nearest is usually defined using Euclidean distance
What does bagging stand for
bootstrap aggregating
Advantages of K-means clustering: Adaptability
can adapt to new examples and is useful in a wide range of applications, from market segmentation to organizing computing clusters
K-means clustering execution: Initialization
choose 'k' initial centroids
What does bagging do
combines terms "bootstrap" and "aggregating"
Model selection steps
data filtering, data transformation, feature selection, feature engineering
Manual hyperparameter tuning
experimenting w/ different sets of hyperparameters manually using the trial & error method
K-fold Cross-validation steps in model selection: step 2
first fold is treated as validation set and method is fit on the remaining k - 1 folds
Model selection choice
good enough not best stakeholders may have specific requirements (maintainability & limited model complexity) Model that has lower skill but is simpler & easier to understand may be preferred model skill is prized above all other concerns, then ability for model to perform well on out-of-sample data will be preferred regardless of computational complexity involved
Clustering data
grouped into meaningful observation
Classification input data
has labels (x, y)
Clustering input data
has no labels
K-fold Cross-validation steps in model selection: step 1
involves randomly dividing the set of observations into k groups, or folds, of approximately equal size
What is being chosen during model selection
just the algorithim used to fit the model or the entire data preparation and model fitting pipeline
K-fold Cross-validation results in model selection:
k estimates of the test error k-fold CV estimate is computed by averaging these values
Clustering importance
labeling data can be expensive/time consuming to undertake
Goal of k-means clustering
minimize the within-cluster variances (squared Euclidean distance) though it doesn't guarantee to find global optimum
Model selection criteria
model selection how long model takes to training how easy it is to explain
Good enough model selection meaning
model that meets req'ts & constraints of project stakeholders model that is sufficiently skillfull given the time and resources available model that is skillful as compared to naive models model that is skillful relative to other tested models model that is skillful relative to state-of-the-art
How boosting works
models are not trained independently but sequentially, with each new model being added to the ensemble in a way that imrpoves the overall performance
Automated hyperparameter tuning
optimal set of hyperparameters are found by using an algorithm
Classification different from regression
output variable is categorical
Hyperparameter tuning examples
penalty of Logistic regression classifier regularization learning rate for training a neural network C & sigma hyperparameters for support vector machine k in k-nearest neighbors linear regression doesn't have any hyperparameters variants of linear regression (ridge & lasso) have regularization decision tree has max depth and min number of observations in leafs
Underfit model
poor performance on training data and will result in unreliable predictions
K-fold Cross-validation steps in model selection: step 4
procedure repeated k times each time a different group of observations is treated as a validation set
Model selection
process of choosing one of the models as the final model that best addresses the problem
K-means clustering execution: Update Step
recalculate the centroid as the mean of all data points assigned to that centroid's cluster
Boosting
reduce bias (& also variance) by sequentially training models, each compensating for the weaknesses of the predecessors, to improve the overall model accuracy (AdaBoost)
Bagging
reducing variance avoiding overfitting by training multiple models in parallel, each on a random subset of data (ex: Random Forest)
Bagging: aggregating meaning
refers to the process of combining multiple models
K-means clustering execution: repeat
repeat the assignment & update steps until convergence or set number of iterations is reached
Manual hyperparameter tuning results
results of each trial are tracked and used as feedback to obtain a combo of hyperparameters that yield highest model performance
Advantages of K-means clustering: Scalability
scales well to large numbers of samples and has been executed by several methods for big data scenarios
Bagging: bootstrap meaning
statistical method for sampling w/ replacement
Classification
supervised learning technique
Clustering result
supervised learning techniques can be applied on clustered data
Automated hyperparameter tuning method
technique that involves methods in which the user defines a set of hyperparameter combos or a range for each hyperparamet & the tuning algorithm runs the trials to find optimal set of hyperparameters for model
Clustering
unsupervised learning technique
Underfitting
when a model has not learned the patterns in the training data well and is unable to generalize well on new data
Overfitting
when model performs well on training data but has poor performance w/ test data