QUIN 499: Week 6

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Reasons for overfitting

- Data used for training is not cleaned and contains noise - model has high variance - size of training dataset used is not enough - model is too complex

Reasons for underfitting

- Data used for training is not cleaned and contains noise (garbage values) - model has high bias - size of training dataset used is too small - model is too simple

Tack underfitting

- incr # of features in dataset - incr model complexity - reduce noise in data - incr duration of training the data

Tackle overfitting

- using K-fold cross-validation - using regularization techniques such as Lasso and Ridge regression - training model w/ sufficient data - adopt ensemble techniques

Automated hyperparameter tuning example

Grid & Random Search

K-means clustering execution

Initialization Assignment Step Update Step Repeat

Advantages of K-means clustering: Simplicity & Efficiency

K-means is easy to understand & implement Highly efficient in terms of computational cost, making it suitable for large datadest

ensembles methods

ML technique that combines multiple models to improve overall performance, stability, & accuracy of predictive analytics

K-fold Cross-validation steps in model selection: step 3

MSE computed on observation in held-out fold

Overfit model

Ml model learns details and noise in the training data such that it negatively affects the performance of the model on test data

Advantages of K-means clustering

Simplicity & Efficiency Adaptability Scalability

K-means clustering

a method used to partition a set of observations into a predefined number of clusters, k each observation belongs to cluster w/ nearest mean, serving as a prototype of cluster

Model selection: predictive error

all models have some predictive error, given statistical noise in data, incompleteness of data sample, and limitations of each model type

Bagging & boosting

are ensemble methods

K-means clustering execution: Assignment step

assign each data point to the nearest centroid nearest is usually defined using Euclidean distance

What does bagging stand for

bootstrap aggregating

Advantages of K-means clustering: Adaptability

can adapt to new examples and is useful in a wide range of applications, from market segmentation to organizing computing clusters

K-means clustering execution: Initialization

choose 'k' initial centroids

What does bagging do

combines terms "bootstrap" and "aggregating"

Model selection steps

data filtering, data transformation, feature selection, feature engineering

Manual hyperparameter tuning

experimenting w/ different sets of hyperparameters manually using the trial & error method

K-fold Cross-validation steps in model selection: step 2

first fold is treated as validation set and method is fit on the remaining k - 1 folds

Model selection choice

good enough not best stakeholders may have specific requirements (maintainability & limited model complexity) Model that has lower skill but is simpler & easier to understand may be preferred model skill is prized above all other concerns, then ability for model to perform well on out-of-sample data will be preferred regardless of computational complexity involved

Clustering data

grouped into meaningful observation

Classification input data

has labels (x, y)

Clustering input data

has no labels

K-fold Cross-validation steps in model selection: step 1

involves randomly dividing the set of observations into k groups, or folds, of approximately equal size

What is being chosen during model selection

just the algorithim used to fit the model or the entire data preparation and model fitting pipeline

K-fold Cross-validation results in model selection:

k estimates of the test error k-fold CV estimate is computed by averaging these values

Clustering importance

labeling data can be expensive/time consuming to undertake

Goal of k-means clustering

minimize the within-cluster variances (squared Euclidean distance) though it doesn't guarantee to find global optimum

Model selection criteria

model selection how long model takes to training how easy it is to explain

Good enough model selection meaning

model that meets req'ts & constraints of project stakeholders model that is sufficiently skillfull given the time and resources available model that is skillful as compared to naive models model that is skillful relative to other tested models model that is skillful relative to state-of-the-art

How boosting works

models are not trained independently but sequentially, with each new model being added to the ensemble in a way that imrpoves the overall performance

Automated hyperparameter tuning

optimal set of hyperparameters are found by using an algorithm

Classification different from regression

output variable is categorical

Hyperparameter tuning examples

penalty of Logistic regression classifier regularization learning rate for training a neural network C & sigma hyperparameters for support vector machine k in k-nearest neighbors linear regression doesn't have any hyperparameters variants of linear regression (ridge & lasso) have regularization decision tree has max depth and min number of observations in leafs

Underfit model

poor performance on training data and will result in unreliable predictions

K-fold Cross-validation steps in model selection: step 4

procedure repeated k times each time a different group of observations is treated as a validation set

Model selection

process of choosing one of the models as the final model that best addresses the problem

K-means clustering execution: Update Step

recalculate the centroid as the mean of all data points assigned to that centroid's cluster

Boosting

reduce bias (& also variance) by sequentially training models, each compensating for the weaknesses of the predecessors, to improve the overall model accuracy (AdaBoost)

Bagging

reducing variance avoiding overfitting by training multiple models in parallel, each on a random subset of data (ex: Random Forest)

Bagging: aggregating meaning

refers to the process of combining multiple models

K-means clustering execution: repeat

repeat the assignment & update steps until convergence or set number of iterations is reached

Manual hyperparameter tuning results

results of each trial are tracked and used as feedback to obtain a combo of hyperparameters that yield highest model performance

Advantages of K-means clustering: Scalability

scales well to large numbers of samples and has been executed by several methods for big data scenarios

Bagging: bootstrap meaning

statistical method for sampling w/ replacement

Classification

supervised learning technique

Clustering result

supervised learning techniques can be applied on clustered data

Automated hyperparameter tuning method

technique that involves methods in which the user defines a set of hyperparameter combos or a range for each hyperparamet & the tuning algorithm runs the trials to find optimal set of hyperparameters for model

Clustering

unsupervised learning technique

Underfitting

when a model has not learned the patterns in the training data well and is unable to generalize well on new data

Overfitting

when model performs well on training data but has poor performance w/ test data


Kaugnay na mga set ng pag-aaral

Powers of 10 & Scientific Notation

View Set

Final Chapter test 1-18, 23, 25, 31, misc.

View Set