q5: Machine Learning; Supervised and Unsupervised Learning
highest avg AUC and lowest standard deviation in AUC
Whats the selection criteria when comparing models
asking the right question
defining the situation requires ______ which can be difficult
estimate cost-benefit
if your model is actionable consider the _____ of building automated process
true
imbalanced distribution cannot just use accuracy, also need precision / recall
ridge regression
penalizes absolute magnitude of coefficients
lasso regression
penalizes num of non-zero coefficients
regularization
provides some combo of fit and simplicity
Lasso regression (auto feature selection)
L1-norm + standard OLS
Ridge regression
L2-norm + standard OLS
similarity matching
Recommending products or services to customers based on their similarity to other customers (e.g., Netflix movie recommendation)
Cluster assignment
Step 2: ________. Go through each point and assign them to one of the 2 cluster centroids
Move centroid
Step 3: _________. Move the 2 cluster centroids to the average of the points colored in the same color
KNN Imputation
imputes based on "similar" non-missing rows
unsupervised learning
opportunity to understand problem better; type of exploratory analysis, which can lead to insightful discoveries
randon forest
- slower to fit than glmnet - less interpretable - often more accurate than glmnet - easier to tune - require little preprocessing - capture threshold effects and variable interactions
elbow method
Choose k, such that there are diminishing returns beyond that point. look at the shape of the curve, cut off where it tapers
machine learning
Field of study that gives computers the ability to learn without being explicitly programmed; A computer program is said to learn from experienceE with respect to task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
overfitting
Finding chance occurrences in the training dataset that look like interesting patterns, but that do not generalize to the population of interest; the tendency of tailoring models to the training dataset, at the expense of generalization to previously unseen data points
co-occurance grouping
Grouping things together based on their co-occurrences (e.g., Target placing products that are bought together next to each other on shelves)
we can be confident that the model is doing well even if we have very skewed classes
If the model has high precision and high recall, then
PCA - brings x values closer to y values (min horizontal distance) OLS - min residual (distance bt data and regression line; vertical distance)
PCA vs OLS
centroids will not change any further and the color of the points will not change any further
Repeat steps 2 and 3 until
cluster centroids
Step1: Given our choice of K = 2, randomly initialize 2 points called the ______
1. Classifying Twitter users as bots or not bots (2 is experience, 3 is performance)
Suppose your bot detection program watches which Twitter accounts you do or do not mark as bot and based on that learns how to better filter bots. What is the task T in this setting? 1. Classifying Twitter users as bots or not bots 2. Watching you label Twitter users as bots or not bots 3. The number of Twitter accounts correctly classified as bots/not bots 4. None of the above
Because otherwise you wouldn't be doing a fair comparison of your models and your results could be due to chance
What's the primary reason that train/test indices need to match when comparing two models?
anomaly detection
Which observations fall out of the discovered "regular pattern" and use it as an input in supervised learning (e.g., amount spent and fraud detection)
being able to accurately and precisely predict future events is valuable
Why ML?
- So you can use the same summaryFunction and tuning parameters for multiple models. - So you don't have to repeat code when fitting multiple models. - So you can compare models on the exact same training and test data.
Why reuse a trainControl?
it gives you more fine-grained control over the tuning parameters that are explored
Why use a custom tuneGrid?
The default tuning grid is very small and there are many more potential glmnet models you want to explore.
Why use a custom tuning grid for a glmnet model?
- most flexible method for fitting caret models - complete control over how the model is fit
adv of custom tuning
- fits quickly - ignores noisy variables - provides interpretable coefficients
adv of glmnet
median imputation (replace missing values with medians, works well if data missing at random)
best way to deal with missing values
Situation Opportunity Action
busn requirements
principle components analysis
combines low-variance and correlated variables into single set of high-variance, perpendicular predictors; prevents collinearity; 1st component had highest variance
- ML first - not enough data (need 1,000 - 10,000) - target variable definition (how to measure observation? what are you trying to predict?) - late testing, no impact - feature selection
common mistakes
median imputation --> center --> scale --> fit glm
common recipe for linear models (order matters)
- easy to get humans to label - draw on human intuition to improve model - use human-level performance to set a desired error rate
compare results to human-level performance
inference
concerned with understanding the drivers of a busn outcome (interpretable but less accurate)
- classical programming is rule based; we give data and rules to get ans - ML uses data and ans to learn the rules
difference bt classical programming and ML
- requires some knowledge of the model - can dramatically inc run time
disadv of custom tuning
glmnet
extension of glm models with built-in variable selection; helps deal with colinearity and small sample sizes; two primary forms: lasso and ridge regression; attemps to find a parsimonious model
Principal Components Analysis (PCA)
goal is to reduce from n-dimension to k-dimension, meaning you will need to find k vectors onto which to project the data, as to minimize the projection error
- dec num features choose algorithm that auto select features for you or manually select which features to keep - regularization keep all features, but dec magnitude of parameters
how do you address overfitting
error analysis
look at the classifications or estimated values that your model get wrong manually, do you see a pattern?
prediction
main goal; not easily interpretable but more accurate
- How well the model predicts new data (not how well it fits the data it was trained with) - Difference between actual outcome and predicted outcome (i.e., error) - RMSE: how much the p's diverge from the a's on avg (regression; numeric measure) - accuracy: true positives + true negatives / total (classification)
model evaluation
supervised learning [aka regression (price of houses)/classification (cancer malignant or benign) problem]
most common ML problem
K-means clustering
most popular, by far most widely used clustering algorithm
mtry
num randomly selected variables used at each split; lower value = more random, higher value = less random, hard to know best value in advance
1. Draw casual insights (what is causing ___?) 2. Predict future events (Which customers are likely to cancel subscription?) 3. Understand patterns in data (Are there groups of customers who are similar?)
objectives of ML
random forests
popular type of ML model, good for beginners, robust to overfitting, yield very accurate, non-linear models; have hyperparameters which require manual specification, can impact model fit and vary from dataset-to-dataset, default values often ok but may need adjustment
- focusing on single metric helps - diagnose nature of error (underfitting - high in both training and test sets vs overfitting - high in test but low in training)
practical advice for (un)supervised learning
1. median impute 2. KNN impute if data missing not random 3. for linear models... - center and scale - try PCA and spatial sign 4. tree-based models dont need much preprocessing
preprocessing cheat sheet
Area under the curve (AUC)
single num summary of model accuracy, summarizes performance across all thresholds, rank diff models within same dataset, ranges from 0 to 1 (0.5 = random guessing, 1 = model always right), thought of as a letter grade
unsupervised learning
start with this type of ML
familiarize yourself with the context engage in research start with casual question define prediction question
things to consider to ask right question
look at historical data run experiments repeat experiments
things to consider when you still may not be able to affect the predicted outcome
- add more training data - Decrease model complexity - Add regularization (e.g., Lasso or Ridge) - Significantly reduce the number of features
to address overfitting you might
- inc complexity - dec regularization
to address underfitting you might
recall (what fraction did we correctly detect as being fraud?)
true pos / actual pos
precision (what fraction actually was fraud)
true pos / predicted pos
supervised and unsupervised
two types of ML
From variable character to numeric Training and test sets Manual or automated feature selection Choosing an algorithm (LR, Logistic) RMSE or Accuracy, Precision, Recall Different algorithm, regularization
typical supervised learning process
"zv" removes constance columns "nzv" removes nearly constant columns
use caret to remove low/no variance
training set
use this to learn the rules which can then be used to predict ans from testing set
Randomized experiments
very useful for testing how actionable ML models are, and for helping to estimate the costs-benefits of deploying an automated system
unsupervised learning
we let the computer learn by itself; no specific purpose, no need to specify a target
supervised learning
we teach the computer how to learn something; specific purpose, requires specifying a target
- data compression allows you to speed up analysis and use up less computer memory; aptitude comes from pilot enjoyment and pilot skill - data visualization easier to visualize with 5 dimensions (or variables) than one with 50 dimensions
why dimensionality reduction?
