q5: Machine Learning; Supervised and Unsupervised Learning

Ace your homework & exams now with Quizwiz!

highest avg AUC and lowest standard deviation in AUC

Whats the selection criteria when comparing models

asking the right question

defining the situation requires ______ which can be difficult

estimate cost-benefit

if your model is actionable consider the _____ of building automated process

true

imbalanced distribution cannot just use accuracy, also need precision / recall

ridge regression

penalizes absolute magnitude of coefficients

lasso regression

penalizes num of non-zero coefficients

regularization

provides some combo of fit and simplicity

Lasso regression (auto feature selection)

L1-norm + standard OLS

Ridge regression

L2-norm + standard OLS

similarity matching

Recommending products or services to customers based on their similarity to other customers (e.g., Netflix movie recommendation)

Cluster assignment

Step 2: ________. Go through each point and assign them to one of the 2 cluster centroids

Move centroid

Step 3: _________. Move the 2 cluster centroids to the average of the points colored in the same color

KNN Imputation

imputes based on "similar" non-missing rows

unsupervised learning

opportunity to understand problem better; type of exploratory analysis, which can lead to insightful discoveries

randon forest

- slower to fit than glmnet - less interpretable - often more accurate than glmnet - easier to tune - require little preprocessing - capture threshold effects and variable interactions

elbow method

Choose k, such that there are diminishing returns beyond that point. look at the shape of the curve, cut off where it tapers

machine learning

Field of study that gives computers the ability to learn without being explicitly programmed; A computer program is said to learn from experienceE with respect to task T and some performance measure P, if its performance on T, as measured by P, improves with experience E

overfitting

Finding chance occurrences in the training dataset that look like interesting patterns, but that do not generalize to the population of interest; the tendency of tailoring models to the training dataset, at the expense of generalization to previously unseen data points

co-occurance grouping

Grouping things together based on their co-occurrences (e.g., Target placing products that are bought together next to each other on shelves)

we can be confident that the model is doing well even if we have very skewed classes

If the model has high precision and high recall, then

PCA - brings x values closer to y values (min horizontal distance) OLS - min residual (distance bt data and regression line; vertical distance)

PCA vs OLS

centroids will not change any further and the color of the points will not change any further

Repeat steps 2 and 3 until

cluster centroids

Step1: Given our choice of K = 2, randomly initialize 2 points called the ______

1. Classifying Twitter users as bots or not bots (2 is experience, 3 is performance)

Suppose your bot detection program watches which Twitter accounts you do or do not mark as bot and based on that learns how to better filter bots. What is the task T in this setting? 1. Classifying Twitter users as bots or not bots 2. Watching you label Twitter users as bots or not bots 3. The number of Twitter accounts correctly classified as bots/not bots 4. None of the above

Because otherwise you wouldn't be doing a fair comparison of your models and your results could be due to chance

What's the primary reason that train/test indices need to match when comparing two models?

anomaly detection

Which observations fall out of the discovered "regular pattern" and use it as an input in supervised learning (e.g., amount spent and fraud detection)

being able to accurately and precisely predict future events is valuable

Why ML?

- So you can use the same summaryFunction and tuning parameters for multiple models. - So you don't have to repeat code when fitting multiple models. - So you can compare models on the exact same training and test data.

Why reuse a trainControl?

it gives you more fine-grained control over the tuning parameters that are explored

Why use a custom tuneGrid?

The default tuning grid is very small and there are many more potential glmnet models you want to explore.

Why use a custom tuning grid for a glmnet model?

- most flexible method for fitting caret models - complete control over how the model is fit

adv of custom tuning

- fits quickly - ignores noisy variables - provides interpretable coefficients

adv of glmnet

median imputation (replace missing values with medians, works well if data missing at random)

best way to deal with missing values

Situation Opportunity Action

busn requirements

principle components analysis

combines low-variance and correlated variables into single set of high-variance, perpendicular predictors; prevents collinearity; 1st component had highest variance

- ML first - not enough data (need 1,000 - 10,000) - target variable definition (how to measure observation? what are you trying to predict?) - late testing, no impact - feature selection

common mistakes

median imputation --> center --> scale --> fit glm

common recipe for linear models (order matters)

- easy to get humans to label - draw on human intuition to improve model - use human-level performance to set a desired error rate

compare results to human-level performance

inference

concerned with understanding the drivers of a busn outcome (interpretable but less accurate)

- classical programming is rule based; we give data and rules to get ans - ML uses data and ans to learn the rules

difference bt classical programming and ML

- requires some knowledge of the model - can dramatically inc run time

disadv of custom tuning

glmnet

extension of glm models with built-in variable selection; helps deal with colinearity and small sample sizes; two primary forms: lasso and ridge regression; attemps to find a parsimonious model

Principal Components Analysis (PCA)

goal is to reduce from n-dimension to k-dimension, meaning you will need to find k vectors onto which to project the data, as to minimize the projection error

- dec num features choose algorithm that auto select features for you or manually select which features to keep - regularization keep all features, but dec magnitude of parameters

how do you address overfitting

error analysis

look at the classifications or estimated values that your model get wrong manually, do you see a pattern?

prediction

main goal; not easily interpretable but more accurate

- How well the model predicts new data (not how well it fits the data it was trained with) - Difference between actual outcome and predicted outcome (i.e., error) - RMSE: how much the p's diverge from the a's on avg (regression; numeric measure) - accuracy: true positives + true negatives / total (classification)

model evaluation

supervised learning [aka regression (price of houses)/classification (cancer malignant or benign) problem]

most common ML problem

K-means clustering

most popular, by far most widely used clustering algorithm

mtry

num randomly selected variables used at each split; lower value = more random, higher value = less random, hard to know best value in advance

1. Draw casual insights (what is causing ___?) 2. Predict future events (Which customers are likely to cancel subscription?) 3. Understand patterns in data (Are there groups of customers who are similar?)

objectives of ML

random forests

popular type of ML model, good for beginners, robust to overfitting, yield very accurate, non-linear models; have hyperparameters which require manual specification, can impact model fit and vary from dataset-to-dataset, default values often ok but may need adjustment

- focusing on single metric helps - diagnose nature of error (underfitting - high in both training and test sets vs overfitting - high in test but low in training)

practical advice for (un)supervised learning

1. median impute 2. KNN impute if data missing not random 3. for linear models... - center and scale - try PCA and spatial sign 4. tree-based models dont need much preprocessing

preprocessing cheat sheet

Area under the curve (AUC)

single num summary of model accuracy, summarizes performance across all thresholds, rank diff models within same dataset, ranges from 0 to 1 (0.5 = random guessing, 1 = model always right), thought of as a letter grade

unsupervised learning

start with this type of ML

familiarize yourself with the context engage in research start with casual question define prediction question

things to consider to ask right question

look at historical data run experiments repeat experiments

things to consider when you still may not be able to affect the predicted outcome

- add more training data - Decrease model complexity - Add regularization (e.g., Lasso or Ridge) - Significantly reduce the number of features

to address overfitting you might

- inc complexity - dec regularization

to address underfitting you might

recall (what fraction did we correctly detect as being fraud?)

true pos / actual pos

precision (what fraction actually was fraud)

true pos / predicted pos

supervised and unsupervised

two types of ML

From variable character to numeric Training and test sets Manual or automated feature selection Choosing an algorithm (LR, Logistic) RMSE or Accuracy, Precision, Recall Different algorithm, regularization

typical supervised learning process

"zv" removes constance columns "nzv" removes nearly constant columns

use caret to remove low/no variance

training set

use this to learn the rules which can then be used to predict ans from testing set

Randomized experiments

very useful for testing how actionable ML models are, and for helping to estimate the costs-benefits of deploying an automated system

unsupervised learning

we let the computer learn by itself; no specific purpose, no need to specify a target

supervised learning

we teach the computer how to learn something; specific purpose, requires specifying a target

- data compression allows you to speed up analysis and use up less computer memory; aptitude comes from pilot enjoyment and pilot skill - data visualization easier to visualize with 5 dimensions (or variables) than one with 50 dimensions

why dimensionality reduction?

See all study sets

q5: Machine Learning; Supervised and Unsupervised Learning

Related study sets

Chapter 1

Physics test chapter 6

Underwriting and Policy Issue

Week 3

CH09 - Dictionaries and Sets

APUSH: Chapter 9: The Confederation and the Constitution, 1776-1790

Geog Final

Honors 1st Nine Weeks Exam

Phrasal Verbs

History 1302 Unit Two Quiz Questions

US History

Chapter 19

BIO 222 Digestive System HW (Ch. 24/25)

Ventricles, Meninges, CSF, Blood Vessels of the Brain

Part 2 BMB 555

Lo 5

Finals- Ch.10 Science T/F

Astro Final

Chapter 13 Designing Databases

Anxiety