Machine Learning Midterm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

All pairs faster to train than OVR. OVR is faster in making predictions

-

Precision recall curve (precision vs recall) ROC curve (true positive rate vs false positive rate)

-

if points within a cluster are still far apart, the cluster should probably be split into more clusters

-

Main differences between adaline and perceptron

- (yi - wtxi) is a real value in adaline, instead of a binary value (in perceptron, either correct or incorrect) - perceptron minimizes number of error, adaline tries to make wtxi close to the correct value (1 or -1) - the update in adaline is based off of the entire training set, instead of one instance at a time

Learning Curves

- A learning curve measures the performance of a model at different amounts of training data (has a curve for training accuracy and one for validation accuracy) - Primarily used to understand 2 things: 1. How much training data is needed 2. Bias/Variance Tradeoff - typical patterns: --- training accuracy decreases with more data --- validation accuracy increases with more data --- the two accuracies should converge to be similar - if gap between train/validation performances isn't closing, probably too much variance (overfitting) - if gap between train/validation performance closes quickly, might suggest high bias (underfitting)

Validation Curves

- A validation curve measures the performance of a model at different hyper parameter settings. Validation curve contains one curve for training accuracy and one for validation accuracy - validation curves help you understand the effect of hyperparameters, and also help you understand the bias/variance tradeoff --- want to find a setting where train/validation performance is similar (low variance) --- of the settings where train/validation performance are similar, pick the one with the highest accuracy (low bias)

Features: Text

- Bag of words: in this representation, the set of features is the set of unique words that appear in the dataset (the vocabulary) and their counts. Does not describe word position - n-grams: generalization of bag of words where features are sequences of words of length n. (n too short-may lose important detail, n too long-may not see enough instances) can include different size n-grams

Features: Images

- Color histograms: analogous to bag of words. counts number of times color appears, not where they appear. Colors are binned within histograms - local context can be characterized in images (like n-grams): this may, for example, allow for detection of edges

K-fold cross validation

- Common technique for getting held-out estimates - split data into k partitions (folds) - use all but one for training, the last one for testing - repeat k times, so each fold gets used for testing once - average k held-out estimates

Some Error Analysis Techniques

- Confusion matrix (error matrix): a table that counts the number of test instances with each true label vs. each predicted label. Different types of multiclass errors may have different importance. Need to look at the confusions, not just as a summary statistic - examine learned parameters - look at a sample of misclassified instances - Error analysis can help inform: ---feature engineering: if you observe that certain classes are easily confused, maybe create new features to distinguish these classes --- if you observe that certain features may be hurting performance, you could remove it

Selecting validation data

- Cross validation: --- use cross validation for model selection, then evaluate on a single held out test after tuning --- use nested cross validation, where a fold is used for testing and a different fold is used for validation

Decision Tree Benefits

- Decision trees can learn conjunctions (e.g., pattern="striped" AND Color="Orange"). This gives context to the feature values. - Naturally handle muticlass classification without modification - Classifier is interpretable - Can be used for regression (final prediction is the average value of all instances at leaf)

Dimensionality reduction

- Dimensionality reduction refers to the process of reducing the number of features in your data: can be done through feature selection and feature transformation - transformed feature vectors are called embeddings intuitions behind dimensionality reduction: - correlated features may be able to be mapped to same feature without losing info - its possible to change dimensionality so that instances retain similarity to each other in the new feature space - lossy compression

Dimensionality

- Dimensionality: number of features/variables - Curse of dimensionality: training: the more features you have, the more data you need to learn distance: all points are far apart in a high dimensional space. Harder to define close vs far

Random Forest

- Ensemble learning with decision trees (a forest is a set of trees) - avoid overfitting better than individual decision trees

Feature engineering

- Features can be created from other features (function, etc) - features can be created from other classifiers

Unsupervised Learning

- Find interesting patterns in data - no training data - not trying to predict any particular variable - clustering: an unsupervised learning task that involves grouping data instances into categories. Similar to classification but do not know classes ahead of time

minima/maxima

- Global maxima is highest peak in data set. - local maxima is an peak, when the slope changes from positive to negative - all global maxima are also local maxima

Tradeoff between precision and recall fraud example

- Increase Prediction threshold -> increase precision - decrease prediction threshold -> increase recall - If a human is reviewing the transactions flagged as fraud, probably optimize for recall - If the classifications are taken as is, probably optimize for precision

Feature Selection Techniques

- L1 regularization - Sequential feature selection algorithms: pick features one by one (e.g., sequential backward selection) - statistical tests: general idea is to measure the statistical dependence or correlation between each feature and the labels (ex:// chi squared test). Choose top N features based on their test statistic or all features whose test statistic is below a threshold - Unlike L1 regularization or sequential methods, feature selection through statistical tests does not explicitly try to help classifier performance - stop words: common words like "the" that arent expected to be useful can be removed - removing infrequent features (those with long tails, e.g., an n-gram that appears in one document) can be removed with little affect on performance

Supervised Learning

- Learns how to predict output from a given input - two types of prediction - classification - discrete outputs - regression - continuous outputs

Logistic regression misnomers

- Logistic regression is classification, not regression. It is regression in that it is learning a function that outputs continuous values, BUT you are using those values to predict discrete classes - Considered a linear classifier because score, which determines the output, is linear

Averaging precision/recall/f1

- Macro average: averages the individual calculated scores of each class. Weights each class equally. PREmacro = (PRE1 + ...+ PREk)/k - Micro average: calculates metric by first pooling all instances of each class. Weights each instance equally. PREmicro = (TP1+...+TPk)/(TP1+...+ TPk + FP1+...+ FPk)

Dimensionality reduction in neural networks

- Neural Networks perform dimensionality reduction in the hidden layers - the first layer of a neural network often learns similar outputs, even when the data and task change - an increasingly common technique is to train the first layer once ("pretraining") and keep reusing it, doing most of the experimentation in the hidden layers

Generalization

- Overfitting results when your function matched the training data well but is not learning general rules that will work for new data - Inductive Bias: restrictions on what a classifier can learn

Size of testing data tradeoffs

- Smaller test set: less reliable performance estimate - smaller training set: less data for training, probably worse classifier (might underestimate performance)

Features: Text - Techniques

- Stemming: converting words to their "root" or "base" form - TF-IDF weighting: raw word counts can overemphasize uninteresting words like "the" and "and" tf-idf(t, d) = tf(t, d) * idf(t, d) idf(t, d) = log (n/1+df(d,t)) where tf is term frequency, idf is inverse document frequency, n is total number of documents, df(d,t) is number of documents that contain the term t

Supervised vs Unsupervised dimensionality reduction

- Supervised (LDA): changes feature space in a way that directly optimizes for the prediction task - unsupervised reduction (PCA): can take advantage of unlabeled data. Potentially advantageous when you have a small amount of training data but a large amount of unlabeled data from the same domain

Linear Discriminant Analysis (LDA)

- Supervised dimensionality reduction - Works similarly to PCA, except instead of choosing axes that have high variance, LDA chooses axes that best separate the class labels - a metric called "scatter" measures how separated class labels are on an axis

High Variance

- The learned function depends a lot on the specified data used to train - prone to overfitting

Variance

- The variance of an estimate refers to how much the estimate will vary from sample to sample - If you consistently get the same parameter estimate regardless of what training sample you see, this parameter has a low variance

Error

- True Error/Risk: a measure of how well a classifier will do on all data it might encounter - Usually, can only measure the error or loss on the training data called the training/empirical error/risk - Goal of machine learning is to learn a prediction function that minimizes true error

Bias and Variance Error

- Variance is error due to randomness in how your training data was selected - Bias is error due to something systematic, not random - Some amount of bias is needed to avoid overfitting - too much bias is bad, but too much variance is usually worse

Bias

- When you estimate a parameter from a sample, the estimator is biased if the expected value of the parameter is different from the true value - Regularization adds bias because it systematically pushes your estimates in a certain direction (weights close to 0)

High Bias

- Will learn similar functions even if given different training examples - prone to underfitting

Features of a dataset

- attributes/covariates - input variables - independent variables - list of feature values for an instance is the feature vector

F1 score

- average of precision and recall - harmonic mean, affected by lower numbers. both number must be high for F1 to be high

Principal component analysis (PCA)

- chooses new axes to project data onto. New axes are called principle components - PCA is unsupervised dimensionality reduction. it does not use any information about class labels - PCA chooses axes so that values will have a high variance once projected onto it - Features must be standardized/normalized - reduced dimensionality K is a hyperparameter

semi-supervised learning

- combines supervised and unsupervised learning - special case of supervised learning: you have a specific prediction task, but some of your data has unknown outputs

Validation data

- data that is held out for measuring performance, but is separate from the final test set - also called development data

problems with crowdsourcing

- dont have same people labeling every instance (consistency issues) - workers may lack expertise - workers may work too fast to do well

Incorrect values

- errors can arise (transcription error, human error, script output error) - outlier detection: commonly accepted definition of an outlier is a value that is more than 2 standard deviations above or below the mean

Error/Loss of a prediction function

- for classification, this is the probability that the classifier outputs the correct label - for regression, this is usually measured by how far away the predicted label will be

k in k-fold cross validation

- generally larger k is better, but limited by efficiency - smaller k means less training data used, so your estimate make be an underestimate - when k is number of instances, this is called leave-one-out cross validation. Useful with small datasets, when you want as much training data as possible

Missing Values

- if a small number of instances have missing values, maybe drop them - if a lot of values are missing for a feature, maybe drop feature - impute missing values (ex:// mean is numerical, majority if categorical) - can have a special "unknown value"

k in k nearest neighbors

- if k is too small: prediction sensitive to noise - if k is too large: algorithm loses local context that makes it work

Implicit labels

- many companies use user engagement (e.g., clicks, "likes") as a type of label - often, this feedback is only a proxy for what you actually want

Benefits of multiple held-out estimates

- more robust final estimate; less sensitive to particular test split - multiple estimates also give you variance of the estimates; can be used to construct confidence intervals

K nearest neighbors

- non linear? - Classifies an instance as follows: 1. find the k labeled instances that have the lowest distance to the unlabeled instance 2. Return the majority class (most common label) in the set of k nearest instances - can also be used for regression: replace "majority class" in step 2 with "average value" - common variant of KNN: weigh the nearest neighbors by their distances

What to do if an instance is unclear and hard to assign a label

- one solution: exclude from training data (however, then the classifier cannot learn how to handle similar instances, but that may be better than teaching it something thats wrong) - label it with a special class label

Transforming categorical values

- one-hot encoding: ex:// SexIsMale, SexIsFemale - ordinal values (e.g., small, medium, large) may be able to be encoded with a feature with increasing numerical values (e.g., 1,2,3)

reinforcement learning

- setting: - an agent interacts with some environment - actions by the agent lead to different states in the environment - some states provide rewards - learning goal is to maximize reward -used to learn models of how to behave

In sample data

- the data that is available when building your model - training data

Classification with logistic regression

- typically you classify something as positive if Φ(z)>=0.5, but you could create other rules. For example, if you dont want to classify something as positive unless you are really confident, use Φ(z)>=0.99 as your rule (example:// spam classification)

label

- value of the output variable - dependent variable - in classification, values that a label can take on are classes

Feature normalization with regularization

-The scale of features matters when using regularization. - If one feature has values on range [0,1] and another feature with value range [0, 1000], the learned weights might be on different scales, but whatever weights are "naturally" larger will be penalized more by regularizer

Error

1 - Accuracy

K means clustering

1. Initialize cluster means 2. Repeat until assignments stop changing: a) assign each instance to the cluster whose mean is nearest to the instance b) update the cluster means based on new cluster assignments: (1/|si|)(∑(xj in si) xj) where si is set of instances in cluster i and |si| is number of instacnes in the cluster

General Idea for Learning a Decision Tree

1. Pick the feature that best distinguishes classes - If you group the instances based on their value for that feature, some classes should become more likely - The distribution of classes should have low entropy (high entropy means that classes are evenly distributed) 2. Recursively repeat from all groups of instances 3. When all instances in a group have the same label, set that class as the final node (If you stop creating a deeper tree before all instances at a node belong to the same class, use the majority class within that group as the final class of the leaf node)

Objective Function

A function we want to minimize/maximize. - Loss/Cost function , L(w), gives the training error when using parameters w - We want to minimize loss function

Bias term in perceptron

Add an extra feature to every instance whose value is always 1

Features: Audio

Audio data is somewhat similar to image data. There are different intensities (amplitudes) at different positions (signal frequency and time)

Logistic regression in other disciplines

Can be used as a tool of understanding relationships between variables. Example: Build a model to "predict" if someone is a smoker or not. The parameters can tell you which variable increase or decrease the likelihood of smoking

Data annotation

Creating the labels for training data also called coding or labeling often needs to be an iterative process to finalize guidelines and set of labels

Decision trees

Decision tree classifiers are structured as trees, where: - nodes are features - edges are feature values - leaves are classes To classify an instance, start at the root of the tree, and follow the branches based on the feature values in that instance. The final node is the prediction.

Overfitting in Decision Trees

Decision trees can easily overfit. Without any restrictions, a tree can encode all possible combinations of feature values. Techniques to avoid overfitting: - restrict maximum depth of the tree - restrict minimum number of instances that must remain before splitting further

Multilayer perceptron

Feed forward neural network where one layers of perceptrons outputs is fed into another layer of perceptrons as inputs. Each perceptron in a neural network is called a unit Input layer, hidden layers, output layer MLP uses a logistic regression function as the activation function Hyperparameters: number of hidden layers, number of units in each layer

Stopping Criteria for perceptron

If the training instances are not linearly separable, the classifier will always get some predictions wrong. - need to implement some sort of stopping criteria - usually this is specified by running the algorithm for a maximum number of iterations or epochs

Summary of Kernel Methods

Kernel SVM is a reformulation of SVM that uses similarity between instances. To make a prediction for a new instance, need to calculate kernel function for the new instance and all training instances that are support vectors. Linear kernel svm == svm Kernels can be useful when your data has a small number of features and/or when the dataset is not linearly separable.

Goal of Support vector machines

Learn the boundaries to make the margins as large as possible while still classifying the instances correctly

Learning rate in perceptron

Learning rate, eta, also called step size. -If eta is too small, algorithm will be too slow because the updates wont make much progress - If eta is too large, algorithm will be slow because it will overshoot and may cause previous correct classifications to become incorrect

Loss function of a multilayer perceptron

Loss function is based on the difference between classifications and true labels can be minimized with gradient descent (or related methods) - back propagation makes gradient descent calculations more efficient Loss function is non convex

Precision

Percentage of instances predicted to be positive that were actually positive PRE = TP/(TP+FP)

Recall

Percentage of positive instances that were predicted to be positive REC = TPR = TP/P = TP/(FN+TP)

Data Preprocessing

Preprocessing refers to the step of processing your raw data in a way that makes it suitable for use in a learning algorithm Main components: - Getting features out of raw data - setting values of the features (fixing incorrect values, converting categorical values to numeric, standardizing/normalizing values) - selecting which instances to include

Multiclass Classification

Refers to the setting when there are more than 2 possible class labels

Multi-label Classification

Refers to the setting when there are more than one labels you want to predict one approach: train one classifier to predict one label, use this output as a feature in the other. (cons: uncertainty, only one informs the other) another approach: treat combinations of classes as their own "classes", then do single label (pros: allows classifier to learn that certain combinations are more likely. cons: all classes are learned independently (e.g. tuxedo+male is completely separate from tuxedo+female))

Why does regularization help with generalization

Regularization helps with generalization because it wont give large weights to features unless there is sufficient evidence that they are useful. (The usefulness of a feature towards improving the loss has to outweigh the cost of having a large feature weight)

Stochastic Gradient Descent

SGD is a variant of gradient descent that makes updates using an approximate of the gradient that is only based on one instance at a time. The gradient for one instances loss is an approximation to the true gradient

Types of Machine Learning

Supervised - Goal: Prediction Unsupervised - Goal: Discovery Reinforcement

Regularization

The act of modifying a learning algorithm to favor "simpler" prediction rules to avoid overfitting. Most commonly, regularization refers to modifying the loss function to penalize certain values of the weights you are learning. Specifically, penalize large weights

Grid Search

The process of evaluating every combination of settings (from a specified set of potential values) on validation data

Feature Extraction

The process of getting the values of features out of raw data (scripts to turn raw data into features)

initializing cluster means in k means clustering

Two approaches: 1. randomly assign each instance to a cluster and calculate means 2. pick k points at random and treat them as cluster means (generally better approach, leads to initial cluster means that are more spread out)

One vs Rest (One vs All)

Used for multiclass classification OVR classification involves training a binary classifier for each class. Each classifier predicts whether the instance belongs to the target class or not. Label instance with whichever classifier has the highest score

All Pairs

Used for multiclass classification. Trains a Binary Classifier for every pair of classes. Whichever class "wins" more pairwise classifications will be the final prediction

TP, FP, TN, FN

With respect to a class c: - True Positive: classifier predicted c and label is c - False Positive (Type i error): classifier predicted c and label is not c - True Negative: classifier predicted not c and label is not c - False Negative (Type ii error): classifier predicted not c and label is c

Out of sample data

data that was not seen during training. also called held-out data or a holdout set usually assumed to be from same distribution as in-sample data

General Form of a Linear Function

f(x1, x2, ..., xk) = (∑(over k) mi*xi) + b - one variable: line - two variables: plane - in general: hyperplane

Accuracy

number of correctly classified instances divided by total number of instances

recipe for supervised machine learning

pattern matching + generalization

Feature selection

refers to choosing a subset of specific features out of all of the features you have engineered and extracted. A form of dimensionality reduction - can reduce complexity of classifier (and therefore reduce overfitting) - can help with runtime/memory complexity

feature extraction

refers to the actual step of converting raw data into vectors or feature values consider efficiency when extracting features - will need to extract features for every instance you classify

Feature engineering

the process of designing the types of features that will be used in a classifier or predictor

General form of a line

y = mx + b - m and b are parameters/coefficients (constant once specified) - x is the argument (input)


Kaugnay na mga set ng pag-aaral

Chapter 24 - Newborn Nutrition & Feeding (Maternity) EAQ's

View Set

Unit 6 - Inflammation, IBD, Pneumonia

View Set

Chapter 7 - Scripting with Python

View Set