Machine Learning Exam 1
Multi-label Classification System
Classification system that outputs multiple binary labels i.e: Classifier trained to recognize 3 people (A, B, C) could return [1,0,1] meaning Yes to A, No to B, and Yes to C.
Receiver Operating Characteristic (ROC) Curve
Common tool used with binary classifiers. Very similar to the precision/recall curve, but instead of plotting precision vs recall, the ROC curve plots the true positive rate (recall) against the false positive rate. Plots sensitivity (recall) Use when you care more about the false negatives than the false positives
Gradient Descent
Generic optimization algorithm capable of finding optimal solutions to a wide range of problems. General idea is to tweak parameters iteratively in order to minimize a cost function. Measures the local gradient of the error function with regards to the parameter vector theta and goes in the direction of descending gradient. Once gradient = 0, you have reached the minimum. Size of steps = determined by the learning rate (hyperparameter) Search in the model's parameter space: the more parameters a model has, the more dimensions this space has and the harder the search is.
Support Vectors
Instances that are fully determined (or "supported") by the instances located on the edge of the street. SVMs are sensitive to the features scales, meaning data needs feature scaling to be applied.
Challenges of Machine Learning
Insufficient quantity of training data Non-representative Training data Poor-quality data Irrelevant features Overfitting/under-fitting the training data
Feature Engineering
Involves Feature selection, feature extraction, and creating new features by gathering new data Process of coming up with a good set of features to train on
Lasso Regression
Just like Ridge Regression, it adds a regularization term to the cost function, but it uses the l1 norm of the weight vector instead of half the square of the l2 norm. Tends to completely eliminate the weights of the lease important features (sets them to zero). Automatically performs feature selection and outputs a sparse model (with few nonzero feature weights)
Lasso vs Ridge Regression
Lasso -- uses l1 norm of the weight vector in regularization term pros: -leads to a sparse model -performs feature selection automatically -good if only a few features are actually useful Ridge regression -- regularized form of linear regression, uses half square of l2 norm in regularization term pros: -good default
Soft Margin Classification
Find a good balance between keeping the street as large as possible (margin as wide as possible) and limiting the number of margin violations. Can control this using the C hyperparameter in Scikit Small C -> wider street and more margin violations Large C -> smaller margin (narrower street) and less margin violations If SVM is overfitting, you can regularize by reducing C
value_counts() method
Find out what categories exist and how many elements in each category for a particular feature.
(QUIZ 3) Suppose you use Batch Gradient Descent and plot the validation error at every iteration. If you notice that validation error consistently does up, what is likely going on?
Learning rate is too high
(QUIZ 4) In Ridge Regression, what may happen to the model when you decrease the hyper-parameter \lambda?
Lower bias, higher variance
What is Machine Learning
Machine Learning is the science (and art) of programming computers so they can learn from data.
Linear Regression
Makes a prediction by simply computing a weight sum of the input features plus a constant called the bias term (intercept term) y-hat = theta 0 + theta1*x1 + ... + thetaN*xN Can be simplified to y-hat = theta • x; theta = model's parameter vector containing bias term + feature weights (theta1 - thetaN) X = instance's feature vector (x0 to xN and x0 always = 1)
SVM Polynomial Kernel Trick
Makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, w/0 actually having to add them.
Normal Equation
Mathematical equation (Closed-form solution) that finds the value of theta that minimizes the cost function theta-hat = (x transpose x)^-1 (X transpose y)
Utility (fitness) function
Measure how good the model is
Learning Schedule
Function that determines the learning rate at each iteration Too fast of reduction of learning rate -> stuck at local minima or stuck halfway to the minimum Too slow of reduction of learning rate -> jump around the minimum for a long time and end up with a suboptimal solution if halted training early.
split_train_test
Function to split data set into training and testing sets.
Bias Error
Generalization error is due to wrong assumptions, (assuming data is linear when it is actually quadratic) High-bias model is most likely to under-fit the training data.
Feature Extraction
Merging two features into one (i.e: merging a car's wear and tear into one feature)
Elastic Net
Middle Ground between Ridge Regression and Lasso Regression. Regularization term is a simple mix of both Ridge and Lasso's regularization terms, and you can control the mix ratio r. When r = 0, Elastic Net = Ridge Regression When r = 1, Elastic Net = Lasso Regression
Cross Entropy
Minimizing the cost function in Logistic Regression leading to a model that estimates a high probability for the target class. Cross entropy penalizes the the model when it estimates a low probability for a target class. Measures the average number of bits you actually send per option. Frequently used to measure how well a set of estimated class probabilities match the target classes
Underfitting
Model is too simple to learn the underlying structure of the data Can solve by selecting a more powerful model or more parameters, better features, reducing the constraints on the model (reduce the hyperparameter).
Overfitting
Model performs well on the training data but doesn't generalize well Occurs when the model is too complex relative to the amount and noisiness of the training data. Can solve by simplifying the model, getting moree data, or reducing the noise in the data (data clean-up).
Can logistic regression perfectly classify data when no line can be drawn between all points?
No
Polynomial Regression
Add powers of each feature as new features if the data is more complex than a simple straight line. PolynomialFeatures(degree=d) = An array containing n features into an array containing (n+d)!/(d!*n!)
Reinforcement Learning
Agent observes the environment, select and perform actions, and get rewards or penalties in return. It must then learn by itself what is the best strategy (policy) to get the most reward over time.
(QUIZ 5) Suppose you have trained a SVM with linear decision boundary. After training SVM, you suspect that your SVM model is under-fitting. What is a good solution to fit this issue?
Allow less margin violations by increasing the value of C
Similarity Function
Another technique to tackle nonlinear problems by adding features computed. It measures how much each instance resembles a particular landmark.
scatter_matrix() method
Another way to check for correlation between attributes. This plots every numerical attribute against every other numerical attribute (Based on the top 4-5 attributes that seem to be the most correlated)
Mini-Batch Gradient Descent
At each step, instead of computing the gradients based on the full training set or based on just one instance, Mini-Batch GD computes the gradients on small random sets of instances called mini-batches. Advantage of Stochastic GD: Performance boost from hardware optimization of matrix operations.
(QUIZ 3) Which gradient descent algorithm will actually converge?
Batch Gradient Descent
(QUIZ 3) Suppose the features in your training set have very different scales (range of values). Which algorithms may suffer from this?
Batch Gradient Descent, Mini-batch Gradient Descent, Stochastic Gradient Descent
Model-based learning
Build a model from the set of examples and then use that model to make predictions.
Batch Gradient Descent
Calculate how much the cost function will change if we change parameter just a bit (ie. partial derivatives) Same as: "what is the slope of the mountain if I take a step to the east?" then ask the same question for other directions Use all of training data → batch
Multiclass Classifiers
Can distinguish between more than 2 classes
Binary Classifier
Capable of distinguishing between just two classes, X and not X.
Batch learning
The system is incapable of learning incrementally: it must be trained using all the available data. Typically done offline. (offline learning)
Supervised Learning
The training data fed to the algorithm includes the desired solutions, called labels
Unsupervised Learning
The training data is unlabeled, so the system tries to learn without a "teacher" Can use visualization algorithms to better visualize data.
(QUIZ 3) Mathematical form of Normal Equation
Theta = (X TRANSPOSE(X))^-1 * X TRANSPOSE(y) where X is the training set feature matrix and y is the training label vector theta is the optimal parameters
One-versus-all (OvA) or (one-versus-the-rest)
To classify an image, create a binary classifier for each class and get the decision score from each classifier for that image and select the class whose classifier outputs the highest score. Is preferred for most binary classification algorithms.
(QUIZ 2) Why would you need to use a Pipeline to process a sequence of data transformations?
To ensure incoming data to be transformed with the same order of operations and parameters in order to compare across different sets (training, test, validation)
One-versus-one (OvO)
Train a binary classifier for every pair of classes. So if there are N classes, you need to train [N(N-1)]/2 classifiers. Advantage: Each classifier needs to be trained on the part of the training set for the two classes that it must distinguish only.
Categories of Machine Learning Systems
Trained or not with human supervision (supervise, unsupervised, semisupervised, and reinforcement training) Learn incrementally on the fly (online vs batch learning) Compare new data points to known data points or detect patterns in the training data and build a predictive model (instance-based vs model-based learning)
Online Learning (Incremental learning)
Training the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. System can learn about new data on the fly since each learning step is fast and cheap. Is still usually done offline.
(QUIZ 4) You have been adding more training examples, but the performance of your regression model (on both training and validating data) does not get any better. From which cause is your algorithm most likely to suffer?
Under-fitting the model
Noise in data
Unexplained variability within a data sample (i.e., partly random)
Stochastic (random) Gradient Descent
Unlike (Batch) Gradient Descent, this just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Makes this algorithm much faster since it has very little data to manipulate at every iteration. Cost function will bounce up and down, decreasing only on average. Over time will end up very close to the minimum but will continue to bounce around, never settling Final parameter values are good, but not optimal
Cross-validation
Used to avoid "wasting" too much training data in validation sets. The training set is split into complementary subsets, and each model is trained against a different combination of these subsets and validated against the remaining parts.
Linear SVM
Used to linearly separate different classes of data
info() method
Useful to get a quick description of the data, in particular the total number of rows, and each attribute's type and number of non-null values
Sampling Bias
Very large samples that have non-representative data if the sampling method is flawed.
one-versus-all strategy
A way to train a binary classifier, using only digit to compare to all others. When classifying an image, you select the class whose classifier outputs the highest score. Usually preferred to OvO
8 Steps of Machine Learning
1) Look at the big Picture 2) Get the data 3) Discover and visualize the data to gain insights 4) Prepare the data for ML algorithms 5) Select a model and train it 6) Fine-tune your model 7) Present you solution 8*) Launch, monitor, and maintain your system
(QUIZ 1) Name three common unsupervised tasks
1. Clustering 2. Visualization 3. dimensionality reduction
What are 4 types of problems that ML is perfect for
1. Problems that require a long list of hand tuned rules 2. Problems too complex for a traditional approach 3. Problems that need to adapt to changing environments 4. Getting insights for human learning
(QUIZ 4) How many binary classifiers do you need to build a classification system for 9 unique classes using the "one-versus-one" strategy?
36
(QUIZ 2) OneHotEncoder can be used to encode categorical data into numerical data. If the categorical data have 6 distinct values, how many zeros would the OneHotEncoder need to encode each entry?
5
Semisupervised learning
A lot of unlabeled data with a little bit of labeled data. Tend to usually be combinations of unsupervised and supervised algorithms.
confusion matrix
A table of numbers showing how often a given stimulus is reported when another stimulus was shown. Top left = TN Top right = FP Bottom left = FN Bottom right = TP
Root Mean Square Error (RMSE)
A typical performance measure for regression problems. Gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. m = # of instances in the dataset x^(i) = vector of all the features values (excluding the label) of the ith instance in the dataset y^(i) = the label (desired output value for that instance) h = Hypothesis and system's prediction function RMSE(X, h) = Cost function measured on the set of examples using the hypothesis h.
One-versus-one stategy
A way to train a binary classifier, using every pair of digits. When classifying an image, you run the image through all pairs and see which class wins the most duels. If there are N classes, you train N * (N - 1)/2 classifiers
Logistic Regression
Commonly used to estimate the probability that an instance belongs to a particular class. If estimated probability > 50%, model predicts the instance belongs to the class or that it does not. This makes it a binary classifier. Computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like Linear Regression, it outputs the logistic (noted σ)of this result. σ = Sigmoid function (S-shaped) that outputs a number b/w 0 and 1 Objective of training is to set the parameter vector theta so that the model estimates high probabilities for positive instances (y=1) and low probabilities for negative instances (y=0).
corr() method
Computes the standard correlation coefficient (also called Pearson's r) between every pair of attributes.
Regularization
Constraining a model to make it simpler and reduce the risk of overfitting Controlled by a Hyperparameter (parameter of a learning algorithm): Large value -> flat model (slope close to 0).
MSE Cost Function for Linear Regression
Convex function, meaning that if you pick any two points on the curve, the line segment joining them never crosses the curve; thus, implying that there are no local minima and just one global minimum. Also a continuous function with a slope that never changes abruptly.
OneHotEncoder
Create one binary attribute per category: one attribute when the category is equal to 1, and 0 otherwise. Only one attribute will be hot, while the others are cold. Used for changing categorical data into numbers.
Early Stopping
Different way to regularize iterative learning algorithms such as Gradient Descent by stopping training as soon as the validation error reaches a minimum.
Association rule learning
Dig into large amounts of data and discover interesting relations b/w attributes.
Variance Error
Due to model's excessive sensitivity to small variations in the training data. High degrees of freedom -> high variance -> overfitting the training data
Irreducible Error
Due to the noisiness of the data itself. Only way to reduce this error is to clean up the data
Generalization Error (Out-of-sample error)
Error rate on new cases. Tells you how well your model will perform on instances it has never seen before. (if training error is low but generalization error is high -> model is overfitting training data)
Training set
Examples that the system uses to learn (each training example is called a training instance (sample))
Softmax Regression
Given an instance X, the Softmax Regression model first computes a score Sk(x) for each class K, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores. Softmax function: Computes the exponential of every score, then normalizes them (dividing by the sum of all the exponentials Predicts only one class at a time (multiclass but not mulitoutput), so it should be used only with mutually exclusive classes such as different types of plants. Can't use to recognize multiple people in one picture.
Bias vs Variance
High bias means that the model is very far from the target (underfit) High variance means that there is no consistency or pattern in the data (overfit). Regularizing the model can prevent overfitting
Learning rate
How fast an online learning system should adapt to changing data. High learning rate -> the system will rapidly adapt to new data but will also tend to quickly forget the old data. Low learning rate -> System will learn more slowly but will also be less sensitive to noise in the new data or to sequences of non-representative data points.
Recall (sensitivity or true positive rate (TPR))
How many relevant items are selected Ratio of positive instances that are correctly detected by the classifier = TP / TP + FN
Precision
How many selected items are relevant Accuracy of the positive predictions safer from false negatives = TP / TP + FP
Feature Scaling
One of the most important transformations you need to apply to your data since ML algorithms don't perform well when the input numerical attributes have very different scales.
Learning Curves
Plots of the model's performance on the training set and the validation set as a function of the training set size (or training iteration) One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.
Support Vector Machine (SVM)
Powerful and versatile ML model capable of performing linear or nonlinear classification, regression, and even outlier detection. Particularly well suited for classification of complex but small/medium sized datasets. Called Large Margin Classification (fitting the widest possible street (represented by the parallel dashed lines) between the classes)
Regression
Predicting a numeric values given a set of features called predictors. (logistic regression is commonly used for classification)
RMSE vs MAE
RMSE emphasizes larger residuals and is more sensitive to outliers
logistic regression
Regression technique used when the outcome is a binary, or dichotomous, variable. Can be used for classification as well. Decision boundary is at where p-hat = 0.5 y-hat = 0 if p-hat < 0.5 y-hat = 1 if p-hat >= 0.5
Ridge Regression
Regularized version of Linear Regression: a regularization term equal to 1/2 * α * sum (theta(i)^2) is added to the cost function (MSE). Forces the learning algorithm to not only fit the data but also keep the model weights as small as possible Should only be done during training, once trained you want to evaluate performance using the unregularized performance measure. α = Hyperparameter that controls how much you want to regularize the model. α = 0 -> Just regular Linear Regression α = very large -> All weights end up very close to zero and the result is a flat line going through the data's mean increasing α reduce variance but increases bias decreasing α increases variance and leads to less bias and more complex model
describe() method
Shows a summary of the numerical attributes (5 number summary basically for each feature + std and mean)
Mean Absolute Error
Similar to RMSE but uses the l1 norm (|x0| + |x1|) Preferred function when there are many outliers since it uses the l1 norm which is the lowest norm index. RMSE preferred generally when outliers are exponentially rare.
Dimensionality reduction
Simplify the data without losing too much information. (Merge several correlated features into one)
Multioutput-multiclass Classification (Multioutput classification)
Simply a generalization of multilabel classification where each label can be multiclass (i.e; it can have more than two possible values)
(QUIZ 3) Which gradient descent algorithm will reach the vicinity of the optimal solution *the fastest*?
Stochastic Gradient Descent
(QUIZ 3) What Linear Regression algorithms should you use if you need to *quickly* train a dataset with thousands of features?
Stochastic Gradient Descent and Mini-batch Gradient Descent
Hard Margin Classification
Strictly impose that all instances be off the street and on the right side. 2 issues: 1) Only works if the data is linearly separable 2) Quite sensitive to outliers
Instance-based learning
System learns the examples by heart and then generalized to new cases using a similarity measure.
(QUIZ 4) F-1 and its formula
harmonic mean of precision and recall, which gives more weight to low values. Precision and recall must be high for F-1 to be high = TP / (TP + (FN + FP)/2)
Margin Violations
instances that end up in the middle of the street or even on the wrong side.
Cost function
measures how bad the model is (people tend to use this for Linear regression to measure the residuals between the models predictions and training examples)
True Negative Rate (TNR)
ratio of negatives instances that are correctly classified Also call specificity
False positive rate
ratio of negatives instances that are incorrectly classified as positive = 1 - true negative rate