Machine Learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

Loss Function

A function that attempts to measure the deviation of our prediction to the actual value we know is correct. Examples of common loss functions are Mean Squared Error and Log Loss.

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following. A classification model predicts discrete values.

Learning Rate Annealing

Decreasing the learning rate as the loss is decreasing. A better approach than simply picking some sort of incremental rate is to use cosine annealing.

F1 score

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it's better to look at both Precision and Recall. F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Sparse representation

A representation of a tensor that only stores nonzero elements. For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence: A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them. A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.

Mini-batch stochastic gradient descent (mini-batch SGD)

A compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random.

Feature cross

A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.

Deep learning - activation function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

Iterative trial-and-error process to find lowest loss.

A model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.

A/B testing

A statistical way of comparing two (or more) techniques, typically an incumbent against a new rival. A/B testing aims to determine not only which technique performs better but also to understand whether the difference is statistically significant. A/B testing usually considers only two techniques using one measurement, but it can be applied to any finite number of techniques and measures.

Active learning

A training approach in which the algorithm chooses some of the data it learns from. Active learning is particularly valuable when labeled examples are scarce or expensive to obtain. Instead of blindly seeking a diverse range of labeled examples, an active learning algorithm selectively seeks the particular range of examples it needs for learning

Differential Learning Rates

AKA discriminative fine tuning. Training the model initially with a very low learning rate and increasing it (either linearly or exponentially) at each iteration.

Tahn vs Sigmoid

Both are sigmoidal but the sigmoid function is 0 to 1 whereas the tahn function is -1 to +1

Stochastic gradient descent

Choosing only one example at random from the data set to calculate the gradient for one iteration.

Reinforcement Learning

It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the reinforcement signal.

kNN (k- Nearest Neighbors)

It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function

SVM (Support Vector Machine)

It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.

Naive Bayes

It is a classification technique based on Bayes' theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

K-Means

It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

Linear regression

Linear regression is a method for finding the straight line or hyperplane that best fits a set of points.

List of Common Algorithms

Nearest Neighbor Naive Bayes Decision Trees Linear Regression Support Vector Machines (SVM) Neural Networks

Mean Absolute Error, L1 Loss

Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions. (If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to ∞.

Mean Square Error, Quadratic loss, L2 Loss

Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable and predicted values.

Problem in using MAE loss

One big problem in using MAE loss (for neural nets especially) is that its gradient is the same throughout, which means the gradient will be large even for small loss values. This isn't good for learning. To fix this, we can use dynamic learning rate which decreases as we move closer to the minima.

Data Augmentation

One way to mitigate overfitting is to effectively create more data, through data augmentation. This refers to randomly changing the images in ways that shouldn't impact their interpretation, such as horizontal flipping, zooming, and rotating.

Precision vs Recall

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall is the ratio of correctly predicted positive observations to the all observations in actual class.

Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good. Precision = TP/TP+FP

Random Forest

Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we've collection of decision trees (so known as "Forest"). To classify a new object based on attributes, each tree gives a classification and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Recall

Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The question recall answers is: Of all the passengers that truly survived, how many did we label? We have got recall of 0.631 which is good for this model as it's above 0.5. Recall = TP/TP+FN

ReLU

Rectified Linear Unit: ReLU(a) = max(0,a) Successor to sigmoid

Scaling feature values

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits: Helps gradient descent converge more quickly. Helps avoid the "NaN trap," in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN. Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

Gradient

The Gradient is nothing but a derivative of loss function with respect to the weights. It is used to update the weights to minimize the loss function during the back propagation in neural networks.

Atttention

The basic idea: each time the model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence. In other words, it only pays attention to some input words.

MSE vs. MAE (L2 loss vs L1 loss)

Using the squared error is easier to solve, but using the absolute error is more robust to outliers. Whenever we train a machine learning model, our goal is to find the point that minimizes loss function. Both functions reach the minimum when the prediction is exactly equal to the true value.

Vanishing Gradient

Vanishing Gradient occurs when the derivative or slope gets smaller and smaller with every layer as we go backward during back propagation. When the weights update is very small the training time takes much longer, and in the worst case, this may completely stop the neural network training. A vanishing gradient problem occurs with the sigmoid and tanh activation function because the derivatives of the sigmoid and tanh activation functions are between 0 to 0.25 and 0-1. Therefore, the updated weight values are small, and the new weight values are very similar to the old weight values. This leads to Vanishing Gradient problem. We can avoid this problem using the ReLU activation function because the gradient is 0 for negatives and zero input, and 1 for positive input.

Stochastic Gradient Descent with Restarts

We reset the learning rate every so many iterations so that we may be able to pop out of a local minimum if we get stuck.

Reinforcement learning - Action

the mechanism by which the agent transitions between states of the environment. The agent chooses the action by using a policy.

Linear Regression - Calculation

y′ is the predicted label (a desired output). b is the bias (the y-intercept), sometimes referred to as w0. w1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of a line. x1 is a feature (a known input). To infer (predict) the temperature y′ for a new chirps-per-minute value x1, just substitute the x1 value into this model. Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:


Conjuntos de estudio relacionados

HomeostasisAssignment zoology connecte ed mcgraw hill

View Set

Chapter 5 - Integumentary System

View Set

Pennsylvania - Land, Real Estate, and Real Property, RELRA

View Set