Machine Learning Google Course

¡Supera tus tareas y exámenes ahora con Quizwiz!

One-hot encoding

A sparse vector in which: One element is set to 1. All other elements are set to 0. One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany data set chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.

validation set

A subset of the data set—disjunct from the training set—that you use to adjust hyperparameters. Contrast with training set and test set.

Linear regression

A type of regression model that outputs a continuous value from a linear combination of input features.

structural risk minimization (SRM)

An algorithm that balances two goals: The desire to build the most predictive model (for example, lowest loss). The desire to keep the model as simple as possible (for example, strong regularization). For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.

L2 regularization

Defines the regularization term as the sum of the squares of all the feature weights:

hyperparameter

The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Log Loss

The loss function used in binary logistic regression.

Regression

predicts continuous values. For example, regression models make predictions that answer questions like the following: What is the value of a house in California? What is the probability that a user will click on this ad?

Classification

predicts discrete values. For example, classification models make predictions that answer questions like the following: Is a given email message spam or not spam? Is this an image of a dog, a cat, or a hamster?

generalization curve

shows the loss for both the training set and validation set against the number of training iterations.

Collaborative filtering

the task of making predictions about the interests of a user based on interests of many other users. As an example, let's look at the task of movie recommendation. Suppose we have 1,000,000 users, and a list of the movies each user has watched (from a catalog of 500,000 movies). Our goal is to recommend movies to users.

Labels

the thing we're predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.

class-imbalanced data set

A binary classification problem in which the labels for the two classes have significantly different frequencies. For example, a disease data set in which 0.0001 of examples have positive labels and 0.9999 have negative labels is a class-imbalanced problem, but a football game predictor in which 0.51 of examples label one team winning and 0.49 label the other team winning is not a class-imbalanced problem.

embeddings

A categorical feature represented as a continuous-valued feature. Typically, an embedding is a translation of a high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English sentence in either of the following two ways: As a million-element (high-dimensional) sparse vector in which all elements are integers. Each cell in the vector represents a separate English word; the value in a cell represents the number of times that word appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1) representing the number of times that word appeared in the sentence. As a several-hundred-element (low-dimensional) dense vector in which each element holds a floating-point value between 0 and 1. This is an embedding

Weight

A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.

scaling

A commonly used practice in feature engineering to tame a feature's range of values to match the range of other features in the data set. For example, suppose that you want all floating-point features in the data set to have a range of 0 to 1. Given a particular feature's range of 0 to 500, you could scale that feature by dividing each value by 500. See also normalization.

ROC (receiver operating characteristic) Curve

A curve of true positive rate vs. false positive rate at different classification thresholds. See also AUC.

synthetic feature

A feature not present among the input features, but created from one or more of them. Kinds of synthetic features include the following: Bucketing a continuous feature into range bins. Multiplying (or dividing) one feature value by other feature value(s) or by itself. Creating a feature cross. Features created by normalizing or scaling alone are not considered synthetic features.

discrete feature

A feature with a finite set of possible values. For example, a feature whose values may only be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous feature.

dropout regularization

A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

step

A forward and backward evaluation of one batch.

activation function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

sigmoid function

A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1. The sigmoid function has the following formula: where in logistic regression problems is simply: y = 1 / 1 + e^-(z) In other words, the sigmoid function converts z = log(y / 1 - y) into a probability between 0 and 1. In some neural networks, the sigmoid function acts as the activation function.

softmax

A function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax.)

loss

A measure of how far a model's predictions are from its label. Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use Log Loss.

early stopping

A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens.

logistic regression

A model that generates a probability for each possible discrete label value in classification problems by applying a sigmoid function to a linear prediction. Although logistic regression is often used in binary classification problems, it can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression).

neural network

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.

prediction

A model's output when provided with an input example.

neuron

A node in a neural network, typically taking in multiple input values and generating one output value. The neuron calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of input values.

calibration layer

A post-prediction adjustment, typically to account for prediction bias. The adjusted predictions and probabilities should match the distribution of an observed set of labels.

machine learning

A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems.

stationarity

A property of data in a data set, in which the data distribution stays constant across one or more dimensions. Most commonly, that dimension is time, meaning that data exhibiting stationarity doesn't change over time. For example, data that exhibits stationarity doesn't change from September to December.

sparse representation

A representation of a tensor that only stores nonzero elements. For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence: A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them. A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.

Learning Rate

A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.

Regularization rate

A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate's influence: minimize(loss function + lamda(regulization function) Raising the regularization rate reduces over fitting but may make the model less accurate.

mini-batch

A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate the loss on a mini-batch than on the full training data.

one-hot encoding

A sparse vector in which: One element is set to 1. All other elements are set to 0. One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany data set chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.

feature cross

A synthetic feature formed by crossing (multiplying or taking a Cartesian product of) individual features. Feature crosses help represent nonlinear relationships.

hidden layer

A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). A neural network contains one or more hidden layers.

gradient descent

A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

candidate sampling

A training-time optimization in which a probability is calculated for all the positive labels, using, for example, softmax, but only for a random sample of negative labels. For example, if we have an example labeled beagle and dog candidate sampling computes the predicted probabilities and corresponding loss terms for the beagle and dog class outputs in addition to a random subset of the remaining classes (cat, lollipop, fence). The idea is that the negative classes can learn from less frequent negative reinforcement as long as positive classes always get proper positive reinforcement, and this is indeed observed empirically. The motivation for candidate sampling is a computational efficiency win from not computing predictions for all negatives.

binary classification

A type of classification task that outputs one of two mutually exclusive classes. For example, a machine learning model that evaluates email messages and outputs either "spam" or "not spam" is a binary classifier.

Classification model

A type of machine learning model for distinguishing among two or more discrete classes. For example, a natural language processing classification model could determine whether an input sentence was in French, Spanish, or Italian. Compare with regression model.

L1 regularization

A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L2 regularization.

prediction bias

A value indicating how far apart the average of predictions is from the average of labels in the data set. Not to be confused with the bias term in machine learning models or with bias in ethics and fairness.

confusion matrix

An NxN table that summarizes how successful a classification model's predictions were; that is, the correlation between the label and the model's classification. One axis of a confusion matrix is the label that the model predicted, and the other axis is the actual label. N represents the number of classes. In a binary classification problem, N=2

Rectified Linear Unit (ReLU)

An activation function with the following rules: If input is negative or zero, output is 0. If input is positive, output is equal to input

AUC (Area under the ROC Curve)

An evaluation metric that considers all possible classification thresholds. The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

true negative (TN)

An example in which the model correctly predicted the negative class. For example, the model inferred that a particular email message was not spam, and that email message really was not spam.

true positive (TP)

An example in which the model correctly predicted the positive class. For example, the model inferred that a particular email message was spam, and that email message really was spam.

false negative (FN)

An example in which the model mistakenly predicted the negative class. For example, the model inferred that a particular email message was not spam (the negative class), but that email message actually was spam.

false positive (FP)

An example in which the model mistakenly predicted the positive class. For example, the model inferred that a particular email message was spam (the positive class), but that email message was actually not spam.

Estimator

An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a TensorFlow session. You may create your own custom Estimators (as described here) or instantiate premade Estimators created by others.

Bias (math)

An intercept or offset from an origin. Bias (also known as the bias term) is referred to as b or w0 in machine learning models. For example, bias is the b in the following formula: y'=b+w1x2+w2x2+...wnxn

step size

Another word for learning rate

empirical risk minimization (ERM)

Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization.

multi-class classification

Classification problems that distinguish among more than two classes. For example, there are approximately 128 species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model that divided emails into only two categories (spam and not spam) would be a binary classification model.

Unlabeled example

Contains features but not the label.

bucketing

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

bucketing (Binning)

Converting a (usually continuous) feature into multiple binary features called buckets or bins, typically based on value range. For example, instead of representing temperature as a single continuous floating-point feature, you could chop ranges of temperatures into discrete bins. Given temperature data sensitive to a tenth of a degree, all temperatures between 0.0 and 15.0 degrees could be put into one bin, 15.1 to 30.0 degrees could be a second bin, and 30.1 to 50.0 degrees could be a third bin.

overfitting

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

one-vs.-all

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers—one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers: animal vs. not animal vegetable vs. not vegetable mineral vs. not mineral

NEVER TRAIN ON TEST DATA

If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. For example, high accuracy might indicate that test data has leaked into the training set.

graph

In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor) as an operand to another operation. Use TensorBoard to visualize a graph.

negative class

In binary classification, one class is termed positive and the other is termed negative. The positive class is the thing we're looking for and the negative class is the other possibility. For example, the negative class in a medical test might be "not tumor." The negative class in an email classifier might be "not spam." See also positive class.

positive class

In binary classification, the two possible classes are labeled as positive and negative. The positive outcome is the thing we're testing for. (Admittedly, we're simultaneously testing for both outcomes, but play along.) For example, the positive class in a medical test might be "tumor." The positive class in an email classifier might be "spam."

labeled example

Includes both feature(s) and the label.

convergence

Informally, often refers to a state reached during training in which training loss and validation loss change very little or not at all with each iteration after a certain number of iterations. In other words, a model reaches convergence when additional training on the current data will not improve the model. In deep learning, loss values sometimes stay constant or nearly so for many iterations before finally descending, temporarily producing a false sense of convergence.

Accuracy Relationships

Raise classification = Precision probably will increase Raise classification = Recall will probably decrease The higher the recall and precision together, the better

generalization

Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model.

L1 and L2 Relationships

Switching from L2 to L1 regularization dramatically reduces the delta between test loss and training loss. Switching from L2 to L1 regularization dampens all of the learned weights. Increasing the L1 regularization rate generally dampens the learned weights; however, if the regularization rate goes too high, the model can't converge and losses are very high

lambda

Synonym for regularization rate.

Accuracy

TP + TN / TP + TN + FP + FN

Mean Squared Error (MSE)

The average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples. The values that TensorFlow Playground displays for "Training loss" and "Test loss" are MSE.

feature set

The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing pric

squared loss

The loss function used in linear regression. (Also known as L2 Loss.) This function calculates the squares of the difference between a model's predicted value for a labeled example and the actual value of the label. Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L1 loss.

batch size

The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference; however, TensorFlow does permit dynamic batch sizes.

regularization

The penalty on a model's complexity. Regularization helps prevent overfitting.

backpropagation

The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.

Tensor

The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.

Feature engineering

The process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features. In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform. Feature engineering is sometimes called feature extraction.

representation

The process of mapping data to useful features.

batch

The set of examples used in one iteration (that is, one gradient update) of model training.

test set

The subset of the data set that you use to test your model after the model has gone through initial vetting by the validation set. Contrast with training set and validation set.

training set

The subset of the data set used to train a model. Contrast with validation set and test set.

What do you use labeled examples for?

To train the model. In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as "spam" or "not spam."

outliers

Values distant from most other values. In machine learning, any of the following are outliers: Weights with high absolute values. Predicted values relatively far away from the actual values. Input data whose values are more than roughly 3 standard deviations from the mean. Outliers often cause problems in model training.

Recall

What proportion of actual positives was identified correctly? Recall = TP / TP + FN

precision

What proportion of positive identifications was actually correct? Precision = TP / TP + FP

NaN trap

When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN. NaN is an abbreviation for "Not a Number."

Examples

a particular instance of data, x. (We put x in boldface to indicate that it is a vector.)

Features

an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: {x1,x2,...xn}

Inference

means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

Training

means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.


Conjuntos de estudio relacionados

Patho Chapter 12: Disorders of White Blood Cells

View Set

Unit 1, Unit 2, Unit 3, Unit 4, Unit 5, Unit 6

View Set

english l (honors) final study guide ~ mrs.russell ~ 2nd hour

View Set

Developmental psych Cognition and Language Development

View Set

English Semester 1 Final-Julius Caesar

View Set

prepU Chapter One Health Assessment

View Set

statistic 1.3: data collection and experimental design

View Set