AI Cards

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Polynomial Regression Model

A non-linear regression function that adds powers of features as new features. Can be used on a linear model to fit to non-linear data like SVM.

Semantic Segmentation

A regression task in CV that tries to detect multiple objects within an image and assigns a label to each pixel in the image.

Recurrent Autoencoders

A variant of autoencoder that is combined with a RNN. Can improve autoencoder tasks on video problems

How do you handle missing data?

Get rid of data instance Get rid of feature Fill in with median Fill in with 0

Unsupervised Learning

Models trains on data without labels and tries to come up with its own patterns. (Clustering, autoencoders)

Sparse Autoencoders

An autoencoder that uses sparsity to its advantage and learn features. Dropout, relu, etc

After max pooling a 26x26 image with a 2x2 filter, how big will the output be?

13x13

Ensemble Models

Models that work in a group to solve a classification or regression task.

Root node

Node at the top of a tree

Adam

Optimization algo that combines momentum and RMSProp. Generally one of the best algorithms.

AdaMax

Optimization algo that generally is more stable than Adam but is dependent on the dataset.

Learning Schedules

Techniques that modify the learning rate while training versus having a constant one.

Multicollinearity

collinearity where a feature exhibits a linear relationship with two or more features.

XGBoost

A boosting technique for random forests. An improved version of GBoost. Features normalization is not required, missing values can be handled automatically

Fully Convolutional Network (FCN)

A form of CNN that does not use any dense layer for the output. I.E The YOLO architecture

t-Distributed Stochastic Neighbor Embedding (t-SNE)

A non-linear dimensionality reduction method that tries to keep similar instances close and dissimilar instances apart. It is mostly used for visualization.

Neural Network Computation

1. Takes in input for the first layer 2. Forward pass and does calculation to the next layer 3. Uses loss function to return a measure of error 4. Uses chain rule to see what neurons contributed to error 5. Gradient Descent optimizes weights

dependent variable

A Label or the answer to your prediction

Softmax Regression Model

A Logistic regression model that does multi-classification.

Double DQN

A Q-learning RL model that also uses neural networks

1D Conv RNN

A RNN that uses a convnet as the input layer. By using a 1 Dimensional input layer, you could think of the row of vectors as being a time series. Filters then create a 1D feature map of the data.

LSTM RNN

A RNN variant that includes cells that learns what to store, what to forget, and what to read from. This data is used later in the layers and so the network does not forget original input data. Almost always better than regular RNNs.

Beam Search

A algorithm used in NLP tasks.

Random Subspace

A bagging technique that instead of randomizing subsets of data instances, it randomizes features.

Linear Support Vector Machines

A basic linear model used to classify things. Places a hyperplane in the best place in order to separate data points on a graph.

Logistic Regression Model

A basic linear model used to classify things. Takes input and activation functions to output a prediction. Used for dog/cat problems.

Linear Regression Model

A basic linear model used to estimate trends. Finds discrete values like housing prices.

K-Nearest Neighbor Model

A basic non-linear model that plots data on a graph and uses the distance between data points to tell how similar they are. K = number of data points to consider in the vicinity of a single data point. Good when using a small dataset with little noise.

Linear SVM (Support Vector Machines)

A binary/multi-class classification model. Plots a line on a graph to separate classes.

Non-linear SVM (Support Vector Machines)

A binary/multi-class classification model. Plots a line on a graph to separate classes. Can handle non-linear data by using a kernel like polynomial or RBF. Can also handle regression problems.

Stochastic Gradient Boosting

A boosting technique that uses the subsample hyperparameter for random forest that specifies the fraction of training samples a tree would be trained on. Speeds up training and trades higher bias for lower variance.

Voting Classifier

A classification method used by ensemble models. Each model comes to a conclusion and the answer that comes up the most is chosen.

Precision

A classification metric that tell you how many of your positives were True. A high precision means the model is more bent on making negative predictions even if it is not correct. Good for when you only want the classifications you are most sure of.

BIRCH Algorithm

A clustering algorithm designed for very large datasets with a small amount of features. Uses limited memory

DBSCAN Algorithm

A clustering model that is good for odd shaped or noisy clustering problems. Uses parameters epsilon (distance between 2 points to be a cluster) and minPts (minimum number of data points to define a cluster). Cannot make predictions but can be paired with other clustering algos like KNN to get better results.

Gaussian Mixture Model

A clustering model that is great for elliptical clusters.

K-fold Cross Validation

A cross validation method that divides data into k sets. Each set then cycles through being part of the training set and validation set. Results get averaged. EX. Train [K1] [K2] [K4] [K5] Valid [K3]

Variance Threshold

A dimensionality reduction technique that drops features that do not vary much.

Factor Analysis

A dimensionality reduction technique that is used for linear data. Analyzes correlations among a large number of items and divides these items into smaller sets of factors or values. I.E Merging food temp and freshness into food quality.

Incremental PCA

A dimensionality reduction technique that is used for linear data. Incrementally feeds data to a PCA. For large datasets.

Linear Discriminant Analysis (LDA)

A dimensionality reduction technique that is used for linear data. Is like PCA, but instead of lines to separate data, it uses an axis. Like pooling the data to drop to the x or y axis.

Principle Component Analysis (PCA)

A dimensionality reduction technique that is used for linear data. Projects high dimensional data in a low dimensional space by drawing a line that can encompass the most data points. Can also be used for compression.

ISO-Map

A dimensionality reduction technique that is used for non linear data. A graph that connects nearest neighbor and reduces dimensions while preserving distances.

Multi-dimensional Scaling

A dimensionality reduction technique that is used for non linear data. Reduces dimensionality while preserving distances between data points.

Locally Linear Embeddings (LLE)

A dimensionality reduction technique that is used for non-linear data. A manifold learning variant that does a KNN then finds the best ways to capture those features.

Projection (Dim Reduction)

A dimensionality reduction technique that projects high dimensional data into a lower dimensional space like a flat surface.

Kernel PCA

A dimensionality reduction technique. Same as PCA but applies a kernel like RBF, poly, linear. This makes it so that it can work on non-linear data.

Uncertainty Sampling

A form of active learning where you label a small subset of data and train a model with that labeled data. The result will be a set of scores for each class that the model is uncertain of. The least confident data classes are the ones that get labeled by a human first.

Manifold Learning

A form of dimensionality reduction that folds high dimensional data like a swiss rolls into a flat 2D graph.

Reinforcement Learning

A form of machine learning that uses a software agent to makes observations in the environment and makes actions that reward the agent the most. Used in game AI.

Activation Function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer. Makes the model non-linear and acts as a threshold for neurons firing or not

Restricted Boltzmann Machine (RBM)

A generative neural network that can do classification, regression, and dimensionality reduction. Can solve many problems

CART Algorithm

A greedy tree algorithm that searches for the most optimal split but not impurity levels. Models using this algo will never be optimal but decent.

Stride

A hyperparameter for CNNs that detail how many pixels a receptive field moves when capturing features. Usually just 1 but can be higher if you want to reduce computation.

Time Series Model

A kind of RNN model that takes in data and uses it to predict a forecast of events.

Supervised Learning

A kind of model that uses labels to train and classify. (KNN, Lin-Reg, Log-Reg, SVM, Trees, NN).

Pooling Layer

A layer that acts as a dimensionality reduction layer. By making averaging or getting the max of these pixel values in a convolutional layer, the feature maps become smaller. Useful for extracting dominant features as well as de-noise images.

SGD Classifier

A linear classification model optimized using stochastic gradient descent. Great for handling large amounts of data as it updates its weights stochastically.

Mult-armed Bandit

A machine learning framework in which an agent has to select actions (arms) in order to maximize its cumulative reward in the long term. In each round, the agent receives some information about the current state (context), then it chooses an action based on this information and the experience gathered in previous rounds. At the end of each round, the agent receives the reward assiociated with the chosen action.

Silhouette Score

A method of finding the optimal amount of centroid for k-means cluster. A more precise way than elbow but also more computationally expensive. Prints out exact score of optimal value for a k value.

Elbow Method

A method of finding the optimal amount of centroid for k-means cluster. Does this by fitting the model with a range of values for k. On a graph, the elbow of the line is where the most optimal k lies.

Kernel

A method of using linear classifiers to solve non-linear problems. These include RBF, polynomial, sigmoid, or linear for SVM models.

Holdout Validation

A method of validation by holding out some of the training data for a validation set later on.

Kullback-Leibler (KL) divergence

A metric for machine learning. Provides a number that describes how multiple probability distributions are. Mostly used in GANS

Bleu Score

A metric in NLP models.

Cross Entropy or Log Loss

A model evaluation metric/loss func. Great for classification problems. Very popular loss function

Hinge Loss

A model evaluation metric/loss func. Leads to better accuracy and some sparsity at the cost of much less sensitivity regarding probabilities

Random Forest

A model that consists of multiple decision trees using bagging and random subspaces. Meaning each tree learns on random subsets of features and full training instances.

Model-based Learning

A model that learns by training on data and generalizing on it to make predictions

Instance-based Learning

A model that learns examples by heart and sees how similar they are to new examples.

Batch Learning

A model that trains on batches of a dataset

Online Learning

A model that trains on data incrementally as its streamed in or in mini batches. Data is discarded after being trained on

Decision Trees

A model used for both linear and logistic regression. Uses a series of nodes that hold certain criteria to come to a prediction. Can be used for regression or classification.

Naïve Bayes Classifier

A multi classification model that is mostly used for NLP tasks. Uses Bayes theorem to predict probability given what has occurred. Weak as it assume all features have equal effect on outcome.

Multi-layer perceptron (MLP)

A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. Is like a Dense neural network but only has one hidden layer

Recurrent Neural Networks

A neural network used for sequential data. Uses a memory cell to hold information called a hidden state. This hidden state contains information from the previous forward pass. This information gets fed into the neuron along with the new input data.

Child Node

A node that asks a question or holds a criteria that filters out data into other nodes.

Leaf Node

A node that contains no children but a classification. The dead end of a tree.

Policy Gradients

A policy class that optimize the parameters of a policy by following the gradients toward higher rewards.

Markov Decision Process

A policy for RL that can solve most problems with discrete actions. An optimal policy (which we'll discuss next week) for maximum rewards over time

Variational Autoencoders

A powerful generative model that allows you to not only generate data but modify it as well. I.E Adding glasses to a face

Sampling Bias

A problem that occurs when a sample is not representative of the population from which it is drawn.

The Exploding Gradient Problem

A problem where large error gradients accumulate and result in very large updates to neural network model weights during training. This has the effect of your model being unstable and unable to learn from your training data.

Cross-validation

A process that creates a dataset and uses it to see how well a model has generalized after training.

Instance Segmentation

A regression task in CV that tries to detect multiple objects within an image and assigns a label to each pixel in the image. Each instance of the same object class is differentiated.

Object Detection

A regression task in CV that tries to detect multiple objects within an image, assign a class to it, and draw boxes around them.

Localization

A regression task in CV that tries to localize an object within a picture. Typically makes a model find the object, classify it, and draw a box around it.

Shrinkage

A regularization technique for random forests that decrease the lowering rate but increases the number of trees.

L2 Regularization (Ridge Regression)

A regularization technique that adds a regularization term (Estimates mean of data) to the cost function to keep model weights small. Good for predictive power

L1 Regularization (Lasso Regression)

A regularization technique that adds a regularization term (Estimates median of data) to the cost function to keep model weights small. Good for sparse models

Pooling Dropout

A regularization technique that applies dropout to convolutional layers after pooling.

Max-Norm

A regularization technique that is less aggressive than other kernel restraint techniques. Enforces upper bound for weights. Works well following dropout layer

Monte Carlo Dropout

A regularization technique that is similar to dropout but the dropping out of neurons during both training and testing.

Warm Start

A regularization technique that makes a model start off training from the weights of a model that was previously trained. Transfer learning is an example of this.

Dropout Connect

A regularization technique that only drops the connections between neurons but not the neurons themselves.

Dropout

A regularization technique that randomly turns off certain neurons and their connections in a network only during training. This way, the model becomes sparse and constrained. Prevents overfitting. Mainly used in the last hidden layer.

Stratified Shuffle Split

A shuffle split method that splits categorical data in proportion to their representation.

Perceptron

A single, simple, artificial neuron. Activation(Input*weight+bias)

Nash Equilibrium

A situation in which agents interacting with one another each choose their best strategy given the strategies that all the other actors have chosen. There are no regrets in their strategy as losing is better than the worst possible outcome Used in ML game theory

Genetic Algorithms

A special kind of policy used by all kinds of AI. Randomly creates 100 policies and kill the worst 80. The remaining algorithms produce offspring and the cycle continues.

Skip Connections (Residual Learning)

A technique that creates a pipe from the input layer straight to the output layer along with data that's been transformed. This way the gradient is preserved. Used in CNNs often like Resnet

Transfer Learning

A technique used in ML and neural networks that transfers layers from one model to another for a different problem.

Intersection Over Union (IOU)

A threshold value used in object detection and classification models. Value determines if a detection is a TP or a TN. Compares predicted box line vs actual box line and see if they are close enough

Extra Trees

A tree model that is similar to random forests but instead of calculating for an optimal split, it randomly chooses a split point on the data.

Character RNN

A type of NLP RNN designed to predict the next character in a sentence.

Stateful RNN

A type of NLP RNN that can learn longer patterns by preserving the hidden state. Only used when the whole sequence plays a part in forming the output. I.E Stock price prediction

Stateless RNN

A type of NLP RNN that learns on random portions of text and so the hidden state is thrown out after each iteration of training. I.E Sentence prediction

Soft Margin SVM

A type of SVM that is more forgiving in that the margin allows for some outliers within the hyperplane.

Hard Margin SVM

A type of SVM that strictly sets a hyperplane. Works if there are no outliers.

GAN (Generative Adversarial Network)

A type of neural network that consist of two neural networks: a generator that tries to generate data that looks similar to the training data, and a discriminator that tries to tell real data from fake data.

Autoencoder

A type of neural network that learns to copy their inputs to their outputs using labeled data to optimize. Can also be trained and incorporated into another network when not much data is labeled. Can also extract features from data and used in another model

Translation Encoder-Decoder

A type of neural network that translates languages by using an RNN encoder to take input and an RNN decoder that takes the label and output of the encoder to output the answer

Non-linear Regression Model

A type of polynomial regression that is used when data shows a curvy trend. Like predicting population.

Peephole LSTM cell

A variant of LSTM that gives cells a bit more context by letting them peek at the long-term state as well.

GRU LSTM cell

A variant of LSTM that is simpler but works just as well.

Convolutional Autoencoder

A variant of autoencoder that is combined with a convolutional layers. Can improve autoencoder tasks on images (image coloring, denoising). Used for dimensionality reduction or unsupervised pretraining models.

Attention Architecture

A variant of sequence to sequence RNNs that produce great results. Pays attention to certain words in a input set that may be more important than others.

Confusion Matrix

A visualization tool used to measure accuracy of model on each class.

GRID Search

A way for a model to use multiple hyperparameters in a random combination to find the best resulting combination.

Greedy Search

A way that Sequence RNNs pick what the next character will be in a task. Done through predicting and then picking the letter with the highest score.

Beam Search Decoder

A way that Sequence RNNs pick what the next character will be in a task. Done through predicting multiple letters that have highscores and combining them with the next letter in the sequence. Now you have a list of two letters to choose from. Get better results than greedy search.

Data Augmentation

A way to increase the accuracy of a convolutional network. Augments images in all types of ways so that the network is forced to learn the consistent features that are most important regardless of augmentation.

RNN Window

A way to process time series data. Cuts data into windows to be fed into an RNN. Must be flattened into a tensor before feeding however. I.E {{...},{...}} -> {[...],[...]}

ReLU Function

Activation function that has an advantage of being fast to compute because it is sparse, so it has become the default. Most importantly, the fact that it does not have a maximum output value helps reduce vanishing gradient problems. Usually used for regression problems and should only be used in hidden layers.

Tanh Function

Activation function that is S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1. Susceptible to vanishing gradient problem but prevents gradients from exploding. Used in RNNs but are better than sigmoid on average as it allows for larger weight changes.

Sigmoid Function

Activation function that is an S-shaped mathematical curve and is best used in the last layer of a binary classification. Widely used function that squeezes' values between 0 and 1. Susceptible to vanishing gradient problem

Residuals (error)

Actual_value - Predicted_value Errors that are left over after being run through a model.

Batch Gradient Descent

All data is taken into consideration within a single step. - Can be very slow - Intractable for datasets that don't fit in memory - Doesn't allow us to update our model online, i.e. with new examples on-the-fly.

Callbacks (Keras)

Allows you to fine tune and get a metric of your model. Can pass in arguments like checkpoints, early stopping, or graphs. I.E model.fit(dataset, epochs=10, callbacks=my_callbacks)

Transformer Architecture

An attention architecture variant that is one of the most advanced neural networks in the world (Used by GPT-3).

Q-Learning

An RL algorithm that works by watching an agent play (e.g., randomly) and gradually improving its estimates of the Q-Values. Once it has accurate Q-Value estimates (or close enough), then the optimal policy is choosing the action that has the highest Q-Value (i.e., the greedy policy).

SWISH

An activation function made by Google Brain. Can replace relu function as it performs better at times

MISH

An activation function that may be the best around for CV tasks.

Isolation Forest

An anomaly detection algorithm. Each forest randomly takes a feature and splits it until instances are separated from other instances. The features that anomalies tend to be isolated in fewer steps.

Random Forests

An ensemble learning model that uses multiple trees that are subjugated to bagging and random subspaces. Can also be used to determine feature importance.

Stacking

An ensemble method of learning that pipes multiple model results into a merging module or blender for a final prediction.

One-class SVM

An unsupervised SVM model that can be used for outlier detection. Learns the boundaries of a single dataset and identifies what is outside those boundaries.

Temporal Difference Learning

An unsupervised approach to RL. Takes into account that the model has not seen the environment nor has experienced it.

Label Propagation

An unsupervised learning algorithm that assigns labels to unlabeled data by looking at the clusters that labeled data are in.

Mini-Batch K-Means

An unsupervised learning cluster technique. A version of k-means cluster that does not use the full dataset at each iteration but in batches to slowly move the centroid.

Accelerated K-Means

An unsupervised learning cluster technique. A version of k-means cluster that uses triangle inequality to speed up the calculation.

K-Means Clustering

An unsupervised learning cluster technique. Model identifies k number of centroids and then allocates every data point to the nearest cluster. While keeping the centroid as small as possible. Weakness is that you must run it several times to find an optimal solution and data clusters are odd shaped.

Random Patches

Bagging technique that uses both random subsets of features and full data points to train on.

Dying ReLu Problem

Because ReLu ends at zero, this can cause some weights that have a negative value to output zero. This effectively kills the network and prevents learning. Especially when having a high learning rate.

Gradient Boosting

Boosting technique where the next model fits on the residual errors of the last model.

ADA Boosting

Boosting technique where the next model in a sequence pays more attentions to the underfitted data of the last model.

F1 Score

Combines precision and recall into a single measure. Best way to measure model.

Feature Extraction

Combining existing features to create new ones. Also a form of dimensionality reduction.

Data Standardization

Data preprocessing method that scales values down between 1 and -1.

One-hot encoding

Data vectorization technique that turns categorical data into numbers. A data point with categories will consist of a vector of 0's with 1's on the indexes where an attribute applies. Problems when dealing with large datasets that are mostly made of 0 matrices.

Ordinal Encoding

Data vectorization technique that turns categorical data into numbers. Each category value is assigned its own number. Good for when data attributes are on a scale (easy: 1, medium: 2, hard: 3). Weakness is that although the numbering seems to make sense, we do not know the relationship regardless of order.

Nominal Encoding

Data vectorization technique that turns categorical data into numbers. Each category value is assigned its own number. Intended for categories that are not correlated (rat: 1, dog: 2, cat: 3). Weakness is that this would enforce a natural order of dog > cat.

UMAP

Dimensionality Reduction technique. May be best one out for most data

Random Projections

Dimensionality reduction technique that conducts a random linear projection. Gives surprisingly good results.

Stemming

Extracting the root word from a word. I.E adjustable -> adjust

independent variable

Features that lead you to a prediction (Data)

Feature Scaling

Form of data preprocessing where values across all attributes are scaled to a value between 1 and 0 or -1 and 1. Scalar

Deep Learning

Form of machine learning that uses deep neural networks. Can do classification and regression tasks.

Loss Function

Function in ML that tells a model how bad it is doing when classifying or conducting regression. Examples include cross entropy, MSE, MAE

Style GAN

GANs that not only improve image generation but introduce noise layers. These layers add a source of randomness to images which allow you to control the style of what image is being generated.

Gradient Descent (Linear Regression)

GD optimization algorithm on a linear regression model. Good for large datasets.

Kernel/Filter

Generally a smaller matrix that filters over an input image and are randomly initialized and trained into representing a feature. These filters create feature maps from input images.

RMSPROP

Gradient-based optimization algo. It is similar to Adagrad, but introduces an additional decay term to counteract Adagrad's rapid decrease in learning rate thus is better. Does smaller weight changes

PR Curve

Graph metric that shows the tradeoff between precision and recall. Higher recall, lower precision.

ROC Curve

Graph metric used for binary classification. Shows the tradeoff of a high true positive rate with the false positive rate.

Lemmatization

Grouping words together based on their basic dictionary definition

Agglomerative/Hierarchical Clustering

Hierarchical clustering procedure where each object starts out in a separate cluster; clusters are formed by grouping objects into bigger and bigger clusters depending on distance from each other. Good for data analysis

Learning Rate

How much the "ball" moves for gradient descent. How large the changes in weights are after a cycle. Can be scheduled or adaptive.

Libraries for Hyperparameter Optimization

Hyperopt, Hyperas, Kopt, Talos, Keras Tuner, Skopt, Spearmint, hyperband, Sklearn-Deap

Bias-variance tradeoff

Ideally, you want low variance and low bias, but this is not easy to achieve. A decrease in one will lead to an increase in the other. This is the tradeoff.

Tying weights

Improves the performance of language models by tying (sharing) the weights of the embedding and softmax layers. This method also massively reduces the total number of parameters in the language models that it is applied to.

Stratified K-fold Validation

K-fold validation but uses stratified sampling to keep representativeness

Bottleneck Layers

Layers in a neural network that is very light and sparse. Intent is to make the network focus on important and rich aspects of the data. Same theory as dropout.

Piecewise Scheduling

Learning rate schedule that allows you to specify multiple constant learnings rates for certain epochs.

Step Decay

Learning rate schedule that drops learning rate every few epochs.

Exponential Scheduling/Decay

Learning rate schedule that drops learning rates at an exponential rate.

Performance Scheduling

Learning rate schedule that drops whenever error rates stops dropping.

Cyclic Scheduling

Learning rate schedule that increases linearly and then decreases halfway through training. Intuition is that it may get stuck in a local minima.

Decision Tree

Machine learning model that splits data attributes into a configuration that results in nodes branching off. These nodes ask questions and hold criteria that filter data into specific decisions.

Masking layers

Masking is a way to tell sequence-processing layers that certain timesteps in an input are missing, and thus should be skipped when processing the data. Padding is a special form of masking where the masked steps are at the start or the end of a sequence.

Chi-square

Method used by decisions tree splitting. A statistical test used to compare observed results with expected results. The purpose of this test is to determine if a difference between observed data and expected data is due to chance, or if it is due to a relationship between the variables you are studying.

Gini Index/impurity

Method used by decisions tree splitting. Is a measure of probabilities for a feature in a node pertaining to one class or another. Slightly faster to compute than entropy

Information Gain (Entropy)

Method used by decisions tree splitting. Measurement of how optimal a split is. Slightly slower than Gini impurity. Lower the value, higher the purity.

Reduction in Variance

Method used by decisions tree splitting. Used when solving a regression problem.

mean Average Precision (mAP)

Metric used by object detection but can be used for all models and is as good if not better than F1. Is like precision and recall scores except keeps in mind that we not only want the highest recall, but we also want a value that will offer the highest precision and vice versa. I.E: recall = 90 precision = 10 < recall = 89 precision = 20

Semisupervised Learning

Model that is trained on some labeled data and non-labeled data. Like google detecting faces and you tagging those faces.

Epoch

One forward pass and one backward pass of all the training examples

Nadam

Optimization algo that combines Adam and Nesterov. Converges slightly faster than Adam. Generally outperforms Adam but depends on dataset.

AdaGrad

Optimization algo that keeps track of the squared gradients over time and automatically adapts the learning rate parameter. It can be used instead of vanilla SGD and is particularly helpful for sparse data. Optimization slows down rather quickly however.

Momentum Optimization

Optimization hyperparameter for gradient descent that takes into account how steep an optimization was and so moves the ball further to find minima. Good for noisy data

Normal Equation

Optimization method used in linear regression problems. Not suited for large datasets.

Specificity

Or the true negative rate. Metric used for classification that tells you how many you correctly labeled for the data that is labeled negative. When you want no false positives.

Recall or Sensitivity

Or true positive rate. A classification metric that tells you how many were correctly predicted, out of your data that is labeled positive. A high recall means the model is more bent on making positive predictions even if it is not correct. Good for when you want to be as expansive as possible like detecting cancer.

Auxiliary Output

Output done at certain layers to see how effective a layer was, before going on to the next layer

Max Depth or split criteria

Parameter for decision trees. Controls how many nodes deep your tree can go with a feature.

Max Features

Parameter for decision trees. Number of features to consider when splitting a node.

Min Sample Split

Parameter for decision trees. Represents the min amount of samples considered in a single node that warrants a split.

Kernel Initializer

Parameter for neural networks that specifies a initialization algorithm for weights. Can be random or some specific math distribution.

Softmax Function

Popular activation function used mainly for multi-class tasks. Usually used last in a layer for exclusive classes meaning one label per data point.

Word Embeddings

Preprocessing technique for texts or categories. Maps words or categories in a 3D vector space. Captures semantic relationships between words. Can be trained or pre-trained.

The Vanishing Gradient Problem

Problem in DNN and RNNs that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to "vanish" throughout the layers, preventing the network from learning. Ways to counter this problem is to use activation functions like ReLUs that do not suffer from small gradients.

Bidirectional RNN

RNN architecture that involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.

Sequence to Sequence models

RNN encoder/decoder models that training models to convert sequences from one domain to sequences in another domain (e.g. translations, chatbots, captioning)

RMSE (root mean squared error)

Regression model evaluation metric. Can punish large errors without having to square data. Use when you have many outliers. Tells you how concentrated the data is around the line of best fit

MAE (mean absolute error)

Regression model evaluation metric/loss func. Measures the absolute average distance between the real data and the predicted data without squaring, but it fails to punish large errors in prediction. Use when there are very few outliers.

MSE (mean squared error)

Regression model evaluation metric/loss func. Measures the squared average distance between the real data and the predicted data. Here, larger errors are well noted (better than MAE). But the disadvantage is that it also squares up the units of data as well. So, evaluation with different units is not at all justified.

Elastic Net

Regularization technique that acts as a middle ground between Lass and Ridge regression. Adds both terms to the cost function to keep model weights small.

Ridge Regression

Regularization technique that adds a regularization term to the cost function to keep model weights small.

Lasso Regression

Regularization technique that adds a regularization term to the cost function to keep model weights small. Similar to Ridge regression.

Dropblock

Regularization technique that is like dropout but for CNN layers. Works very well.

Dimensionality Reduction

Removing unneeded features or transforming the dimensions of your data to be lower. Thus taking up less space and compute.

Feature Selection

Selecting the most useful feature among existing features.

Radial Basis Functions (RBF)

Similiarity functions used in non-linear SVM kernels. Most popular kernel.

feature maps

Spatial representations of the visual features, with separate maps for color, orientation, and movement. Generated from filters convolving an image. Is then passed through an activation function layer to determine if a feature is present in an area.

How to split Sequential data?

Split this kind of data across time for RNNs. For example, training data would contain 2001-2012, validation would have 2013-2015, and test data would have 2016-2021.

Early Stopping

Stopping right before a model starts to overfit.

Correlation Coefficient

Technique that shows how correlated certain features with each other or with the label. Values are standardized.

Boosting

Technique that takes several weak models and sequentially train them to be better than the previous model.

OVR (One-Versus-the-Rest)

Technique used by binary classifiers to do multi-classification tasks. Each class goes up against the rest of the classes. Class with most wins is chosen.

OVO (One-Versus-One)

Technique used by binary classifiers to do multi-classification tasks. Every binary classification problem is done and the one that wins the most is chosen.

Pasting

Technique used for ensemble decision trees. Data is divided into random subsets that use up the entire pool of data so there are no duplicates or unused data points. Good for large datasets or real life predictions.

Bagging

Technique used for ensemble decision trees. Used to increase bias but lower variance. Data is divided into random subsets that can include duplicates and does not use all data points. Each tree gets their own subsets to train on. Good for robust training.

SMOTE

Technique used in machine learning to fight imbalances in data.

Similarity Functions

Technique used on non-linear SVM classification problems. Adds features based on how much each instance resembles a particular landmark in a data graph.

Unsupervised Pretraining

Technique used to pretrain layers on unlabeled data by using an autoencoder to learn the data and then once finished, use those layers to train a labeled network.

Auxiliary Task Pretraining

Technique used to pretrain models by creating a model for a different yet similar/easier task and then freezing that models lower layers for training on the main task.

Regularization

Techniques used in an attempt to solve the overfitting problem in models. This can include equations the penalize large weights, dropout, or making a model smaller.

TFRecord Format

Tensorflow's preferred format for storing large amounts of data and reading it efficiently.

Variance

The error in a models complex assumptions. High variance leads to overfitting.

Bias

The error in a models simplistic assumptions. High bias causes a model to underfit as the line is straighter.

Policy (RL)

The algorithm an RL agent uses to determines its actions. Can be almost any kind of algo.

Active learning

The name used for the process of prioritizing the data which needs to be labelled in order to have the highest impact to training a supervised model

Batch Size

The number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.

Centroid

The real or imaginary location representing the center of a cluster. Used in K-means cluster.

Image Segmentation

The task of partitioning an image into multiple segments. Can be done with K-means clusters or CNNs.

Vanishing Gradient Problem

The vanishing gradient problem arises in very deep Neural Networks, typically Recurrent Neural Networks, that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to "vanish" throughout the layers, preventing the network from learning long-range dependencies.

Time Step

Time increments of a sample. For example, a sample can contain 128-time steps, where each time steps could be a 30th of a second for signal processing.

Convolutional Layer

Uses filters to detect patterns within a pixel value image.

Progressive Growing GAN

Traditional GANs are limited to small/low resolution datasets. This is because the generator must learn how to output both large structure and fine details at the same time. Progressive GANs slowly increase the amount of layers within it to handle these images.

Multi-Encoder Training

Training multiple autoencoder layers in separate instances and then combining all of them together.

Neural Networks

Uses neurons to take in "activation(x*w+b)" to come to a output.

Convolutional Neural Networks

Type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.

Wide and Deep model

Type of neural network that has two inputs. One input goes directly to the output layer while the other goes through a multiple neural layers. The intuition is that untransformed data coupled with transformed data can give us better results

Clustering

Unsupervised learning model that clusters data in groups where they are most similar. Can also be used for dimensionality reduction.

Multi-output Neural Network

Use cases: 1. May need to perform regression and classification 2. Multiple tasks per dataset 3. Auxiliary output

Freezing a Layer

Used during transfer learning to train just the top layer weights of a model and freezing the weights of the lower layers. The more data you have, the more you can unfreeze. Also helpful to lower learning rate.

Mini-Batch Gradient Descent

Used for large datasets. Divides up data into batches and takes these mini batches into account when calculating gradients.

Stochastic Gradient Descent

Used for large datasets. Takes a single data point in consideration when calculating and adjusting gradients. Converges faster but in the end, not as accurate.

Prioritized Experience Replay

Used in RL to prioritize positive experiences and results in better models.

Positional Embeddings

Used in attention models to distinguish meaning behinds words using positional semantic relationships

Standard Scaler

Used in the sklearn library to scale down features so that their values lie between 1 and 0

Gradient Clipping

Used to combat exploding gradient issues. Used on RNNs as batch normalization does not work well on them. Sets a threshold for gradient growth rescales them should they get too big.

Randomized Leaky ReLu

Used to combat gradient issues and overfitting. Same as Leaky ReLu but the threshold for how leaky it is, is randomized and then averaged to a optimal value.

Exponential Linear Unit

Used to combat gradient issues but can explode. ReLu variant that is slower to compute but has a fast convergence rate and outperforms many ReLu variations.

Batch Normalization

Used to combat gradient issues. A layer that is used before or after every activation layer to normalize inputs by slowly optimizing the value the inputs are scaled to. Speeds up training but still requires normalizing data.

Leaky ReLu

Used to combat gradient issues. A variation of ReLu that tries to counter the dying ReLu problem. It prevents neurons from being stuck at zero by reactivating them even if they die for a bit during training.

Parametric ReLu

Used to combat gradient issues. Randomly picks a leaky ReLu threshold and optimizes during training. Good on large datasets and underfitting but not the opposite.

Padding

Used to distinguish the border on an input image. Also allows for deeper networks so that images do not reduce too quickly when convolving.

point-biserial correlation

Used when you want to extract a correlation coefficient with categorical data that is binary

ANOVA (Analysis of Variance)

Used when you want to extract a correlation coefficient with categorical data with more than two categories

Out-Of-Bag

Validation set method where you use the leftover data points not used from bagging. (Since duplicates are present in bagging.

Nesterov Accelerated Gradient

Variant of momentum optimization that is almost always faster than vanilla momentum op.

Scaled ELU

Version of ELU that is great for deep neural networks as it is immune to vanishing/exploding gradients. Must be using sequential dense layers to use. Must also use LeCun initializer.

Hard Voting

Voting classifier method in which the largest most predicted answer is chosen.

Soft Voting

Voting classifier method in which the largest sum of probabilities for a class is chosen as the answer.

Optimizer

Ways to optimize a model like SGD, Adam, etc

Tree Splits

What a decision tree does when creating itself. Splits itself into a tree that creates the best questions and criteria based on algorithm used.

Out of Vocabulary (OOV)

Words that are not in the training set, but appear in the test set, real data. Embedding layer may not be able to resolve this but you can assign unique values to these words in order to get the model to recognize them when it comes across it in the future.

Layer-wise Learning Rate Decay (LLRD)

a method that applies higher learning rates for top layers and lower learning rates for bottom layers. This is accomplished by setting the learning rate of the top layer and using a multiplicative decay rate to decrease the learning rate layer-by-layer from top to bottom

MCC ( Matthews correlation coefficient )

generally considered one of the best measurements of performance for a classification model. This is largely because, unlike any of the previously mentioned metrics, it takes all possible prediction outcomes into account. If there are imbalances in the classes this will therefore be accounted for.

Collinearity

where two features are linearly associated (high correlation), and they are used as predictors for the target.


Ensembles d'études connexes

Viruses and Needs of life quiz review

View Set

Nursing Assisting Chapter 5 - Diversity and Human Needs and Development

View Set

Crim. Ch.14 Deterrence/ Incapacitation/ Retribution/ and Rehabilitation

View Set

Transferring energy in the atmosphere (Conduction, convection and latent heat)

View Set

World Civilizations Module 1 Quiz Questions

View Set

Chapter 38 oxygenation and perfusion

View Set