DL4G
Supervised Learning
- The algorithm is given labeled training data - The algorithm learns to predict the label of yet unseen examples Form of Deep Learning where an output label exists for every input example. The labels are used to compare the output of a DNN to the ground-truth values and minimize the cost function.
Unsupervised Learning
- The algorithm is given unlabeled data - The algorithm detects and exploits the inherent structure of the data A form of machine learning where the output class is not known. GANs or Variational Auto Encoders are used in unsupervised Deep Learning tasks.
Concepts for Reinfocment Learning
A -> Action at time t S -> State at time t R -> Reward at time t Pi -> Policy (decision making rule) Pi(A | S) -> Propability of taking action A in state S G -> Return (cumulative discounted reward v(s) -> value (expected return) in state s
Activation Function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value. Activation Functions should be non-linear. ReLU is primarly used as activation function for hidden layers.
Softplus
Activation Function. f(x) = log(1+e^x)
Evaluate NNs
Calculate best class or class probabilities
Recurrent Neural Networks
RNNs allow the neural network to understand the context in speech, text or music. The RNN allows information to loop through the network, thus persisting important features of the input between earlier and later layers.
Human-Level Performance
The best possible performance of a group of human experts. Algorithms can exceed human-level performance. Valuable metric to compare and improve neural network against.
Convolutional Neuronal Network
The convolutional filter scans through the layer. Same weight for every filter "level".
Teaching NNs
w = weights, b = bias. Find w and b so that the network calculates the correct classes
Vector
A combination of values that are passed as inputs into an activation layer of a DNN.
Definition for ML (Tom T. Mitchell)
A computer program is said to learn from experience E with respect to some class of tasks and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Learning Rate Decay
A concept to adjust the learning rate during training. Allows for flexible learning rate adjustments. In deep learning, the learning rate typically decays the longer the network is trained.
Forward Propagation
A forward pass in deep neural networks. The input travels through the activation functions of the hidden layers until it produces a result at the end. Forward propagation is also used to predict the result of an input example after the weights have been properly trained.
Fully-Connercted Layer
A fully-connected layer transforms an input with its weights and passes the result to the following layer. This layer has access to all inputs or activations from the previous layer.
Momentum
A gradient descent optimization algorithm to smooth the oscillations of stochastic gradient descent methods. Momentum calculates the average direction of the direction of the previously taken steps and adjusts the parameter update in this direction. Imagine a ball rolling downhill and using this momentum when adjusting to roll left or right. The ball rolling downhill is an analogy to gradient descent finding the local minimum.
Neural Network
A machine learning model which transforms inputs. A vanilla neural network has an input, hidden, and output layer. Neural Networks have become the tool of choice for finding complex patterns in data.
Convolution
A mathematical operation which multiplies an input with a filter. Convolutions are the foundation of Convolutional Neural Networks, which excel at identifying edges and objects in images.
AlexNet
A popular CNN architecture with eight layers. It is a more extensive network architecture than LeNet and takes longer to train. AlexNet won the 2012 ImageNet image classification challenge. Read the paper here.
VGG-16
A popular network architecture for CNNs. It simplifies the architecture of AlexNet and has a total of 16 layers. There are many pretrained VGG models which can be applied to novel use cases through transfer learning.
Dropout
A regularization technique which randomly eliminates nodes and its connections in deep neural networks. Dropout reduces overfitting and enables faster training of deep neural networks. Each parameter update cycle, different nodes are dropped during training. This forces neighboring nodes to avoid relying on each other too much and figuring out the correct representation themselves. It also improves the performance of certain classification tasks. Read the paper here.
Layer
A set of activation functions which transform the input. Neural networks use multiple hidden layers to create output. Distinguish between input, hidden and output layers.
Long Short-Term Memory
A special form of RNN which is able to learn the context of an input. While regular RNNs suffer from vanishing gradients when corresponding inputs are located far away from each other, LSTMs can learn these long-term dependencies.
Strong Solution for Finite Sequential Games
A strong solution provides an algorithm that can produce perfect moves from any position, even if mistakes have already been made by any player
Transfer Learning
A technique to use the parameters from one neural network for a different task without retraining the entire network. Use weights from a previously trained network and remove output layer. Replace the last layer with your own softmax or logistic layer and train network again. Works because lower layers often detect similar things like edges which are useful for other image classification tasks.
Weak Solution for Finite Sequential Games
A weak solution provides an algorithm2,3 that reveals a complete play of perfect moves from the initial position, assuming perfect play of the opponents
Goodfellow Definition of AI
AI -> Machine Learning -> Representation Learning -> Deep Learning
Difference AI / ML / DL with Jass
AI -> Rule-set, scoring, heuristic, algorithms ML -> Needs domain know-how, features, classifier, data set DL -> Needs no domain know-how, data set, features (created withing the NN)
Support Vector Machine
Act like fingerprints on the image. Calculates features from the image, SIFT, Descriptiors, Color Distributions. Creates a function to seperate the space into the classes.
ReLU Activation
Activation Function. f(x) = max(0,x). A Rectified Linear Unit, is a simple linear transformation unit where the output is zero if the input is less than zero and the output is equal to the input otherwise. ReLU is the activation function of choice because it allows neural networks to train faster and it prevents information loss.
Sigmoid
Activation Function. good choice for binary classification. f(x) = 1/(1+e^-x)
Adaptive Gradient Algorithm
AdaGrad is a gradient descent optimization algorithm that features an adjustable learning rate for every parameter.
Non-Max Suppresion
Algorithm used as a part of YOLO. It helps detect the correct bounding box of an object by eliminating overlapping bounding boxes with a lower confidence of identifying the object. Read the paper here.
Reward Hypothesis
All goals can be described by the maximization of the expected cumulative reward. A reward Rt is a scalar feedback indicating how well the agent is doing at time t
Loss Function
Also called Cost Function. Function that the NN wants to minimize.
End-to-End Learning
An algorithm is able to solve the entire task by itself. Additional human intervention, like model switching or new data labeling, is not necessary. For example, end-to-end driving means that the neural network figures out how to adjust the steering command just by evaluating images.
Softmax
An extension of the logistic regression function which calculates the probability of the input belonging to every one of the existing classes. Softmax is often used in the final layer of a DNN. The class with the highest probability is chosen as the predicted class. It is well-suited for classification tasks with more than two output classes.
Stochastic Gradient Descent
An optimization algorithm which performs a parameter update for every single training example. The algorithm converges usually much faster than batch gradient descent, which performs a parameter update after calculating the gradients for the entire training set.
Mini-Batch Gradient Descent
An optimization algorithm which runs gradient descent on smaller subsets of the training data. The method enables parallelization as different workers separately iterate through different mini-batches. For every mini-batch, compute the cost and update the weights of the mini-batch. It's an efficient combination of batch and stochastic gradient descent.
Ultra-Weak Solution for Finite Sequential Games
An ultra-weak solution proves1 whether the first player will win, lose or draw from the initial position, assuming perfect play of the opponents
Average Pooling
Averages the results of a convolutional operation. Used in older CNN, recent architectures favor maximum Pooling.
Batch Normalization
Batch Normalization is a technique that normalizes layer inputs per mini-batch. It speed up training, allows for the usage of higher learner rates, and can act as a regularizer. Batch Normalization has been found to be very effective for Convolutional and Feedforward Neural Networks but hasn't been successfully applied to Recurrent Neural Networks
ImageNet
Collection of thousands of images and their annotated classes. Very useful resource for image classification tasks. More specific labels with sub categories. Bird -> Eagle
Multi-Class Problem
Decide into one of k different classes. Don't use the lavel directly, use One-Hot encoded or categorical vectors. Should likewise probability for each class. Can be achived by using the softmax function as activation function in the last layer (Normalization)
Cost Function
Defines the difference between the calculated output and what it should be. Does form the basis for parameter updates. Output is compared to groundtruth and weights will be adjusted
Hyperparameters
Determine performance of your neural network. Examples of hyperparameters are, e.g. learning rate, iterations of gradient descent, number of hidden layers, or the activation function. Not to be confused with parameters or weights, which the DNN learns itself. Examples of hyperparameters: - In k-NN the number of neighbors k is a hyperparameter - Degree of polynomial for regression models - Regularization parameters - Kernel for support vector machines - Tree depth and variable selection policy in decision tree models - Number of layers, neurons, activation function in deep learning models - Dropout, batch normalization, optimizer, etc. in deep learning models
Capacity
Difference between Bias and Variance.
Actions and rewards
Each action has an expected reward. Those values are estimated as Q(a)
Epoch
Encompasses a single forward and backward pass through the training set for every example. A single epoch touches every training example in an iteration.
Training error
Error on training set
Greedy action
Exploitiong for best reward
Non greedy action
Exploring for better rewards
Finite Sequential Game
Finite set of players, finite set of possible actions actions choosen sequentially finite set of turns is being player layter players observe the move of earlier player (i.e. perfect recall) player acts after a strategy strategy profile is the the selected strategy by each player utility or payoff function determines the outcome for each action profile
Regression
Form of statistical learning where the output variable is a continuous instead of a categorical value. While classification assigns a class to the input variable, regression assigns a value that has an infinite number of possible values, typically a number. Examples are the prediction of house prices or customer age.
Generalization Error
Gap between error on the training and test set
GoogleLeNet (Incetion Architecture)
Going deeper with convolutionals. Larger and deeper nets are necessary. New architecture with filters of various sizes
LeNet
Handwritten Digit Recognition, First CNN
Reinforcement Learning
Has an explicity goal. Can sense aspect of the environment. Can choose actions to influence the environment. Usually assumed there is some uncertainity about the environment. The network optimizes itself by getting rewards for actions. Agent performes Action on Environment. Environment gets new State and gives Reward to agent. IT IS NOT: Supervices Learning, Unsupervised Learning
Gradient Descent
Helps Neural Network decide how to adjust parameters to minimize the cost function. Repeatedly adjust parameters until the global minimum is found.
Residual Network
If there are enough layers, adding more layers should not decrease performance. Layers do not leran identity transform easily. Solution is Residual conncetions. The Input (x, also identity) of a added layer will be summarized with the output F(x) of it. The sum will be uses as actual output.
SARSA
Instead of learning the Value function we often want to learn the expected total reward from an action A in State S, this is the Q Function and represents the value of a state-action pair The action in each calculation is chosen according to a policy (for example Epsilon-Greedy) (State, Action) -> Reward -> (State, Action)
Regularization
Limit the capacity of networks by adding a parameter norm penalty to the loss function. The penalty function can depend on weights. L1 and L2 Regularization is used for this.
Deep Neural Network
Neural Network with many hidden layers
Node
Neurons for a Layer. Each node has Inputs, Internal Parameters and a function which calculates its output. Function depends on the parameters.
General update formula for RL-Action
NewEstimate <- OldEstimate + StepSize[Target - OldEstimate]
Feed Forward Network
Nodes in a NN does not form cycles. Simple architecture.
Complexity Factors in Game Analysis
Number of players. more are harder Size of the search space. number of possilbe turns Competitive games vs cooperative games stochastic games vs determistic games perfect vs imperfect information games
Variance (Overfitting)
Occurs when the DNN overfits to the training data. The DNN fails to distinguish noise from pattern and models every variance in the training data. A model with high variance usually fails to accurately generalize to new data.
Bias
Occurs when the model does not achieve a high accuracy on the training set. It is also called underfitting. When a model has a high bias, it will generally not yield high accuracy on the test set.
Cybernetics
Old term for AI/DL/NN
Maximum Pooling
Only selects the maximum values of a specific input area. It is often used in convolutional neural networks to reduce the size of the input.
Cross Entropy
Optimal Loss Function for categorical problems.
Adam Optimization
Optimization Function - Can be used instead of SGD. computationally efficient, works well with large data sets, and requires little hyperparameter tuning
Likelihood of Data-Model
Propability that the data curve is matched by the model. Choose model that maximises this likelihood
Root Mean Squared Propagation
RMSProp is an extension of the stochastic gradient descent optimization method. The algorithm features a learning rate for every parameter, but not a learning rate for the entire training set. RMSProp adjusts the learning rate based on how quickly the parameters changed in previous iterations.
Deeper Neuronal Networks
Representation learning. No need for handcrafted domain knowledge features. NN: Features -> Output DNN: Raw input -> Output
Games Representation
Sequential Games cab be represented as game trees. Nodes -> game states / positions Edges -> actions/moves Leafs -> payoff
Derivative
Slope of a function at a specific point. Calculated to let the gradient descent algorithm adjust weights
Policy Iteration
Start with random policy, Calculate value function v using current policy, improve policy by selection actions greedily from v. Repeat until the policy remains stable
Early stopping
Stop when error on validation set starts to increase again. Effectively learn the number of training steps needed (as a hyperparameter).
Bias
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
Vanishing Gradients
The problem arises when training very deep neural networks. In backpropagation, weights are adjusted based on their gradient, or derivative. In deep neural networks, the gradients of the earlier layers can become so vanishingly small, that the weights are not updated at all. The ReLU activation function is suited to address this problem because it doesn't squash the input as much as other functions
Validation Set
The validation set is used to find the optimal hyperparameters of a deep neural network. Generally, the DNN is trained with different combinations of hyperparameters are tested on the validation set. The best performing set of hyperparameters is then applied to make the final prediction on the test set. Pay attention to balancing the validation set.
State Space for Games
Tic-Tac-Toe -> 10^3 Connect-4 -> 10^13 Backgammon -> 10^20 Chess -> 10^47 GO 19x19 -> 10^170
Iteration
Total number of forward and backward passes of a neural network. Every batch counts as one pass. If your training set has 5 batches and trains 2 epochs, then it will run 10 iterations.
Ensemble method for Regularization
Train several models seperately and then vote on the result.
Temporal Difference Learning (TD)
Update the Value of v after only looking at the next step. Used with a approximation of the total reward
AlphaGo Approach
Use CNN to learn policy and value function. Done with Supervised policy networks, Reinforcement learned policy networks, Fast policy network (patterns instead of CNN). Reinforcment learned value network. All this combined with a Monte Carlo Tree Search. Maximazes the probability to win MCTS for playing (simulating) games, then learning from games MCTS with rollout policy and value function 2 DCNN
Upper Confidence Bound (UBC)
Used to balance exploitation and exploration
Minimzing a Function
Using an optimization function like a SGD. It changes the paramaters for the layers to optimize the loss function. More clear gradients are easier to minimize.
Types of RL Agents
Value based: No policy (implicit), value function Policy based: Policy, no value function Actor Critic: Policy, Value Function Model free: No model, policy and/or value function Model: Policy and/or value function, model
VGG Net
Very Deep Convolutional Networks for Large-scale Image Recognition. Convolutional Layers only
Q-Learning
Very similar to SARSA. Subtle difference is that it does not use a policy to update the Q value, but choses the best action
Parameters
Weights of a DNN which transform the input before applying the activation function. Each layer has its own set of parameters. The parameters are adjusted through backpropagation to minimize the loss function.
Epsilon-Greedy
With Probability 1-Epsilon: Take action with maximal Action-Reward With probability Epsilon: Take any action with equal probability
Xavier Initialization
Xavier initialization assigns the start weights in the first hidden layer so that the input signals reach deep into the neural network. It scales the weights based on the number of neurons and outputs. This way, it prevents the signal from either becoming too small or too large later in the network.
YOLO
You Only Look Once, is an algorithm to identify objects in an image. Convolutions are used to determine the probability of an object being in a part of an image. Non-max suppression and anchor boxes are then used to correctly locate the objects.
Backprobagation
general Framework used to adjust network werights to monimize the loss function. Travels backward through the network and adjusts the weights with form of gradent descent of each activation function
Backward Induction
the process of analyzing a problem in reverse, starting with the last choice, then the second-to-last choice, and so on, to determine the optimal strategy. Solution strategy for finite gmes with perfect information One run of backward induction from the start state reveals a weak solution If the backward induction can be dynamically (during the game) for any state its a strong solution
PASCAL Visual Object Classes
three challenges: Classification (object in image), Detection (where is the object), Segmentation (which pixels belongs to which object). Extra: Action classification, person layout