Deep Learning ANN
How many activation functions per layer in an ANN?
1
What are some regularization functions to reduce overfitting?
1. Dropout 2. Batch normalization 3. L1 & L2 normalization
What is an ANN?
ANN is a collection of perceptrons and activation functions. Perceptrons are connected to form the hidden layer or units. ANN is a map to input to output.
Where can the output of a perceptron go?
Activation function or transfer function
Sigmoid function
Attenuates (reduces) values with high magnitudes
What is the procedure for updating weights?
Backpropagation
How to optimize softmax?
By taking the Euclidean distance and one-hot encoding. There's a better cost function to optimize, cross entropy
How does CNN encode filters?
By transformation. Learned filters detect features or patterns in images. The deeper the layer the more abstract the pattern is. Layers can detect edges, patterns and corners.
What's a convolutional neural Network and why are thy useful.
CNNs are similar to the neural networks described earlier in previous sections. CNN have weights, biases, and outputs through a nonlinear activation. Neurons are arranged in a volumetric fashion to take advantage of the volume. Each of the layers transform input volume into output volume. CNN Filters encode by transformation
Why is the hyperbolic tangent function, tanh, useful?
tanh is a scaled version of sigmoid and avoids the problem of vanishing Gradient. Smooth and differentiable. Tanh maps from -1 to 1. Gradients are more stable than sigmoid and hence have fewer vanishing gradient descents
An image is considered a volume...
with height, width, and depth. Depth is the channel of the image (RGB)
What is the hidden unit or layer?
Hidden units form the nonlinear basis that maps the input layer to output layer in a lower-dimensional space, also called an ANN.
What does the Learning rate determine in Gradient Descent?
How big the steps should be. Nonlinear activations will have local minima, SGD works better in practice for optimizing non-convex functions
If we use a regular ANN for images, why is this not scalable?
If we use a regular neural network, we will have a large neural network due to a huge number of neurons, resulting in overfitting. Cannot use this for images as images are large and will increase the model size as it requires a huge amount of neurons.
Explain how does a perceptron work.
Inputs are summed and weighted as seen in the picture. Sum is then passed to unit function. Perceptron can only learn simple functions from examples.
Why is training an ANN tricky?
It's tricky because it contains several parameters to optimize.
What's a parameter used to convolve images?
Kernel is used to convolve images. Convolution operation is attached
What is L1 & L2 Regularization? Why is it useful?
L1 regularization penalizes the absolute value of the weight and tends to make it zero. L2 regularization penalizes the squared value of the weight and tends to make it smaller during training. Both of the regularizations assume that models with smaller weights are better.
What does the training process do?
Learns features, better representation than raw images
What is a activation function?
Makes perceptron non-linear, activation function decides whether or not a neuron should fire or not. During training activations, functions play an important role in adjusting gradients
Draw or picture an ANN
Multi-layer perceptron. Input layer, hidden layer, output layer. Several inputs of x are passed to the hidden layer of perceptrons and summed to the output.
What gives the ability to learn complex functions in deep nets?
Non-linear behavior of an activation function
Why is Sigmoid useful?
Sigmoid is useful for converting any values into probabilities and can be used for binary classification. Maps input values from 0 to 1
What is softmax and why is it useful?
Softmax is a way of forcing the neural Network to sum to 1 and is an activation function with specialty out summing to 1. This creates output values that can be considered as a part of a nice probability distribution. Useful in multi-class classification. Converts output to 1 by dividing the output by summation of all other values.
What does the Universal Approximation Theorem state?
States that a multi-layer perceptron can approximate any function.
What are the parameters in a kernel?
Stride and Size. Size can be any dimensions of a rectangle. Stride is the number of pixels moved every time.
What's a stride?
Stride is the number of pixels moved every time. A stride of 1 produces an image almost the same size, and a stride of length 2 produces half the size. Padding image helps it receive same size as input
How is sampling done in max Pooling? Benefits of pooling
Takes maximum value in an image. Average pooling averages over the window. Pooling acts as a regularization function and also prevents overfitting. Carried out on all the channels of features. Can be performed with various strides.
Artificial Neuron
Takes several inputs and performs a weighted summation to produce an output.
What's a tool to visualize neural networks?
TensorFlow Playground - can change Learning rate, activation functions, regularization, hidden layers to see how it affects training process. Intuition of neural networks.
What does a model consist of?
The values of the weights and bias values along with the architecture.
What are the cons of both sigmoid and tanh?
They fire all the time, making the ANN really heavy. Rectified Linear Unit (ReLu) activation function avoids this pitfall by not firing all the time.
What determines the values of the weights and biases?
Training process determines the values of the weights and biases. Model values usually set randomly at the beginning.
What is a multi-class classification problem?
Tries to discriminate more than 10 categories. Becomes possible with one-hot encoding and softmax function.
What's the problem with sigmoid?
Vanishing Gradient. Changes of Y in respect to X is small, thus there will be vanishing gradients.
What is one-hot encoding and why is it useful?
Way to represent the target variables in classification problems. Target variables are converted from string values ("dog") to one-hot encoded vectors. One-hot encoded vector is filled with 1 at the index at the target class and 0 everywhere else. Makes no assumption of similarity of target variables. Makes multi-class classification possible with softmax.
How is the weight of a perceptron determined?
Weight is determined through the training process and is based on the training data.
Backpropagation
Weights are updated backwards based on the error calculated. Gradient Descent can be used to calculate the weight updating.
When does training stop?
When the error can no longer be reduced
Most of the activation function...
are continuous and differentiable functions, except rectified unit at 0. A continuous function has a small output change given every small input.
What is batch normalization and why is it important?
batch-norm increase stability and performance of neural network training. Normalizes the output from a layer with Zero mean and standard deviation of 1. Important because it reduces overfitting and makes the network train faster. Very useful for complex neural networks.
In order to train a neural Network, functions must be...
differentiable.
Maximizing a function is equivalent to
minimizing the negative of the same function.
How is the ANN map computed?
Computed by weighted additions of the input with biases.
What is cross entropy and why is it useful?
Cross Entropy is a loss function in which error has to be minimized. Cross Entropy is the summation of negative logarithmic probabilities, logarithmic values are used for numerical stability. Cross Entropy compares the distance between the outputs of softmax and one-hot encoding. Neural Networks estimate the probability of the given data for every class. Probability has to be maximized to the correct target label.
What can the hidden layer be also called?
Dense layer
How many perceptrons and hidden layers are needed?
Depends on the problem.
What is dropout and why is it important?
Dropout is an effective way of regularizing neural networks to prevent overfitting of the ANN. During training, the dropout layer cripples the neural Network by removing hidden units stochastically as shown in the image. Dropout is also an efficient way of combining several neural networks.
What is an extreme case of bagging and model averaging?
Dropout. For each training case, we randomly select a few hidden units so we end up with a different architecture for each case. Should not be used as an inference as it's not necessary.
How is the error computed?
Error is computed using a loss function by contrasting it with the ground truth. Based on the loss of every step, weights are tuned at every step.
What is the procedure to reduce error
Optimization
Gradient Descent
Perform multidimensional optimization. Objective is to reach global minimum and is used to improve or optimize the model prediction. Optimization involves calculating the error value and changing the weights to achieve that minimal error.
What's max pooling and why is it useful?
Pooling layers are placed between convolution layers. Pooling reduce the size of the image across the layers by sampling.
What is Training?
Process of learning the weights. Through gradient-based methods for training.
Why is the Rectified Linear Unit (ReLu) useful?
ReLu can let big numbers pass through, making a few neurons stale and they don't fire. This increases sparsity, which is good. ReLu maps input x to (0, X), map negative inputs to 0. Since it doesn't fire all the time it can be trained faster. Function is simple so computationally the least expensive. Works well for a large amount of applications. But we can stack several perceptrons to learn more complex functions
What is Stochastic Gradient Descent (SGD) and why is it important?
Same as Gradient Descent but uses partial to train each time. Parameter is mini-batch size. Theoretically, even one sample can be used for training
