L2: MLP, backpropagation and activation functions
Which loss function should be used in a binary classification task?
(Binary) Cross-entropy
Name 5 activation functions
1) Linear 2) Sigmoid 3) ReLU 4) Hyperbolic tangent 5) Softmax
Give three examples of loss functions
1) Mean squared error (MSE) 2) Mean absolute error 3) Cross-entropy
What defines an MLP's output layer for a binary classification task?
A single neuron with sigmoid activation function.
What is Leaky-ReLU?
A variant of ReLU where for negative inputs, a small slope (<1) is used.
In a regression test, what should the output produce?
All possible values (in principle from -∞ to ∞)
What is a solution to the vanishing gradient?
An activation function with a constant or linear derivative.
At which stage does an MLP optimize its parameters?
Backpropagation
Which loss function should be used in a multiclass classification task?
Categorical cross-entropy
How are the layers in a multilayer perceptron or feed-forward network called?
Dense or Fully connected layers
What is meant with the 'vanishing gradient'?
Due to the chain rule, the gradients are smaller layer after layer.
What is meant with 'dying ReLU'?
During training a ReLU unit can fall into a state where its output will be 0 for any input. It is very difficult to recover from such a sate as the gradient will also be 0.
How is the multilayer perceptron (MLP) also called?
Feed-forward network
What defines the output layer of an MLP that is designed for a classification task of five different classes?
Five neurons with softmax activation function.
What is special about the multilayer perceptron (MLP) or feedforward network?
Formed by more than one layer.
At which stage does an MLP calculate the loss/error?
Forward propagation
To which activation function does this the formula belong? tanh(σ) = (e^20 - 1)/(e^20 + 1)
Hyperbolic tangent
What does the chain rule say?
If L = f(a) and a = g(z) than dL/dz = dL/da × da/dz
When does a linear regression activation function work?
In the output layer of a regression test
Give the formula for the cross-entropy loss function that is used in binary classification.
L = -y log(p) - (1-y) log(1-p)
To which activation function does this the formula belong? σ(z) = z
Linear
Which activation function should be used in a regression test?
Linear
How is the feed-forward network also called?
Multilayer perceptron (MLP)
What is special about the XOR?
No single straight line can separate the two classes.
Can the perceptron solve the XOR?
No, the perceptron can only separate the space linearly.
In a multiclass classification test, what should the output produce?
One-hot encoding indicating the correct class
Give an example of an activation function with a constant or linear derivative.
ReLU
To which activation function does this the formula belong? ReLU(z) = max(z, 0)
ReLU
To which activation function does this the formula belong? σ(z) = 1/(1+e^(-z))
Sigmoid
Which activation function should be used in a binary classification test?
Sigmoid or Tanh
To which activation function does this the formula belong? σ(z_i) = e^(z_i)/(∑_i^N e^(z_j))
Softmax
Which activation function should be used in a multiclass classification task?
Softmax
How do you know how each individual parameter contributes to the error?
Take the derivative of the error (cost function) with respect to each parameter.
What does the loss (error/cost) function calculate?
The "cost" or distance between the network's output and a expected one.
What is dL/dw_ij^l?
The derivative of the error (cost function) with respect to each parameter.
What calculates the "cost" or distance between the network's output and a expected one?
The loss (error/cost) function
What happens at the stage 'forward propagation'?
The loss/error is calculated.
What are the outputs of XOR?
The output is 1 if only x1 or x2 is 1, but not both.
For a MLP with n layers, what is the output of the network (y)?
The output of the last layer
What happens at the stage 'backpropagation'?
The parameters (weights and biases) are updated while minimizing the loss (with gradient descent).
For what is a nonlinear activation function needed?
To learn more complex functions.
In a binary classification test, what should the output produce?
Two different values
For a MLP with n layers, the output of each layer (a^l) is defined as: a^l = f(W^l × a^(l-1) + b^l) What is a^0? What is a^n?
a^0 = x a^n = y
How is the output of each layer (a^l) defined, for a MLP with n layers?
a^l = f(W^l × a^(l-1) + b^l)
Suppose a MLP starts with an x value and a weight (W₁). u₁ = W₁x. There is one more hidden layer that leads to a₁ and this value is used to calculate the loss. Give the formula to calculate how much parameter W₁ contributes to the error.
dloss/dW₁ = dloss/da₁ × da₁/du₁ × du₁/dW₁
In the formula for the cross-entropy loss function that is used in binary classification ( L = -y log(p) - (1-y) log(1-p) ), what is p and what is y?
p is the output of the network and y is the target output.
Give the function of the linear activation function
σ(z) = z