Case study - convolutional neural networks (CNNs)
RELU (rectified linear unit)
- It will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This activation function replaces the sigmoid function for simplified calculation. This leaves the size of the volume unchanged. - will implement a fixed function - no additional hyperparameters
POOL details
- Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. - It operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. - parameters required: spatial extent F, stride S, - it's not common to use zero-paddling
architecture of the CNN
- a list of Layers that transform the image volume into an output volume - input layer: receives an input (a single vector) - hidden layers: made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections - output layer: the last fully-connected layer represents in classification settings the probabilities
CONV
- computes the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume - performs transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons) - parameters will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image
Converting FC layers to CONV layers
- difference: the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. - similarity: neurons in both layers still compute dot products, so their functional form is identical. - for any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (where the weights in many of the blocks are equal.
CONV details
- does most of the computational heavy lifting - parameters consist of number of filters K, their spatial extent F, the stride S, the amount of zero padding P - local connectivity: When dealing with high-dimensional inputs it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). - spatial arrangement: three hyperparameters control the size of the output volume -> ---depth: it corresponds to the number of filters we would like to use, each learning to look for a different feature in the input We will refer to a set of neurons that are all looking at the same region of the input as a depth column. ---stride (usually 2-3) ---zero-padding: it allows us to control the spatial size of the output volumes - parameter sharing: used to control the number of parameters
FC
- it will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score. - As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume. - performs transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons) - parameters will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image
POOL
- it will perform a downsampling operation along the spatial dimensions (width, height) - will implement a fixed function
Technologies that support CNN learning and operation
- large-scale labelled data sets MNIST (containing classified handwritten digits) CIFAR-10 (containing classified images) - latest developments in GPU GPU-accelerated computing offloads compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run much faster. - end-to-end learning
the main features of CNNs
- made up of neurons that have learnable weights and biases - each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity - ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.
Advantages of CNNs over ANNs
- shift invariance - reduction in the processing required due both to their basic design (that is, local connectivity instead of full connectivity between layers) and the use of filter strides and pooling - the reduced memory footprint owing to the use of the same parameters (weights) across each convolution layer
RELU details
- the activation is simply thresholded at zero, meaning that activation levels lower than 0 are rounded up to zero Pros: - greatly accelerates the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form. - Compared to tanh/sigmoid neurons that involve expensive operations it can be implemented by simply thresholding a matrix of activations at zero. Cons: - ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on.
Layers used to build ConvNets
Convolutional Layer, Pooling Layer, and Fully-Connected Layer
FC details
Neurons have full connections to all activations in the previous layer. Their activations can hence be computed with a matrix multiplication followed by a bias offset.
3D volumes of neurons
width, height, depth
INPUT
will hold the raw pixel values of the image