CS 7643 Quiz 3
L1 Loss
Sum of Absolute Value of (true - predicted)
L2 Loss
Sum of Absolute Value of (true - predicted)^2
Mask R-CNN
same as Faster R-CNN, but learns a mask that says which pixels touch the object helps deal w background pixels Lot of hyper parameters Slower than YOLO/SSD but more accurate in general
Mean Squared Error (MSE)
Average of (true - predicted)^2
Focal Loss
-1 * (1- prediction of true class)^gamma * log(prediction of true class)
Balanced Cross-Entropy Loss
-1 * alpha * log(prediction of true class)
Class Balanced Focal Loss
-1 * alpha_t (1- prediction of true class)^gamma * log(prediction of true class)
Binary Cross-Entropy Loss
-1* log(prediction of true class)
VGGNet
2x(2xCONV=>POOL)=>3x(3xCONV=>POOL)=>3xFC Repeated Application of 3x3 Conv (stride of 1, padding of) & 2x2 Max Pooling (stride 2) blocks Very large number of parameters (most in FC) layers, most memory in Conv Layers (you are storing activation produced in forward pass) Critical Development: Blocks of repeated structures
AlexNet
2x(CONV=>MAXPOOL=>NORM)=>3xCONV=>MAXPOOL=>3xFC ReLU, specialized normalization layers, PCA-based data augmentation, Dropout, Ensembling (used 7 NN with different random weights) Critical development: More depth and ReLU
ResNet
Allow information from a layer to propagate to a future layer Passes residuals of a layer at depth x and adds it to the output of the layer at x+1 Averaging block at end Critical Development: Passing residuals of previous layers forward
Faster R-CNN
DL to do everything RPN (region proposal network) ... NN generates proposals select TopK of them outputs objectness score & bounding box Losses: 4 losses: bounding box loss, objectiveness score loss, regression loss, classifier loss for each class . anchors as grid
Inception Net
Deeper and more complex than VGGNet Average Pooling before FC Layer Repeated blocks that are repeated over again to form NN Blocks are made of simple layers, FC, Conv, MaxPool, and softmax Parallel filters of different sizes to get features at multiple scales Critical Development: Blocks of parallel paths Uses Network In Network concept i.e 1x1 Convolution -sort of Dimensionality reduction see slide Negative things: Increased Computational Work
Estimation Error
Even if finding the best hypothesis, weights, and parameters that minimize training error, may not generalize to test set
Optimization Error
Even if your NN can perfectly model the world, your algo may not find good weights that model the function. When model complexity increases, modeling error reduces, but optimization error increases.
R-CNN
Find regions of interest (ROIs) with object-like things. Slow -based on Selective Search; returns scores and bounding boxes; 100's of images/crops to process. Wasting a lot of computing resources for the same image portions. Classifier those regions and refine their bounding boxes
Number of parameters for CNN
For each layer sum ((Kernel_dim1 * Kernel_dim2 * Input_dim1) + 1) * NumberOfFilters Helpful for parameters: https://stackoverflow.com/questions/42786717/how-to-calculate-the-number-of-parameters-for-convolutional-neural-network
Number of parameters for FC Layers
For each layer sum NumHiddenUnits * (InputSize + 1) Helpful for parameters: https://stackoverflow.com/questions/42786717/how-to-calculate-the-number-of-parameters-for-convolutional-neural-network
Modeling Error
Given a particular NN architecture, the actual model that represents the real world may not be in that space. When model complexity increases, modeling error reduces, but optimization error increases.
Equivariance
If the input changes, the output changes in the same way if f(g(x) =g(f(x). If the beak of a bird in a picture moves a bit, the output values will move in the same way change to input causes equal change to output
Invariance
If the input changes, the output stays the same. That is f(g(x)) = f(x) E.g. rotating/scaling a number will still result in that number being classified the same.. change to input does not affect output Useful if we care more about if a feature is present than exactly where it is
Transpose convolution
In contrast to the regular convolution that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input. If we feed X into a convolutional layer f to output Y=f(X) and create a transposed convolutional layer g with the same hyperparameters as f except for the number of output channels being the number of channels in X, then g(Y) will have the same shape as X. We can implement convolutions using matrix multiplications. The transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer. Sometimes it is known a deconvolution, the forward and backward passes are essentially reversed when compared to a regular convolution layer. Normal convolution layer map pixels -> features, transpose/deconv layer match features -> pixels. Source: https://machinelearningmastery.com/upsampling-and-transpose-convolution-layers-for-generative-adversarial-networks/
Adversarial examples
Inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence.
saliency maps
Instead of using deconvnets, can, instead of taking the error gradient wrt model parameters, take gradient of class score we're trying to visualize wrt to the image itself (the input of the network) gets the degree to which a pixel contributed to that class score take absolute value of score because we care about degree, not direction helps us understand why model gives response they did Another method to make saliency maps: guided backpropagation algorithm (combination of deconvnet and gradient of class score wrt input of network) Sensitivity of loss to individual pixel changes , uses pre-softmax scores (gradient, then absval, then sum across channels)
Guided Backprop
Layer by layer (deconvolution is similar to backprop) From details to more abstracted representations
Number of parameters for Pooling Layers
None
Memory per CNN layer (KB)
NumFilters * HeightOut * WidthOut * BytePerElement / 1024 BytePerElement = 4 for 32-bit floating point
Memory per FC Layer (KB)
NumHiddenNodes * BytePerElement / 1024 BytePerElement = 4 for 32-bit floating point
Receptive fields
Receptive Field (RF) is defined as the size of the region in the input that produces the feature. Basically, it is a measure of association of an output feature (of any layer) to the input region (patch) The size of the region in the input that produces the feature When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in 2D space (along width and height), but always full along the entire depth of the input volume. Easy to understand link: https://theaisummer.com/receptive-field/
Effectiveness of transfer learning under certain conditions
Remove last FC layer of CNN and initialize it randomly, then run new data through network to train only that layer In order to train the NN for transfer learning -freeze the CNN layers or early layers and learn parameters in the FC layers. Performs very well on very small amount of training, if similar to the original data Does not work very well if the target task's dataset is very different If you have enough data in the target domain, and is different than the source, better to just train on the new data Transfer learning = reuse features we learn on a very large dataset on a completely new thing Steps: Train on very large dataset Take custom dataset and initialize network with weights trained in Step 1 (replace last fully connected layer since classes in new network will be different) Final step -> continue training on new dataset Can either retrain all weights ("finetune") or freeze (ie: not update) weights in certain layers (freezing reduces number of parameters that you need to learn)
SSD (Single-Shot Detector)
grid as anchors w different scales/aspect ratios Based on VGG model till conv5_3 layer
Forwards and backwards computation across a convolution layer (i.e. know whether backwards with respect to the weights or input is a convolution or cross-correlation.
https://medium.com/@pavisj/convolutions-and-backpropagations-46026a8f5d2c https://glassboxmedicine.com/2019/07/26/convolution-vs-cross-correlation/
Convolutional layers and how they work (forward/backward)
https://www.youtube.com/watch?v=Lakz2MoHy6o&t=1299s (Don't have a good short summary)
Fast R-CNN
map each ROI in image to corresponding region in feature maps Reuse comp by finding regions in feature maps Feature extraction once per image Issue: variable input size to FC layers Solved with ROI Pooling
Style Transfer
measure the difference in style between the synthesized image and the style image Sum of Gram matrix of each layer Stype vs prediction Where G is the Gram matrix that abstracts the correlation between the layers
Grad-CAM
more versatile version of CAM that can produce visual explanations for any arbitrary CNN, even if the network contains a stack of fully connected layers too let the gradients of any target concept score flow into the final convolutional layer; then compute an importance score based on the gradients and produce a coarse localization map highlighting the important regions in the image for predicting that concept What regions of image is model looking at to make prediction? Which individual regions have highest class activation as you extract layer from CNN? Direction/magnitude of gradients to determine which gradients are causing the most updates to the NN Objective: inspective given layer of CNN and correlate to output Task specific (if asked what is a dog -> dog pixels are more important)
YOLO (You Only Look Once)
single-scale faster for same size Customized architecture, full connected at the end NMS before results
Content Loss
the difference in content features between the synthesized image and the content image via the squared loss function
CAM = Class Activation Mapping
use Global Average Pooling layer as final layer to average the activations of each feature map and run through softmax loss layer to highlight the important regions of the image by projecting back the weights of the output on the convolutional feature maps