Deep Learning for Computer Vision

Ace your homework & exams now with Quizwiz!

CNN architecture

image -> featuremap -> convolution -> pooling -> more convolution -> more pooling> fully connected/ dense layers CNNs have two main parts: * A convolution/pooling mechanism that breaks up the image into features and analyzes them * A fully connected layer that takes the output of convolution/pooling and predicts the best label to describe the image

Fully Connected Layers

FC input layer: Flatten━takes the output of the previous layers, "flattens" them and turns them into a single vector that can be an input for the next stage. FC output layer - give final probabiliies for each label

Pooling Layer

(downsampling) ━ reduces the amount of information in each feature obtained in the convolutional layer while maintaining the most important information (there are usually several rounds of convolution and pooling).

FC layer as convolution

1) choosing a convolutional kernel that has the same size as the input feature map or 2) using 1x1 convolutions with multiple channels.

Bags of Visual Words

1. Extract features 2. Learn visual vocabulary 3. Quantize features 4. Use clusters and hashes to group visual words

Cranny Edge Detector

1. Filter image with Gaussian. Large sigma gives overall features, small sigma gives details. 2. Find magnitude and orientation of gradient. 3. Non-maximum suppression. Thin multi-pixel edges to single pixel width. 4. Linking and thresholding (hysteresis): high threshold to start edge and low threshold to continue them. 3 convolutions are needed: 1 Gaussian, 1 x, 1y works on grayscale images.

Viola-Jones Face Detector

1. Selecting Haar-like features Multiple image features are calculated by using pixel intensities in rectangular regions. 2. Creating an integral image (also known as summed-area table) used to calculate Haar-features. 3. Running AdaBoost training Set of weak classifiers create a strong classifier. 4. Creating classifier cascades Weak classifiers are cascaded to quickly discard non-faces. https://realpython.com/traditional-face-detection-python

Convolution

2D convolution is performed by a matrix of weights sliding over the image, performing an elementwise multiplication with the part of the input it is currently on, and then summing up the results into a single output pixel.

Accuracy

A description of how close a measurement is to the true value of the quantity measured.

Spacial Pyramid Pooling

Adds a new layer between the convolutional layers and the fully-connected layers. Its job is to map any size input down to a fixed size output.

Anchor Boxes

Anchor boxes are a set of predefined bounding boxes of a certain height and width. ... The network does not directly predict bounding boxes, but rather predicts the probabilities and refinements that correspond to the tiled anchor boxes. The network returns a unique set of predictions for every anchor box defined

AUC

Area under the curve of RoC. provides an aggregate measure of performance across all possible classification thresholds AUC is desirable for the following two reasons: AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen. However, Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won't tell us about that. Classification-threshold invariance is not always desirable.

Semantic segmentation

Associate each pixel of an image with a categorical label

Spacial Pyramid Matching

Bag of words approach removes the spacial structure of the image. So divide image into a small number of cells, and concatenate the histogram of each of these cells into the histogram of the original image with suitable weight.

ResNet

Deep residual neural network framework for image classification. Supports several architectures.

Pose Estimation

Detect human pose from images

Object Detection

Detect multiple objects with their bounding boxes in an image

Instance Segmentation

Detect objects and associate each pixel inside object area with an instance label

UNET

Fully convolutional network for image segmentation. Works with fewer training images and to yield more precise segmentations fast. The network consists of a contracting path and an expansive path, which gives it the u-shaped architecture. The contracting path is a typical convolutional network that consists of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. During the contraction, the spatial information is reduced while feature information is increased. The expansive pathway combines the feature and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. https://heartbeat.fritz.ai/deep-learning-for-image-segmentation-u-net-architecture-ff17f6e4c1cf

Filters

Gaussian filter: smoothing kernel

GAN

Generate visually deceptive images

HoG

Histogram of Gradients

Object Localization

Identify location of one or more objects and draw a bounding box.

Transfer Learning

It is common to use a pretrained CNN because it is rare to have a dataset of sufficient size. Three scenarios 1. ConvNet as fixed feature extractor: remove the last fully connected layer, and train new layers. 2. Fine tune CNN: Not just retrain classifier but also finetune the weights by continuing backpropagation 3. Pretrained models: Final convnet checkpoints are released for fine-tuning https://cs231n.github.io/transfer-learning

Non-max Suppression

Object ddetection may indicate multiple bounding boxes around the same object. All overlapping bounding boxes are removed except for the max probability one.

OverFeat

Object detection, localization and classification in one CNN. The main idea is to (i) do image classification at different locations on regions of multiple scales of the image in a sliding window fashion, and (ii) predict the bounding box locations with a regressor trained on top of the same convolution layers. Similar to AlexNet.

YOLO

One stage detector Uses simple DarkNet backbone and then simplified detection as a regression problem

Non-Maximum Suppression

Picking out the maximum probability bounding box from various overlapping bounding boxes

Depth Prediction

Predict Depth map from Images

RANSAC

Random sample consensus is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed.

Person Re-ID

Re-identify humans across scenes

ROC curve

Receiver operating Characteristic curve compares false positive rate on x axis to true positive rate on y axis for different thresholds steeper curve means more accurate

Image Classification

Recognize an object in an image

Video Action Recognition

Recognize human actions in a video

RCNN

Region based CNN. 1. use selective search to get Region of Interest. 2. Extract CNN features from each region independently for classification

Faster RCNN

RoI is integrated to CNN.

SIFT

Scale Invariant Feature Transform 1. Keypoints of objects are extracted from reference images and saved in db. These are generally on high contrast edges. 2.In new image, candidate matching features are found based on Euclidian distance of feature vectors. 3. Efficient hash table is used to find clusters

Sensitivity

Sensitivity = TP / (TP + FN)

Specificity

Specificity = TN / (TN + FP)

Fast RCNN

The CNN feature extraction is done once per image instead of once for every RoI.

Precision

The degree to which repeated measurements show the same result. TP/(TP+FP)

Deformable Parts Model

The object may look different in different images - person can be sitting or standing, car doors can be open or closed.

Object detection using CNN

There are generally two parts: 1. Backbone: Extract features, usually borrowed from image net classification. 2. Detection branch

BBox Regressor

Used to refine or predict localization boxes. They are trained to regress from either region proposals or fixed anchor boxes to nearby bounding boxes of a pre-defined target object classes.

F1 Score

combines precision and sensitivity into a single measure

SoftMax

function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities.

Recall

sensitivity or recall TP/(TP+FN)


Related study sets

13-THE MUSCULOSKELETAL SYSTEM KHAN ACADEMY NOTES

View Set

Path Ch.38 Disorders of Special Sensory Function

View Set

LEGL-2064 - Chapter 14, Section 4: Sales and lease contracts

View Set

Business communication study guide 8

View Set