Computer Vision

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Appearance based recognition. Pros and Cons

- Model is trained with a hopefully representative set of training images. + No explicit models (wire frame models, model light sources, surrface etc) + Different views of object are learned automatically -> invarriant representation - Training data may not be representative - Recognition is more than object identity. - Generalization? -> Very robust approach. Can be used e.g. for grasp classification. Range of applications beyond object identity

SIFT: Invariant to what? How achieved?

- Rotation invariance: Achieved when feature vector of length 128 is computed. Vector consists of gradient directions in 16x16 environment around pixel. Computed relative to the orientation of the IP itself. - Illumination invariance: Partly obtained by thresholding the feature vector (to prevent large gradient magnitudes), i.e. all values > 0.2 are set to 0.2 before normalization - Scale invariance: DoG-extrema are searched on different scales (octaves) as well as blurrings. The feature vector (descriptor) is computed in the scale where the feature was found. Thus, the same feature occuring on another image on another scale is likely to be found again (and matched in R^128 feature space)

Feature requirements

- Without good features, not even the best classifier can recognize patterns. Good features are... - discriminative (well separated classes) - stable against minor changes in the signal (viewpoint, illlumination etc.) The more samples the better

What is a Feature descriptor?

Feature descriptors encode interesting information into a series of numbers and act as a sort of numerical "fingerprint" that can be used to differentiate one feature from another. Should be unique and robust image feature description. Too short -> not discriminative (similar to other vectors -> high TP but also high FP) Ideally this information would be invariant under image transformation, so we can find the feature again even if the image is transformed in some way. - Examples: SIFT, SURF, HOG

Classifier requirements

Features are more important than classifier! - Must separate the feature space as good as possible into different classes. - Should generalize to novel views of the same classes. - 2 types: -- Separatrix based: Explicitly encoding the boundary (by a polynomial function) -- Prototyp based: No explicit boundary, but derive cluster center (prototypes) from data *Important classifier*: Nearest Neighbor, SVM, NN,

Explain the idea of rectangle features. How are they computed? What is an integral image?

Rectangle features are simple features that consist of black and white parts and can detect simple geometric features. There are edge features, line features, point features, diagonal line features. *Feature value = sum of values in dark region minus sum of values in bright region* Similar to Haar-wavelets. - Used in real time object recognition (e.g. faces, Viola&Jones). - Approximation of Gabor filters in V1 (Gauss modulated by sinus), work very localy. The idea of RF is to categorize subsections of images by features that detect common observations, e.g. that in faces the eye regions are darker than the cheek region. To detect this, a rectangular feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region and calculates the difference between these sums. This is implemented by integral images. Scaling invariance easily obtained by diff. window sizes. *Integral image:* How to quickly compute sum of values in rectangular grid? II is an image/table that contains at pos. x/y the sum of all the pixels above and to the left of x,y (inclusive x,y): *I(x,y)= i(x,y)+I(x-1,y)+I(x,y-1)-I(x-1,y-1)* -> Efficient, recursive computation with single pass over image -> O(n) if n #Elem in Array. Then sum of valuies of any rectangle (independent of its size!) in O(1), by lookup of 4 values: I(region) = *I(D)-I(B)-(C)+I(A)* --I_A------I_B--- | | --I_C-----I_D--- -> extendable to continuous functions (probability theory: CDF) and to more dimensions (fMRI, voxels)

Explain edge based segmentation (vs Region based segmentation)

Region based: Try to detect high level concepts by applying homogeneity conditions on multiple scales (thus hoping for correlation between local and global features). Requires homogeneity for each region and hopes for sharp boundaries. Edge based: Inverted to region based: Requires sharp boundaries and hopes for homogeneity for each condition. Start with edge detection, connect them, e.g. via edge linking: Merge neighbouring edge pixels

What is the histogram of an image? What is a gradient image and how is it computed? What is a histogram of gradients? Name some applications.

*Histogram*: Frequency of gray values in image. Can be analyzed, e.g. to enhance contrast (histogram equalization) *Gradient image*: Detects edeges in images by emphasizing luminance changes. Obtained by application of an edge detecting convolution kernel like Sobel or Laplace Filter. Used e.g. for detection of object shapes. *Histogram of gradients:* Summarizes distribution of gradient orientation -> Feature descriptor. Magnitude: sqrt(I_x ² + I_y ²) Orientation: arctan(I_y/I_x)

Linear vs Nonlinear Operators

*Linear* = Homogen + additive -> O(a*g) = a*L(g) and O(g1+g2) = O(g1)+O(g2) -> Linear Operators are homogeneous *Nonlinear*, opposite. E.g. logarithmic brithening

PCA: What are the assumptions? (4)

- *Linearity*, restricts the set of potential bases. - Mean and variance are sufficient statistics to capture the probability distribution of the data. I.e. the data points x_i must be exponentially distributed (Gauss, beta, poisson,gamma). Distributions like binomial, multinomial or student's t are not always exponential. - variance is interesting. If variance exists along a particular direction, the information is kept, otherwise discarded. - Principal components are orthogonal

PR: What are the problems on the way from the signal (image) to the symbols (classes)? What are factors hampening/simplifying this pathway?

- Degree of *abstraction* in signal. (generalize from a character in a typed font or from a handwritten character to the symbol M) - Degree of *variation* (harder the more, e.g. illumination, rotation, size, occlusion, background) - *Distance* between signal and symbol (e.g. symbol straight line is easier to get from pixel than a cat)

PCA with Images

- Developed for faces when researcher searched for a compact representation of faces. Other setups possible as well. For example, a idealized visual input to the retina (e.g. an image containing a hand and an object as both slowly shifts over the retina, due to microsaccades) - Requires that the images were obtained from a highly static, calm setup. Every pixel of the image is one feature, therefore high-level features like eyes need to be aligned. - E.g. faces: Faces need: same size, same conditions (illumination, angle, rotation), same cropping. - In contrast to PCA on non-image data, the PCs by itself are very interesting. Reshape the PCs to images -> eigenimages, that show which parts of the image varied most across the trainingSet. First eigenimage(large EV) have low spatial frequencies compaerd to later PCs. - To get a compact representation of the image, project it into the feature space to get a feature vector. Mostly, 10-20 PCs/values(!) are sufficient even in large databases. - Reconstruct the image from these 10-20 feature values by multiplying with the PC matrix. - With a subset of M out of the N-1 eigenimages/faces (eigenvalue tradeoff) the N images from the original dataset can be reconstructed.

SURF: Differences to SIFT

- SIFT computes image pyramid by repeated blurring and downsampling. Then DoGs and then approximates the determinant of the Hessian at a potentially interesting position (maxima in DoG) and scale. - SURF approximates convolution of image with second derivative of Gaussian (Laplce) by box-filters (haar features) whose responses can be computed in constant runtime with the integral image. This is FAST, since no image downsampling and convolving! In SURF, scaling on different levels is obtained by computing the general Hessian matrix (dependent not only on x,y but also sigma), i.e. using *rectangular features of different sizes*(constant cost). Filter size doubled between octaves, scale levels of adjacent octaves overlap, cause 3D Non-Max-Supp cannot detect maxima on borders. This approximation is ok, despite Gaussian filter being optimal for scale space representation, since discretization needs to be done anyways (and slower, of course).

Feature extraction. What can you tell about?

- Template Matching - PCA - SIFT, SURF, Viola Jones

Name two different approaches of CV and describe them.

1) the data driven approach tries to extract content from the image (bottom up) 2) the model driven approach tries to find a predefined model in the image (top down)

Explain the idea of SIFT and describe the main steps. How do these steps contribute the the properties of the features.

1. *Create the scale space/Gaussian pyramide*. Reason: Real objects have structures on several scales. Make all available. Take the image and blur it a few times with Gaussian kernel of increasing size and with increasing variance (-> more blurring). This is the first octave. Then half the image size and repeat. Result: E.g. 4 Octaves, each with 5 scales(blurrings), recommended by Lowe. 2. *Scale space extrema detection -> IP candidates*. Approximate derivative. Target :Locate corners and edges by second order derivative -> Laplacian filter. (Lindeberg: automatic scale detection, obtained by normalized Laplacian, done by sigma² to get scale invariance ) But: sensitive to noise -> Laplace of Gaussian (blur before) is better. Still computationally hard, approximate by DoG (just subtract images) -> gives one less than the number of scales. Take each DoG and compare all pixels to the local 26-neighborhood (8 on same level, 18 on adjacent scales, i.e. neighbouring DoGs) and define IP to be Min or Max of this environment. IP candidate defined by x,y,octave 4. *Select stable IPs*. Discard small local extrema (low contrast IPs), i.e. IPs with absolute DoG-values below some threshold. Discard also keypoints on edges, by using Harris corner detection. -> Matrix S=((Ix² Ixy),(Ixy, Iy²)) for every location a 2x2 matrix. Edge: High changes in one principal curve, low along corner -> One high EV, one low. Keep only edges -> two high EV. 5. *Keypoint orientation detection*. Collect gradient direction and magnitude at each keypoint. Make HoG, assign orientation to each IP. Uses taylor expansion to get subpixel precision. If 2nd largest histogram bin is > 0.8*largestBin, another IP is created at same position with different orientation. Future operations are performed relative to the assigned orientation --> rotation invariance (necessary: isotropic kernel, DoG) 6. *Keypoint descriptor (feature generation) * Take every IP, cut the 16x16 pixels around it (on the selected scale) divide into 16 sets of 4x4 pixels and compute a 8 bin histogram of gradient orientations to get a feature vector of length 8x16=128 per IP (Histogram of oriented gradients!). Subtract orientation from IP from all others -> rot. invar. Contribution to bin is gradient magnitude of pixel, overlayed with Gaussian (distance decay). Normalize to get illumination invariance. 7. *Compare feature vector* (via Euclid dist) to a database of features (Nearest-Neighbor) of known objects to infer candidates. If ratio of dist to nearest to 2nd nearest is > 0.8 -> ambiguous -> discard Then use generalized Hough transform to identify clusters of features that all belong to the same object and verify via least square solution the class estimate. Intuitively: 3D histogram of gradient values over positon and orientation

Surf procedure details

1. *Rectangle features* Use the integral image to compute responses to rectangle filters that approximates the Hessian matrix, i.e. the second derivative gradient image normally obtained by Laplacian. First octave: Start with filter size 9x9, then increase for more scalings. Next octave: Do not downsample, but double filtersize (18x18) for first scale. 2. *Detect IP candidates*: For each scaling, for each position: Compute determinant of Hessian (gradient was computed by Box filte) and check whether *Dxx Dyy - wDxy²* with w=0.9 is over threshold. If yes -> IP candidate 3. *NonMaximumSuppresion* discards IP candidates that are no maxima in 3D (neighboring scales, same octave) neighborhood (26 pixel) 4. *Orientation computation:* Take circular window of radius 6sigma around each IP and compute haar wavelet response of each pixel in x and y direction. Take these 2 values, weight by Gaussian according to distance to center and then plot in 2D (xResp,yResp). To assign final orientation, sum up all x-,y- values in sliding window which yields a vector. Take longest vector as orientation. 5. *Descriptor:* Construct square region of size 20s centered around IP, oriented like IP. Subdivide into 4x4 squares and compute gradient orientations for all pixels in square (like above, Gaussweighted). Assign gradient vector with 4 values (x,y, abs(x), abs(y)) to each 4x4 square to get a feature vector of length 64. Normalize -> contrast invariance -> Shorter than SIFT with 128 (so, faster to compute) an. 6. *Matching*. Like in SIFT OpenCV: SIFT 3 times so fast, but same robustness.

Template Matching: 1. Is it a data or model based approach? 2. What are the advantages and disadvantages? 3. In what situations would you apply template matching?

1. It is clearly a model based approach. The model is our predefined template that we search in the images. -> Top down. Requires the user to do sth before the algorithm can be used. 2. + Relatively fast O(mnMN). + Very reliable if the template occurs almost identically within the real image. - Very sensitive to rotation, size changes of the template. - totally not flexible, always template required. 3. If I clearly now that the template occurs almost identically within the image but I need to find the precise location of the object. For example if I have a static video camera in quality control and I want to track the movement of the template-object through the scene.

Could you give examples for IP-Detectors?

1. Moravec-IP: Idea: Area is salient if it is unique in its own local environment. Take a window of a few pixels and compare to itself, shifted by one pixel in 4 (or 8) directions (Compute average squared sum of gradients). Save minimum of 4 (8) values for that pixel (procedure is performed for every pixel) -> Saliency map *+* Easy to implement, fast *-* not rotation invariant (anisotropic), precision only 1 pixel, hard window (like box filter) 2. Harris corner detector Improvement over Moravec. Idea: Combine gradients in neigborhood of pixel. Slide a Gaussian instead of a box window -> larger window, smoother, scale, illumination, rotation invariant (isotropic). Measurement of how much variation is created by a window moving around its neighborhood. HowTo: 1. Compute derivatives in x and y direction 2. Smooth image -> Do together by Sobel (Prewett, i.e. extension of 3x1 gradient filter, overlayed with Gauss) 3.Combine to symmetric (!) S = 2x2 Matrix ((I_x²,I_xy,)( I_yx,I_y²)). I_x:= derivative in x-direction pointwise times itself. Check eigenvalues for particular (x,y) (all are real). If - both small -> flat area - both large -> corner - one large one small -> edge -> EV computation for 2x2 matrix requires sqrt() -> slow. Compute instead det(S)-k*trace(S)² and tune k (usually 0.04)->if > threshold its a corner. Further improved by Tomasi (not in lecture)

How to find principal curves?

1. Start with a smooth curve, e.g. the first PC of the data. 2. Project data onto PC and evaluate the expectation value (Erwartungswert) of being at its real position conditional on the projected position. 3. If value is low, curve is slightly pushed (bulged) towards that point and 2. is repeated for all points -> Cannot cope clustered data

How do I comoute eigenfaces

1. Subtract Mean Face 2. Compute covariance matrix. Problem: - If pictures are of size KxK, there are K² features, so the covariance matrix is of size K²xK². In most cases, we have N < K² images which implies that the rank of the covariance matrix is at most N. Accordingly, the number of eigenvectors/faces/images that are different from the zero-vector is N-1. Therefore, the matrix of PCs returned by our PCA function will not be squared but will contain N-1 PCs of size K². -> This is why we subtract the mean face (mean over N "trials" of all "features") rather than the mean from each feature/pixel like in regular PC *How did I get these N-1 PCs?* The covariance matrix is easily of dimension 50000x50000 which makes regular eigenvector computation hardly feasible. C was defined as AA^T. Instead we compute the eigenvectors of A^TA, which is of dimension 300x300 if we have 300 images in our training set. 3. Compute eigenvectors/values of small covariance matrix 4. Show Eigenfaces 5. Project TrainData into small feature space 6. Project testData into feature space and extract e.g. identity by distance in feature space

What is a feature detector?

Algorithm which takes an image and outputs locations (i.e. pixel coordinates) of significant areas in your image. An example of this is a corner detector, which outputs the locations of corners in your image but does not tell you any other information about the features detected.

Local PCA. Details for first method

1. Use clustering (e.g. K-Means) to get K centroids 2. assign to each datapoint the closest cluster center k_i (based on a predefined norm) -> yields K subsets of data 3. Compute mean of all datapoints assigned to cluster T_i (not the same like centroid k_i) and subtract it from all values in T_i (for PCA) 4. For each of the K subsets, perform PCA and keep the L PCs with the largest EV. 5. Project the data into their corresponding(!) eigenspace/featurespace 6. Label the feature space representation of each datapoint according to the desired classes 7. Train K classifier, one for each of the feature space sets (Only supervised step) Bad: norm used by clustering optimizes sth like compactness but not well-suitness for PCA. How to classify unknown input? 1. Find nearest cluster center. 2. Mean subtract (transform to center coordinates) 3. Project in corresponding feature space. 4. Classify the obtained feature vector.

SURF: What is non-maximum suppression?

2 Methods of detecting maxima in images. Could be used in SIFT do detect maxima in DoGs, but was introduced in SURF as 3D Non-Max-Supp. 1. Naive: Start with first non-boundary pixel and compare to all its neighbors. If max found: Mark. Then proceed to next pixel and repeat. -> 1.5 comparisons per pixel 2. *Dynamic block algorithm:* With memory. If 2 neighboring pixels are identical, don't check the second one. If max has been found skip all valus in environment. -> 0.815 comparisons per pixel. Neighborhood sizes: 1D -> 3, 2D -> 9, 3D -> 27 Algo: while (i+1<W) {if g(i) > g(i+1) then if g(i)>g(i-1) then {Max Found; i+=2; break} else {i+=1; while(i+1<W and g(i)<g(i+1)) i++; if i+1<W then maxFound; i+=2; break;}}

Local PCA. How could that be done (rough methods)

2 Procedures: 1.1: Search for data clusters (Describe large data set by a few, prototypic datapoints) 1.2: PCA for each cluster 1.3 Classify + Straightforward. -Which metrics for clustering? Clusters are e.g. compact but not well suited for PCA -> Mostly sufficient 2: Try to find the global optimum to approximate the data by a fixed number of clusters and PCs. + Theoretically optimal. - Hardly possible to get. -> Local optima. -> Iterative approximation. Choose cluster centers, do PCA, improve Cluster Centers, do PCA...

SIFT in general

Algorithm to extract (and describe) distinctive, local features in images. Each feature is described by a descriptor vector of length 128. These descriptors are invariant against scaling (scale space alllows that similar features are detected when their size on image varies), rotation (keypoint orientation), translation and other transforms of the image. If the image is transformed somehow, the keypoints will likely occur in the transformed image again (distance in R^128). By default, algorithm on greyscale images but can be extended to color images (longer descriptors, better recognition, but slower) Used for stitching, object tracking, robot navigation, face recognition. Robots: Find out position by characteristic points in scene, obtained by SIFT. Egomotion gives implicit information of likely position of features in new image (if env stable)

Name morphological operators and their effects

All on binary images Erosion: Output 1 iff all neighbors 1 -> Removes unimpotant objects Dilation: Output 1 iff any neighbor 1 -> Enhances segments Opening: First erosion then dilation (holes get closed) Closing: First dilation then erosion (small objects disappear)

Pattern Recognition. What is this? What are the goals of pattern recognition? How can they be achieved? What are the problems?

Automatic discovery of regularity within data by using algorithms. Like ML, but PR is traditionally an engineering approach. - in PR the extracted features may be hand-crafted, not self-learned by examples like in ML - PR is a term originating from CV and aims to *formalize, explain and visualize the patterns* (unlike many ML algorithms such as black box neural nets) Inherent problem: Task only partially defined (incomplete set of training examples = class. Later, incomplete set of features defines object). Objects occur in huge variety of transforms. What is car? Take away components stepwise 3 stages: 1) Sensor to caption real world 2) Feature extraction (unsupervised or hand-crafted) 3) Classification (supervised, e.g. SVM)

Give RGB and HSV values of blue, magenta, gray, white yellow

BLUE: 0 0 1 - 240° 1 1 MAGENTA: 1 0 1 - 300° 1 1 GRAY: 0.5 0.5 0.5 - any 0 0.5 WHITE: 1 1 1 - any 0 1 YELLOW: 1 1 0 - 60° 1 1 Hue order: RGB, value 0 = black

Nonlinear alternatives to PCA

Bad: - Assumption about linearity in data of PCA. - Many large eigenvalues if data forms many clusters. 1. PRINCIPAL CURVES: - nonlinear extension of PCA, approximates shape of data in space - smooth, 1D curve that provides a nonlinear summary. - nonparametric like PCA. PLUS: Captures nonlinear data MINUS: - Cannot cope clustered data. Dimensionality for real description is unknown. High computation effort (iterative approximation) 2. LOCAL PCA (LPCA) - Several local PCAs rather than one global -> Locally linear, globally nonlinear

What is AdaBoost?

Best out of the box classifier: Linearly combine a set of binary, weak classifiers to get one strong binary classifier. Initialize a set of weights for each weak, random, classifier. Classify some data and select the best k classifier that minimize an exponential error measurement. Add it with some logarithmic weight to the strong classifier, given it does not correlate with another good weak classifier. The error function is updated such that the contribution of correctly classified images lowers. If another classifier has... ... also good prediction and is uncorrelated to all other selected classifier -> add ... good prediction but correlated to one selected classifier --> do not add ... random prediction -> do not add ... bad prediction -> change sign, then add. Used e.g. in Viola&Jones algorithm, e H(x) = sign(sum_i alpha_i * h_i(x) ) where alpha are coefficients and h_i are the weak classifier. Result is strong classifier saying "face/noFace" based on a subset of all possible weak classifier (each detecting a particular haar-feature)

Explain region merging and split and merge? What are the differences between them?

Both Region based segmentation Region Merging: Bottum up approach. Each pixel is treated as own segment. If neighbouring segments (in the beginning pixels) satisfy homogeneity condition, they get merged to a single segment. Stop if no more segments can be merged. Homogeneity condition needs to be defined by the user. Used to find connected objects in images. *Split and Merge*: First top down, then bottom up. Better runtime. Entire image = 1 segment. Check whether HC fulfilled. If not, make new layer in quad tree (divide in 4 squares of equal size). Check for each new node HC, and merge bottom up neighbouring segments that fulfill it.

Which concepts for Interesting Point (IP) detection exist? What are the requirements to IP?

CONCEPTS IPs detection is first step in local feature analysis and should work properly, otherwise early game over. IPs should be robust against disruptions. *Context free detection:* Maxima of predefined saliency measurement (the same for all images). Like max in DoG in SIFT *Context dependent detection:* Global preanalysis, e.g. a set of properties that are interesting for a particular image, such that a white spot. REQUIREMENTS: - Should be salient and rare within the image/within all images - Should be stable under disruptions -> Characteristic and prominent.

K-Means - Pseudocode

Clustering groups a set of objects (e.g. pixel) such that objects of the same group are more similar than of other groups. 1. Initialisere randomly K centroids, i.e. center of clusters 2. WHILE position of any centroid moved more than threshold 2.1 Compute closest centroid for each datapoint (in a given metric) 2.2 Update centroid position; set it to the mean of all points belonging to that centroid.

Local Features Motivation

Common problems of object recognition are: occlusion, illumination, background variation. But also object transforms: rotation, translation, scaling -> See object as set of local features that are relatively stable against transforms. Small, local features are less likely to be occluded. If a considerable subset of the local features of a given object is found in a reference image, it is reasonable to conclude that the object is the same -> Well suited for object recognition. General procedure: 3 steps: 1. Find potential keypoints in scale space and verify whether they are stable IPs 2. Compute descriptor (feature vector) based on local gradients but anchored to orientation of IP itself. 3. Sort descriptors to classes/objects and measure similarity (euclidean distance)

Template Matching: (Pearson) Correlation Coefficient. How does it work?

Compared to MAD, CC is a more sophisticated approach to use template matching to recognise patterns in images. Again we shift a small template over a larger image and compute the similarity between both at all positions. CC is statistical concept: - dimensionless measure of linear relation of two variables (mindestens intervallskaliert), in range [-1,1] := cov(g,T)/(sd(g)*sd(T)) = cov(g,T) / (std(g)*std(T)) Result = 1. Identical up to linear transform with scaling >0. 0 -> no correlation. -1 -> Identical up to linear transform with scaling <0 If template T is big, computation may become slow. Solution: Project in Fourier space, apply convolution theorem, project back O(mnMN), template occurs at mn positions. Std computation of template was done before, of imageCrop is O(mn), variance again O(mn). Slightly slower than MAD since 5*MN (3 for mean) computations per mn step.

Convolution Theorem. How could it help in Template Matching?

Convolution in regular space is the same as a pointwise multiplication in the frequency domain. This can be used to fasten the computation of a convolution, since convolution is in O(n² m²) (squared image size n and squared kernel size m) while FFT is in O(n*log_n). Convolution can be speed up by separable kernels to O(n+n+m+m) Project f and g into fourier space, make a pointwise multiplication and project back.

Approach of Lowe to Descriptor choice. How did he decide whether a keypoint gets an object identity label?

Database consists of tons of feature vectors of length 128, each labelled with an object/entity. Naive approach: Take a feature vector, search closest vector (Euclidean distance) and assign its label/entity. -> Bad as identities arise for background IPs. Better/Lowe: 1) Discard if absolute distance is too high. 2) Search 2 closest vectors, build their ratio. If >0.8 IP gets no label, else the closest. Other variants possible. E.g. PCA-SIFT descriptor (also computed relative to orientation) that uses patches of size 41x41 around every IP and later compresses them to length 20. To detect real objects, cluster of IPs in one image should show similarity to a cluster in another IP

How can I detect edges?

Derivative would be great, but image is discrete. Approximate by a convolution filter, simplest version 1x3 [-1 0 1] which will give no response for constant regions but high responses for vertical edges. This gradient filter is usually used in x and y direction. Extending it to a 3x3 filter gives the Prewitt operator that is less noise sensitive than the 1x3. Sobel: Prewitt Operator overlayed with Binomial (Gauss) Filter to weight values further away from center less. Approximates first derivative and is not direction invariant (exists in 4 directions). [1 0 1; 2 0 -2; 1 0 -1]. Disadvantages: False negative rate relatively high Laplace: Approximates 2nd derivative. {(0,1,0),(1,-4,1),(0,1,0)} Direction invariant! Better edge detection than Sobel but result is noisy. -> Preprocessing with LaplaceOfGaussian

Differernce between local and global contrast

Global contrast: (Max-Min)/MaxPossible. =1 if entire range of values used across entire image. Local contrast: Average gray value difference of every pixel to every pixel in its N4 or N8. -> enhances structures

What are Haar-Wavelets?

Haar wavelet was first wavelet (1909, Haar). What are wavelets? Wavelet transformation: linear time frequency transformation that uses particular, new set of basis functions (beforehand only splines, piecewise polynomials and trigonometric functions as basis functions like in FT). Allows a target function over an interval to be represented by a set of orthonormal basis functions. Advantage: Trigonometric functions like cos, sin are local only in frequency domain, not in time domain (due to thei peridiocity). Most wavelets are are local in time domain. Disadvantage: not continuous -> not differentiable. *Haar-Wavelet*:Simplest wavelet, combination of 2 rectangular/box functions. 2x2 matrix: [1 1; 1 -1]. Haar wavelets/features can be used to detect faces, e.g. by overlaying the rectangular function over the eyes (eyes are darker than cheeks)

SIFT/SURF Harris Corner? Why Hessian Matrix

Hessian Matrix describes 2nd derivatives in all directions (xx, xy, yy). Its eigenvalues describe the intensity of the principal curvatures of this matrix. If both EV are high, then we are located at a corner, whereas one high and one low principle curvature denotes an edge. SIFT: *det(A) - k trace(A)²* with k =0.04 SURF: only det(A): *Dxx Dyy - w Dxy² * w=0.8

PCA aim covariance matrix

High covariance means that there is redundant information between features. We want to remove that, cov should be around 0. High variance means a lot of different values for a particular feature. We want to maximize that. *Therefore, PCA asks whether there is a linear combination of the original basis vectors such that the data projected onto the new basis vectors form a covariance matrix that is indeed a diagonal matrix. That means, after the projection the different features are entirely decorrelated, since the covariance for every pair is 0.*

Relation of Template Matching to biological vision

Human vision makes use of low level templates such as simple and complex cells in V1 that detect edges and can be modelled via gabor filters. Real world input has plenty degrees of freedom per persistent object, i.e. visual input of the same object varies as a function of viewpoint, illumination, rotation, contrast, size/distance etc. Therefore, grandmother cells would not only need to exist once per object but once per view of the object. Impossible due to combinatorial explosion.

Can you describe the pathway from an IP to a Region? And pathway from region to object/entity? What are requirements for a good descriptor

IP is a point -> not interesting. 1) Naive: Associate a region of predefined size with the IP 2) Scale-Adaptive: IPs have been searched on different scalings of the image. Size of region depends on scale on which feature was found. 3) BLOB (Binary Large OBject). Region of image where all pixels are somewhat similar (share a property). Convolution filter like Laplace are also BLOB detectors. Region --> Entity Entity is recognized by a set of local features; called descriptors, usually a vector in feature space. Requirements: These descriptors should be * - unique - allow fast comparison (euclidean distance in feature space) - robust against variation in IPs as well as recording conditions*

When is PCA used?

If data is of too high dimensionality and if there is a lot of similarity within the data. Then redundancy potential is high and PC will have a high effect. A single Hebbian neuron can be used to extract the first PC of a given input (weight vector converges against eigenvector with largest value)

What is the idea of Fourier Transform, and why is it useful for image processing? Can you provide a formula? Why is it called an orthogonal transformation? Which aspects of an image can be recognized in its Fourier transform?

Image can be seen as a superposition of a bunch of frequencies. High frequencies for fine structures (borders), low frequencies model constant colourings (sky). FT can decompose an image into a set of frequencies (linear) thus giving an overview of the spectrum of an image, i.e. how strong individual frequencies are. - It is called orthogonal because the basis functions form an orthogonal basis. Result of FT are 2 images: 1. Amplitude image: Information about orientation of borders/content 2. Phase image: Information about position of borders/content

Image Processing. Define

Improving images to simplify analysis by humans. Filtering irrelevant information, enhancing relevant information(medical scans) - compression. - repair bad acquisition conditions (contrast etc)

relation of CC to convolution

In CC the template needs to have an average of 0. Due to this, we don't have to subtract the ImageMean at every step while computing the sum. (0 times ImageMean=0), so we end up with a convolution divided by the product of the standard deviations. A bit like convolution, but unlike convolution, CC is not associative. Because convolution is associative we can concatenate filters like in the DOG filter (derivative and gaussian filter). So, we cannot combine two templates into one.

Techniques to fasten Template Matching

In general, if template is too big, resolution of template and image could be reduced by the same factor and performing the search on the downsized image. Computation could also take place on a multiscale image pyramid such like the Gauss/LaPlace pyramid If CC is used, apply frequency domain filtering. I.e. project image and template into fourier space, compute dotwise product and project back (convolution theorem)

Explain Histogram of oriented gradients (HoG)

Is a feature descriptor, can be used for object detection by classification. Idea: Local features of objects can be described by the distribution of gradient direction and intensities in their vicinity. E.g. used in SIFT, where we assign a rotation to each IP. Later computations (e.g. comparison in feature space) are done relative to orientation thus achieving rotation invariance. *Procedure:* Take a set of cells of equal size (e.g. 10x10 pixel. SIFT only done for IPs) and collect gradient direction and magnitude of every pixel in local environment. Both obtained from differences in the original image Create a histogram of gradient orientations. 36 bins (10° each). Amount added proportional to gradient magnitude. -> Maxmimum of histogram is orientation assigned to keypoint. Histogram peaks > 80% of main peak are new IPs (from same location, but) with different orientation.

Convolution

Local Operator, Linear and homogeneous. Result at every position is scalar product of filter kernel with image patch of corresponding size

What are local features and what are they used for? Name some examples?

Local features are features extracted by computer vision methods that describe local image patches. They are used to charakterize objects depicted on an image Applications: - object detection and recognition (e.g. face detection, Viola&Jones with rectangle/haar features) - SIFT/SURF

Hufman Coding example

Lossless compression method. Bad: Each pixel requires 2^8 bits memory (binary image 2^1 indeed). Idea: Examine histogram and assign frequent values to short codes Algorithm: oder values by frequency. Generate binary tree by merging iteratively the 2 lowest probs to a parent node. Values < parent: left, else: right. Repeat until 2 values remain, assign binary codes, 0 fo rirght part, 1 for left, always append. - We need tree for decoding - Why not taking binary code of increasing length (0,1,10,...): Code is not prefix free -> we would need to save the gaps additionally.

Template Matching: Please compare Mean Absolute Difference (MAD) with Correlation Coefficient (CC)

MAD is the simplest and most intuitive approach. It is way weaker in recognizing the target object in the image under realistic variation (viewpoint, illumination etc.), compared to CC. [OPTIONAL: In both approaches we shift the template over the image like a convolution kernel and compute the similarity between the template and the image at any given position.] In MAD, the metric we use is the absolute difference between the template pixel and the underlying pixel of the image. By summing all the values up we get a similarity measure with a low value indicating high similarity. In CC, the metric we use is the pearson correlation coefficient, a dimensionless measure in statistics. - CC is invariant to linear transforms of the pixel values, MAD is not (cares only about absolute numbers). Thus, CC is partially robust against illumination and color intensity variations. Runtime, O-Notation is the same O(mnMN)

How to find a proper template for Template Matching

Make use of eigenspaces. Collect a number of shots of the template under varying conditions (viewpoint, illumination, contrast etc.), then project these image set into the eigenspace to get out the eigenimages/eigenvector that contain the most invariant features of the template.

Monadic and Dyadic Operators

Monadic: Takes on image into consideration, e.g. regular point/local/global operators Dyadic: Takes 2 images into consideration (e.g. overlay an image of an object with a text)

Moravec and Harris corner detection

Moravec: compare pixel with its surroundings, by using 4 shifted versions of the image. Moravec value is minimum of 4 comparisons. If larger than thresh -> corner Harris: uses a "soft" window, use the gradients to compute the Hessian matrix and consider ratio of largest and smallest eigenvalue. EV describe principal curvatures.

Template Matching. What is that and when can it be applied?

Most straightforward approach to Pattern Recognition, relatively brute force. Match the similarity of a prototypic, predefined template (a ball) with the image by moving the template (like a convolution kernel) over the image. Similarity can be measured by different metrics, we learned 2: 1. Mean Absolute Difference (MAD) 2. Correlation Coefficient (CC) Application: In industry (quality control). Is a product packed in the desired way? In medical image processing (X-Ray) Very slow algorithm (in the range of O(mnMN), sum up m*n values for M*N times)

Rectangle features: How can the problem of not being rotation invariant being tackled?

Not rotation invariant because simple rectangle features have directions and are either clearly in X or Y direction. Lienhart: Rotate Haar-features by 45°. -> Diagonal integral image. To create the integral image use this recursive formula: I(x,y) = i(x,y)+i(x,y-1)+I(x-1,y-1)+I(x+1,y-1)-I(x,y-2) Compute sum of values in selected, rotated rectangle (A,B,C,D from top to bottom): * I(region) = I(D) - I(B) - I(C) + I(A) * -> significantly improves performance

PCA Interpolation. Why is it useful, how does it work and how do I compare a test datapoint now?

PCA can be used as an unsupervised feature extraction method whose features are later used for a classifier. E.g. trained on different objects, each presented in different views, the training images are points in the feature space. The images of a particular object hopefully form a cluster in feature space. -> IDEA: Set of discrete points in feature space is bad: * no center, no clear boundary, no information about size, infinitely intermediate views not captured in the training set.* Use splines to interpolate the object manifold in feature space. Classify new datapoint: Search for closest spline. Bad: In high dimensional space interpolation is unlikely to be smooth and to really approximate intermediate views.

What is a point operator, a local operator, a global operator? Provide examples for each of them. Which are linear which are not? Give an example for a non-homogenous operator. Describe application scenarios for different operators. What is a rank filter?

POINT Operator: Operates on particular points, e.g. object background separation based on simple threshold (like mean) LOCAL Operator: Takes local environment of pixel into consideration, e.g. convolution kernels (box filter, Gaussian filter) or morphological operators (erosion) GLOBAL Operator: Takes whole image into consideration (e.g. contrast enhancement by histogram equalization) HOMOGENEOUS: Mapping independent from location (translation invariant) NON-HOMOGENEOUS: Mapping may depend on location (e.g. local threshold brightening) Applications: Denoising, edge detection, blurring Rank filter: Rank filter are all those filters that *sort* the values from a given range and then select one, such as min/max filter or median filter.

Eigenfaces Pros and Cons

PRO: - Training completely unsupervised, nonparametric - Easy to implement - Very compact representation --> Face recognition of training and testing faces in real time CON: - Very sensitive to lightning, translation, rotation (controlled setup) - Most significant eigenfaces show illumination (not e.g. facial expression) -> Performance boost possible if first 3 eigenfaces are discarded.

What is the Viola & Jones algorithm?

Pattern recognition algorithm for real time object detection developed in 2001. Developed for faces but usable for other targets (e.g. pedastrians) as well. Stages: *1. Haar/Rectangular Features.* All posssible Haar features/filter are initially considered (or a subset is already discarded) Features shall make use of knowledge about the target object/face (eyes darker than cheeks, nose brighter than eyes). Searched on multiple scales, start by 24x24. *2. Integral image computation*. Allows computation of haar features at any position in constant runtime. Ends with a set of positions where some haar features were highly responsive. *3. AdaBoost training.* There are 160k Haarr Features but only very few are useful. In each training step the best weak classifiers (obtained greedily) are added to a strong classifier with weights reflecting their training performance. Strong classifier = Linear combination of weak classifiers. Result: Strong classifier consisting of 2500 weak ones. *4. Cascading classifier* Problem: Evaluating 2500 features on every location on all scales is still computationally hard. -> Use cascade: *Set of weak classifier, obtained by Adaboost are arranged in a cascade of strong classifiers* First 10 weak classifier -> First level strong classifier, next 20 weak classifer -> 2nd level strong classifier. -> Get Classifier with high sensitivity (TPR) but low precision (FPR), i.e. high TP, low FN and high FP. E.g. first layer of classifer consists of 2 simple haar classifier and has 100% TP but 50% FN -> Sorts out 50% of all windows. Next layer tackles only remaining 50% and so on. Strategie: If sequence of 30 classifier, final TPR is product of all TPR of weak classifier sets. (Each has 0.993 -> final TPR = 0.8). FPR also product (Each has 0.8 -> final FPR = 0.12). *Improve accuracy by accepting false positives.* Speed: Sequence of 10 class. with 20 feat. is 10x faster than 1 class. with 200 feat. PRO (4): - Fast, robust, not only for faces, location and scale invariant (no scale space, but scale features!), CONS (2) - requires frontal vision, sensitive to illumination and rotation, multiple detections possible

Computer Vision. Define

Recognition of parts of the image by computers. Understand what is depicted on images. Compared to image processing, it does not change the image. - Feature detection - Motion detection - Classification (shapes, objects)

robustness of Correlation Coefficient (Template Matching)

Robust to noise: YES (averages out) Robust to color/illumination perturbations (grey value intensity change): YES, BUT: The ratio of gray values needs in the template needs to match the ratio in the image (linear transform between template and template in image okay, but not if background is identical in both but template is bright and image is dark) Robust to rotations: NO (human also not) Robust to contrast changes: YES Robust to viewpoint: NO

Compare SIFT and rectangle features. Name pros and cons.

SIFT creates a scale space for the given image (resolution, blurring) and findes IPs in this space. The IPs are described as rotation invariant vectors in R^128 and compared by Euclid dist (nearest neighbor) to feature vectors of databse. Viola&Jones use simple, rectangle features whose responses can be computed in constant runtime by the concept of integral images. To extract a subset of the mass of rectangle features, a cascade of classifier, trained by adaboost is used to combine a set of simple, weak classifier to one strong. PRO Rectangle - Faster than SIFT - Location invariant, scale invariant easy (windows of diff. size rather than gauss pyramid) Contra Rectangle: - Rotation invariance hard to obtain (Linberg rotate haar features), whereas easy in SIFT - multiple detections possible (not so in SIFT) - most effective on frontal images (not so in SIFT)

SIFT + SURF Pros and Cons

SIFT: Breakthrough in CV. Initially thought for object recognition, nowadays a bunch of applications. From the view of SIFT + Feature recognition that is relatively robust against scale, rotation, translation, illumination, viewpoint, partial occlusion etc. + SURF: When rectangular filters are upscaled only certain scales can be computed - relatively slow compared to SURF due to downscaling + convolving - only partly invariant against affine transforms (linear transform + real translation) though it is invariant against both by itself. SURF claims to be a bit more invariant against affine transforms - SURF is more robust against noise than SIFT, due to computing gradient sums in two directions rather than saving directions.

Provide Shannon sampling theorem and explain its applications with an example

ST: If k is highest frequency of a measured signal then sampling rate has to be higher than 1/2k to capture enough information to unambiguously reconstruct the signal. Otherwise *aliasing*, discretisation of continuous signal with subsequent interrpolation may cause incorrect interpretation if sample frequency is lower than highest signal frequency.

SURF invariance how?

Scale invariance: By rectangle filters of increasing size. Rotation invariance: By aligning the descriptor window of size 20s according to the orientation of the IP. Contrast/illumination invariance: By normalization of final feature vector translation invariance: Same computations are done for every pixel

What is an object? And what is the motivation for dimensionality reduction methods?

Set of all possible views in a MxNxC space (Image size, color channels). Complexity of object depends on its appearance, the visual degrees of freedom (rotation, translation, illumination but also constant colouring(--), symmetry(--)) Idea:

Template Matching. How does the Mean Absolute Difference approach work? Is it robust against xy? Why (not)?

Small template of target object, larger image which potentially contains the template. For every location of the small template on the large image: Compute absolute difference from every template pixel to the underlying image pixel. Small value -> High similarity *Robustness* - Noise: yes, because of averaging over pixels - Linear transforms (gray/color intensity change): no, local maxima get disturbed - since it only cares about difference in absolute values, it is NOT robust against Rotation, contrast, size, viewpoint, illumination. Hardly useful for real-world applications

SURF. What is it and name main steps.

Speeded Up Robust Features - Local feature detector, extension of SIFT - Claims to be significantly faster than SIFT and to be more robust against affine transformations. - combines SIFT with idea of rectangle features - biologically motivated by saccades (?) Steps: 1. Rectangular features (integral image) 2. IP detection (Hessian matrix) 3. 3D Non-Max-Suppression 4. Orientation assignment 5. Feature vector description 6. Matching

Viola&Jones: How is the cascade built?

Successively. In each step we evaluate all features on all locations and all steps. Then we select greedily the feature with the best positive (or negative, then turn sign!) separation, like in Adaboost. Training samples rejected by the first layer of classifiers are excluded from the next classifier.

How do I do a PCA? Explain how principal components can be computed?

Suppose you have N data points, with F features, then the data matrix D is of size NxF. 1) *Demeaning*, i.e. take each of the F feature vectors of size 1xN, compute the mean and subtract it. 2) Compute the *covariance matrix*, i.e. correlate each feature with every other feature, i.e. C:= (1/(N-1)) * D^T*D. This matrix contains variance of each feature on the main diagonal. (Multiplication works only if we demeaned every feature beforehand). Contains correlation between all pairs of features. If PCA with images, C may be too large, then compute DD^T to compute covariance matrix of samples (same rank, eigenvectors can be easily obtained). Since D^TD is symmetric we have no complex eigenvalues. 3) Compute eigenvectors (V in NxF) and eigenvalues (Fx1) of C. Each row of V has length F and is a principal component, a linear combination of all features. 4) Sort the eigenvalues in descending order; sort the eigenvectors accordingly. (The larger the eigenvalue the more important the PC) 5) Check out a histogram of the eigenvalues to find a cutoff point, which PCs are considered and which are discarded. -> Depends on task. Usually projecting on first 2 or 3 PCs useful (visualization easy). In case of eigenimages, show all PCs by themselves. 6) Project data by multiplying Data matrix with matrix of important EV to get a compact representation in the feature space. 7) Hopefully, the datapoints from different classes (e.g. male/female participants) form cluster, such that classification of new datapoints (projected in the same feature space) is simple

PCA: What happens if you multiply a covariance matrix C with one of its eigenvectors v? How are the eigenvalues e correlated with C?

The eigenvector of a matrix (mathematically speaking from a linear transformation) is that vector whose direction does not change if the linear transformation is applied to it (mulitplication with matrix). The eigenvalue of the eigenvector describes how much the length of the vector changes. The equation is: Cv = lambda*v That is, if you multiply C with an eigenvector, you get the same result as if you would multiply the eigenvector with its eigenvalue.

What is the idea of Hough transform? What is an accumulator space? How to determine its dimensionality? Can you interpret the linear Hough space?

Try to find particular shapes in images. Per default lines, but can be adapted to circles and other shapes. 1. Edge detection 2. Binarization (thresholding for clear edges) 3. Compute all possible lines through all edge points in the binary image (each edge point transforms to a line in the accumulator space) 4. A line that goes to multiple edge points will have a peak value in accumulator space. Accumulator space is spanned by distance and angle of an edge point to the origin, Hesse form of the line) 5. From the position of peaks in Acc space the position of the line in the image can be inferred (distance, angle)

What is the idea of color segmentation and does it give any advantage? Provide an algorithm in pseudocode.

Try to find segments of constant colour. Homogeneity condition is distance in color space. E.g. K-Means-Clustering. Centroids are RGB-triplets.

(Co)variance and standard deviation. Definitions

Variance = mittlere quadratische Abweichung vom Mittelwert (average squared distance to mean). Dispersion measure of a random variable. Var(X):= 1/(n-1) * \sum_i (x_i - x\quer)² Describes intensity + direction, but is not standardized (unlike standard deviation or CC) Covariance(X,X) = Variance(X). Linear similarity measure between two random variables (like CC, but CV is not standardized!). If high X and Y show similar behavior (iff X high then Y high, iff Y low then X low), then cov is positive. If opposite behavior -> negative. Standard deviation(X) = sqrt(Variance(X)). Has the same dimension like the data (unlike (co)variance).

Edge detection before Template Matching (with Correlation Coefficient)

Very bad idea, since the area that needs to match becomes smaller (edges are small) --> harder to get matches. Also, if edges are occluded algorithm fails straight away.

PCA as iterative process (graphically)

What we do graphically: 1. Compute all possible lines through the mean data point & save the average squared distance from each datapoint to the line 2. Choose line with minimal average squared distance, that is the first PC. 3. Since all datapoints get projected orthogonally on that line, that minimizes the distance of the projection, the variance of the points along the new line (new axis, first PC) is maximized. 4. Find the next PC, that is go back to step 1 with the restriction to choose only those lines that are orthogonal to the all previous PCs

Principal Component Analysis (PCA) = KL-Transfomation/Expansion. What is it?What are (dis)advantages?

a non-parametric tool for unsupervised feature extraction, based on dimensionality reduction. In PR usually used as a preprocessing method for classifier if set of features is too large otherwise. Idea: Map a set of high-dimensional, possibly correlated variables (pixels) to a set of linearly independent (uncorrelated), low-dimensional features, the so called principal components. Linear dimensionality reduction method that projects the data onto the most meaningful set of basis vectors, the PC; intuitively the coordinate system is rotated such that variance along the PCs is highest. But the PC keep the property of being orthogonal to each other (pure rotation of coordinate system) *Advantage: * -Redundant information gets compressed, it remains a compact representation that even may reveal a hidden structure -PCs can be used as features (linear combination of all real features) or filter kernels *Disadvantage:* - PCs are not interpretable as they are a linear combination of all features. - Fails if distribution is non linear


Kaugnay na mga set ng pag-aaral

Geometry Chapter 5+6 Flash Cards

View Set

Organs of the Digestive System and Accessory Organs of the Digestive System

View Set

LOTF exam: english; significant passages(quotations!), symbols, and mix/match allegory

View Set

Microeconomics Test 1 Review (Hw 2)

View Set

Article 240 -Overcurrent Protection

View Set

A&P 3 - Ch 28 Female Reproduction

View Set

SmartBook Assignment Chapter 11 CONNECT Assignment

View Set