Robotic Vision

Ace your homework & exams now with Quizwiz!

Consider Noisefree Uncalibrated SfM, where the camera model in each view is a perspective camera model with p_camera,i = p_camera = (c_x, c_y, f_x, f_y, s_θ). All five intrinsic parameters are unknown, but fixed. Give one example of a motion that is critical with respect to calibration.

All camera rotations are about the same vector. E.g. Pure translation.

Describe how you may detect edges using the Laplacian of Gaussian

Edge detecting uses image gradients in the x- and y-direction to observe where the pixel variations are the largest. This happens around edges. Problems occur with noise such as shadows and color changes. LoG is based on the same principal, but because of noise we first smoothen the image and the output is a grayscale image. (So color changes isn't as prominent). Using this image edge points can be detected by looking for zero-crossings. Afterwards, the second derivatives of the image is used to detect images. (Shown in photo). Derivatives are calculated through convolution with derivative kernels (2x2 window sliding over image one pixel at a time. We label as a zero crossing if + and - values are simultaneously present in the 2x2 window and the difference exceeds a threshold. Zero-crossing = edge). Conclusion: First grayscale, gaussian smoothing, and then second derivatives. If derivatives exceed threshold it is classified as a line because the transition is sharp enough.

True/false: FLANN has been shown to have no impact on accuracy in downstream relative pose estimation tasks over exhaustive nearest neighbor matching.

False

True/false: SIFT is often used together with a k-d tree to accelerate the exact nearest neighbor search.

False

True/false: The focal length of a lens depends on the distance between the sensor and the lens.

False. Focal length is an inherent property of the lens.

True/false: In a modern camera, adjusting focus always preserves the angle of view.

False. The angle of view of a compound lens can change as it adjusts focus - focus breathing.

How have we related the mean squared (reprojection) error (MSE) to a quality indicator?

If the errors are gaussian, MSE is a biased estimate for estimated variance factor.

Mathematically define the mutual nearest neighbor (MNN) test. Give an example of when you can likely expect the MNN test to correctly discard a mismatch.

MNN discards if a match from image1 to image2 does not match from image2 to image1. If: Match_1->2 b = a, then discard match if Match_2->1 a != b. It is likely that the MNN test discards a mismatch if for example a corner in one image is matched to a corner in another image, but that corner is mapped back to a different corner in the first image again.

In Problem 26, is camera 2 located a positive distance along the optical axis of camera 1?

No

Repeat Problem 71 if J is the Jacobian of e(p) = (e_1(p)^T, ..., e_n(p)^T) at p_ML, and you want to instead find the covariance of the linear function f(p_ML) = Ap_ML, where A = Σ.

None of these. It will be ∑(J^T ∑^-1 J)^-1∑

Consider the RANSAC strategy in the two-view reconstruction algorithm from the homework. True/false: There exists an error threshold such that all correspondences are counted as inliers.

Not always. But it can exist.

A planar surface is located 15m along and perpendicular to a camera's optical axis. It is desired that a 1×1 px^2 image area roughly corresponds to a 2×2 mm^2 surface area. Assuming the sensor has a pixel size of 3 μm, determine roughly an appropriate focal length for a rectilinear lens to ensure this.

Pixel size = 3•10^-6 m Z = 15 m X = 2•10^-3 m X_s = 1•3•10^-6 m X_s = f•(X/Z) f = X_s•(Z/X) = 1•3•10^-6 • (15/2•10^-3) m = 22,5 mm

Your colleague runs a two-view reconstruction algorithm on simulated noise-free correspondences between successive image pairs, (1, 2), (2, 3), ..., (l − 1, l). They compute the product of the estimated relative poses (T^l)_1 = (T^l)_l−1 · · · (T^3)_2(T^2)_1. The simulated camera moved in a loop, so that (T^l)_1 should be exactly equal to I_4×4. They find that this is far from the case. What have they likely forgotten?

Recovered poses from the two-point algorithm can only determine translation up to a scale factor, as ant solution to the linear equation Ae = 0 can be scaled. Therefore they most likely forgot to normalize the scaling of each pose translation.

Consider the fundamental matrix F = [0 3 -175 -1 0 -976 72 825 7612] Find ρ and θ in the equation u_2•cos θ + v_2•sin θ = ρ for the epipolar line passing through u_1, when u_2 = (50, 40). u ̃_2^T F u ̃_1 = 0

Set up the equations equal to zero. Solve for θ and ρ. Remember to scale!

List three aspects that often distinguish systems advertised to solve SfM versus SLAM.

SfM: - Unordered set of images with unknown overlap. - Images from potentially different cameras - Relaxed time/compute constraint SLAM: - Live stream with high overlap - Same camera - Strict compute constraint

In the task with the scatter plot of image gradients, consider the auto-correlation matrix of the image gradients. Are the eigenvalues of the matrix in this case roughly identical, or is one significantly larger than the other (e.g. more than double)?

The eigenvalue along the x-direction, λ_1, should be more than double the value of λ_2, as there is more than double the variance in the x-direction than in the y-direction.

Consider an image I_1 showing a square 5 × 5 grid of dots. The dot grid is centered in the image and is exactly aligned with the horizontal and vertical axes of the image. Ignore discretization and sketch the transformed image I_2(u, v) = I_1(h(u, v)) where h(u, v) = [u + bv, v] with b = 1/10. Label your axes with u and v and indicate the positive axis directions.

The rows will be horizontal, but the higher the v, the more shifted to the right the dots will be. (by a factor of 1/10 times v)

Your colleague is evaluating a natural feature matching method. They measure the fraction of correspondences for which the average reprojection error of the triangulated point is less than ε, using a known relative pose and calibrated camera model. They find that some of the correspondences that are counted as inliers are clearly mismatches, and persist as ε is decreased. What have they likely forgotten?

They have most likely forgot that a set with mismatches in RANSAC can have better or as good (small) error as inliers.

Your colleague is evaluating a RANSAC strategy for estimating T in Calibrated AbsPose. They run it multiple times, with the same number of trials and the same threshold. They find that it returns the same inlier set every time, but (undesirably) the returned pose varies slightly. What have they likely forgotten?

They have most likely forgotten to enforce re-estimation of pose using inliers after each run of RANSAC (Bundle-adjustment)

Figure that shows a scatter plot of image gradients ∇I(u) = (I_x(u), I_y(u)). The depicted clusters are consistent with Gaussian distributions. Describe an image that is consistent with the image gradients. (Image = Three clusters horizontally lined, with centers in (I_x(u), I_y(u)) ≈ (-0.5, 0), (0, 0), and(0.5, 0)

This can be an image with vertical edges. For example it varies from gray, to white, to gray in the x-direction.

Find a perspective camera model for a camera with a rectilinear lens with: vertical angle of view 80° A sensor with physical size 4.8 mm horizontally and 3.6 mm vertically and pixel size 3 μm.

To find a camera model we need to find K = [sf_x s_theta c_x 0 sf_y c_y 0 0 1] Pixel size = 3•10^-6 m -> s=1/(pixel size) = s_x = s_y. Finding the focal length through trig: 80°/2 = 40° tan(40°) = (3,6/2)mm/f f = 1,8mm/tan(40°) = 2,15mm c_x = (4,8/2)•10^-3/3•10^-6 = 800 px c_y = (3,6/2)•10^-3/3•10^-6 = 600 px f_x = f_y = s•f = 2,15*10^-3/3*10^-6 = 715 skew = 0.

Consider Noisefree Calibrated Triangulation. Give an example of a motion that is critical with respect to estimating the 3D coordinates of the point.

Translating along the optical axis without rotation. Triangulation rely on disparity or the difference in 2D observations from different viewpoints to estimate depth/3D-coordinates.

True/false: If η_i,j ∼ N(0, σ^2_i,jI_2x2) in SfM, then it is sufficient to know σ^2_i,j up to a common scaling factor to compute a maximum likelihood estimate of the points and poses.

True

True/false: If ηi,j ∼ N(0, Σi,j ) in SfM, then it is sufficient to know Σi,j up to a common scaling factor to compute a maximum likelihood estimate of the points and poses.

True

Let the four functions f ̃_1, f ̃_2, f_1, f_2 be defined by f ̃_1(X) = KX f ̃_2(X) = K(X+t), with t=(1,0,0) f_i(X) = dehomogenize(f ̃_i(X)), where K is an intrinsic matrix of the form in the formula sheet. Let F=K^−T[t] K−1.Also let u_1 and u_2 be two arbitrary vectors in R^2, and regard these as fixed in the following. Select the always-true statements. a) If u ̃T_2 F u ̃_1 = 0, then there exists at least one X so that f_1(X) = u_1 and f_2(X) = u_2. b) If u ̃T_2 F u ̃_1 = 0, then there exists infinitely many X so that f_1(X) = u_1 and f_2(X) = u_2. c) If there exists an X so that f_1(X) = u_1 and f_2(X) = u_2, then u ̃T_2 F u ̃_1 = 0. d) If there exists an X so that f_1(X) = u_1 and u ̃T_2 F u ̃_1 = 0, then f_2(X) = u_2.

True: a) If u ̃T_2 F u ̃_1 = 0, then there exists at least one X so that f_1(X) = u_1 and f_2(X) = u_2 c) If there exists an X so that f_1(X) = u_1 and f_2(X) = u_2, then u ̃T_2 F u ̃_1 = 0

Consider a scatter plot of the residual reprojection error vectors from a camera calibration. a) True/false: The scatter plot depicting a clearly non-Gaussian distribution is always indicative of disagreement between model and data. b)True/false: The scatter plot depicting a clearly asymmetric Gaussian distribution is always indicative of disagreement between model and data.

a ) False b) False

Fischler and Bolles proposed a formula to determine the number of RANSAC trials. Describe the inputs to this formula, what it guarantees, and two practical limitations. (You can omit the actual formula.)

a = Wanted success rate b = Inlier ratio s = sample size It guarantees that in a% of the cases, an all inlier set is found within some number of iterations. Two practical limits: 1. The formula is actually not perfectly correct since the number of iterations can be much higher 2. Not all outlier free sets are equally good

Consider an image I_1 showing a square 5 × 5 grid of dots. The dot grid is centered in the image and is exactly aligned with the horizontal and vertical axes of the image. Consider the transformation h(u, v) = [c_x + x, c_y + y], where (x ̃, y ̃, z ̃) = R_Y(a) • (u - c_x, v - c_y, 1) and (c_x, c_y) is the center of the image, and a is a parameter in the range [−π/2, +π/2], and R_Y is as defined in the formula sheet. The coordinates (x, y) are obtained by dehomogenizing (x ̃, y ̃, z ̃). Answer the following. a) True/false: The rows of the dot grid remain horizontal after the transformation. b) True/false: The columns of the dot grid remain vertical after the transformation.

a) False b) True

Let u ̃_1,j = K_1[R_1 t_1]X ̃_j and u ̃_2,j = K_2[R_2 t_2]X ̃_j, j = 1...n. For each case below, find an expression for a matrix H so that u ̃1,j =Hu ̃_2,j,∀j=1...n. a) The camera motion is purely rotational. b) The 3D points are at infinity.

a) Only rotation means that t_1 = t_2 = 0 K_1[R_1 0]X ̃_j = HK_2[R_2 0]X ̃_j K_1[R_1 ]x ̃_j = HK_2[R_2 ]x ̃_j K_1[R_1]= HK_2[R_2] H = K_1R_1(K_2R_2)^-1 b) Points at infinity means that X ̃_j = (x_j, y_j, z_j, 0) K_1[R_1 t_1]X ̃_j = HK_2[R_2 t_2]X ̃_j When you put in the values for X ̃_j one can see that t disappears. We get the same answer as in a. H = K_1R_1(K_2R_2)^-1

Consider Noisefree Uncalibrated SfM, where the camera model in each view is a perspective camera model, with parameters that may be fixed or varying, and known or unknown. For each scenario listed below (referring to the assumptions on these parameters), state whether or not calibration is possible. (a) All fixed, all unknown b) All fixed, known skew c) All fixed, known skew and aspect ratio d) All varying, all unknown e) All varying except for skew, but still all unknown f) All varying, known skew g) All varying, known skew and aspect ratio

a) Possible b) Possible c) Possible d) Not possible - all unknown all varying. e) Possible f) Possible g) Possible The more constraints, the more we know.

Suppose Calibrated RelPose has been framed as the problem of minimizing the sum of squared reprojection errors, with respect to 6D parameterizations of the two poses, and 4D parameterizations of the 3D points. The vector of residuals r(p) is the vector of horizontal and vertical errors. a) How many residuals are there? b) How many rows and columns does the Jacobian of r have?

a) The number of residuals depend on the number of points of interest/markers. The amount of residuals per image are double the amount of points. One for horizontal and one for vertical. n points per image gives us 2n residuals -> 4n total b) The Jacobian of r is based on the number of parameters to differentiate by. # rows = # residuals = 4n # columns: Here we have the (2•6 = 12) + 4n = 12 + 4n (assuming none of the images are fixed)

Let {X ̃_j } j=1...n be a set of homogeneous 3D coordinates. Let X ̃_j , Y ̃_j , Z ̃_j be finite. For each case below, decide if the points are either (i) all coplanar (ii) all at infinity (iii) not necessarily any of these a)∀ j : W ̃_j = 0 b)∀ j : W ̃_j = 1 c)∀ j : Z ̃_j = 1 d)∀ j : Z ̃_j = W ̃_j e)∀ j : X ̃_j =Z ̃_j

a) ii - all at infinite. Dividing by zero to find X, Y and Z. b) iii - not necessarily any of these. Dividing by one does not mean we get all three on the same plane. c) iii - not necessarily any of these. d) i - all coplanar. e) i - all coplanar.

Consider the nearest neighbor matching strategy (in the context of feature matching). (a) Mathematically define the strategy. b) True/false: It involves computing distances between image points. c) True/false: It can only be used with the L2 distance metric. d) True/false: It can be applied to any algorithm that directly produces corresponding image points. e) True/false: It is often used after FLANN. f) True/false: It is often used after RANSAC.

a) {f_1,i}i=1...n_1, {f_2,i}i=1...n_2 = descriptors. d = dissimilarity (L_2 distance) for a = 1 ... n_1 b = match_1->2(a) = argminj d(f_1,a , f_2,j) IndexPairs <- {IndexPairs, (a,b)} Resulting correspondences: {u_1,a <-> u_2,b | ∀(a,b) ∈ IndexPairs } b) False c) False d) False e) False f) False

Repeat Problem 47, but give an example for which H is invertible

a_X = a_Y = a_Z = 0 t = (0, 0, 1) This would give H = [f_x s_theta c_x 0 f_y c_y 0 0 1] det(H) = f_x • f_y ≠ 0 -> invertible.

Consider a set of image points {u_j} j=1...n, where u_j=uˆj(p ̄)+η_j, with η_j∼N(0,σ^2_j •I_2×2). Let r(p)=(r_1(p)^T,...,r_n(p)^T),where r_j(p)=e_j(p)/σj and e_j(p)=u_j−uˆ_j℗. The cost function || r(p)||_2 is minimized with respect to p by a solver, which returns a maximum likelihood estimate p_ML and the Jacobian J of r at p_ML. Let Σ = diag(σ_1^2,...,σ_n^2). A first-order approximation of the covariance of p_ML is: a) (J^T J) b) (J^T J)^−1 c) (J^T Σ^−1 J) d) (J^T Σ^−1 J)−1 e) 1/2 (J^T Σ^−1 J) f) Σ (J^T J) g) Σ^−1 ( J^T J)^−1 h) (J^T Σ J) i) (J^T Σ J)^−1 j) None of these.

b) (J^T J)^-1

In Problem 22, what is the distance in pixels between the resulting line and a point at (30, 60)?

d = ρ - ( u_x•cos (θ) + v_x•sin(θ) ) ρ = 42,453 θ = 86,87 ° u_x = 30 v_x = 60 d = 42,453 - (30•cos(86,87 °) + 60•sin(86,87 °)) ≈ 19 px

Find a perspective camera model for a camera with a rectilinear lens with focal length 24 mm, and a sensor with pixel size 3μm and image resolution 1000×1000px2.

f = 24•10^-3 m pixel size = 3•10^-6 m (c_x, c_y) = (500, 500) s = 1/pixel size = 1/3•10^-6 m (s_x = s_y = s) f_x = f_y = s•f = 24•10^-3/3•10^-6 = 8000 K = [8000 0 500 0 8000 500 0 0 1]

Repeat problem 71 if J is the Jacobian of e(p) = (e_1(p)^T, ..., e_n(p)^T) at p_ML A first-order approximation of the covariance of p_ML is: a) (J^T J) b) (J^T J)^−1 c) (J^T Σ^−1 J) d) (J^T Σ^−1 J)−1 e) 1/2 (J^T Σ^−1 J) f) Σ (J^T J) g) Σ^−1 ( J^T J)^−1 h) (J^T Σ J) i) (J^T Σ J)^−1 j) None of these.

(J^T ∑^-1 J)^-1

Mathematically define the second-nearest neighbor distance ratio (SNNDR) test (Lowe's ratio test). Give an example of when you can likely expect the SNNDR to be close to 1 or 0, respectively.

(a,b) are matched point correspondences, and c is the second nearest neighbour to a. Discard match if: SNNDR = (f_1,a-f_2,b)/(f_1,a-f_2,c) > MaxRatio If there are a lot of repeating textures, we would expect the SNNDR to be close to 1.

Let f : Z→R, g : Z→R, h : Z→R, be functions defined on the set of all integers, and define the operators ∗ and ⊗ so that (f∗g)(x)= Σ f(x−i)g(i) (1) and (f⊗g)(x)= Σ f(x+i)g(i) (2). (Both sums where i = -∞ to ∞) In the following equalities, the symbol ◦ is a placeholder for an operator. For each equality, state whether it is true when ◦ is ∗ or when ◦ is ⊗ or both or neither. 1. (f◦(ag))(x) = a(f◦g)(x), for any a∈R. 2. (f◦(a+g))(x) = a+(f◦g)(x), for any a∈R. 3. (g◦f)(x) = (f◦g)(x) 4. (f◦(g◦h))(x) = ((f◦g)◦h)(x) 5. (f◦(g+h))(x) = (f◦g)(x)+(f◦h)(x) 6. (f ◦ (gh))(x) = ((gf) ◦ h)(x)

1. Both (1) and (2) True 2. Both (1) and (2) False 3. (1) True convolution. (2) False cross-correlation 4. (1) True convolution if symmetric kernel. (2) False cross-correlation. 5. Both (1) and (2) True 6. Both (1) and (2) False

In a set of correspondences {u_j ↔ X_j}j=1...n, a subset (the inliers) are expected to be related by u_j = project(TX ̃ _j ) + η_j , where η_j is an additive noise vector. Describe the main loop of an inlier-counting RANSAC strategy to estimate T. Assume you have a function solveAbsPose that, given at least n_min correspondences, either returns at least one but no more than nsoln solutions for T, or it returns none if the configuration was critical. Count inliers using a user-specified threshold ε on reprojection errors.

1. In each iteration, randomly select a sample of m correspondences. 1.1 Estimate T using solveAbsPose with only m correspondences. 1.2 Compute a vector of errors, || e_j ||_2 = || project(TX ̃_j ) - u_j ||_2. (The error is the reprojection error) 1.3 Count the number of errors smaller than ε. (The inlier count) 2. Return T and the associated inlier set that had the highest inlier count.

Your colleague runs a two-view reconstruction algorithm on correspondences from real images. They find that the reconstruction contains several points that appear to float in the air and should clearly not be part of the reconstruction. They try to get rid of these by reducing the error threshold in RANSAC, but they find that some of them remain no matter how small they make the threshold. Describe how a third image can be used to better eliminate these points. Suggest how/where to take this image, if the first two images were taken while stepping sideways without rotating

A third image can be used to do three-view reconstruction, by first selecting correspondences between the first two images and the third. We can use there correspondences to estimate the essential matrix between first two and third image. Use this essential matrix to triangulate the points. Third image gives us more geometric constraints and will make RANSAC more robust for outlier rejection. The problem with only two translational images is that all points are probably not visible in both images. The third image should therefore have good overlap with first two images, so we have enough correspondences. Ideally the image should fill inn the gaps between first and second image. For example from a higher viewpoint, with some rotation of the camera.

What distinguishes an axial camera model from a general non-central camera model?

Axial camera model: A line intersects all back-projection rays. Rays with the same angle against the line can be associated with the same point of origin.

State the number of correspondences in the minimal instance of the problems below. Define the goal of the estimation problem. a) Noisefree Calibrated AbsPose, central camera model. b) Noisefree Calibrated AbsPose, non-central camera model. c) Noisefree Calibrated RelPose, central camera model in both views. d) Noisefree Calibrated RelPose, non-central camera model in both views.

Calibrated AbsPose: n = 3 Estimate R and t (T) The same for central and non-central camera model Calibrated RelPose: Central: n=5 Non-central: n=6 Estimate T^1_2 (X^w_j, j = 1,...,n, T^1_w, T^2_w)

How do we recommend you treat rigid transformations in iterative optimization? How is it helpful if the implementation of the optimizer allows the user to specify custom parameter updates?

Can treat rigid transformations as Euler parameterizations. Custom parameter updates allow flexibility in incorporating domain-specific knowledge or constraints related to rigid transformations. (By utilizing custom parameter updates, you can tailor the optimization process to the specific requirements. Potentially leading to more accurate and robust results)

What distinguishes a non-central camera model from a central camera model?

Central camera: All back projection rays can be associated with a common point of origin. A non-central camera model does not have this property.

What distinguishes a feature detector from a feature descriptor?

Detector: Chooses points from image based on some criteria. E.g. "cornerness". Defines direction and scale which is necessary for making a descriptor. Descriptor: Vector of values which describes the image patch around a point detected by the detector.

Find the normalization constant k for the filter kernel, assuming that it is intended to perform a differentiation operation. k • [1 0 -1 1 0 -1 1 0 -1]

Differention operation - k = 1/(sum of absolute values) k = 1/(1 + 1 + 1 + 1 + 1) = 1/6

Consider two images from a perspective camera with K = [100 0 50 0 100 50 0 0 1] The camera only translated, so that X_1 = X_2 + t, with t = (1, 0, −5). Decide for both images if the epipole (where the epipolar lines intersect) is inside the image, when both image domains are u ∈ [0, 100], v ∈ [0, 100].

Epipole in 1st image: e_1 = K•t Epipole in 2nd image: e_2 = K•-t u = (30, 50) This is inside

A linear method for Triangulation can be derived using Cartesian or homogeneous coordinates, resulting in a system of the form AX = b or AX ̃ = 0. Why may the latter be preferable?

Homogenous coordinate problem can be solved by using SVD. This is not the case for non-homogenous coordinates. They also let us represent points at infinity easier. It is easy to transform between cartesian and homogenous coordinates) Overall simplify calculations.

Consider perspective camera models A and B with parameters (f_x, f_y, c_x, c_y): A = (7950, 7950, 480, 520) B = (8000, 8000, 500, 500) (with sθ = 0). Find the largest possible difference in the horizontal pixel coordinate of the projection of a 3D point under the two models. The image domain is u ∈ [0, 1000], v ∈ [0, 1000].

Horizontal pixel coordinate u. We want to find max |u_A - u_B|. where u_i = c_i + f_x,i • (X_i/Z_i) Use that X/Z = c_x/f_x ! we get max|u_A - u_B| = |480 - 500 + (7950-8000)•(1/16)| ≈ 23,125 px.

You have a set of 2D points x_i = (x_i, y_i), i = 1, ..., n, and you want to detect straight lines among the points using the Hough transform. You only want to detect lines for which the angle θ is in the range [15°, 30°] in the line equation ρ = x cos(θ) + y sin(θ). To avoid trigonometric functions, you decide to parameterize lines using the slope-intercept form y = ax + b. Describe how you may use a Hough transform to detect lines represented in the slope-intercept form. Assume that votes are accumulated in an array with N_a and N_b bins, and that a and b are in the ranges a ∈ [a1,a2] and b ∈ [b1,b2].

Hough transform idea: Each point votes for all hypothetical lines that could pass through it. Each vote is a point in parameter space. A strongly present line -> dense cluster of votes. We need to go from (x, y)-space to (a, b)-space. Rewrite ρ = x cos(θ) + y sin(θ) to y = (ρ - cos(θ))/sin(θ) (see image) Points in (x,y)- space that lie on a line will form clusters of zero-crossings in (a, b)-space -> Quantize (a, b) -> a ∈ [a1,a2] and b ∈ [b1,b2]. where a ∈ [-(cos(15°)/sin(15°), -(cos(30°)/sin(30°) ] b ∈ [ρ/sin(15°), ρ/sin(30°)] Count number of zero-crossings in every accumulator-cell. Cells with most votes correspond to most likely line in (x, y)-space.

Your colleague runs a two-view reconstruction algorithm on correspondences from real images. They find that the reconstruction contains several points that appear to float in the air and should clearly not be part of the reconstruction. They try to get rid of these by reducing the error threshold in RANSAC, but they find that some of them remain no matter how small they make the threshold. What is happening?

If backprojection rays to two points intersect, then the reprojection error will always be zero. All coplanar backprojection rays will intersect at some point, for example two points with same height from the ground if the cameras are at the same height.

How can the rank of the Jacobian of the vector of residuals be used to identify a critical configuration?

If the rank of the Jacobian is less than the number of parameters we estimate, we have rank deficiency. This means we don't have enough observations of the parameters. -> If rank(J) < #parameters

Your colleague is estimating a fundamental matrix using a RANSAC strategy. They find that, for any threshold, relatively fewer of the correspondences are counted as inliers for which one or both image points are close to the edge of the image, compared those for which both image points are close to the center. What have they likely forgotten?

Inaccurate edge detections along edge of the image due to - Distortion along edges - Camera that is used is not rectilinear (straight line preserving)

What have we suggested regarding rotations about the optical axis in the context of calibration?

Include ± 90° and 180° rotations about the optical axis at each viewpoint.

In a pinhole camera with an infinitely small pinhole, and a sensor with finite physical size, which of the following will always increase the angle of view? - Increasing the principal distance - Increasing the pixel density - Increasing the sensor's physical size - Increasing the image resolution. - Decreasing the principal distance - Decreasing the pixel density - Decreasing the sensor's physical size. - Decreasing the image resolution.

Increasing and decreasing the principal distance moves the sensor and will change AoV. Increasing and decreasing the sensor's physical size changes the sensor and will change AoV. Changing image resolution or pixel density does not change the AoV.

What distinguishes a homogeneous linear system from an inhomogeneous linear system?

Inhomogenous: AX = b where b≠0 Homohenous: AX ̃ = 0

In the context of camera calibration, why do we prefer to use a checkerboard mixed with uniquely identifiable markers (e.g. AprilTags), over a plain checkerboard?

It avoids the need for a fully visible pattern. For instance a checkerboard usually must be fully visible and asymmetric to uniquely identify its internal points. This can make it hard to properly make a distortion model for the edges.

How have we used a scatter plot of the aggregated image point observations as a quality indicator?

It can be used by looking at the concentration of point correspondences, as very few point correspondences along edges give an indication that out radial distortion model is wrong.

What do we mean by the cheirality constraint in the context of the relative pose problem?

It is a geometric constraint to ensure that 3D-points reconstructed from relative pose estimation lie in front of both cameras.

How can knowing the relative pose and camera model parameters for a pair of images simplify the problem of finding point correspondences?

Knowledge of relative pose and camera model parameters lets us compute for given point in image 1, the epipolar curve in image 2, and do a stitching window search on this line, comparing pixel for pixel intensities to best match the points. Gives information about the fundamental matrix, which makes it so that we can do a sliding window search instead of a search over the whole image.

Consider Noisefree Calibrated SfM. Give an example of a motion that is critical with respect to obtaining a similarity reconstruction.

Pure rotation. No change of perspective. Similarity reconstruction rely on differences in scale and relative positions of the 3D scene points. A pure rotation provides limited information for accurately estimating the parameters needed.

What distinguishes RANSAC (as described by Fischler and Bolles) from MSAC?

RANSAC: Randomly select a subset of samples to fot data model, and after model with the most inliers are found, the inliers are used to re-estimate parameters. MSAC: The discrete inlier counting is replaced by a robust loss function. Weighs residuals of the samples and chooses the samples with highest weights.

What have we suggested that a scatter plot of the radial reprojection errors can reveal about the camera model? Also, describe how a point in this scatter plot is computed from a given measured image point u, estimated principal point (cˆ_x, cˆ_y), and predicted image point uˆ.

Scatter-plotting the radial errors (r_i - rˆ_i) against r_i may reveal the presence (and magnitude) of radial lens distortion. r_i = || u_i - (cˆ_x, cˆ_y) || rˆ_i = || uˆ_i - (cˆ_x, cˆ_y) || If there is a trend among the radial errors then the model may be wrong (in not assuming distortion).

Consider Noisefree Uncalibrated RelPose. Let S_1 = (K_1, K_2, (T^1)_w, (T^2)_w, {(X^w)_j }j=1...n) be one solution. Give a brief verbal description of what we mean by a similarity reconstruction, and define the solution set in a similarity reconstruction in terms of the elements of S_1.

Similarity reconstruction is reconstruction of an object while preserving shape and structure. (But allows for variations in size and position) S_1 = (K_1, K_2, T^-1(T^1)_w, T^-1(T^2)_w, {T(X^w)_j }j=1...n) Where T = [λR t 0 1]

Find the normalization constant k for the filter kernel on the right, assuming that it is intended to perform a smoothing operation. k • [1 1 1 1 10 1 1 1 1]

Smoothing operation - k = 1/sum. k = 1/(1 + 1 + 1 + 1 + 10 + 1 + 1 + 1 + 1) = 1/18

You have detected point features in two images, and want to count for each feature in one image, how many features were detected within some radius of the true corresponding point in the other image. You have a fundamental matrix and a homography. Can either of these, or both, let you do this count?

The Homography matrix can be used by mapping all points in one image to the other image. After this, a search within the radius can decide how many points are inside the radius. The fundamental matrix can not be used because point maps to line with infinitely many solutions.

Figure that shows a simulation of light paths through a lens. Find the principal distance in a perspective camera model using the figure. Include intermediate measurements, e.g. "I judge from the figure that the real sensor plane is approximately at X = 105 mm"

To find the principal distance in a perspective camera model: 1. Select a line to work with. (Easy with the lower-to-upper one) 2. Draw a horizontal line crossing the top point. 3. Draw a horizontal line on y=0. 4. Draw a straight line from the start of the line chosen in (1) and elongate it so it crosses both horizontal lines. 5. Note the two x-values in the crossings of the lines. f = cross2-cross1. Here: f = 64mm - 32mm ≈ 32mm

Consider a perspective camera model with f_x = f_y = f and s_θ = 0. Estimates of the parameters, fˆ, cˆ_x, cˆ_y, represented by pˆ and are within ±3σ_pˆ of the true values, f ̄, c ̄_x, c ̄_y. (pˆ, σ_pˆ) = f: (2000, 3) c_x: (1100, 2) c_y: (900, 2) Let u be a measured image point, equal to the projection of some point X under the true values, and let uˆ be the projection of X under the estimated values. For each parameter, find the potential horizontal reprojection error |u − uˆ|. Treat each parameter in isolation, assuming σ_pˆ = 0 for the others. The image domain is u ∈ [0, 2500], v ∈ [0, 2000].

Using the standard deviation from the table we get: fˆ= [1991, 2009] cˆ_x = [1094, 1106] cˆ_y = [894, 906] u = c ̄_x + f ̄•x uˆ = cˆ_x + fˆ•x x = X/Z For fˆ: We put in the lower and upper values for u, to solve for x. u = 0 -> X/Z = -1100/2009 u = 2500 -> X/Z = 1400/2009 max|u − uˆ| = max |c ̄_x - cˆ_x + (f ̄ - fˆ)(X/Z)| = max|1100-1100 + (2000-2009)[-1100/2009, 1400/2009]| =|-9•(1400/2009)| = 6.272 px. For cˆ_x: max|u − uˆ| = max|c ̄_x - cˆ_x + (f ̄ - fˆ)(X/Z)| = |1100-1106| = 6 px. We do not need to use c_y because that doesn't affect horizontal error. -> Largest error for u=2500 (v = [0,2000]) where σ_cˆ_x = 0, σ_fˆ= ±3, with the error being 6.3px.

Let g(x) = ∑f(i)w(x-i, f(x)-f(i))/∑w(x-i, f(x)-f(i)) where w(a,b)= exp(−(a^2)/(2σa^2) − (b^2)/2σb^2) The picture shows a signal f, and four instances of g computed by the above formula, where f has been zero-padded as necessary, for four choices of σ_a and σ_b (not necessarily in order): a) (σ_a, σ_b) = (0.2, 3) b) (σ_a, σ_b) = (3, 0.2) c) (σ_a, σ_b) = (3, 3) d) (σ_a, σ_b) = (1, 1) What does (σ_a, σ_b) do?

Version 1 = a) (σ_a, σ_b) = (0.2, 3) Version 2 = b) (σ_a, σ_b) = (3, 0.2) Version 3 = d) (σ_a, σ_b) = (1, 1) Version 4 = c) (σ_a, σ_b) = (3, 3) σ_a controls the reach of the smoothing operation σ_b controls how similar a value must be to the center value to be included in the smoothing

Consider the Quanser helicopter from the homework. Characterize the Jacobian in the cases below. a) Only the markers on the arm (not the rotor carriage) are observed. b) All markers are observed, but the helicopter is oriented to point nearly straight up (ignore the physical limitation of the hinge).

We have 7 markers = 14 residuals J = [∂r_1/∂ψ ∂r_1/∂θ ∂r_1/∂φ ... ... ... ∂r_14/∂ψ ∂r_14/∂θ ∂r_14/∂φ] a) All residuals related to the markers on rotator will be 0. The first 6 residuals relate to the arm, and they will be non-zero. The rest will be zero. b) With the helicopter straight up, changing ψ contributes the same as changing φ. This makes it so the first and third column of J are the exact same.

A surface is a disk of radius 5 m is located 15m along and perpendicular to a camera's optical axis. The disk is centered on the optical axis. What is the smallest image resolution that makes the disk fully visible in the image? Use the focal length from problem 29. (It is desired that a 1×1 px^2 image area roughly corresponds to a 2×2 mm^2 surface area. Assuming the sensor has a pixel size of 3 μm, determine roughly an appropriate focal length for a rectilinear lens to ensure this.)

We have from problem 29 that f = 22,5•10^-3 m X = 5 m Z = 15 m Because we know that the disk is centered on the cameras optical axis we can set up the equation: X/Z = c_x/f and solve for c_x = 7,5•10^-3 m Number of pixels on c_x = c_x/pixel size = 7,5•10^-3/3•10^-6 = 2500 px. To cover the whole circle we therefore need 2•2500 px because it is 10m in diameter. This means that the sensor needs 5000px in both x- and y-direction Resolution = 5000 x 5000 px^2 = 25MP

In Problem 46, let R = R_X(a_X )R_Y(a_Y )R_Z(a_Z ), and give one example of numerical values for a_X , a_Y , a_Z , and t, for which H is not invertible. In your example, the 3D points should all be in front of the camera, assuming xj , yj ≥ 0, and also ||t||_2 = 1.

We have from problem 46: H = K[r_1 r_2 t] r_1 = (1, 0, 0) r_2 = (0, 1, 0) r_3 = (0, 0, 1) t = (0, 1, 0) -> H^-1 not possible. H = [f_x s_theta 0 0 f_y 0 0 0 0] -> a_X, a_Y, a_Z = 0 t = (0, 1, 0)

How can it be problematic to parameterize 3D points by 3D Cartesian vectors?

When a 3D point is very far away, all that matters is its direction. However, if its (X, Y, Z) components are very large, then they lose numerical precision, and correspondingly large displacements are necessary to affect the cost function

In Problem 22, is the resulting line visible if the image domain is u ∈ [0, 100], v ∈ [0, 100]?

Yes, we have a positive angle (θ ≈ 86,87 °).

In Problem 26, is camera 1 located a positive distance along the optical axis of camera 2?

Yes. X_1 = X_2 + t If we make camera 2 the center (0, 0, 0) we get X_1 = (0, 0, 0) + (1, 0, -5) = (1, 0, -5). So camera 1 "sees" camera 2 in a negative direction. Camera 2 is therefore behind camera 1.

Consider Noisefree Calibrated SfM with l = 3 views, all with a central camera model. Let the relative pose (T^2)_1 be known without scale ambiguity in the translation. For each case below, decide if (T^3)_2 can then generally (for some configuration) also be determined without scale ambiguity in the translation. a) Vj = {1, 2, 3} for all points b) Vj = {1, 2} for a subset of points, and Vj = {2, 3} for the remaining points c) Vj = {1, 2} for one subset, Vj = {2, 3} for another subset, and Vj = {1, 3} for the remaining points.

a) Can be determined without scale ambiguity because all points are visible in all views. b) Can't be determined without scale ambiguity because not metric scale in views to establish metric scale. c) Can be determined by triangulating the origin of the third view using the direction vectors from 1-3 and 2-3.

Answer the following regarding the Lucas-Kanade method, as described in the lectures. a) True/false: It is often used together with an edge detector. b) True/false: It is often used in a hierarchical manner, starting at the least blurry version of the input. c) True/false: It generally finds the parameters of a transformation, e.g. a 2D homography. d) True/false: It involves the epipolar geometry. e) True/false: It is typically used in a "wide baseline" setting.

a) False b) False c) True (It generally finds the parameters of a transformation, e.g. a 2D homography.) d) False e) False

The camera model u = c_x + (x+δ_x)f_x, δ_x = (k_1r^2 +k_2r^4 +k_3r^6 +···)x + 2p_1xy + (r^2 +2x^2)p_2 v = c_y + (y+δ_y)f_y, δ_y =(k_1r^2 +k_2r^4 +k_3r^6 +···)y + 2p_1xy + (r^2 +2x^2)p_2, has been calibrated on images of width W and height H. How should the parameters be modified if: (a) The images are resized to width aW and height aH by a factor a ∈ (0, 1]? b) The images are cropped by subtracting l, r, t, b ≥ 0 px from the left, right, top, and bottom?

a) c_x and c_y is smaller: c_x,after = a•c_x c_y,after = a•c_y To have the same light ray hit the same space, f_x and f_y mush be scaled by a as well. (In theory we scale s): f_x,after = af_x f_y,after = af_y Distortion coefficients are unaffected by downscaling. ∆(k_1, k_2, ... , p_1, p_2) = 0. b) c_x and c_y will be changed, but this depends on the amount cropped. c_x,new = c_x - l c_y,new = c_y - t All other parameters are unchanged.

Let (f (1), ..., f (4)) = (5, 3, 7, 2), and (g(−1), g(0), g(1)) = (1, 1, 1), and h(x) = Σf′(x − i)g(i) , i = -1, 0, 1 where f′ = (f (0), ..., f (5)) is a padded version of f. State the resulting values of (h(1), h(2), h(3), h(4)), for the following padding strategies: a) "zero" (extend the signal with 0) b) "clamp" (extend the signal with the nearest edge value).

a) f' = [0, 5, 3, 7, 2, 0] h(1) = 3 + 5 + 0 = 8 h(2) = 7 + 3 + 5 = 15 h(3) = 2 + 7 + 3 = 12 h(4) = 0 + 2 + 7 = 9 h = [8, 15, 12, 9] b) f' = [5, 5, 3, 7, 2, 2] h(1) = 3 + 5 + 5 = 13 h(2) = 7 + 3 + 5 = 15 h(3) = 2 + 7 + 3 = 12 h(4) = 2 + 2 + 7 = 11 h = [13, 15, 12, 11]

Let u ̃_j = K[R t]X ̃_j for j = 1...n. Let X_j = (x_j,y_j,0), and define x_j = (x_j, y_j). a) Find an expression for a matrix H so that u ̃_j = Hx ̃_j , ∀j = 1...n. b) Does the expression for H you found in (a) always represent a homography?

a) see image b) No, only if H is full rank so that H^-1 exists.

Let {u_1,j ↔ u_2,j}j=1...n be a set of corresponding perspective projections of a rigid set of 3D points: u ̃_i,j =KX^i_j .Let X^2_j = RX^1_j + t, and let t = (t_X, t_Y, t_Z), and let K, R, and t_Y be known. The equations from n correspondences then lead to a linear system Ah = b, where A is an n × 2 matrix, and only h involves the unknown variables. Give an expression for h and b and one row of A.

h will contain u ̃_1,j and u ̃_2,j. b will contain the x's. A will contain matrices K and R

Repeat Problem 53 for Calibrated SfM. Likewise, the poses and 3D points are respectively parameterized by 6D vectors and 4D vectors. Tip: The number of elements in V_j is denoted by |V_j|. a) how many residuals are there b) number of rows and columns in the Jacobian.

l = #views, Q = sum of all points n_j = num points for image j a) # residuals = 2•∑n_j (sum from j=1 to l) rows = # residuals = 2•∑n_j (sum from j=1 to l) columns = 6l + 4Q

Consider a perspective camera model with a generic K matrix. For each 3D point below, find an expression for the Cartesian coordinates of its projection, or decide that the 3D point is behind the camera or that its projection is undefined: a) X ̃ = (0,0,0,0) b) X ̃ = (0,0,0,1) c) X ̃ = (0,0,1,0) d) X ̃ = (0,0,1,−1) e) X ̃ = (0,0,−1,−1) f) X ̃ = (1,1,1,1) g) X ̃ = (1,1,1,0) h) X ̃ = (2,2,2,1)

u ̃ = K[I_(3x3) 0]X ̃ s_i • f = f_i a) Undefined b) Undefined c) u = [c_x, c_y] d) u = [c_x, c_y] Behind!! (W=-1) e) u = [c_x, c_y] f) u = [f_x + s_θ + c_x, f_y + c_y] g) u = [f_x + s_θ + c_x, f_y + c_y] h) u = [f_x + s_θ + c_x, f_y + c_y]

Consider a perspective camera model with K = [100 0 200 0 100 500 0 0 1] Find the Cartesian coordinates of the projection of X ̃ = (3, 5, 7, 2)

u ̃ = K[I_(3x3) 0]X ̃ u = [1700/7, 2250/7]

Consider a homography represented by H = [2 0.5 -0.5 0 2 0 0 0 2] Find the Cartesian coordinates u_2, when u ̃_2 = Hu ̃_1, and u_1 = (1, 1)

u ̃_1 = (1, 1, 1) See image. u_2 = (1, 1)

Consider the fundamental matrix F = [0 3 -175 -1 0 -976 72 825 7612] Find ρ and θ in the equation u_2•cos θ + v_2•sin θ = ρ for the epipolar line passing through u_2, when u ̃_2^T F u ̃_1 = 0, and u_1 = (30, 40).

u ̃_1 = (30, 40, 1) u ̃_2 = (u_2, v_2, 1) Set up the equation u ̃_2^T F u ̃_1 = 0 Remember to scale!

Consider the problem of finding a maximum likelihood estimate of p ̄ in Calibrated AbsPose, where T = Tˆ(p ̄) and Tˆ:R^6 → SE(3), and η_j ∼N(0,σ^2_0•σ^2_j•I_2×2), and n=4. Define an expression for an unbiased estimate of σ^2_0, in terms of the reprojection error vectors at the maximum likelihood estimate, e1, ..., en, and any other variables of relevance. Specify the redundancy.

σ^2_0 = 1/R ∑ (1/σ^2_j)(m_j - mˆ_j(p_ML))^2 = 1/R ∑ (1/σ^2_j)|| e_j ||^2 R = n_m - n_p n_m = 2n n_p = number of parameters. Here it is 6 (R^6) R = 2n-6 = 8-6 = 2

State the number of degrees of freedom of the following: • 2D homography • 2D affine transformation • 3D similarity transformation • 3D rigid transformation • Perspective projection.

• 2D homography - 8 • 2D affine transformation - 6 • 3D similarity transformation - 7 • 3D rigid transformation - 6 • Perspective projection - 11


Related study sets

Modern Database Management - Chapter 3

View Set

Quiz Chapter 2: Doing Social Psychology Research

View Set

Sample Exam Questions for ECO FInal

View Set

Chapter 9 Exam (Lumbar Spine, Sacrum, & Coccyx)

View Set

Behavioral Sciences CH:5 Motivation, Emotion and Stress

View Set

Unit 3: Interests in Real Estate

View Set

The Declaration of Independence ENGLISH III ONLINE CLASS

View Set