CS 189 Final Review
Select the correct statements about principal component analysis (PCA). A: PCA is a method of dimensionality reduction B: If we select only one direction (a one-dimensional subspace) to represent the data, the sample variance of the projected points is zero if and only if the original sample points are all identical C: The orthogonal projection of a point x onto a unit direction vector w is (x >w)w D: If we select only one direction (a one-dimensional subspace) to represent the data, PCA chooses the eigenvector of the sample covariance matrix that corresponds to the least eigenvalue
a,b,c
Which of the following are advantages of using k-medoid clustering instead of k-means? k-medoids is less sensitive to outliers Medoids make more sense than means for nonEuclidean distance metrics Medoids are faster to compute than means The k-medoids algorithm with the Euclidean distance metric has no hyperparameters, unlike k-means
a,b
Which of the following are true for the k-nearest neighbor (k-NN) algorithm? A: k-NN can be used for both classification and regression. B: As k increases, the bias usually increases. C: The decision boundary looks smoother with smaller values of k. D: As k increases, the variance usually increases.
a,b
Which of the following are true in general for backpropagation? It is a dynamic programming algorithm Some of the derivatives cannot be fully computed until the backward pass The weights are initially set to zero Its running time grows exponentially in the number of layers
a,b
Recall that in certain cases, Newton's method can converge to the global optimum of an objective function in just one step. For which of the following objective functions and methods will Newton's method always converge in just one step? (Assume λ > 0 where regularization is used.) A: Dualized, kernelized logistic regression where the cost function is the mean of the logistic losses B: Ridge regression with an ℓ2-regularized mean squared error C: A neural network with ReLU activation functions on the hidden units and an ℓ2-regularized mean squared error D: Dualized, kernelized ridge regression with an ℓ2-regularized mean squared error
b,d
Select the correct statements about the Fiedler vector. A: The Fiedler vector is the eigenvector of the Laplacian matrix that is associated with the smallest eigenvalue B: The Fiedler vector always satisfies the balance constraint (as written as an equation) C: The Fiedler vector is a solution to the unrelaxed, NP-hard optimization problem D: The sweep cut is a spectral graph partitioning technique that tries n − 1 different cuts (in an n-vertex graph) and picks one of them.
b,d
Suppose we have a feature map Φ and a kernel function k(Xi , Xj) = Φ(Xi)· Φ(Xj). Select the true statements about kernels. A: If there are n sample points of dimension d, it takes O(nd) time to compute the kernel matrix B: The kernel trick implies we do not compute Φ(Xi) explicitly for any sample point Xi C: For every possible feature map Φ : R d → R D you could imagine, there is a way to compute k(Xi , Xj) in O(d) time D: Running times of kernel algorithms do not depend on the dimension D of the feature space Φ(·)
b,d
What is true of human neurology? A: The output of a unit in an artificial neural network is roughly analogous to the voltage of an axon at the synapse. B: A connection in an artificial neural network is roughly analogous to a synapse in the brain. C: The brain is made of general-purpose neurons, each of which could be trained to do any job that neurons do. D: The visual cortex has neurons primarily devoted to detecting lines or edges.
b,d
Which of the following are advantages to using AdaBoost with short trees (say, depth 4) over random forests with an equal number of tall trees (refined until the leaves are pure)? A: AdaBoost is more robust against overfitting outliers in the training data. B: AdaBoost is faster to train. C: AdaBoost is better at reducing variance than a random forest. D: AdaBoost is better at reducing bias than a random forest.
b,d
Which of the following are true of hierarchical clustering? The number k of clusters is a hyperparameter The greedy agglomerative clustering algorithm repeatedly fuses the two clusters that minimize the distance between clusters Complete linkage works only with the Euclidean distance metric During agglomerative clustering, single linkage is more sensitive to outliers than complete linkage
b,d
Which of the following classifiers are capable of achieving 100% training accuracy on the data below? The decision trees use only axis-aligned splits. Logistic regression A neural network with one hidden layer AdaBoost with depth-one decision trees AdaBoost with depth-two decision trees
b,d
Which of the following statements about AdaBoost is true? A: AdaBoost is a natural good fit with the 1-nearest neighbor classifier. B: AdaBoost can give you a rough estimate of the posterior probability that a test point is in the predicted class. C: AdaBoost trains multiple weak learners that classify test points by equal majority vote. D: AdaBoost with an ensemble of soft-margin linear SVM classifiers allows the metalearner to learn nonlinear decision boundaries.
b,d
You are training a neural network with sigmoid activation functions. You discover that you are suffering from the vanishing gradient problem: with most of the training points, most of the components of the gradients are close to zero. It is causing training to be very slow. How could you combat this problem? A: Make the network deeper (more layers). B: Make the network shallower (fewer layers). C: Initialize the weights with larger values. D: Use ReLU activations instead of sigmoids.
b,d
Consider an n × d design matrix X with labels y ∈ R n . What is true of fitting this data with dual ridge regression with the polynomial kernel k(Xi , Xj) = (X T i Xj + 1)p = Φ(Xi) >Φ(Xj) and regularization parameter λ > 0? If the polynomial degree is high enough, the polynomial will fit the data exactly The algorithm computes Φ(Xi) and Φ(Xj) in O(d p ) time The algorithm solves an n × n linear system When n is very large, this dual algorithm is more likely to overfit than the primal algorithm with degree-p polynomial features
c
Select the correct statements about AdaBoost. A: When we go from iteration t to iteration t + 1, the weight of sample point Xi is increased if the majority of the t learners misclassify Xi . B: Unlike with decision trees, two data points that are identical (Xi = Xj) but have different labels (yi , yj) can be classified correctly by Adaboost. C: AdaBoost can benefit from a learner that has only 5% training accuracy. D: If you train enough learners and every learner achieves at least 51% validation accuracy, Adaboost can always achieve 100% validation accuracy;
c
Suppose your training set for two-class classification in one dimension (d = 1; xi ∈ R) contains three sample points: point x1 = 3 with label y1 = 1, point x2 = 1 with label y2 = 1, and point x3 = −1 with label y3 = −1. What are the values of w and b given by a hard-margin SVM? A: w = 1, b = 1 B: w = 0, b = 1 C: w = 1, b = 0 D: w = ∞, b = 0
c
Which of the following conditions could serve as a sensible stopping condition while building a decision tree? A: Stop if you find the validation error is decreasing as the tree grows B: Don't split a treenode that has an equal number of sample points from each class C: Don't split a treenode whose depth exceeds a specified threshold D: Don't split a treenode if the split would cause a large reduction in the weighted average entropy
c
What is true of kernels? A: Kernel algorithms require you to solve a linear system with a D × D matrix, where D is the length of the lifted sample points Φ(Xi). B: Neural networks can be kernelized because there is always an optimal weight vector w that is a linear combination of the sample points. C: Fitting and evaluating a kernel ridge regression model requires access only to the kernel matrix K and not the training points Xi nor Φ(Xi). D: The kernel perceptron algorithm with a Gaussian kernel returns a classifier similar to a smoothed version of k-nearest neighbor classification.
d
What is true of regression algorithms? A: All regression problems can be solved by solving a system of linear equations. B: Least squares regression can be derived from maximum likelihood estimation if we assume that the sample points have a multivariate normal distribution. C: Adding a feature with no predictive value to least squares regression is unwise because it increases the bias of the method. D: Weighted least squares regression can be derived from maximum likelihood estimation if we assume that some points' labels are noisier than others.
d
While solving a classification problem, you use a pure, binary decision tree constructed by the standard greedy procedure we outlined in class. While your training accuracy is perfect, your validation accuracy is unexpectedly low. Which of the following, in isolation, is likely to improve your validation accuracy in most real-world applications? Lift your data into a quadratic feature space Select a random subset of the features and use only those in your tree Normalize each feature to have variance 1 Prune the tree, using validation to decide how to prune
d
What is true of k-nearest neighbor classifiers in a Euclidean space? A: A 1-NN classifier is more likely to overfit and less likely to underfit than a 10-NN classifier. B: In the special case of a two-dimensional feature space, it is possible to preprocess a data set so that nearest neighbor queries can be answered in O(log n) time. C: Sometimes finding k approximate nearest neighbors can be done much faster than finding the k exact nearest neighbors. D: The k-d tree algorithm for nearest neighbor search can speed up 1-NN, but it doesn't work well for k-NN.
a,b,c
What is true of the shatter function ΠH(n), where n is the number of training points? A: It either grows polynomially with n or is 2n for all n, but it can never be in between those extremes. B: We use it to compute an estimate of how many training points we need to ensure we choose a hypothesis with nearly optimal risk with high probability. C: For a linear classifier in R d , the shatter function grows polynomially with n. D: For a power set classifier, the shatter function grows polynomially with n.
a,b,c
Which of the following algorithms can learn nonlinear decision boundaries? The decision trees use only axisaligned splits. A depth-five decision tree Quadratic discriminant analysis (QDA) AdaBoost with depth-one decision trees Perceptron
a,b,c
Which of the following are true for k-nearest neighbor classification? It is more likely to overfit with k = 1 (1-NN) than with k = 1,000 (1,000-NN) In very high dimensions, exhaustively checking every training point is often faster than any widely used competing exact k-NN query algorithm If you have enough training points drawn from the same distribution as the test points, k-NN can achieve accuracy almost as good as the Bayes decision rule The optimal running time to classify a point with k-NN grows linearly with k
a,b,c
Consider a centered design matrix X ∈ R n×d and its singular value decomposition X = UDV>. X has d principal components, which are found by principal components analysis (PCA). Which statements are correct? A: The row space of X (if not trivial) is spanned by some of the principal components B: The null space of X (if not trivial) is spanned by some of the principal components C: The principal components are all right singular vectors of X D: The matrix UD lists the principal coordinates of every sample point in X
a,b,c,d
Select the true statements about awards given for research related to this course. A: A Nobel Prize in Physiology was awarded for characterizing action potentials in squid axons. B: A Nobel Prize in Physiology was awarded for discoveries about how neurons in the visual cortex process images. C: A Turing Award was awarded for work on deep neural networks. D: A Godel Prize was awarded for the paper on ¨ AdaBoost.
a,b,c,d
Consider the kernel perceptron algorithm on an n × d design matrix X. We choose a matrix M ∈ R D×d and define the feature map Φ(x) = Mx ∈ R D and the kernel k(x,z) = Φ(x) · Φ(z). Which of the following are always true? The kernel matrix is XM>MX> If the primal perceptron algorithm terminates, then the kernel perceptron algorithm terminates The kernel matrix is MX>XM> If the kernel perceptron algorithm terminates, then the primal perceptron algorithm terminates
a,b,d
What is true of convolutional neural networks (CNNs)? A: Learned convolutional masks can act as edge detectors or line detectors. B: A convolutional layer of connections from a layer of m hidden units to a layer of m 0 hidden units has fewer weights than a fully-connected layer of connections from a layer of m hidden units to a layer of m 0 hidden units. C: Let X be a 4 × 4 image that is fed through an average-pooling layer with 2 × 2 masks, producing a 2×2 output layer Y. Then for every unit/pixel Xi j and every unit/pixel Yk` , the partial derivative ∂Yk`/∂Xi j is equal to 1/4. D: Some research on CNNs made a strong enough impact to win the Alan M. Turing Award.
a,b,d
Which of the following are reasons one might choose latent factor analysis (LFA) over k-means clustering to group together n data points in R d ? LFA is not sensitive to how you initialize it, whereas Lloyd's algorithm is LFA allows us to consider points as belonging to multiple "overlapping" clusters, whereas in k-means, each point belongs to only one cluster In market research, LFA can distinguish different consumer types, whereas k-means cannot k-means requires you to guess k in advance, whereas LFA makes it easier to infer the right number of clusters after the computation
a,b,d
Which of the following are true about PCA? covariance matrix symmetric. A: Appending a 1 to the end of every sample point doesn't change the results of performing PCA (except that the useful PC vectors have an extra 0 at the end, and there's one extra useless component with eigenvalue zero). B: If you use PCA to project d-dimensional points down to j principal coordinates, and then you run PCA again to project those j-dimensional coordinates down to k principal coordinates, with d > j > k, you always get the same result as if you had just used PCA to project the d-dimensional points directly down to k principle coordinates. C: If you perform an arbitrary rigid rotation of the sample points as a group in feature space before performing PCA, the PC directions do not change. D: Same intro as C, the largest eigenvalue of the sample covariance matrix does not change.
a,b,d
Given the spectral graph clustering optimization problem Find y that minimizes y >Ly subject to y >y = n and 1 > y = 0, which of the following optimization problems produce a vector y that leads to the same sweep cut as the optimization problem above? M is a diagonal mass matrix with different masses on the diagonal.
a,c
Select the correct statements about AdaBoost. A: "Ada" stands for "adaptive," as the metalearner adapts to the performance of its learners B: AdaBoost works best with support vector machines C: At test/classification time, AdaBoost computes a weighted sum of predictions D: AdaBoost can transform any set of classifiers to a classifier with better training accuracy
a,c
Select the true statements about the running time of k-means clustering of n sample points with d features each. A: The step that updates the cluster means µi , given fixed cluster assignments yj , can be implemented to run in at most O(nd) time. B: Increasing k always increases the running time. C: The step that updates the cluster assignments yj , given fixed cluster means µi , can be implemented to run in at most O(nkd) time. D: The k-means algorithm runs in at most O(nkd) time.
a,c
We are given n training points having d features each. Select the true statements about applying a k-nearest neighbor algorithm to a test point. A: It is possible to implement the algorithm to use the ℓ1 metric as your distance function. B: The k-d tree k-nearest neighbor algorithm is fast because it computes the distance between the test point and a training point for only k training points. C: There is a k-nearest neighbor algorithm that classifies a test point in at most O(nd + n log k) time, even if k is much larger than d. D: The k-nearest neighbor algorithm can be used for classification, but not regression.
a,c
Which of the following are true of support vector machines? Increasing the hyperparameter C tends to decrease the training error The hard-margin SVM is a special case of the softmargin with the hyperparameter C set to zero Increasing the hyperparameter C tends to decrease the margin Increasing the hyperparameter C tends to decrease the sensitivity to outliers
a,c
Which of the following are true of the vanishing gradient problem for sigmoid units? Deeper neural networks tend to be more susceptible to vanishing gradients If a unit has the vanishing gradient problem for one training point, it has the problem for every training point Using ReLU units instead of sigmoid units can reduce this problem Networks with sigmoid units don't have this problem if they're trained with the cross-entropy loss function
a,c
Which of the following statements are true of the Rayleigh quotient Q(X,w) = w >X >Xw/(w >w) for an arbitrary matrix X ∈ R n×d and vector w ∈ R d ? A: Q(X,w) ≥ 0 for all X and w. B: Q(X,w) ≤ σ 2 max(X) for all X and w, where σmax(X) is the greatest singular value of X. C: Q(X,w) is maximized when w is the eigenvector of X >X corresponding to the greatest eigenvalue. D: Q(X,w) is maximized when w is the eigenvector of X >X corresponding to the smallest eigenvalue.
a,c
For binary classification, which of the following statements are true of AdaBoost? It can be applied to neural networks It uses the majority vote of learners to predict the class of a data point The metalearner provides not just a classification, but also an estimate of the posterior probability The paper on AdaBoost won a Godel Prize
a,c,d
Let X be a matrix with singular value decomposition X = UΣV >. Which of the following are true for all X? A: rank(X) = rank(Σ). B: If all the singular values are unique, then the SVD is unique. C: The first column of V is an eigenvector of X >X. D: The singular values and the eigenvalues of X >X are the same.
a,c,d
Suppose we are doing ordinary least-squares linear regression with a fictitious dimension. Which of the following changes can never make the cost function's value on the training data smaller? A: Discard the fictitious dimension (i.e., don't append a 1 to every sample point). B: Append quadratic features to each sample point. C: Project the sample points onto a lower-dimensional subspace with PCA (without changing the labels) and perform regression on the projected points. D: Center the design matrix (so each feature has mean zero).
a,c,d
Which of the following statement(s) about kernels are true? A: The dimension of the lifted feature vectors Φ(·), whose inner products the kernel function computes, can be infinite. B: For any desired lifting Φ(x), we can design a kernel function k(x,z) that will evaluate Φ(x) >Φ(z) more quickly than explicitly computing Φ(x) and Φ(z). C: The kernel trick, when it is applicable, speeds up a learning algorithm if the number of sample points is substantially less than the dimension of the (lifted) feature space. D: If the raw feature vectors x, y are of dimension 2, then k(x, y) = x 2 1 y 2 1 + x 2 2 y 2 2 is a valid kernel.
a,c,d
Write the SVD of an n × d design matrix X (with n ≥ d) as X = UDVT . Which of the following are true? The components of D are all nonnegative If X is a real, symmetric matrix, the SVD is always the same as the eigendecomposition The columns of V all have unit length and are orthogonal to each other The columns of D are orthogonal to each other
a,c,d
For the sigmoid activation function and the ReLU activation function, which of the following are true in general? Both activation functions are monotonically nondecreasing Both functions have a monotonic first derivative Compared to the sigmoid, the ReLU is more computationally expensive The sigmoid derivative s 0 (γ) is quadratic in s(γ)
a,d
Lasso (with a fictitious dimension), random forests, and principal component analysis (PCA) all . . . A: can be used for dimensionality reduction or feature subset selection B: compute linear transformations of the input features C: are supervised learning techniques D: are translation invariant: changing the origin of the coordinate system (i.e., translating all the training and test data together) does not change the predictions or the principal component directions
a,d
Select the true statements about convolutional neural networks. A: Pooling layers (of edges) reduce the number of hidden units in the subsequent layer (of units). B: For a convolutional layer, increasing the number of filters decreases the the number of hidden units in the subsequent layer. C: Each unit in a convolutional layer is connected to all units in the previous layer. D: For a convolutional layer, increasing the filter height and width decreases the the number of hidden units in the subsequent layer. (Assume no padding.)
a,d
Select the true statements about the bias-variance tradeoff in random forests. A: Decreasing the number of randomly selected features we consider for splitting at each treenode tends to increase the bias. B: Increasing the number of decision trees tends to increase the variance. C: Decreasing the number of randomly selected features we consider for splitting at each treenode tends to decrease the bias. D: Increasing the number of decision trees tends to decrease the variance.
a,d
Suppose we use the k-d tree construction and query algorithms described in class to find the approximate nearest neighbor of a query point among n sample points. Select the true statements. It is possible to guarantee that the tree has O(log n) depth by our choice of splitting rule at each treenode Sometimes we permit the k-d tree to be unbalanced so we can choose splits with better information gain Querying the k-d tree is faster than querying a Voronoi diagram for sample points in R 2 Sometimes the query algorithm declines to search inside a box that's closer to the query point than the nearest neighbor it's found so far
a,d
Suppose you have a fixed dataset and want to train a decision tree for two-class classification. If you increase the maximum depth of the decision tree, which of the following are possible effects? A: The test accuracy goes up. B: The training accuracy goes down. C: The number of pure leaves is reduced. D: The time to classify a test point increases.
a,d
What is true of bagging? (Note: not in a random forest; just bagging alone.) A: Bagging without replacement is more likely to overfit than bagging with replacement. B: Bagging involves using different learning algorithms on different subsamples of the training set. C: Bagging is often used with decision trees because it helps increase their training accuracy. D: Even with bagging, sometimes decision trees still end up looking very similar.
a,d
Which of the following can lead to valid derivations of PCA? Fit the mean and covariance matrix of a Gaussian distribution to the sample data with maximum likelihood estimation Find the direction w that minimizes the sample variance of the projected data Find the direction w that minimizes the sum of projection distances Find the direction w that minimizes the sum of squares of projection distances
a,d
Which of the following would be a reasonable cost function (by the criteria discussed in lecture) for choosing splits in a decision tree for two-class classification, where p is the fraction of points in Class C in a specified treenode? A: −p log2 p − (1 − p) log2 (1 − p) B: p log2 p + (1 − p) log2 (1 − p) C: 0.5 − |p − 0.5| D: p(1 − p)
a,d
Below are some choices you might make while training a neural network. Select all of the options that will generally make it more difficult for your network to achieve high accuracy on the test data. A: Initializing the weights to all zeros B: Normalizing the training data but leaving the test data unchanged C: Using momentum D: Reshuffling the training data at the beginning of each epoch
a,b
For binary classification, which of the following statements are true of AdaBoost with decision trees? It usually has lower bias than a single decision tree It is popular because it usually works well even before any hyperparameter tuning To use the weight wi of a sample point Xi when training a decision tree G, we scale the loss function L(G(Xi), yi) by wi It can train multiple decision trees in parallel
a,b
Select the correct statements about the k-nearest neighbor classifier. A: For exact nearest neighbor search in a very high-dimensional feature space, generally it is faster to use exhaustive search than to use a k-d tree. B: When using a k-d tree, approximate nearest neighbor search is sometimes substantially faster than exact nearest neighbor search. C: When using exhaustive search, approximate nearest neighbor search is sometimes substantially faster than exact nearest neighbor search. D: Order-k Voronoi diagrams are widely used in practice for k-nearest neighbor search (with k > 1) in a two-dimensional feature space.
a,b
We are fitting a linear function to data in which some features are known to be much noisier than others. We account for this by applying an asymmetric penalty in ridge regression: in the normal equations, we replace the identity matrix with a different diagonal matrix. Each entry on the diagonal is a hyperparameter. Although there are many hyperparameters, suppose that we magically find the best hyperparameter values for our validation set. How is the result likely to differ from the result of standard ridge regression? A: Lower validation error. B: It is equivalent to placing an anisotropic Gaussian prior probability on the regression weights, then finding the weights that maximize the likelihood. C: The number of weights equal to zero tends to be greater. D: The difference is usually small, as the hyperparameters will tend to have almost the same value.
a,b
Use the same training set as part (d). What is the value of w and b given by logistic regression (with no regularization)? A: w = 1, b = 1 B: w = 0, b = 1 C: w = 1, b = 0 D: w = ∞, b = 0
d
Which of the following techniques tend to increase the likelihood that the decision trees in your random forest differ from one another? A: Using shorter decision trees B: Using deeper decision trees C: Considering only a subset of the features for splitting at a treenode D: Bagging
b,c,d
Recall that k-means clustering (Lloyd's algorithm) takes n sample points X1, X2, . . . , Xn and seeks to find a vector y of cluster assignments that minimizes X k i=1 X yj=i Xj − µi 2 , where yj ∈ {1, 2, . . . , k} and the cluster center µi = 1 ni X yj=i Xj is the average of the sample points assigned to cluster i. Select the true statements. A: k-means is guaranteed to find clusters that minimize its cost function, as the steps updating the cluster assignments yj can be solved optimally, and so can the steps updating the cluster means µis. B: In the algorithm's output, any two clusters are separated by a linear decision boundary. C: It is not possible to kernelize the k-means algorithm, because the means µi are not accounted for in the kernel matrix. D: Statisticians justify k-means optimization by assuming a Gaussian prior on the means and applying maximum likelihood estimation.
b
Select the true statements about AdaBoost for two-class classification. A: We can train all T weak learners simultaneously in parallel. B: After a weak learner is trained, the weights associated with the training points it misclassifies are increased. C: The coefficient βT assigned to weak learner GT is 0 if the weighted error rate errT of GT is 1. D: AdaBoost makes no progress if it trains a weak learner only to discover that its weighted error rate is substantially greater than 0.5.
b
We are using an ensemble of decision trees for a classification problem, with bagging. We notice that the decision trees look too similar. We would like to build a more diverse set of learners. What are possible ways to accomplish that? A: Increase the size of each random subsample B: Decrease the size of each random subsample C: Apply normalization to the design matrix first D: Sample without replacement
b
Which of the following methods will cluster the data in panel (a) of the figure below into the two clusters (red circle and blue horizontal line) shown in panel (b)? Every dot in the circle and the line is a data point. In all the options that involve hierarchical clustering, the algorithm is run until we obtain two clusters. A: Hierarchical agglomerative clustering with Euclidean distance and complete linkage B: Hierarchical agglomerative clustering with Euclidean distance and single linkage C: Hierarchical agglomerative clustering with Euclidean distance and centroid linkage D: k-means clustering with k = 2
b
Given X ∈ R n×d and y ∈ R n , consider solving the normal equations X >Xw = X >y for w ∈ R d . Let the SVD of X be X = UDV>. Let X + denote the Moore-Penrose pseudoinverse of a matrix X. Which of the following statements are certain to be true, for any values of X and y? A: w = (X >X) −1X >y is a solution of the normal equations. B: w = X + y is a solution of the normal equations. C: w = VD+U >y is a solution of the normal equations. D: Every solution w to the normal equations is a solution to Xw = y.
b,c
Select the true statements about principal components analysis (PCA). A: PCA is a clustering algorithm. B: PCA produces features (principal coordinates) that are linear combinations of the input features. C: The principal components are chosen to maximize the variance in the projected data. D: The principal coordinates are the eigenvalues of the sample covariance matrix.
b,c
Which of the following are benefits of using the backpropogation algorithm to compute gradients? A: Its running time is linear in the total number of units (neurons) in the network B: It can be applied to any arithmetic function (assuming the directed computation graph has no cycles and we evaluate the gradient at a point where the gradient exists) C: Compared to naive gradient computation, it improves the speed of each iteration of gradient descent by eliminating repeated computations of the same subproblem D: Compared to naive gradient computation, it reduces the number of iterations required to get close to a local minimum, by protecting against sigmoid unit saturation (vanishing gradients)
b,c
Which of the following are true of spectral clustering? The Fiedler vector is the eigenvector associated with the second largest eigenvalue of the Laplacian matrix Nobody knows how to find the sparsest cut in polynomial time The relaxed optimization problem for partitioning a graph involves minimizing the Rayleigh quotient of the Laplacian matrix and an indicator vector (subject to a constraint) The Laplacian matrix of a graph is invertible
b,c
Which of the following are typical benefits of ensemble learning in its basic form (that is, not AdaBoost and not with randomized decision boundaries), with all weak learners having the same learning algorithm and an equal vote? A: Ensemble learning tends to reduce the bias of your classification algorithm. B: Ensemble learning tends to reduce the variance of your classification algorithm. C: Ensemble learning can be used to avoid overfitting. D: Ensemble learning can be used to avoid underfitting.
b,c
Which of the following are typical benefits that might motivate you to preprocess data with principal components analysis (PCA) before training a classifier? A: PCA tends to reduce the bias of your classification algorithm. B: PCA can be used to avoid overfitting. C: PCA tends to reduce the variance of your classification algorithm. D: PCA can be used to avoid underfitting.
b,c
Which of the following is true about Lloyd's algorithm for k-means clustering? It is a supervised learning algorithm It never returns to a particular assignment of classes to sample points after changing to another one If run for long enough, it will always terminate No algorithm (Lloyd's or any other) can always find the optimal solution
b,c
Select the true statements about the singular value decomposition (SVD) and the eigendecomposition. ⃝ A: The SVD applies only to square matrices. B: The eigendecomposition applies only to square matrices. C: The right singular vectors of a matrix X ∈ R n×d are eigenvectors of X ⊤X. D: Consider a non-square matrix X ∈ R n×d and the vector w ∈ R d \ {0} that maximizes the Rayleigh quotient (w ⊤X ⊤Xw)/(w ⊤w). The singular values of X are no greater than the (positive) square root of the maximum Rayleigh quotient.
b,c,d
Which of the following are true about principal components analysis (PCA)? A: The principal components are eigenvectors of the centered data matrix. B: The principal components are right singular vectors of the centered data matrix. C: The principal components are eigenvectors of the sample covariance matrix. D: The principal components are right singular vectors of the sample covariance matrix.
b,c,d
Facets of neural networks that have (reasonable, though not perfect) analogs in human brains include backpropagation linear combinations of input values convolutional masks applied to many patches edge detectors
b,d
You have n training points, each with d features. You try two different algorithms for training a decision tree. 1. The standard decision tree with single-feature (axis-aligned) splits, as covered in lecture. 2. A randomized decision tree—like a tree in a random forest, but there is only one tree and we don't use bagging. At each internal node of the tree, we randomly select m of the d features, and we choose the best split from among those m features. Suppose you train both trees permitting no treenode to have depth greater than h. Upon completion, you find that in each tree, at least half its leaves are at depth h. What are the running times for training the standard decision tree and the randomized one, respectively? (Select exactly one option.) A: Θ(dn2 h ); Θ(mn2 h ) B: Θ(dn2 h ); Θ(dn2 h ) C: Θ(dnh); Θ(mnh) D: Θ(dnh); Θ(dnh)
c
In which of the following cases should you prefer k-nearest neighbors over k-means clustering? For all the four options, you have access to images X1, X2, . . . , Xn ∈ R d . A: You do not have access to labels. You want to find out if any of the images are very different from the rest, i.e., are outliers. B: You have y1, y2, . . . , yn telling us whether image i is a cat or a dog. You want to find out whether the distribution of cats is unimodal or bimodal. You already know that the distribution of cats either has either one or two modes, but that's all you know. C: You have access y1, y2, . . . , yn telling us whether image i is a cat or a dog. You want to find out whether a new image z is a cat or a dog. D: You have y1, y2, . . . , yn telling us whether image i is a cat or a dog. Given a new img z, you want to approximate the posterior probability of z being a cat and the posterior probability of z being a
c,d
Let classifier A be a random forest. Let classifier B be an ensemble of decision trees with bagging—identical to classifier A except that we do not limit the splits in each treenode to a subset of the features; at every treenode, the very best split among all d features is chosen. Which statements are true? A: After training, all the trees in classifier B must be identical B: Classifier B will tend to have higher bias than Classifier A C: Classifier B will tend to have higher variance than Classifier A D: Classifier B will tend to have higher training accuracy than Classifier A
c,d
Select the true statements about decision trees. A: The information gain is always strictly positive at each split in the tree. B: Pruning is a technique used to reduce tree depth by removing nodes that don't reduce entropy enough. C: Decision trees with all their leaves pure are prone to overfitting. D: Calculating the best split among quantitative features for a treenode can be implemented so it takes asymptotically the same amount of time as calculating the best split among binary features.
c,d
Suppose our input is two-dimensional sample points, with ten non-exclusive classes those points may belong to (i.e., a point can belong to more than one class). To train a classifier, we build a fully-connected neural network (with bias terms) that has a single hidden layer of twenty units and an output layer of ten units (one for each class). Which statements apply? For the output units, softmax activations are more appropriate than sigmoid activations This network will have 240 trainable parameters For the hidden units, ReLU activations are more appropriate than linear activations This network will have 270 trainable parameters
c,d
We want to use a decision tree to classify the training points depicted. Which of the following decision tree classifiers is capable of giving 100% accuracy on the training data with four splits or fewer? (blue clsuter surrounded by yellow pionts) A: A standard decision tree with axis-aligned splits B: Using PCA to reduce the training data to one dimension, then applying a standard decision tree C: A decision tree with multivariate linear splits D: Appending a new feature |x1| + |x2| to each sample point x, then applying a standard decision tree
c,d
What is true of k-means clustering? A: k-means is a supervised learning algorithm. B: k-means clustering always converges to the same solution regardless of how clusters are initialized. C: Increasing k can never increase the optimal value of the k-means cost function. D: The k-medoids algorithm with the `1 distance is less sensitive to outliers than standard k-means with the Euclidean distance.
c,d
Which of the following are benefits of using convolutional neural networks—as opposed to fully connected ones— for image recognition tasks? A: The ability to express a wider variety of more complicated functions of the input features B: Fewer model architecture hyperparameters for the designer to select C: Enables the network to more easily learn and recognize features regardless of their position in the image D: Typically requires less data to train well
c,d
Which of the following are true of decision trees? Assume splits are binary and are done so as to maximize the information gain. If there are at least two classes at a given node, there exists a split such that information gain is strictly positive As you go down any path from the root to a leaf, the information gain at each level is non-increasing The deeper the decision tree is, the more likely it is to overfit Random forests are less likely to overfit than decision trees
c,d
