AIL302m

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. What would be a reasonable choice for P? A. The probability of it correctly predicting a future date's weather. B. The weather prediction task. C. The process of the algorithm examining a large amount of historical weather data. D. None of these.

Let A be a 10x10 matrix and x be a 10-element vector. Your friend wants to compute the product Ax and writes the following code: v = zeros(10, 1); for i = 1:10 for j = 1:10 v(i) = v(i) + A(i, j) * x(j); end end How would you vectorize this code to run without any for loops? Check all that apply. A. v = A * x; B. v = Ax; C. v = x' * A; D. v = sum (A * x);

Suppose Theta1 is a 5x3 matrix, and Theta2 is a 4x6 matrix. You set thetaVec = [Theta1(:), Theta2(:)]. Which of the following correctly recovers ? A. reshape(thetaVec(16 : 39), 4, 6) B. reshape(thetaVec(15 : 38), 4, 6) C. reshape(thetaVec(16 : 24), 4, 6) D. reshape(thetaVec(15 : 39), 4, 6) E. reshape(thetaVec(16 : 39), 6, 4)

Suppose that you just joined a product team that has been developing a machine learning application, using m = 1,000 training examples. You discover that you have the option of hiring additional personnel to help collect and label data. You estimate that you would have to pay each of the labellers $10 per hour, and that each labeller can label 4 examples per minute. About how much will it cost to hire labellers to label 10,000 new training examples? A. $400 B. $600 C. $10,000 D. $250

Suppose we have three cluster centroids , and .Furthermore, we have a training example . After a cluster assignmentstep, what will be? A. = 1 B. is not assigned C. = 2 D. = 3

Suppose you are working on stock market prediction, Typically tens of millions of shares of Microsoft stock are traded (i.e., bought/sold) each day. You would like to predict the number of Microsoft shares that will be traded tomorrow. Would you treat this as a classification or a regression problem? A. Regression B. Classification

Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem? A. Regression B. Classification

Suppose you are working on weather prediction, and use a learning algorithm to predict tomorrow's temperature (in degrees Centigrade/Fahrenheit).Would you treat this as a classification or a regression problem? A. Regression B. Classification

Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set: When you measure the SVM's performance on a cross validation set, it does poorly. Should you try increasing or decreasing ? Increasing or decreasing ? A. It would be reasonable to try decreasing C. It would also be reasonable to try increasing . B. It would be reasonable to try decreasing C. It would also be reasonable to try decreasing . C. It would be reasonable to try increasing C. It would also be reasonable to try decreasing . D. It would be reasonable to try increasing C. It would also be reasonable to try increasing .

Suppose you have trained an anomaly detection system for fraud detection, and your system that flags anomalies when p(x) is less than ε, and you find on the cross-validation set that it is missing many fradulent transactions (i.e., failing to flag them as anomalies). What should you do? A. Increase ε B. Decrease ε

Which of the following is a reasonable way to select the number of principal components k?(Recall that n is the dimensionality of the input data and m is the number of input examples.) A. Choose k to be the smallest value so that at least 99% of the variance is retained. B. Choose k to be the smallest value so that at least 1% of the variance is retained. C. Choose k to be 99% of n (i.e., k = 0.99 ∗ n, rounded to the nearest integer). D. Choose the value of that minimizes the approximation error E. Choose k to be the largest value so that at least 99% of the variance is retained F. Use the elbow method. G. Choose k to be 99% of m (i.e., k = 0.99 ∗ m, rounded to the nearest integer).

You are using the neural network pictured below and have learned the parameters (used to compute ) and (used to compute as a function of ). Suppose you swap the parameters for the first hidden layer between its two units so and also swap the output layer so . How will this change the value of the output ? A. It will stay the same. B. It will increase. C. It will decrease D. Insufficient information to tell: it may increase or decrease.

You have the following neural network: You'd like to compute the activations of the hidden layer . One way to do so is the following Octave code: You want to have a vectorized implementation of this (i.e., one that does not use for loops). Which of the following implementations correctly compute ? Check all that apply. A. z = Theta1 * x; a2 = sigmoid (z); B. a2 = sigmoid (x * Theta1); C. a2 = sigmoid (Theta2 * x); D. z = sigmoid(x); a2 = sigmoid (Theta1 * z);

You run gradient descent for 15 iterations with and compute after each iteration. You find that the value of decreases slowly and is still decreasing after 15 iterations. Based on this, which of the following conclusions seems most plausible? A. Rather than use the current value of α, it'd be more promising to try a larger value of α (say = 1.0). B. Rather than use the current value of α, it'd be more promising to try a smaller value of α (say = 0.1). C. = 0.3 is an effective choice of learning rate.

K-means is an iterative algorithm, and two of the following steps are repeatedly carried out in its inner-loop. Which two? A. Move the cluster centroids, where the centroids are updated. B. The cluster assignment step, where the parameters are updated. C. Using the elbow method to choose K. D. Feature scaling, to ensure each feature is on a comparable scale to the others. E. The cluster centroid assignment step, where each cluster centroid is assigned (by setting ) to the closest training example . F. Move each cluster centroid , by setting it to be equal to the closest training example . G. Test on the cross-validation set. H. Randomly initialize the cluster centroids.

Say you have two column vectors v and w, each with 7 elements (i.e., they have dimensions 7x1). Consider the following code: z = 0; for i = 1:7 z = z + v(i) * w(i) end Which of the following vectorizations correctly compute z? Check all that apply. A. z = sum (v .* w); B. z = w' * v; C. z = v * w'; D. z = w * v';

Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from. A. Given historical data of children's ages and heights, predict children's height as a function of their age. B. Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript's author (when the identity of this author is unknown). C. Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow "similar" or "related". D. Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.

Suppose you have the following training set, and fit a logistic regression classifier .Which of the following are true? Check all that apply. A. Adding polynomial features (e.g., instead using ) could increase how well we can fit the training data. B. At the optimal value of θ (e.g., found by fminunc), we will have J(θ) ≥ 0. C. Adding polynomial features (e.g., instead using ) would increase J(θ) because we are now summing over more terms. D. If we train gradient descent for enough iterations, for some examples in the training set it is possible to obtain .

In Octave/Matlab, many functions work on single numbers, vectors, and matrices. For example, the sin function when applied to a matrix will return a new matrix with the sin of each element. But you have to be careful, as certain functions have different behavior. Suppose you have an 7x7 matrix X. You want to compute the log of every element, the square of every element, add 1 to every element, and divide every element by 4. You will store the results in four matrices, A, B, C, D. One way to do so is the following code: for i = 1:7 for j = 1:7 A(i, j) = log(X(i, j)); B(i, j) = X(i, j) ^ 2; C(i, j) = X(i, j) + 1; D(i, j) = X(i, j) / 4; end end Which of the following correctly compute A, B, C or D? Check all that apply. A. C = X + 1; B. D = X / 4; C. A = log (X); D. B = X ^ 2;

ABC

Which of the following statements about map-reduce are true? Check all that apply. A. When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration. B. Because of network latency and other overhead associated with map-reduce, if we run map-reduce using N computers, we might get less than an N-fold speedup compared to using 1 computer. C. If you have only 1 computer with 1 computing core, then map-reduce is unlikely to help. D. If we run map-reduce using N computers, then we will always get at least an N-fold speedup compared to using 1 computer. E. Running map-reduce over N computers requires that we split the training set into pieces. F. In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

ABCF

Suppose you have a PhotoOCR system, where you have the following pipeline: You have decided to perform a ceiling analysis on this system, and find the following: Which of the following statements are true? A. There is a large gain in performance possible in improving the character recognition system. B. Performing the ceiling analysis shown here requires that we have ground-truth labels for the text detection, character segmentation and the character recognition systems. C. The potential benefit to having a significantly improved text detection system is small, and thus it may not be worth significant effort trying to improve it. D. The least promising component to work on is the character recognition system, since it is already obtaining 100% accuracy. E. The most promising component to work on is the text detection system, since it has the lowest performance (72%) and thus the biggest potential gain. F. We should dedicate significant effort to collecting additional training data for the text detection system. G. If the text detection system was trained using gradient descent, running gradient descent for more iterations is unlikely to help much. H. If we conclude that the character recognition's errors are mostly due to thecharacter recognition system having high variance, then it may be worth significant effort obtaining additional training data for character recognition.

ABCGH

Suppose you are working on a spam classifier, where spam emails are positive examples (y = 1) and non-spam emails are negative examples (y = 0). You have a training set of emails in which 99% of the emails are non-spam and the other 1% is spam. Which of the following statements are true? Check all that apply. A. A good classifier should have both a high precision and high recall on the cross validation set. B. If you always predict non-spam (output y=0), your classifier will have an accuracy of 99%. C. If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, but it will do much worse on the cross validation set because it has overfit the training data. D. If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, and it will likely perform similarly on the cross validation set.

ABD

Which of the following statements are true? Select all that apply. A. On every iteration of K-means, the cost function (the distortion function) should either stay the same or decrease; in particular, it should not increase. B. A good way to initialize K-means is to select K (distinct) examples from the training set and set the cluster centroids equal to these selected examples. C. Once an example has been assigned to a particular centroid, it will never be reassigned to another different centroid D. For some datasets, the "right" or "correct" value of K (the number of clusters) can be ambiguous, and hard even for a human expert looking carefully at the data to decide. E. The standard way of initializing K-means is setting to be equal to a vector of zeros. F. If we are worried about K-means getting stuck in bad local optima, one way to ameliorate (reduce) this problem is if we try using multiple random initializations. G. Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, and thus it is always better to have as large a number of clusters as is computationally feasible.

ABDF

Suppose you have two matrices A and B, where is 5x3 and is 3x5. Their product is C = AB, a 5x5 matrix. Furthermore, you have a 5x5 matrix R where every entry is 0 or 1. You want to find the sum of all elements C(i, j) for which the corresponding R(i, j) is 1, and ignore all elements C(i, j) where R(i, j)=0. One way to do so is the following code: Which of the following pieces of Octave code will also correctly compute this total? Check all that apply. Assume all options are in code. A. total = sum(sum((A * B) .* R)) B. C = (A * B) .* R; total = sum(C(:)); C. total = sum(sum((A * B) * R)); D. C = (A * B) * R; total = sum(C(:)); E. C = A * B; total = sum(sum(C(R == 1)));

ABE

Assuming that you have a very large training set, which of the following algorithms do you think can be parallelized using map-reduce and splitting the training set across different machines? Check all that apply. A. A neural network trained using batch gradient descent. B. Linear regression trained using batch gradient descent. C. An online learning setting, where you repeatedly get a single example (x, y), and want to learn from that single example before moving on. D. Logistic regression trained using stochastic gradient descent. E. Computing the average of all the features in your training set (say in order to perform mean normalization). F. Logistic regression trained using batch gradient descent.

ABEF

Suppose a massive dataset is available for training a learning algorithm. Training on a lot of data is likely to give good performance when two of the following conditions hold true. Which are the two? A. We train a learning algorithm with a large number of parameters (that is able to learn/represent fairly complex functions). B. The features x contain sufficient information to predict accurately. (For example, one way to verify this is if a human expert on the domain can confidently predict when given only ). C. When we are willing to include high order polynomial features of (such as , etc.). D. We train a learning algorithm with a small number of parameters (that is thus unlikely to overfit). E. We train a model that does not use regularization. F. The classes are not too skewed. G. Our learning algorithm is able to represent fairly complex functions (for example, if we train a neural network or other model with a large number of parameters). H. A human expert on the application domain can confidently predict y when given only the features x (or more generally we have some way to be confident that x contains sufficient information to predict y accurately)

ABGH

Which of the following statements are true? Check all that apply. A. For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network. B. Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking. C. Using gradient checking can help verify if one's implementation of backpropagation is bug-free. D. Gradient checking is useful if we are using one of the advanced optimization methods (such as in fminunc) as our optimization algorithm. However, it serves little purpose if we are using gradient descent.

Which of the following statements are true? Check all that apply. A. If we are training a neural network using gradient descent, one reasonable "debugging" step to make sure it is working is to plot J(Θ) as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration. B. Suppose you have a three layer network with parameters (controlling the function mapping from the inputs to the hidden units) and (controlling the mapping from the hidden units to the outputs). If we set all the elements of to be 0, and all the elements of to be 1, then this suffices for symmetry breaking, since the neurons are no longer all computing the same function of the input. C. Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions). D. If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of "symmetry breaking" because the parameters are no longer symmetrically equal to zero.

Which of the following statements are true? Check all that apply. A. The one-vs-all technique allows you to use logistic regression for problems in which each comes from a fixed, discrete set of values. B. For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc). C. The cost function for logistic regression trained with examples is always greater than or equal to zero. D. Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification).

Which of the following statements are true? Check all that apply. A. A model with more parameters is more prone to overfitting and typically has higher variance. B. If the training and test errors are about the same, adding more features will not help improve the results. C. If a learning algorithm is suffering from high bias, only adding more training examples may not improve the test error significantly. D. If a learning algorithm is suffering from high variance, adding more training examples is likely to improve the test error. E. When debugging learning algorithms, it is useful to plot a learning curve to understand if there is a high bias or high variance problem. F. If a neural network has much lower training error than test error, then adding more layers will help bring the test error down because we can fit the test set better.

ACDE

For which of the following tasks might K-means clustering be a suitable algorithm Select all that apply. A. Given a set of news articles from many different news websites, find out what are the main topics covered. B. Given historical weather records, predict if tomorrow's weather will be sunny or rainy. C. From the user usage patterns on a website, figure out what different groups of users exist. D. Given many emails, you want to determine if they are Spam or Non-Spam emails. E. Given a database of information about your users, automatically group them into different market segments. F. Given sales data from a large number of products in a supermarket, figure out which products tend to form coherent groups (say are frequently purchased together) and thus should be put on the same shelf. G. Given sales data from a large number of products in a supermarket, estimate future sales for each of these products.

ACEF

Which of the following are true? Check all that apply. A. If you do not have any labeled data (or if all your data has label y = 0), then is is still possible to learn p(x), but it may be harder to evaluate the system or choose a good value of ϵ. B. If you are developing an anomaly detection system, there is no way to make use of labeled data to improve your system. C. When choosing features for an anomaly detection system, it is a good idea to look for features that take on unusually large or small values for (mainly the) anomalous examples. D. If you have a large labeled training set with many positive examples and many negative examples, the anomaly detection algorithm will likely perform just as well as a supervised learning algorithm such as an SVM. E. In a typical anomaly detection setting, we have a large number of anomalous examples, and a relatively small number of normal/non-anomalous examples. F. When developing an anomaly detection system, it is often useful to select an appropriate numerical performance metric to evaluate the effectiveness of the learning algorithm. G. In anomaly detection, we fit a model p(x) to a set of negative ( y=0) examples, without using any positive examples we may have collected of previously observed anomalies.

ACFG

Which of the following are true of collaborative filtering systems? Check all that apply. A. For collaborative filtering, it is possible to use one of the advanced optimization algoirthms (L-BFGS/conjugate gradient/etc.) to solve for both the 's and 's simultaneously. B. Suppose you are writing a recommender system to predict a user's book preferences. In order to build such a system, you need that user to rate all the other books in your training set. C. Even if each user has rated only a small fraction of all of your products (so r(i, j) = 0 for the vast majority of (i, j) pairs), you can still build a recommender system by using collaborative filtering. D. For collaborative filtering, the optimization algorithm you should use is gradient. In particular, you cannot use more advanced optimization algorithms (LBFGS/ conjugate gradient/etc.) for collaborative filtering, since you have to solve for both the x(i) 's and θ(j)'s simultaneously. E. To use collaborative filtering, you need to manually design a feature vector for every item (e.g., movie) in your dataset, that describes that item's most important properties. F. Recall that the cost function for the content-based recommendation system is . Suppose there is only one user and he has rated every movie in the training set. This implies that n_u = 1nu=1 and r(i,j) = 1r(i,j)=1 for every i,ji,j. In this case, the cost function J(\theta)J(θ) is equivalent to the one used for regularized linear regression. G. When using gradient descent to train a collaborative filtering system, it is okay to initialize all the parameters (x^{(i)}x(i) and \theta^{(j)}θ(j)) to zero. H. If you have a dataset of users ratings' on some products, you can use these to predict one user's preferences on products he has not rated.

ACFH

Let be some function so that outputs a number. For this problem, is some arbitrary/unknown smooth function (not necessarily the cost function of linear regression, so may have local optima).Suppose we use gradient descent to try to minimize as a function of and .Which of the following statements are true? (Check all that apply.) A. If and are initialized at the global minimum, then one iteration will not change their values. B. Setting the learning rate to be very small is not harmful, and can only speed up the convergence of gradient descent. C. No matter how and are initialized, so long as is sufficiently small, we can safely expect gradient descent to converge to the same solution. D. If the first few iterations of gradient descent cause to increase rather than decrease, then the most likely cause is that we have set the learning rate to too large a value.

The SVM solves where the functions and look like this: The first term in the objective is: This first term will be zero if two of the following four conditions hold true. Which are the two conditions that would guarantee that this term equals zero? A. For every example with = 0, we have that θT ≤ −1. B. For every example with = 1, we have that θT ≥ 0. C. For every example with = 0, we have that θT ≤ 0. D. For every example with = 1, we have that θT ≥ 1.

Which of the following statements are true? Check all that apply. A. Any logical function over binary-valued (0 or 1) inputs x1 and x2 can be (approximately) represented using some neural network. B. Suppose you have a multi-class classification problem with three classes, trained with a 3 layer network. Let be the activation of the first output unit, and similarly and . Then for any input x, it must be the case that . C. A two layer (one input layer, one output layer; no hidden layer) neural network can represent the XOR function. D. The activation values of the hidden units in a neural network, with the sigmoid activation function applied at every layer, are always in the range (0, 1).

For which of the following problems would anomaly detection be a suitable algorithm? A. From a large set of primary care patient records, identify individuals who might have unusual health conditions. B. Given data from credit card transactions, classify each transaction according to type of purchase (for example: food, transportation, clothing). C. Given an image of a face, determine whether or not it is the face of a particular famous individual. D. Given a dataset of credit card transactions, identify unusual transactions to flag them as possibly fraudulent. E. In a computer chip fabrication plant, identify microchips that might be defective.

ADE

Suppose you are building an object classifier, that takes as input an image, and recognizes that image as either containing a car (y = 1) or not (y = 0). For example, here are a positive example and a negative example: After carefully analyzing the performance of your algorithm, you conclude that you need more positive (y = 1) training examples. Which of the following might be a good way to get additional positive examples? A. Mirror your training images across the vertical axis (so that a left-facing car now becomes a right-facing one). B. Take a few images from your training set, and add random, Gaussian noise to every pixel. C. Take a training example and set a random subset of its pixel to 0 to generate a new example. D. Select two car images and average them to make a third example. E. Apply translations, distortions, and rotations to the images already in your training set. F. Make two copies of each image in the training set; this immediately doubles your training set size.

You run a movie empire, and want to build a movie recommendation system based on collaborative filtering. There were three popular review websites (which we'll call A, B and C) which users to go to rate movies, and you have just acquired all three companies that run these websites. You'd like to merge the three companies' datasets together to build a single/unified system. On website A, users rank a movie as having 1 through 5 stars. On website B, users rank on a scale of 1 - 10, and decimal values (e.g., 7.5) are allowed. On website C, the ratings are from 1 to 100. You also have enough information to identify users/movies on one website with users/movies on a different website. Which of the following statements is true? A. You can merge the three datasets into one, but you should first normalize each dataset's ratings (say rescale each dataset's ratings to a 0-1 range). B. You can combine all three training sets into one as long as your perform mean normalization and feature scaling after you merge the data. C. Assuming that there is at least one movie/user in one database that doesn't also appear in a second database, there is no sound way to merge the datasets, because of the missing data. D. It is not possible to combine these websites' data. You must build three separate recommendation systems. E. You can merge the three datasets into one, but you should first normalize each dataset separately by subtracting the mean and then dividing by (max - min) where the max and min (5-1) or (10-1) or (100-1) for the three websites respectively.

Which of the following statements are true? Check all that apply. A. Using a very large training set makes it unlikely for model to overfit the training data. B. After training a logistic regression classifier, you must use 0.5 as your threshold for predicting whether an example is positive or negative. C. If your model is underfitting the training set, then obtaining more data is likely to help. D. It is a good idea to spend a lot of time collecting a large amount of data before building your first version of a learning algorithm E. On skewed datasets (e.g., when there are more positive examples than negative examples), accuracy is not a good measure of performance and you should instead use F1 score based on the precision and recall. F. The "error analysis" process of manually examining the examples which your algorithm got wrong can help suggest what are good steps to take (e.g., developing new features) to improve your algorithm's performance.

AEF

Suppose you are training a logistic regression classifier using stochastic gradient descent. You find that the cost (say, , averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Which of the following changes are likely to help? A. Try using a smaller learning rate α. B. Try averaging the cost over a larger number of examples (say 1000 examples instead of 500) in the plot. C. This is not an issue, as we expect this to occur with stochastic gradient descent. D. Try using a larger learning rate α. E. Use fewer examples from your training set. F. Try halving (decreasing) the learning rate α, and see if that causes the cost to now consistently go down; and if not, keep halving it until it does.

Suppose you are working on stock market prediction. You would like to predict whether or not a certain company will declare bankruptcy within the next 7 days (by training on data of similar companies that had previously been at risk of bankruptcy). Would you treat this as a classification or a regression problem? A. Regression B. Classification

Suppose you are working on weather prediction, and your weather station makes one of three predictions for each day's weather: Sunny, Cloudy or Rainy. You'd like to use a learning algorithm to predict tomorrow's weather. Would you treat this as a classification or a regression problem? A. Regression B. Classification

Suppose you have a dataset with m = 1000000 examples and n = 200000 features for each example. You want to use multivariate linear regression to fit the parameters to our data. Should you prefer gradient descent or the normal equation? A. Gradient descent, since it will always converge to the optimal θ. B. Gradient descent, since will be very slow to compute in the normal equation. C. The normal equation, since it provides an efficient way to directly find the solution. D. The normal equation, since gradient descent might be unable to find the optimal θ.

Suppose that you have trained a logistic regression classifier, and it outputs on a new example a prediction = 0.2. This means (check all that apply): A. Our estimate for P(y = 1|x; θ) is 0.8. B. Our estimate for P(y = 0|x; θ) is 0.8. C. Our estimate for P(y = 1|x; θ) is 0.2. D. Our estimate for P(y = 0|x; θ) is 0.2.

Suppose you have a dataset with n = 10 features and m = 5000 examples. After training your logistic regression classifier with gradient descent, you find that it has underfit the training set and does not achieve the desired performance on the training or cross validation sets. Which of the following might be promising steps to take? Check all that apply. A. Increase the regularization parameter λ. B. Use an SVM with a Gaussian Kernel. C. Create / add new polynomial features. D. Use an SVM with a linear kernel, without introducing new features. E. Try using a neural network with a large number of hidden units. F. Reduce the number of example in the training set.

BCE

Which of the following statements are true? Check all that apply. A. Given only and , there is no way to reconstruct any reasonable approximation to . B. Even if all the input features are on very similar scales, we should still perform mean normalization (so that each feature has zero mean) before running PCA. C. Given input data , it makes sense to run PCA only with values of k that satisfy . (In particular, running it with is possible but not helpful, and does not make sense.) D. PCA is susceptible to local optima; trying multiple random initializations may help. E. PCA can be used only to reduce the dimensionality of data by 1 (such as 3D to 2D, or 2D to 1D). F. Given an input , PCA compresses it to a lower-dimensional vector . G. If the input features are on very different scales, it is a good idea to perform feature scaling before applying PCA. H. Feature scaling is not useful for PCA, since the eigenvector calculation (such as using Octave's svd(Sigma) routine) takes care of this automatically.

BCFG

Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from. A. Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow "similar" or "related". B. Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years. C. Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail. D. Examine the statistics of two football teams, and predict which team will win tomorrow's match (given historical data of teams' wins/losses to learn from).

Which of the following are recommended applications of PCA? Select all that apply. A. To get more features to feed into a learning algorithm. B. Data compression: Reduce the dimension of your data, so that it takes up less memory / disk space. C. Preventing overfitting: Reduce the number of features (in a supervised learning problem), so that there are fewer parameters to learn. D. Data visualization: Reduce data to 2D (or 3D) so that it can be plotted. E. Data compression: Reduce the dimension of your input data , which will be used in supervised learning algorithm (i.e., use PCA so that your supervised learning algorithm runs faster ). F. As a replacement for (or alternative to) linear regression: For most learning applications, PCA and linear regression give sustantially similar results. G. Data visualization: To take 2D data, and find a different way of plotting it in 2D (using k=2)

BDE

Which of the following statements about stochastic gradient descent are true? Check all that apply. A. Suppose you are using stochastic gradient descent to train a linear regression classifier. The cost function is guaranteed to decrease after every iteration of the stochastic gradient descent algorithm. B. One of the advantages of stochastic gradient descent is that it can start progress in improving the parameters θ after looking at just a single training example; in contrast, batch gradient descent needs to take a pass over the entire training set before it starts to make progress in improving the parameters' values. C. Stochastic gradient descent is particularly well suited to problems with small training set sizes; in these problems, stochastic gradient descent is often preferred to batch gradient descent. D. In each iteration of stochastic gradient descent, the algorithm needs to examine/use only one training example. E. Before running stochastic gradient descent, you should randomly shuffle (reorder) the training set. F. In order to make sure stochastic gradient descent is converging, we typically compute after each iteration (and plot it) in order to make sure that the cost function is generally decreasing. G. You can use the method of numerical gradient checking to verify that your stochastic gradient descent implementation is bug-free. (One step of stochastic gradient descent computes the partial derivative .) H. If you have a huge training set, then stochastic gradient descent may be much faster than batch gradient descent.

BDEGH

Which of the following statements about online learning are true? Check all that apply. A. One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen. B. In the approach to online learning discussed in the lecture video, we repeatedly get a single training example, take one step of stochastic gradient descent using that example, and then move on to the next example. C. One of the advantages of online learning is that there is no need to pick a learning rate α. D. When using online learning, in each step we get a new example (x, y), perform one step of (essentially stochastic gradient descent) learning on that example, and then discard that example and move on to the next. E. When using online learning, you must save every new training example you get, as you will need to reuse past examples to re-train the model even after you get new training examples in the future. F. Online learning algorithms are most appropriate when we have a fixed training set of size m that we want to train on. G. One of the advantages of online learning is that if the function we're modeling changes over time (such as if we are modeling the probability of users clicking on different URLs, and user tastes/preferences are changing over time), the online learning algorithm will automatically adapt to these changes. H. Online learning algorithms are usually best suited to problems were we have a continuous/non-stop stream of data that we want to learn from.

BDGH

Suppose you have implemented regularized logistic regression to predict what items customers will purchase on a web shopping site. However, when you test your hypothesis on a new set of customers, you find that it makes unacceptably large errors in its predictions. Furthermore, the hypothesis performs poorly on the training set. Which of the following might be promising steps to take? Check all that apply.NOTE: Since the hypothesis performs poorly on the training set, it is suffering from high bias (underfitting) A. Try increasing the regularization parameter λ. B. Try decreasing the regularization parameter λ. C. Try evaluating the hypothesis on a cross validation set rather than the test set. D. Use fewer training examples. E. Try adding polynomial features. F. Try using a smaller set of features. G. Try to obtain and use additional features.

BEG

Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemist obtains the dataset below. In the column on the right, "kJ/mol" is the unit measuring the amount of energy released. You would like to use linear regression () to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for and ? You should be able to select the right answer without actually implementing linear regression. A. = −569.6, = 530.9 B. = −1780.0, = −530.9 C. = −569.6, = −530.9 D. = −1780.0, = 530.9

Suppose you have m = 23 training examples with n = 5 features (excluding the additional all-ones feature for the intercept term, which you should add). The normal equation is . For the given values of m and n, what are the dimensions of , X, and y in this equation? A. X is 23 × 5, y is 23 × 1, θ is 5 × 5 B. X is 23 × 6, y is 23 × 6, θ is 6 × 6 C. X is 23 × 6, y is 23 × 1, θ is 6 × 1 D. X is 23 × 5, y is 23 × 1, θ is 5 × 1

Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set: You suspect that the SVM is underfitting your dataset. Should you try increasing or decreasing ? Increasing or decreasing ? A. It would be reasonable to try decreasing C. It would also be reasonable to try increasing . B. It would be reasonable to try decreasing C. It would also be reasonable to try decreasing . C. It would be reasonable to try increasing C. It would also be reasonable to try decreasing . D. It would be reasonable to try increasing C. It would also be reasonable to try increasing .

Which of the following statements about regularization are true? Check all that apply. A. Using a very large value of hurt the performance of your hypothesis; the only reason we do not set to be too large is to avoid numerical problems. B. Because logistic regression outputs values , its range of output values can only be "shrunk" slightly by regularization anyway, so regularization is generally not helpful for it. C. Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when λ = 0). D. Using too large a value of λ can cause your hypothesis to overfit the data; this can be avoided by reducing λ.

You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply. A. Introducing regularization to the model always results in equal or better performance on the training set. B. Introducing regularization to the model always results in equal or better performance on examples not in the training set. C. Adding a new feature to the model always results in equal or better performance on the training set. D. Adding many new features to the model helps prevent overfitting on the training set.

You run gradient descent for 15 iterations with and compute after each iteration. You find that the value of decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible? A. Rather than use the current value of α, it'd be more promising to try a larger value of α (say = 1.0). B. Rather than use the current value of α, it'd be more promising to try a smaller value of α (say = 0.1). C. = 0.3 is an effective choice of learning rate.

In the given figure, the cost function has been plotted against and , as shown in 'Plot 2'. The contour plot for the same cost function is given in 'Plot 1'. Based on the figure, choose the correct options (check all that apply). A. If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function is maximum at point A. B. If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point C, as the value of cost function is minimum at point C. C. Point P (the global minimum of plot 2) corresponds to point A of Plot 1. D. If we start from point B, gradient descent with a well-chosen learning rate will eventually help us reach at or near point A, as the value of cost function is minimum at A. E. Point P (The global minimum of plot 2) corresponds to point C of Plot 1.

Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from. A. Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the treatment, side effects, etc.), discover whether there are different categories or "types" of patients in terms of how they respond to the drug, and if so what these categories are. B. Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments. C. Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals). D. Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.

Which of the following statements are true? Check all that apply. A. Suppose you are using SVMs to do multi-class classification and would like to use the one-vs-all approach. If you have K different classes, you will train K-1 different SVMs. B. If the data are linearly separable, an SVM using a linear kernel will return the same parameters θ regardless of the chosen value of C (i.e., the resulting value θ of does not depend on C). C. It is important to perform feature normalization before using the Gaussian kernel. D. The maximum value of the Gaussian kernel (i.e., ) is 1. E. Suppose you have 2D input examples (ie, ). The decision boundary of the SVM (with the linear kernel) is a straight line. F. If you are training multi-class SVMs with one-vs-all method, it is not possible to use a kernel.

CDE

Which of the following statements are true? Check all that apply. A. Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest test set error. B. Suppose you are training a regularized linear regression model.The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest training set error. C. The performance of a learning algorithm on the training set will typically be better than its performance on the test set. D. Suppose you are training a regularized linear regression model. The recommended way to choose what value of regularization parameter to use is to choose the value of which gives the lowest cross validation error. E. A typical split of a dataset into training, validation and test sets might be 60% training set, 20% validation set, and 20% test set. F. Suppose you are training a logistic regression classifier using polynomial features and want to select what degree polynomial (denoted in the lecture videos) to use. After training the classifier on the entire training set, you decide to use a subset of the training examples as a validation set. This will work just as well as having a validation set that is separate (disjoint) from the training set. G. It is okay to use data from the test set to choose the regularization parameter λ, but not the model parameters (θ). H. Suppose you are using linear regression to predict housing prices, and your dataset comes sorted in order of increasing sizes of houses. It is then important to randomly shuffle the dataset before splitting it into training, validation and test sets, so that we don't have all the smallest houses going into the training set, and all the largest houses going into the test set.

CDEH

What are the benefits of performing a ceiling analysis? Check all that apply. A. If we have a low-performing component, the ceiling analysis can tell us if that component has a high bias problem or a high variance problem. B. A ceiling analysis helps us to decide what is the most promising learning algorithm (e.g., logistic regression vs. a neural network vs. an SVM) to apply to a specific component of a machine learning pipeline. C. It gives us information about which components, if improved, are most likely to have a significant impact on the performance of the final system. D. It can help indicate that certain components of a system might not be worth a significant amount of work improving, because even if it had perfect performance its impact on the overall system may be small. E. It is a way of providing additional training data to the algorithm. F. It helps us decide on allocation of resources in terms of which component in a machine learning pipeline to spend more effort on.

CDF

In which of the following situations will a collaborative filtering system be the most appropriate learning algorithm (compared to linear or logistic regression)? A. You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the expected sales volume (number of books sold) as a function of the average rating of a book. B. You're an artist and hand-paint portraits for your clients. Each client gets a different portrait (of themselves) and gives you 1-5 star rating feedback, and each client purchases at most 1 portrait. You'd like to predict what rating your next customer will give you. C. You run an online bookstore and collect the ratings of many users. You want to use this to identify what books are "similar" to each other (i.e., if one user likes a certain book, what are other books that she might also like?) D. You own a clothing store that sells many styles and brands of jeans. You have collected reviews of the different styles and brands from frequent shoppers, and you want to use these reviews to offer those shoppers discounts on the jeans you think they are most likely to purchase. E. You've written a piece of software that has downloaded news articles from many news websites. In your system, you also keep track of which articles you personally like vs. dislike, and the system also stores away features of these articles (e.g., word counts, name of author). Using this information, you want to build a system to try to find additional new articles that you personally will like. F. You run an online news aggregator, and for every user, you know some subset of articles that the user likes and some different subset that the user dislikes. You'd want to use this to find other articles that the user likes. G. You manage an online bookstore and you have the book ratings from many users. For each user, you want to recommend other books she will enjoy, based on her own ratings and the ratings of other users.

CDFG

Suppose you have trained a logistic regression classifier which is outputing .Currently, you predict 1 if , and predict 0 if , where currently the threshold is set to 0.5.Suppose you increase the threshold to 0.9. Which of the following are true? Check all that apply. A. The classifier is likely to have unchanged precision and recall, but higher accuracy. B. The classifier is likely to now have higher recall. C. The classifier is likely to now have higher precision. D. The classifier is likely to have unchanged precision and recall, and thus the same F1 score. E. The classifier is likely to now have lower recall. F. The classifier is likely to now have lower precision.

Suppose you have trained a logistic regression classifier which is outputing .Currently, you predict 1 if , and predict 0 if , where currently the threshold is set to 0.5.Suppose you decrease the threshold to 0.3. Which of the following are true? Check all that apply. A. The classifier is likely to have unchanged precision and recall, but higher accuracy. B. The classifier is likely to have unchanged precision and recall, but lower accuracy. C. The classifier is likely to now have higher recall. D. The classifier is likely to now have higher precision. E. The classifier is likely to have unchanged precision and recall, and thus the same F1 score. F. The classifier is likely to now have lower recall. G. The classifier is likely to now have lower precision.

Suppose that for some linear regression problem (say, predicting housing prices as in the lecture), we have some training set, and for our training set we managed to find some , such that .Which of the statements below must then be true? (Check all that apply.) A. Gradient descent is likely to get stuck at a local minimum and fail to find the global minimum. B. For this to be true, we must have and so that C. For this to be true, we must have for every value of = 1, 2,...,. D. Our training set can be fit perfectly by a straight line, i.e., all of our training examples lie perfectly on some straight line.

Which of the following statements about regularization are true? Check all that apply. A Using a very large value of hurt the performance of your hypothesis; the only reason we do not set to be too large is to avoid numerical problems. B. Because logistic regression outputs values , its range of output values can only be "shrunk" slightly by regularization anyway, so regularization is generally not helpful for it. C. Because regularization causes J(θ) to no longer be convex, gradient descent maynot always converge to the global minimum (when λ > 0, and when using anappropriate learning rate α). D. Using too large a value of λ can cause your hypothesis to underfit the data; this can be avoided by reducing λ.

Suppose you have an unlabeled dataset . You run K-means with 50 different random initializations, and obtain 50 different clusterings of the data. What is the recommended way for choosing which one of these 50 clusterings to use? A. Use the elbow method. B. Plot the data and the cluster centroids, and pick the clustering that gives the most "coherent" cluster centroids. C. Manually examine the clusterings, and pick the best one. D. Compute the distortion function , and pick the one that minimizes this. E. The only way to do so is if we also have labels for our data. F. Always pick the final (50th) clustering found, since by that time it is more likely to have converged to a good solution. G. The answer is ambiguous, and there is no good way of choosing. H. For each of the clusterings, compute , and pick the one that minimizes this.

Let u be a 3-dimensional vector, where specifically What is uT?

2 1 8 ngang

Let A and B be 3x3 (square) matrices. Which of the following must necessarily hold true? Check all that apply.

2 Ifs

you have a 1-D dataset and you want to detect outliers in the dataset. You first plot the dataset and it looks like this: Suppose you fit the gaussian distribution parameters and to this dataset. Which of the following values for and might you get? A. -3, 4 B. -6, 4 C. -3, 2 D. -6, 2

Suppose I first execute the following Octave/Matlab commands: A = [1 2; 3 4; 5 6]; B = [1 2 3; 4 5 6]; Which of the following are then valid commands? Check all that apply. (Hint: A' denotes the transpose of A.) A. C = A * B; B. C = B' + A; C. C = A' * B; D. C = B + A;

Let . Let , and . Use the formula to numerically compute an approximation to the derivative at . What value do you get? (When , the true/exact derivati ve is .) A. 8 B. 6.0002 C. 6 D. 5.9998

Suppose you are running a sliding window detector to find text in images. Your input images are 1000x1000 pixels. You will run your sliding windows detector at two scales, 10x10 and 20x20 (i.e., you will run your classifier on lots of 10x10 patches to decide if they contain text or not; and also on lots of 20x20 patches), and you will "step" your detector by 2 pixels each time. About how many times will you end up running your classifier on a single 1000x1000 test set image? A. 250,000 B. 500,000 C. 1,000,000 D. 100,000

Suppose you have trained an anomaly detection system for fraud detection, and your system that flags anomalies when p(x) is less than ε, and you find on the cross-validation set that it is mis-flagging far too many good transactions as fradulent. What should you do? A. Increase ε B. Decrease ε

Suppose you have implemented regularized logistic regression to classify what object is in an image (i.e., to do object recognition). However, when you test your hypothesis on a new set of images, you find that it makes unacceptably large errors with its predictions on the new images. However, your hypothesis performs well (has low error) on the training set. Which of the following are promising steps to take? Check all that apply. NOTE: Since the hypothesis performs well (has low error) on the training set, it is suffering from high variance (overfitting) A. Try adding polynomial features. B. Use fewer training examples. C. Try using a smaller set of features. D. Get more training examples. E. Try evaluating the hypothesis on a cross validation set rather than the test set. F. Try decreasing the regularization parameter λ. G. Try increasing the regularization parameter λ.

CDG

Which of the following are reasons for using feature scaling? A. It is necessary to prevent gradient descent from getting stuck in local optima. B. It speeds up solving for θ using the normal equation. C. It prevents the matrix (used in the normal equation) from being non-invertable (singular/degenerate). D. It speeds up gradient descent by making it require fewer iterations to get to a good solution.

Which of these is a reasonable definition of machine learning? A. Machine learning is the science of programming computers. B. Machine learning learns from labeled data. C. Machine learning is the field of allowing robots to act intelligently. D. Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Let two matrices be , What is A + B ?

1 -3 1 -7

Let Which of the following indexing expressions givesCheck all that apply. A. B = A(:, 1:2); B. B = A(1:4, 1:2); C. B = A(:, 0:2); D. B = A(0:4, 0:2);

Let two matrices be , What is A - B ?

1 -7 -7 -7

Let A and B be 3x3 (square) matrices. Which of the following must necessarily hold true? Check all that apply. A. A*B*A = B*A*B B. If A is the 3x3 identity matrix, then A*B = B*A C. A*B = B*A D. A+B = B+A

Consider the problem of predicting how well a student does in her second year of college/university, given how well she did in her first year. Specifically, let x be equal to the number of "A" grades (including A-. A and A+ grades) that a student receives in their first year of college (freshmen year). We would like to predict the value of y, which we define as the number of "A" grades they get in their second year (sophomore year). Here each row is one training example. Recall that in linear regression, our hypothesis is to denote the number of training examples. For the training set given above (note that this training set may also be referenced in other questions in this quiz), what is the value of ? In the box below, please enter your answer (which should be a number between 0 and 10).

Suppose someone tells you that they ran PCA in such a way that "95% of the variance was retained." What is an equivalent statement to this?

<=0.05

A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E. Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is T? A. The weather prediction task. B. None of these. C. The probability of it correctly predicting a future date's weather. D. The process of the algorithm examining a large amount of historical weather data.

AIL302m

Set pelajaran terkait

Core Gramar For Lawyers: Quotations

Civil Service

types of whorl

FIN3403-ch10

Corporate consolidation movement

MBE Criminal Law and Procedure Practice Problems

Rational and Irrational Properties

Ch 11

CGS2518 Final

WSU PSYCH 311 TEST 4

Human Migration

p

All previous psychology exams

English Final

HSC ch.9 drug use and addiction

135-1

Chapter 6: Values, Ethics, and Advocacy

OB Success: Newborn

GML Chapter 23

Oceanography Ch. 1 - Combined