CS540 Midterm

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Suppose we want to perform convolution on a RGB image of size 224 x 224 (no padding) with 64 kernels, each with height 3 and width 3. Stride = 1. The convolution layer has bias parameters. Which is a reasonable estimate of the total number of learnable parameters?

(3 x 3 x 3 + 1) x 64 Each kernel is 3D kernel across 3 input channels, so has 3x3x3 parameters. Each kernel has 1 bias parameter. So in total (3x3x3+1)x64

You need to search a randomly generated state space graph with one goal, uniform edges costs, d=2, and m=100. Considering worst case behavior, do you select BFS or DFS for your search? 1. BFS 2. DFS

1. BFS DFS might get stuck going down one branch indefinitely

You are running BFS on a finite tree-structured state space graph that does not have a goal state. What is the behavior of BFS? 1. Visit all N nodes, then return one at random 2. Visit all N nodes, then stop and return "failure" 3. Visit all N nodes, then return the node farthest from the initial state 4. Get stuck in an infinite loop

2. Visit all N nodes, then stop and return "failure"

Suppose we want to perform convolution on a single channel image of size 7x7 (no padding) with a kernel of size 3x3, and stride = 2. What is the dimension of the output?

3x3 ⌊(nh-kh+ph+sh)/sh⌋ x ⌊(nw -kw+pw+sw)/sw⌋

Suppose you are given a 3-layer multi-layer perceptron (2 hidden layers h1 and h2 and 1 output layer). All activation functions are sigmoids, and the output layer uses a softmax function. Suppose h1 has 1024 units h2 has 512 units. Given a dataset with 2 input features and 3 unique class labels, how many learnable paramters does the perceptron have in total

529411 (1024 * 2 + 1024) + (512 * 1024 + 512) + (512 * 3 + 3) = ^^^^

Suppose we want to perform convolution on an RGB image of size 224x224 (no padding) with 64 kernels, each with height 3 and width 3. Stride = 1. Which is a reasonable estimate of the total number of scalar multiplications involved in this operation (without considering any optimization in matrix multiplication)?

64 x 3 x 3 x 3 x 222 x 222 For each kernel, we slide the window to 222 x 222 different locations. For each location, the number of multiplication is 3x3x3. So in total 64x3x3x3x222x222

Consider a convolution layer with 16 filters. Each filter has a size of 11x11x3, a stride of 2x2. Given an input image of size 22x22x3, if we don't allow a filter to fall outside of the input, what is the output size?

6x6x16

Consider the following dataset showing the result whether a person has passed or failed the exam based on various factors. Suppose the factors are independent to each other. We want to classify a new instance with Confident=Yes, Studied=Yes, and Sick=No. A Pass B Fail Confident Studied Sick Result Y N N F Y N Y P N Y Y F N Y N P Y Y Y P

A Pass P(y = F | x1 = Y, x2 = Y, x3 = N) = (1/2 * 1/2 *1/2 * 2/5) / P(x1 = Y, x2 = Y, x3 = N) = 1/(4*5) P(y = P | x1 = Y, x2 = Y, x3 = N) = (2/3 * 2/3 *1/3 * 3/5) / P(x1 = Y, x2 = Y, x3 = N) = 4/ (9*5)

There are exactly 3 candidates for a presidential election. We know X has a 30% chance of winning, B has a 35% chance. What's the probability that C wins? A. 0.35 B. 0.23 C. 0.333 D. 0.8

A. 0.35

What is the perplexity for a sequence of n digits 0- 9? All occur independently with equal probability. PP(W) = P(w_1, w_2, w_3... w_n)^(-1/2) A. 10 B. 1/10 C. 10n D. 0

A. 10 (P(w1)*P(w2)...*P(w_10))^(-1/10)= ((1/10)*(1/10)*....(1/10))^(-1/10) = 10

Suppose we are given a dataset with n=10000 samples with 100-dimensional binary feature vectors. Our storage device has a capacity of 50000 bits. What's the lowest compression ratio we can use? A. 20X B. 100X C. 5X D. 1X

A. 20X 50,000 bits / 10,000 samples means compressed version must have 5 bits / sample. Dataset has 100 bits / sample. Must compress 20x smaller to fit on device.

50% of emails are spam. Software has been applied to filter spam. A certain brand of software claims that it can detect 99% of spam emails, and the probability for a false positive (a non-spam email detected as spam) is 5%. Now if an email is detected as spam, then what is the probability that it is in fact a nonspam email? A. 5/104 B. 95/100 C. 1/100 D. 1/2

A. 5/104 S : Spam NS: Not Spam DS: Detected as Spam P(S) = 50 % spam email P(NS) = 50% not spam email P(DS|NS) = 5% false positive, detected as spam but not spam P(DS|S) = 99% detected as spam and it is spam Applying Bayes Rule P(NS|DS) = (P(DS|NS)*P(NS)) / P(DS) = (P(DS|NS)*P(NS)) / (P(DS|NS)*P(NS) + P(DS|S)*P(S)) = 5/104

Consider the linear perceptron with x as the input. Which function can the linear perceptron compute? 1. y = ax + b 2. y = ax^2 +bx + c A. 1 B. 2 C 1 & 2 D. None

A. All units in a linear perceptron are linear. Thus, the model can not present non-linear functions

You have seven 2-dimensional points. You run 3-means on it, with initial clusters C1 = {(2,2), (4,4), (6,6), C2= {(0,4), (4,0)} and C3 = {(5,5), (9,9)} Cluster centroids are updated to? A. C1: (4,4), C2: (2,2), C3: (7,7) B. C1: (6,6), C2: (4,4), C3: (9,9) C. C1: (2,2), C2: (0,0), C3: (5,5) D. C1: (2,6), C2: (0,4), C3: (5,9)

A. C1: (4,4), C2: (2,2), C3: (7,7) The average of points in C1 is (4,4). The average of points in C2 is (2,2). The average of points in C3 is (7,7)

When we train a model, we are A. Optimizing the parameters and keeping the features fixed. B. Optimizing the features and keeping the parameters fixed. C. Optimizing the parameters and the features. D. Keeping parameters and features fixed and changing the predictions.

A. Optimizing the parameters and keeping the features fixed.

When training a neural network, which one below indicates that the network has overfit the training data? A. Training loss is low and generalization loss is high. B. Training loss is low and generalization loss is low. C. Training loss is high and generalization loss is high. D. Training loss is high and generalization loss is low. E. None of these.

A. Training loss is low and generalization loss is high.

Adding more layers to a multi-layer perceptron may cause ______. A. Vanishing gradients during back propagation. B. A more complex decision boundary. C. Underfitting. D. Higher test loss. E. None of these.

A. Vanishing gradients during back propagation. B. A more complex decision boundary. D. Higher test loss.

For Q learning to converge to the true Q function, we must A. Visit every state and try every action B. Perform at least 20,000 iterations. C. Re-start with different random initial table values. D. Prioritize exploitation over exploration

A. Visit every state and try every action

You have trained a classifier, and you find there is significantly lower loss on the test set than the training set. What is likely the case? A. You have accidentally trained your classifier on the test set. B. Your classifier is generalizing well. C. Your classifier is generalizing poorly. D. Your classifier needs further training.

A. You have accidentally trained your classifier on the test set.

Consider two heuristics for the 8 puzzle problem. h1 is the number of tiles in wrong position. h2 is the l1/Manhattan distance between the tiles and the goal location. How do h1 and h2 relate? A. h2 dominates h1 B. h1 dominates h2 C. Neither dominates the other

A. h2 dominates h1

Smoothing is increasingly useful for n-grams when A. n gets larger B. n gets smaller C. always the same D. n larger than 10

A. n gets larger

Suppose we want to solve the following k-class classification problem with cross entropy loss ℓ(𝐲, 𝐲hat) = − ∑ 𝑦𝑗log𝐲hat_𝑗 , where the ground truth and predicted probabilities 𝐲, 𝐲hat ∈ ℝ^𝑘. Recall that the softmax function turns output into probabilities: yhat_𝑗 = exp𝑓_𝑗(𝑥) / ∑exp𝑓_𝑖(𝑥) . What is the partial derivative 𝜕_𝑓_𝑗ℓ(𝐲, 𝐲hat)? A. yhat_𝑗 − 𝑦_𝑗 B. exp(𝑦_𝑗) − 𝑦_𝑗 C. 𝑦_𝑗−yhat_j

A. yhat_𝑗 − 𝑦_𝑗

Which of the following are TRUE about the vanishing gradient problem in neural networks? Multiple answers are possible. A.Deeper neural networks tend to be more susceptible to vanishing gradients. B.Using the ReLU function can reduce this problem. C. If a network has the vanishing gradient problem for one training point due to the sigmoid function, it will also have a vanishing gradient for every other training point. D. Networks with sigmoid functions don't suffer from the vanishing gradient problem if trained with the cross-entropy loss.

A.Deeper neural networks tend to be more susceptible to vanishing gradients. B.Using the ReLU function can reduce this problem.

Which of the following are true about AlexNet? Select all that apply. A. AlexNet contains 8 conv/fc layers. The first five are convolutional layers. B.The last three layers are fully connected layers. C.some of the convolutional layers are followed by max-pooling (layers). D. AlexNet achieved excellent performance in the 2012 ImageNet challenge.

All are true!

Which one of the following is valid activation function a)Step function b) Sigmoid function C) ReLU function D) all of above

All of the above

We toss a biased coin. If P(heads) = 0.7, then P(tails) = ? A. 0.4 B. 0.3 C. 0.6 D. 0.5

B 0.3

Consider a three-layer network with linear Perceptrons for binary classification. The hidden layer has 3 neurons. Can the network represent a XOR problem? a)Yes b)No

B NO A combination of linear perceptrons is still a linear function

You see samples of X given by [0,1,1,2,2,0,1,2] Empirically estimate E[X^2] A. 9/8 B. 15/8 C. 1.5 D. There aren't enough samples to estimate

B. 15/8 1/8 * (0^2 + 1^2 + ....) = 15/8

Given joint distribution table: What is the probability the temperature is hot given the weather is cloudy? _________ Sunny Cloudy Rainy hot 150/365 40/365 5/365 cold 50/365 60/365 60/365 A. 40/365 B. 2/5 C. 3/5 D. 195/365

B. 2/5

We are playing a game where Player A goes first and has 4 moves. Player B goes next and has 3 moves. Player A goes next and has 2 moves. Player B then has one move. How many nodes are there in the minimax tree, including termination nodes (leaves)? • A. 23 • B. 65 • C. 41 • D. 2

B. 65 (1 + 4 + 4*3 + 4*3*2 + 4*3*2 = 65. Note the root and leaf nodes.)

If we run K-means clustering twice with random starting cluster centers, are we guaranteed to get same clustering results? Does K-means always converge? A. Yes, Yes B. No, Yes C. Yes, No D. No, No

B. No, Yes *The clustering from k-means will depend on the initialization. Different initialization can lead to different outcomes. K-means will always converge on a finite set of data points: 1. There are finite number of possible partitions of the points 2. The assignment and update steps of each iteration will only decrease the sum of the distances from points to their corresponding centers. 3. If it run forever without convergence, it will revisit the same partition, which is contradictory to item 2.

Consider finding the fastest driving route from one US city to another. Measure cost as the number of hours driven when driving at the speed limit. Let h(s) be the number of hours needed to ride a bike from city s to your destination. h(s) is A. An admissible heuristic B. Not an admissible Heuristic

B. Not an admissible heuristic

Let's compare sigmoid with rectified linear unit (ReLU). Which of the following statement is NOT true? A. Sigmoid function is more expensive to compute B. ReLU has non-zero gradient everywhere C. The gradient of Sigmoid is always less than 0.3 D. The gradient of ReLU is constant for positive input

B. ReLU has non-zero gradient everywhere

During minimax tree search, must we examine every node? • A. Always • B. Sometimes • C. Never

B. Sometimes

What is ? [1 2] [3 1] x [0] [1 1] [ 1] A. [-1 1 1]T B. [2 1 1]T C. [1 3 1]T D. [1.5 2 1]T

B. [2 1 1]T

: Given matrices A^mxn B^dxm and C^pxn What are the dimensions of BAC^T A. n x p B. d x p C. d x n D. Undefined

B. d x p *To rule out (D), check that for each pair of adjacent matrices XY, the # of columns of X = # of rows of Y Then, B has d rows so solution must have d rows. C^T has p columns so solution has p columns.

We have two datasets: a social network dataset S1 which shows which individuals are friends with each other along with image dataset S2. What kind of clustering can we do? Assume we do not make additional data transformations. A. k-means on both S1 and S2 B. graph-based on S1 and k-means on S2 C. k-means on S1 and graph-based on S2 D. hierarchical on S1 and graph-based on S2

B. graph-based on S1 and k-means on S2

Which of the following is false? (i) Rock/paper/scissors has a dominant pure strategy (ii) There is a Nash equilibrium for rock/paper/scissors • A. Neither • B. (i) but not (ii) • C. (ii) but not (i) • D. Both

B. i but not ii

Let's do hierarchical clustering for two clusters with average linkage on the dataset below. What are the clusters? A. {1}, {2,4,5,7.25} B. {1,2}, {4, 5, 7.25} C. {1,2,4}, {5, 7.25} D. {1,2,4,5}, {7.25} 1-2--4-5--7.25

B. {1,2}, {4, 5, 7.25}

K-NN algorithms can be used for: A Only classification B Only regression C Both

Both

Perceptron can be used for representing: A. AND function B. OR function C. XOR function D. Both AND and OR function

Both AND and OR functions

Which of the following about Naive Bayes is incorrect? A Attributes can be nominal or numeric B Attributes are equally important C Attributes are statistically dependent of one another given the class value D Attributes are statistically independent of one another given the class value E All of above

C Attributes are statistically dependent of one another given the class value

Which has more rows: a truth table on n symbols, or a joint distribution table on n binary random variables? A. Truth table B. Distribution C. Same size D. It depends

C Same size

Consider an MDP with 2 states {A, B} and 2 actions: "stay" at current state and "move" to other state. Let r be the reward function such that r(A) = 1, r(B) = 0. Let 𝛾 be the discounting factor. Let π: π(A) = π(B) = move (i.e., an "always move" policy). What is the value function 𝑉𝜋 (𝐴)? • A. 0 B. 1 / (1 -𝛾) C. 1 / (1 -𝛾^2 ) D. 1

C. 1 / (1 -𝛾^2 ) (States: A,B,A,B,... rewards 1,0, 𝛾^2 ,0, 𝛾^4 ,0, ...)

You have a dataset for regression given by (x1, y1) = ([-1,0,1],2) and (x2,y2) = ([2,3,1], 4). We have the weights B0 = 0, B1 = 2, B2 = 1 and B3 = 1. Predict yhat for x = [1,10,1] A. 15 B. 9 C. 13 D. 21

C. 13 yhat = 1 * B0 + 1 * B1 + 10 * B2 + 1 * B3 = 13 * need to add 1 to start of vector

You have a dataset for regression given by (x1, y1) = ([-1,0,1],2) and (x2,y2) = ([2,3,1], 4). We have the weights B0 = 0, B1 = 2, B2 = 1 and B3 = 1. What is the mean squared error (MSE) on the training set? A. 9 B. 13/2 C. 25/2 D. 25

C. 25/2 Step 1: compute yhat for both: yhat_1 = -1*B1 + 0*B2 + 1*B3 = -1 loss(yhat1,y1) = (-1-2)^2 = 9 yhat_2= 2*B1 + 3*B2 + 1*B3 = 8 loss(yhat2,y2) = (8-4)^2 = 16 MSE = (16+9)/2 = 25/2

A fair coin is tossed three times. Find the probability of getting 2 heads and a tail A. 1/8 B. 2/8 C. 3/8 D. 5/8

C. 3/8

Which of the following is not true? A. Adding more layers can improve the performance of a neural network. B. Residual connections help deal with vanishing gradients. C. CNN architectures use no more than ~20 layers to avoid problems such as vanishing gradients. D. It is usually easier to learn a zero mapping than the identity mapping

C. CNN architectures use no more than ~20 layers to avoid problems such as vanishing gradients.

Suppose P is false, Q is true, and R is true. Does this assignment satisfy (i) ¬(¬p → ¬q) ∧ r (ii) (¬p ∨ ¬q) → (p ∨ ¬r) A. Both B. Neither C. Just (i) D. Just (ii)

C. Just (i)

Suppose you are given a dataset with 1,000,000 images to train with. Which of the following methods is more desirable if training resources are limit but enough accuracy is needed A. Gradient Descent B. Stochastic Gradient Descent C. Minibatch Stochastic Gradient Descent D. Computation Graph

C. Minibatch Stochastic Gradient Descent

Which one of the following is NOT true about perceptron? A. Perceptron only works if the data is linearly separable. B. Perceptron can learn AND function C. Perceptron can learn XOR function D. Perceptron is a supervised learning algorithm

C. Perceptron can learn XOR funciton

Which one of the following is NOT true? A. LeNet has two convolutional layers B. The first convolutional layer in LeNet has 5x5x6x3 parameters, in case of RGB input C. Pooling is performed right after convolution D. Pooling layer does not have learnable parameters

C. Pooling is performed right after convolution Pooling is performed after ReLU: conv -> relu -> pooling

Which output function is often used for multi-class classification tasks? A. Sigmoid function B. Rectified Linear Unit (ReLU) C. Softmax D. Max Function

C. Softmax

Which of the following statement about MDP is not true? • A. The reward function must output a scalar value B. The policy maps states to actions C. The probability of next state can depend on current and previous states D. The solution of MDP is to find a policy that maximizes the cumulative rewards

C. The probability of next state can depend on current and previous states

Which is true about feature vectors? A. Feature vectors can have at most 10 dimensions B. Feature vectors have only numeric values C. The raw image can also be used as the feature vector D. Text data don't have feature vectors

C. The raw image can also be used as the feature vector *A. Feature vectors can be high dimensional B. Some feature vectors can have other types of values like strings D. Bag-of-words is a type of feature vector for text

You have trained a classifier, and you find there is significantly higher loss on the test set than the training set. What is likely the case? A. You have accidentally trained your classifier on the test set. B. Your classifier is generalizing well. C. Your classifier is generalizing poorly. D. Your classifier is ready for use.

C. Your classifier is generalizing poorly.

In standard dropout regularization, with dropout probability p, each intermediate activation h is replaced by a random variable h' as: A. h B. h/p C. h/(1-p) D. h(1-p)

C. h/(1-p)

Which of the following are admissible heuristics? (i) h(s) = h*(s) (ii) h(s) = max(2, h*(s)) (iii) h(s) = min(2, h*(s)) (iv) h(s) = h*(s)-2 (v) h(s) = sqrt(h*(s)) • A. All of the above • B. (i), (iii), (iv) • C. (i), (iii) • D. (i), (iii), (v)

C. i & iii

How many entries does a truth table have for a FOL sentence with k variables where each variable can take on n values? A. Truth tables are not applicable to FOL B. 2^ k C. n^k D. It depends

C. n^k *Must have one entry for every possible assignment of values to variables. That number is (C).

Of a company's employees, 30% are women and 6% are married women. Suppose an employee is selected at random. If the employee selected is a woman, what is the probability that she is married? A. 0.3 B. 0.06 C. 0.24 D. 0.2

D. 0.2

What's the probability of selecting a black card or a number 6 from a standard deck of 52 cards? A. 26/52 B. 4/52 C. 30/52 D. 28/52

D. 28/52

Feb 8th - Which of the below are bigrams from the sentence "It is cold outside today". A. It is B. cold today C. is cold D. A & C

D. A & C

Which of the following statement is TRUE for the success of deep models? A. Better design of the nueral networks B. Large scale training dataset C. Available computing power D. All of the above

D. All of the above

The CIFAR-10 dataset contains 32x32 images labeled with one of 10 classes. What could we use it for? (i) Supervised learning (ii) PCA (iii) k-means clustering A. Only (i) B. Only (ii) and (iii) C. Only (i) and (ii) D. All of them

D. All of them (i)Yes: train an image classifier; have labels (ii) Yes: run PCA on image vectors to reduce dimensionality (iii) Yes: can cluster image vectors with k-means

We are running 3-means again. We have 3 centers, C1 (0,1), C2, (2,1), C3 (-1,2). Which cluster assignment is possible for the points (1,1) and (-1,1), respectively? Ties are broken arbitrarily: (i) C1, C1 (ii) C2, C3 (iii) C1, C3 A. Only (i) B. Only (ii) and (iii) C. Only (i) and (iii) D. All of them

D. All of them For the point (1,1): square-Euclidean-distance to C1 is 1, to C2 is 1, to C3 is 5 So it can be assigned to C1 or C2 For the point (-1,1): square-Euclidean-distance to C1 is 1, to C2 is 9, to C3 is 1 So it can be assigned to C1 or C3

If we apply data augmentation blindly, we might (i) Change the label of the data point (ii) Produce a useless training point A. (i) but not (ii) B. (ii) but not (i) • C. Neither • D. Both

D. Both

What are some consequences of data augmentation? (i) We have to store a much bigger dataset in memory (ii) For a fixed batch size, there will be more batches per epoch • A. (i) but not (ii) • B. (ii) but not (i) • C. Neither • D. Both

D. Both

Which of the following is true? (i) Nash equilibria require each player to know other players' possible strategies (ii) Nash equilibria require rational play • A. Neither • B. (i) but not (ii) • C. (ii) but not (i) • D. Both

D. Both

Which of the following is not a common task of supervised learning? A. Object detection (predicting bounding box from raw images) B. Classification C. Regression D. Dimensionality reduction

D. Dimensionality reduction

Which is true about machine learning? A. The process doesn't involve human inputs B. The machine is given the training and test data for learning C. In clustering, the training data also have labels for learning D. Supervised learning involves labeled data

D. Supervised learning involves labeled data *A. The labels are human inputs B. The machine should not have test data for learning C. No labels available for clustering

Which is true about unsupervised learning? A. There are only 2 unsupervised learning algorithms B. Kmeans clustering is a type of hierarchical clustering C. Kmeans algorithm automatically determines the number of clusters k D. Unsupervised learning is widely used in many applications

D. Unsupervised learning is widely used in many applications

let x = [x1, x2]T which of the following functions is NOT an element-wise operation that can be used as an activation function? A. f(x) = [x1, x2]T B. f(x) = [max(0,x1), max(0,x2)]T C. fx = [exp(x1), exp(x2)]T D. fx = [exp(x1 +x2), exp(x2)]T

D. fx = [exp(x1 +x2), exp(x2)]T

If we do hierarchical clustering on n points, the maximum depth of the resulting tree is A. 2 B. log n C. n/2 D. n-1

D. n-1 * Like worst binary search tree just done on one side

You are empirically estimating P(X) for some random variable X that takes on 100 values. You see 50 samples. How many of your P(X=a) estimates might be 0? A. None. B. Between 5 and 50, exclusive. C. Between 50 and 100, inclusive. D. Between 50 and 99, inclusive

For each alpha, your estimate is P(X = alpha) = #samples taking value alpha/50 D. Between 50 and 99, inclusive *If you don't see a number at all in the 50 samples then the estimated probability of that number is 0. You can see up to 50 different values in 50 samples. On the other hand, all 50 samples might have the same value in which case 99 values were never seen.

Which of the following distance measure do we use in case of categorical variables in k-NN? A Hamming distance B Euclidean distance C Manhattan distance

Hamming Distance

What is k-means trying to optimize?

It aims to minimize the within-cluster sum of squares, which is the sum of the squared distances between each data point and its assigned centroid.

Will k-means find a global or local optimum

K-means may find a local optimum depending on the initial placement of centroids and the specific implementation. Since the algorithm's initialization is sensitive to the starting positions of centroids, it may converge to a local optimum

You have a dataset for regression given by (x1, y1) = ([-1,0,1],2) and (x2,y2) = ([2,3,1], 4). What are the labels, number of points (n), and dimension of the features (d)?

Labels are 2 and 4; n = 2 and d = 3

What is the partial derivative 𝜕f/𝜕𝑤1 of: 𝑓(𝑥1, 𝑥2, 𝑤1, 𝑤2, 𝑦) = 𝑦log𝜎(𝑤1𝑥1 + 𝑤2𝑥2) + (1 − 𝑦)log(1 − 𝜎(𝑤1𝑥1 + 𝑤2𝑥2)) when 𝑦 = 1 and 𝜎(𝑧) = 1 / 1+𝑒^−𝑧 .Hint: 𝜕𝜎/𝜕z = 𝜎(𝑧)(1 − 𝜎(𝑧) ).

Let 𝑎 = 𝜎(𝑏) Let 𝑏 = 𝑤1𝑥1 + 𝑤2𝑥2 (𝜕f / 𝜕𝑤1) = 𝜕f/𝜕alpha * 𝜕alpha/𝜕b * 𝜕b/w1 𝜕f/𝜕𝑤1 = 𝑦/𝑎 * 𝜎(𝑏)(1 − 𝜎(𝑏))𝑥1 = (1 − 𝜎(𝑤1𝑥1 + 𝑤2𝑥2))x1

How to pick starting cluster centers for K-means?

Randomly selecting 'k' data points from the dataset as initial centroids.

What are the eigenvalues of [2 0 0] [0 5 0] [0 0 1] A. -1, 2, 4 B. 0.5, 0.2, 1.0 C. 0, 2, 5 D. 2, 5, 1

Right down the identity matrix D. 2, 5 ,1 *You can also multiple it by an nx1 vector and set it equal to and eigenvalue and compute it that way

A and B are matrices, neither of which is the identity. Is AB = BA?

Sometimes

True or False Maximum likelihood estimation is the same regardless of whether we maximize the likelihood or log-likelihood function. A True B False

True

A Leaky ReLU is defined as f(x)=max(0.1x, x). Let f'(0)=1. Does it have non-zero gradient everywhere?? A.Yes B. No

Yes

Will k-means stop converging

Yes, k-means will stop when either the centroids do not change significantly between iterations or when a specified number of iterations is reached.

What is inverse of [0 2] [3 0] *One matrix

[0 1/3] [1/2 0] *one matrix

Consider binary classification in 2D where the intended label of a point x = (x1, x2) is positive if x1>x2 and negative otherwise. Let the training set be all points of the form x = [4a, 3b] where a,b are integers. Each training item has the correct label that follows the rule above. With a 1NN classifier (Euclidean distance), which ones of the following points are labeled positive? Multiple answers. [5.52, 2.41] [8.47, 5.84] [7,8.17] [6.7,8.88]

[5.52, 2.41] [8.47, 5.84] *Nearest neighbors are [4,3] => positive [8,6] => positive [8,9] => negative [8,9] => negative Individually.

Let A ="Aldo is Italian" and B ="Bob is English". Formalize "Aldo is Italian or if Aldo isn't Italian then Bob is English". a. A ∨ (¬A → B) b. A ∨ B c. A ∨ (A → B) d. A → B

a. A ∨ (¬A → B) b. A ∨ B (equivalent!)

Gradient Descent in neural network training computes the ___ of a loss function with respect to the model __ until convergence A. gradients, parameters B. parameters, gradients C. loss, parameters D. Parameters, loss

gradients, parameters

How many clusters should we use?

no definitive answer, and multiple methods may need to be employed to determine the most suitable number of clusters for a given dataset.


Set pelajaran terkait

Econ 102 CH8 Unemployment Practice Test

View Set

SIN, COS, TAN, CSC, SEC, COT (SPECIAL ANGLE VALUES)

View Set

Skull (includes bones, sutures, and structures)- part 1

View Set

수능기출어법 700제 OX퀴즈

View Set

Individual Life Insurance Contract - Provision and Options

View Set

positive and negative connotation

View Set

Human Biology Chapter 4 (Tiss & Integument)

View Set

Chapter 4 IP Address Planning and Management

View Set

Python Crash Course - Chapter 3: Lists

View Set

Section 13: Types of Mortgages and Sources of Financing

View Set