True/false questions from all previous exams

Ace your homework & exams now with Quizwiz!

The k-means algorithm for clustering is guaranteed to converge to a local optimum.

TRUE.

Bagging makes use of the bootstrap method.

TRUE. (Bootstrap AGGregatING)

Vanishing Gradient problem is a problem related to the sparse connectivity for convolutional neural networks (CNN).

FALSE. (more theory here)

We can get multiple local optimum solutions if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent.

FALSE. Gradient descent is only guaranteed to converge to a local solution, not necessarily the global one. However, since the sum of squared errors is a strictly convex function, the gradient descent will converge to the global solution.

The ID3 algorithm is guaranteed to find the optimal decision tree given the training data.

FALSE. How do you define optimal decision tree? ID3 is greedy, so it does not reconsider it's choices, and thus doesn't find an optimal solution, but with a sufficiently large tree it should be able to classify the training data perfectly.

The EM algorithm consists of two steps that are repeated iteratively: The Exploration-step and the Maximization-step.

FALSE. It is Expectation Maximization.

Like many other machine learning methods, deep neural networks require the user to provide definitions of typical features, such as "fur" and "whiskers", so that the network can classify images, such as those of cats and dogs.

FALSE. It learns these things itself.

Random forest is a boosting algorithm.

FALSE. Random forests and boosting are related in that they can both be tree-based, but they are fundamentally different. Random forests are a variation on bagging.

Support Vector Machines (SVMs) are called minimum margin classifiers.

FALSE. Should be maximum. We want to construct the margin so that it is as wide as possible.

The function h_θ(x) = θ_0 + x_1^θ1+ x_2^θ_2 , where θ is the function parameter, can be learned using linear regression.

FALSE. The expression must be linear in the x'es. (Note: unsure about this. Think maybe it should be true. Does it need to be linear, or does it just need to be no cross-terms between the x'es? Because you sort of just redifine the basis, so you just say x' = x^θ, and then the regression is linear in the newly defined x'.)

You are more likely to get overfitting when the hypothesis space is small.

FALSE. The hypothesis space is the set of possible models (hypotheses) we may choose. If we have few hypotheses, we have few choices, and are thus underfitting. If we have a flexible (large) hypothesis space, we may be overfitting if we do not choose wisely.

The ML hypothesis will always give at least as good classification results as the MAP hypothesis on training data.

FALSE. The maximum likelihood is a special case of the maximum a posteriori, when the prior is constant. Thus, the MAP will always be at least as good as the ML, so goodness(MAP) ≥ goodness(ML), in a sense. (BUT: what if the prior is based on false assumptions and actually makes it worse? Do we assume that the prior is always good? Also: unsure how the training data factors into this. With the prior, the MAP is based on more than just the observed data, and should thus be more robust against overfitting. but on the training data alone, overfitting is better. So based on that I would actually say TRUE here.)

Bagging algorithms attach weights w1... wn to a set of N weak learners. They re-weight the learners and convert them into strong ones. Boosting algorithms draw N sample distributions (usually with replacement) from an original data set for learners to train on.

FALSE. The opposite is true.

In Deep Learning, a Convolutional Neural Network has a temporal memory making it suitable for handling streaming data.

FALSE. This would be a recurrent neural network. CNN's work for images for example, and other applications where you may want to look at subsets of the features, not only single features.

Deep neural networks are excellent at classifying individual images (often performing better than humans on pattern-recognition tasks), but they consistently fail to produce good results on time-series of images or text.

FALSE. Using recurrent neural networks one can do quite well on time-series data and text data as well.

Support Vector Machines are often called narrow-margin classifiers.

FALSE. We call it wide-margin classifier.

To learn the conditional probability tables of a Bayesian Network from complete data you need an iterative algorithm like the EM algorithm.

FALSE. We only need the EM algoritm if we have incomplete data.

Performing k-nearest neighbours with k = N (> 1) yields more complex decision boundaries than 1-nearest neighbour.

FALSE. When k = 1 we will overfit and the whole thing will be quite complex. With k = N (given that N is the number of points), all points will be classified exactly the same. Not very complex!

Unsupervised learning means learning without training examples.

FALSE. You need training examples, but in unsupervised learning, they don't have a label.

Supervised learning means learning unlabelled data.

FALSE. _Un_supervised learning means learning unlabelled data.

Training a k-nearest-neighbors classifier takes more computational time than applying it.

FALSE. kNN doesn't even need training, it doesn't have any parameters other than k, and this is not set through training.

In decision tree learning with noise-free data, starting with a wrong attribute at the root can make it impossible to find a tree that fits the data exactly.

Need to think about this one. Instinct says in is false.

The computational complexity of the EM-algorithm is at worst exponential in the number of variables.

Not sure.

For logistic regression, with parameters optimized using a stochastic gradient method, setting parameters to 0 is an acceptable initialization.

TRUE.

The version space is guaranteed to contain the target concept if the target concept is within the hypothesis space and the training examples are consistent.

TRUE. Definition: The version space with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with the training examples in D. A hypothesis is consistent with the training examples if it correctly classifies them. So, the version space is the set of models that correctly classifies the training data.

Find-S finds the maximal (least general) consistent generalization of the observed training data.

TRUE. Find-S finds the most specific hypothesis consistent with the training examples.

In linear SVMs, the optimal weight vector w is a linear combination of training data points.

TRUE. I think. w = b/x, where x is the training data.

The size of decision trees is the crucial factor for learnability by ID3 and its variants.

TRUE. ID3 is just the standard algorithm for growing trees. The size (number of nodes) is important, as one can easily grow a large tree that overfits the training data but generalizes poorly.

You are more likely to get overfitting when the set of training data is small.

TRUE. It will generalize poorly when training on a small dataset. This will likely lead to overfitting.

Gated recurrent units (GRU) have been introduced to tackle the long-term dependency problem.

TRUE. Not sure, sounds reasonable.

Pre-pruning is a method to avoid overfitting in decision tree learning.

TRUE. Pre-pruning is basically the same as early stopping. We stop growing the tree earlier, before it reaches the point where it perfectly classifies all the training data. In practice this is not as effective as post-pruning, which is consequently more common.

Reproducibility of artificial intelligence and machine learning research can be achieved by rerunning experiments.

TRUE. Should check. It is definitely not enough.

CNNs have the property of translation invariance.

TRUE. The CNN takes into account a larger subset of the features at a time, for example a sextion of an image instead of just a pixel. It will then recognize that section if it reappears, nomatter where it is located in the image.

After training a Support Vector Machine we can discard all examples which are not support vectors, and still correctly classify new examples.

TRUE. The classification depends _only_ on the support vectors.

The function h_θ(x) = θ_0 + θ_1x_1x_2^3 , where θ is the function parameter, can be learned using linear regression.

TRUE. The expression must be linear in the θ, and it is.

The MAP hypothesis will always give at least as good classification results as the ML hypothesis on test data.

TRUE. The maximum likelihood is a special case of the maximum a posteriori, when the prior is constant. Thus, the MAP will always be at least as good as the ML, so goodness(MAP) ≥ goodness(ML), in a sense. (BUT: what if the prior is based on false assumptions and actually makes it worse? Do we assume that the prior is always good?)

If P(A|B) = P(A) then P(A ∧ B) = P(A) ⋅ P(B)

TRUE. They are independent events, and these are two equivalent formulations of independence.

One of the greatest advances in neural network research was a technique for training multilayered networks based on the chain rule for derivatives (from calculus).

TRUE. This is backpropagation.

Recurrent Neural Networks (RNN) have a temporal memory making it suitable for handling streaming data.

TRUE. This is exactly what RNNs are used for.

You are given m data points, and use half of them for training and the other half for testing. When m increases, the difference between the training error and test error decreases.

TRUE. When the size of the set increases, the test error generalizes better, and thus is closer to the training error.


Related study sets

Millionaire cities, mega cities and world cities

View Set

Pharm Chapter 39: Antibiotics Part 2

View Set

Biology- Chapter 5 Multiple Choice

View Set