Examination: DD2421/FDD3431 HT22 Machine Learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

Probabilistic LearningSuppose that for two events A and B we can write that P(A|B) = P(A). This means that: a) A is more likely than B b) A is deterministic c) A and B are independent

c) A and B are independent

Error back-propagation

Algorithm to train artificial neural networks

Simplify the following Neuronal Network (NN) as much as possible.The NN consists of only linear transfer units in all layers. Explain IN KEYWORDS how you justify your solution.

A (large) network of linear units can be collapsed into a single linear unit, as any linear functionof a linear function (of a linear function ...) is still a linear function. Hence, a single linear unit issufficient.

The data is linearly separable, so a linear kernel should be sufficient. What would be the advantage of using a non-linear kernel anyway?

A non-linear kernel will result in much wider margins and therefore generalize better.

Subspace

A space spanned by a set of linearly independent vectors

Vänd först!! (−1,2),(−1,−0.5),(2,2) will not be support vectors.All of these have another point of the same class much closer to the boundary between theclasses.

If a quadratic kernel is used, state at least one of the data points which will surely not bea support vector. Motivate with a short argument why this point is unlikely to be a supportvector. (1p)

Perceptron Learning

Method to find separating hyperplanes An approach to train artificial neural networks In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of weights, input values or scores, and a threshold.

The method described in b) is called LASSO and known to yield sparse models. Briefly explain what property of it enables the sparcity in a short sentence. (1p)

The variable selection property.

Terminology (4p For each term (a-h) in the left list, find the explanation from the right list which best describes how the term is used in machine learning. a) Error backpropagation b) Expectation Maximization c) k-fold cross validation d) The Lasso e) k-means f) RANSAC g) Subspace h) Fisher's criterion 1) An approach to find useful dimension for classification 2) Algorithm to learn with latent variables 3) A space spanned by a set of linearly independent vectors 4) Estimating expected value 5) An approach to train artificial neural networks 6) Random strategy for amplitude compensation 7) A strategey to generate k different models 8) The last solution 9) Method for estimating the mean of k observations 10) Algorithm to estimate errors 11) Robust method to fit a model to data with outliers 12) An approach to regression that results in feature seletion 13) Clustering method based on centroids 14) A subportion of area defined by two sets of parallel lines 15) A technique for assessing a model while exploiting avail-able data for training and testing

a-5, b-2, c-15, d-12, e-13, f-11, g-3, h-1

Entropy equation

sigma(-p*log2(p))

Shannon Entropy Consider a single toss of fair coin. Regarding the uncertainty of the outcome {head, tail}: a) The entropy is equal to two bits. b) The entropy is equal to one bit. c) The entropy is not related to uncertainty.

b) The entropy is equal to one bit.

Ensemble Learning Which one below correctly describes the property of Adaboost Algorithm for classification? a) Adaboost algorithm is more suited to multi-class classification than binary classification. b) Models to be combined are requied to be as similar as possible to each other. c) A weight is given to each training sample, and it is iteratively updated.

c) A weight is given to each training sample, and it is iteratively updated.

Principal Component Analysis (PCA) All of the following statements about PCA are true except a) PCA serves for subspace methods to represent the data distribution in each class. b) PCA is useful for reducing the effective dimensionality of data. c) PCA is a supervised learning method that requires labeled data.

c) PCA is a supervised learning method that requires labeled data.

Perceptron Learning When does the perceptron learning algorithm stop modifying the weights? a) When the step size reaches zero. b) When the error gradient becomes zero. c) When all training data is correctly classified

c) When all training data is correctly classified

In regression, one way for performing regularization is to introduce an additional term, so-called shrinkage penalty. Which one of the three methods includes the additional term. (1p) i. Logistic regression. ii. Ridge regression. iii. k-NN regression.

ii. Ridge regression.

RSS

Residual Sum of Squares

k-fold cross validation

A technique for assessing a model while exploiting avail-able data for training and testing Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations

Projection length

A similarity measure in the subspace method

Bagging

An example of ensemble learning

Bagging

An example of ensemble learning Bootstrap aggregating

Naive Bayes Classifier What is the underlying assumption unique to a naive Bayes classifier? a) A Gaussian distribution is assumed for the feature values. b) All features are regarded as conditionally independent. c) The number of features (the dimension of feature space) is large.

b) All features are regarded as conditionally independent.

Regression and Classification Choose the most proper statement reflecting the output formats of regression and classification. a) They are both discrete. b) Discrete for classification and continuous-valued for regression. c) Discrete for regression and continuous-valued for classification.

b) Discrete for classification and continuous-valued for regression.

Occams's razor

A principle to choose the simplest explanation a scientific and philosophical rule that entities should not be multiplied unnecessarily which is interpreted as requiring that the simplest of competing theories be preferred to the more complex or that explanations of unknown phenomena be sought first in terms of known quantities

k-fold cross validation

A technique for assessing a model while exploiting avail-able data for training and testing

Ensemble Learning Which one below best describes the characteristics of Ensemble methods in machine learning? a) Ensemble methods are aimed to exploit a large number of training data. b) Diverse models are trained and combined. c) Ensemble learning is not well-suited to parallel computing.

b) Diverse models are trained and combined.

Expectation Maximization

Algorithm to learn with latent variables In statistics, an expectation-maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

If a quadratic kernel and no slack is used, state at least one of the data points which will surely not be a support vector. Motivate with a short argument why this point is unlikely to be a support vector.

All of these have another point from the same class much closer to the boundary between the classes, preventing the margin to extend out to these points.

Fisher's criterion

An approach to find useful dimension for classification Fisher's linear discriminant can be used as a supervised learning classifier. Given labeled data, the classifier can find a set of weights to draw a decision boundary, classifying the data. Fisher's linear discriminant attempts to find the vector that maximizes the separation between classes of the projected data

d) Fisher's criterion

An approach to find useful dimension for classification Fisher's linear discriminant can be used as a supervised learning classifier. Given labeled data, the classifier can find a set of weights to draw a decision boundary, classifying the data. Fisher's linear discriminant attempts to find the vector that maximizes the separation between classes of the projected data

The Lasso

An approach to regression that results in feature seletion In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.

Dropout

An approach to train artificial neural networks The term "dropout" refers to dropping out the nodes (input and hidden layer) in a neural network (as seen in Figure 1). All the forward and backwards connections with a dropped node are temporarily removed, thus creating a new network architecture out of the parent network.

k-nearest neighbour

Class prediction by a majority vote

k-means

Clustering method based on centroids k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Posterior probability

Conditional probability taking into account the evidence The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood, through an application of Bayes' theorem.[1]

Curse of dimensionality

Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. In order to obtain a reliable result, the amount of data needed often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

Regression In regression regularization can be achieved by adding a term, so-called shrinkage penalty.Which one of the methods below introduces the additional term. a) Least squares. b) Ridge regression. c) k-NN regression.

b) Ridge regression.

RANSAC

Robust method to fit a model to data with outliers Random sample consensus, or RANSAC, is an iterative method for estimating a mathematical model from a data set that contains outliers. The RANSAC algorithm works by identifying the outliers in a data set and estimating the desired model using data that does not contain outliers.

Subspace method

The concept of the subspace method is derived from the observation that patterns belonging to a class form a compact cluster in high-dimensional vector space, where, for example, a w×h pixels image pattern is usually represented as a vector in w×h-dimensional vector space.

Briefly explain what information can be referred for choosing an effective dimensionality of L.

The eigenvalues of the covariance (or autocorrelation) matrix

Regularization In regression, regularization is a process of introducing additional term, so-called shrinkagepenalty. Which one of the three methods includes the additional term. a) k-NN regression. b) Ridge regression. c) Logistic regression.

b) Ridge regression.

Indicate a correct one as the basic strategy for selecting a question (attribute) at each node in decision trees. i. To minimize the expected reduction of the entropy. ii. To minimize the expected reduction of gini impurity. iii. To maximize the expected reduction of the entropy.

iii

The LASSO is known to yield sparse models. Explain what property of it enables thesparcity in a short sentence.

The variable selection property.

The method described in b) is called LASSO and known to yield sparse models. Brieflyexplain what property of it enables the sparcity in a short sentence

The variable selection property.

Support Vector MachineWhat does the concept of Structural Risk Minimization address? a) Splitting the data set such that training and testing is supported. b) Selecting a separating hyperplane such that future data is most likely classified correctly. c) Exploring multiple training methods to identify the best classification.

b) Selecting a separating hyperplane such that future data is most likely classified correctly.

In ridge regression, relative to least squares, a term called shrinkage penalty is added in the quantity to be minimised. They give improved prediction accuracy in some situations. Briefly explain when this happens in terms of bias-variance trade-off.

When the increase in bias is less than the decrease in variance.

What criterion are considered as useful for generating a basis in PCA? i. Minimum variance criterion. ii. Minimum squared distance criterion. iii. Maximum variance criterion. iv. Maximum squared distance criterion.

ii. Minimum squared distance criterion. iii. Maximum variance criterion.

Which of the support vectors in a would have the largest associated alpha-value? You need to motivate the answer!

[0, 0] will have the largest alpha. This is the only negative support vector, so its alpha-value has to balance the sum of the two positive support vectors.

assume you were given the four estimates from point (a) and the four estimates from point(b) (more data has been added in b), but you were not told how these were obtained. Suggest a way to measure if the esti-mates in (a) are more or less reliable than those in (b).

a measure of reliability of repeated estimates is the variance of the estimates. If repeated estimates have low variance we can trust each of them more. It is easy to see that the estimates in (b) have lower variance because they are based on a larger number of observations.

Regression and ClassificationChoose the most proper statement reflecting the output formats of regression and classification. a) Discrete for classification and continuous-valued for regression. b) Discrete for regression and continuous-valued for classification. c) They are both continuous-valued.

a) Discrete for classification and continuous-valued for regression.

Ensemble Learning Which one below correctly describes the characteristics of boosting method in machine learning? a) Each training example has a weight which is re-weighted through iterations. b) Weak classifiers to be combined are chosen independently of each other. c) Weak classifiers are ensembled with equal contributions (reliablilty).

a) Each training example has a weight which is re-weighted through iterations.

Kernels What role does the kernel function have in a support vector machine? a) It computes the dot-product in a high-dimensional space. b) It integrates the error over the whole data set. c) It updates the weights in the network.

a) It computes the dot-product in a high-dimensional space.

Probabilistic Learning What are the model parameters in a multivariate normal (Gaussian) distribution? a) Mean vector and covariance matrix. b) Number of data points. c) Likelihood function.

a) Mean vector and covariance matrix.

For each of the learning algorithms a-c, what could be a possible cause for failure, that is, that the algorithm did not find a solution which classifies the training data correctly? a) A single linear hyperplane classifier, trained using perceptron learning. b) A two layer neural network with 10 hidden units, trained using backpropagation. c) A support vector machine with a RBF kernel. For each algorithm a-c, state which of the three alternative explanations below (i-iii) are possible causes of failure. Multiple answers may be correct (including none or all), so you must motivate why each alternative is possible or impossible (in total, 9 yes/no answers with shortmotivations). i) The data was not linearly separable ii) Learning got stuck in a local minimum iii) Initial weights were inappropriate

a) Perceptron learning i) Yes, a pure hyperplane can only do linear separation ii) No, will always converge iii) No, initial weights do not affect convergence b) Backpropagation i) No, unless only a single hidden unit is used, linear separability is not necessary ii) Yes, BP is prone to converge to local minima iii) Yes, a bad starting position can ruin convergence c) SVM i) No, the RBF kernel makes non-linear separation possible ii) No, the optimization problem is convex, so only one minima exists iii) No, there are no initial weights involved

Vänd först! For each answer, you have to give a clear motivation, but you do not have to mathematically derive the answers. • Point d is outside the margin, so it must have α = 0 • Point b is inside the margin, so slack is used and, hence, α = C = 4 • Point a must have α = 1 because the other possible values (1.5 and 3.5) would not make it possible to fulfill the constraint ∑tiαi = 0 (the alphas for the two positive samples would then sum up to more than the sum of the alpha values left for the negative samples). • The remaining question is which of points c and e should have α = 1.5 and which should have α = 3.5. We can note that e is closer to the decision boundary, but more importantly, it must balance point b on the other side, and that point has a maximally high alpha. Hence,e must be the bigger of the two: Point e has α = 3.5. • Point c gets the only remaining value: α = 1.5.

a) Positive sample (−1,1) b) Positive sample (1,1) c) Negative sample (0,−1) d) Negative sample (0,−2) e) Negative sample (2,1)

a) What are the two kinds of randomness involved in the design of Decision Forests?

a) Randomness (i) in generating bootstrap replicas and (ii) in possible features to use at each node.

Ensemble Methods (2p) Briefly answer the following questions regarding ensemble methods of classification. a) What are the two kinds of randomness involved in the design of Decision Forests? b) In Adaboost algorithm, each training sample is given a weight and it is updated according to some factors through an iteration of training weak classifiers. What are the two most dominant factors in updating the weights, and how are they used?

a) Randomness (i) in generating bootstrap replicas and (ii) in possible features to use at each node. b) The update is according to (i) if the sample was misclassified, and (ii) the reliability of the weak classifier based on the training error; the smaller the training error, the greater the reliability. The weight is increased if misclassified, and decreased if classified correctly.The reliability is then used as the coefficient.

Support Vector Machine What does it mean when a training sample gets a high α-value in a support vector machine? a) That sample has a large influence on the resulting classifier. b) That sample is associated with high uncertainty. c) That sample occurs many times in the training set.

a) That sample has a large influence on the resulting classifier.

Consider a single toss of skewed coin (it is likely to show one side more than the other side).Regarding the uncertainty of the outcome {head, tail}: a) The entropy is smaller than one bit. b) The entropy is equal to two bits. c) The entropy does not explain the uncertainty.

a) The entropy is smaller than one bit.

BackProp Learning (3p) There are a number of values involved when training a layered feed-forward artificial neuralnetwork: 1) The number of hidden layers 2) Initial weights 3) Updated weights 4) Input vectors 5) Target values 6) Output values for the nodes 7) Local (generalized) errors 8) The number of nodes in each layer 9) The step-size 10) The number of training samples Learning using the back-propagation (BackProp) learning algorithm involves computation inthree steps; a forward propagating step, a backward propagating step, and a final local step. Whatvalues (pick one from the list above) is computed in each of these steps: a) The forward propagating step b) The backward propagating step c) The local step

a) The forward propagating step Output values for the nodes. This must be executed in a forward propagating manner, because the output of every node depends on the outputs from nodes in the preceding layer. b) The backward propagating step Local (generalized) errors. The generalized errors are first computed at the output nodes, where the targets are known,and then propagated backwards to assign local errors to every node in the network. c) The local step Updated weights. Once the output values (from a) and generalized errors (from b) are known everywhere in the network, the weights can be updated. This can be done locally since no more globalcommunication is needed.

Artificial Neural Networks Which statement describes the functionality of an artificial neuron (the perceptron)? a) The perceptron generates an output signal based on the integrated weighted input. b) Each perceptron solves a partial differential equation. c) The perceptron can be trained to compute arbitrary complex functions.

a) The perceptron generates an output signal based on the integrated weighted input.

Artificial Neural Networks What is the underlying principle when using backpropagation to train an artificial neural net-work? a) The weights are modified to minimize the mismatch between the actual and the desired output. b) A Gaussian distribution is used to approximate the training data. c) The number of hyper-planes is maximized using a dual formulation.

a) The weights are modified to minimize the mismatch between the actual and the desired output.

Artificial Neural Networks What values of an artificial neural network are adjusted during training when using the Back-Propagation algorithm? a) Weights and thresholds b) Means and covariance c) Labels of training samples

a) Weights and thresholds

To appply the maximum variance criterion in the Principal Component Analysis (PCA), a) covariance matrix of given vectors b) random sampling from given vectors c) least squares approximation is a useful tool. Choose the most proper statement.

a) covariance matrix of given vectors In the Maximum Variance Formulation, the goal is to find the orthogonal projection of the data into a lower dimensional linear space such that the variance of the projected data is maximised

Probabilistic Learning What is the goal of maximum likelihood classification of an observation?To find the class that: a) maximizes the probability of the observation conditioned on the class. b) has the maximum probability in a Gaussian distribution. c) has the maximum prior probability.

a) maximizes the probability of the observation conditioned on the class.

Probabilistic Learning The goal of maximum a posteriori estimation is to find the model parameters that ... a) optimize the likelihood of the new observations in conjunction with the a priori information. b) maximize a convex optimality criterion. c) maximize the prior.

a) optimize the likelihood of the new observations in conjunction with the a priori information.

Terminology (4p) For each term (a-h) in the left list, find the explanation from the right list which best describes how the term is used in machine learning. a) Occams's razor b) RANSAC c) Dropout d) Fisher's criterion e) Support Vector f) Perceptron Learning g) k-means h) Posterior probability 1) Clustering method based on centroids 2) A concept of accepting high model complexity 3) Method to find separating hyperplanes 4) Conditional probability taking into account the evidence 5) Robust method to fit a model to data with outliers 6) Sudden drop of performance 7) Random strategy for amplitude compensation 8) Learning trying to mimic human vision 9) Data point affecting the decision boundary 10) A principle to choose the simplest explanation 11) An approach to train artificial neural networks 12) Probability at a later time 13) Vector representation of a feature 14) Method for estimating the average of k observations 15) An approach to find useful dimension for classification

a-10, b-5, c-11, d-15, e-9, f-3, g-1, h-4 (f-11 is also allowed.)

Given a dataset D = {(x,y)1,...,(x,y)n}, Maximum Likelihood estimates of the parametersof P(x,y) are computed with which assumption and optimality criterion? a) Choose the parameters that maximize the likelihood of D assuming all observations of Dare independently and identically distributed multivariate Gaussian. b) Choose the parameters that maximize the likelihood of D assuming all observations of Dare independently and identically distributed given y. c) Choose the parameters that maximize the likelihood of D assuming it is a representative sample of the problem domain.

b) Choose the parameters that maximize the likelihood of D assuming all observations of Dare independently and identically distributed given y.

Terminology (4p) For each term (a-h) in the left list, find the explanation from the right list which best describeshow the term is used in machine learning. a) Bagging b) Posterior probability c) Dropout d) Expectation Maximization e) Curse of dimensionality f) Occams's razor g) k-means h) RANSAC 1) A principle to choose the simplest explanation 2) Probability before observation 3) Algorithm to learn with latent variables 4) Issues in data sparsity in space 5) Estimating expected value 6) An approach to train artificial neural networks 7) Random strategy for amplitude compensation 8) A strategey to generate k different models 9) Probability at a later time 10) Method for estimating the mean of k observations 11) Robust method to fit a model to data with outliers 12) A technique for assessing a model while exploiting avail-able data for training and testing 13) Conditional probability taking into account the evidence 14) Clustering method based on centroids 15) Bootstrap aggregating

a-15, b-13, c-6, d-3 e-4, f-1, g-14, h-11

For each term (a-h) in the left list find the explanation that best describes how the term is usedin machine learning among the list in the right, and indicate it by the number. a) Expectation Maximization b) Posterior probability c) RANSAC d) Dropout e) The Lasso f) Bagging g) Error back-propagation h) k-fold cross validation 1) A technique for assessing a model while exploiting avail-able data for training and testing 2) A method for preventing artificial neural networks from overfitting 3) Algorithm to learn with latent variables 4) The last solution 5) Conditional probability taking into account the evidence 6) Probability at a later time 7) An approach to regression that results in feature seletion 8) Sudden drop of performance 9) A strategey to generate k different models 10) Random strategy for amplitude compensation 11) Algorithm to train artificial neural networks 12) Implementation of the bag-of-words model 13) Estimating expected value 14) Bootstrap aggregating 15) Robust method to fit a model to data with outliers

a-3, b-5, c-15, d-2, e-7, f-14, g-11, h-1

For each term (a-e) in the left list, find the explanation from the right list which best describe show the term is used in machine learning. a) k-means b) Dropout c) k-fold cross validation d) Expectation Maximization e) k-nearest neighbour 1) Robust method to fit a model to data with outliers 2) Algorithm to learn with latent variables 3) An approach to train artificial neural networks 4) A strategey to generate k different models 5) Method for estimating the mean of k observations 6) Clustering method based on centroids 7) Class prediction by a majority vote 8) Estimating expected value 9) A technique for assessing a model while exploiting avail-able data for training and testing 10) An approach to find useful dimension for classification

a-6, b-3, c-9, d-2, e-7

Terminology (4p) For each term (a-h) in the left list, find the explanation from the right list which best describes how the term is used in machine learning. a) Posterior probability b) k-fold cross validation c) Curse of dimensionality d) The Lasso e) Dropout f) Perceptron Learning g) RANSAC h) Projection length 1) Robust method to fit a model to data with outliers 2) Method to find separating hyperplanes 3) A strategy to generate k different models 4) The last solution 5) A technique for assessing a model while exploiting available data for training and testing 6) Learning trying to mimic human vision 7) Conditional probability taking into account the evidence 8) A similarity measure in the subspace method 9) The length of cast shadow 10) Issues in data sparsity in space 11) Probability at a later time 12) An approach to train artificial neural networks 13) Sudden drop of performance 14) Random strategy for amplitude compensation 15) An approach to regression that results in feature seletion

a-7, b-5, c-10, d-15, e-12, f-2, g-1, h-8

Terminology (4p) For each term (a—h) in the left list, find the explanation from the right list which best describeshow the term is used in machine learning. a) k-means b) The Lasso c) RANSAC d) Dropout e) Curse of dimensionality f) k-nearest neighbour g) Expectation Maximization h) k-fold cross validation 1) An approach to train artificial neural networks 2) Estimating expected value 3) Robust method to fit a model to data with outliers 4) Random strategy for amplitude compensation 5) An approach to regression that results in feature seletion 6) The final solution 7) Method for estimating the mean of k observations 8) Clustering method based on centroids 9) Sudden drop of performance 10) Issues in data sparsity in space 11) A technique for assessing a model while exploiting avail-able data for training and testing 12) An approach to generate k different models 13) Algorithm to learn with latent variables 14) Problems in slow computation 15) Class prediction by a majority vote

a-8, b-5, c-3, d-1, e-10, f-15, g-13, h-11

For each term (a-e) in the left list, find the explanation from the right list which best describeshow the term is used in machine learning. a) Curse of dimensionality b) Fisher's criterion c) Bagging d) RANSAC e) Occams's razor 1) Problems in high computational cost 2) The bag-of-words model 3) An example of ensemble learning 4) A concept of accepting high model complexity 5) Robust method to fit a model to data with outliers 6) A principle to choose the simplest explanation 7) An approach to find useful dimension for classification 8) Issues in data sparsity in space 9) An example of unsupervised learning 10) Random strategy for amplitude compensation

a-8, b-7, c-3, d-5, e-6

Assume someone suggests using a non-linear kernel for the SVM classification of the above data set (A-H). Give one argument in favor and one argument against using non-linear SVM classification for such a data set. USE KEYWORDS!

at least one of each; max 1 point for (+) and 1 point for (-) +) The decision boundary margin might get wider with a non-linear kernel. +) The same learning approach is likely to work for additional (possibly more complex)data. -) More computing resources required. -) The algorithm is more difficult to implement.

Ensemble Learning Which one below correctly describes the property of Adaboost Algorithm for classification? a) Models to be combined are requied to be as similar as possible to each other. b) A weight is given to each training sample, and it is iteratively updated. c) Adaboost algorithm is more suited to multi-class classification than binary classification.

b) A weight is given to each training sample, and it is iteratively updated.

Naive Bayes Classifier Naive Bayes classification assumes P r (x1, . . . , x D |Y = y) = QDd=1 P r (xd |Y = y). Thisassumption means: a) All D dimensions of an observation are conditionally distributed Bernoulli. b) All D dimensions of an observation are conditionally independent given Y . c) Y is conditionally independent of all D dimensions of an observation.

b) All D dimensions of an observation are conditionally independent given Y .

Support Vector MachineWhen using a kernel-function, for example in a support vector machine; what does this func-tion correspond to, mathematically? a) The generalised distance between any data point and the decision boundary. b) The scalar product between two data points transformed into a higher dimensional space. c) The midpoint of the training data, computed separately for each class.

b) The scalar product between two data points transformed into a higher dimensional space.

b) In Adaboost algorithm, each training sample is given a weight and it is updated according to some factors through an iteration of training weak classifiers. What are the two most dominant factors in updating the weights, and how are they used?

b) The update is according to (i) if the sample was misclassified, and (ii) the reliability of the weak classifier based on the training error; the smaller the training error, the greater the reliability. The weight is increased if misclassified, and decreased if classified correctly.The reliability is then used as the coefficient. The weight is increased if misclassified, and decreased if classified correctly.The reliability is then used as the coefficient.

Principal Component Analysis (PCA) Which one is considered as the main purpose of the principal component analysis (PCA)? a) To find the least squares fit. b) To reduce the effective number of variables. c) To use class labels in an optimal way.

b) To reduce the effective number of variables.

Principal Component Analysis (PCA) Which one is considered as the main purpose of the principal component analysis (PCA)? a) To treat the data in infinite dimensional space. b) To reduce the effective number of variables. c) To find the least squares fit.

b) To reduce the effective number of variables.

Support Vector Machine What are the support vectors in a support vector machine? a) Weights describing how important each sample is b) Training data samples used to define the decision boundary c) Orthogonal base vectors used to describe the kernel

b) Training data samples used to define the decision boundary

Probabilistic Learning What are latent variables in estimation? a) Variables that do not influence the accuracy. b) Variables that are not directly observed. c) Deterministic factors.

b) Variables that are not directly observed.

Perceptron Learning Rule The Perceptron Learning Rule is used to ... a) adjust the step size for optimal learning. b) update the weights when a training sample is erroneously classified. c) minimize the entropy over the whole training dataset.

b) update the weights when a training sample is erroneously classified.

In Adaboost algorithm, each training sample is given a weight and it is updated according to some factors through an iteration of training classifiers. b-1. What are the two most dominant factors in updating the weights? b-2. How are those two factors used to update the weight?

b-1) The update is according to (i) if the sample was misclassified, and (ii) the reliability of the weak classifier based on the training error; the smaller the training error, the greater the reliability. b-2) The weight is increased if the sample was misclassified, and decreased if correctly classified. The reliability is then used as the coefficient.

What is the advantage of using a multi-layered artificial neural network (as opposed to a single-layered)? a) Learning is guaranteed to converge to a unique solution b) All input variables become independent c) More complex decision bounderies can be formed

c) More complex decision bounderies can be formed

The Subspace Method For the subspace methods, a technique of dimentionality reduction is often used to represent the data distribution in each class. Which of these techniques is most suited for this purpose? a) Pulse-Code Modulation (PCM). b) Phase Change Memory (PCM). c) Principal Component Analysis (PCA).

c) Principal Component Analysis (PCA).

Probabilistic Learning Which of the following statements is false? a) Probabilistic learning involves estimating P(x,y) from a dataset D = {(x,y)1,...,(x,y)n}. b) Probabilistic learning helps one work with uncertainty in a problem domain. c) Probabilistic learning can only be used to create generative models.

c) Probabilistic learning can only be used to create generative models.

Support Vector Machine What property of the Support Vector Machine makes it possible to use the Kernel Trick? a) The weights are non-zero only in a limited part of the state space. b) The margin width grows linearly with the number of sample points. c) The only operation needed in the high dimensional space is to compute scalar products between pairs of samples.

c) The only operation needed in the high dimensional space is to compute scalar products between pairs of samples.

Principal Component Analysis (PCA) Which one is considered as the main purpose of the principal component analysis (PCA)? a) To find the least squares fit. b) To treat the data in infinite dimensional space. c) To reduce the effective number of variables.

c) To reduce the effective number of variables.

Mainly two kinds of randomness are known to form the basic principle of Decision Forests.In which two of the following processes are those randomnesses involved? i. In the rule of terminating a node as a leaf node. ii. In the way to formulate the information gain. iii. In feature selection at each node. iv. In deciding the number of trees used. v. In generating bootstrap replicas. vi. In combining the results from multiple trees. Simply indicate two among those above.

iii and v.

Mainly two kinds of randomness are known to form the principle of Decision Forests.In which two of the following processes are those randomnesses involved? (1p) i. In deciding the number of trees used. ii. In deciding the depth of trees used. iii. In the way to formulate the information gain. iv. In feature selection at each node. v. In generating bootstrap replicas.

iv. In feature selection at each node. v. In generating bootstrap replicas.

The Subspace Method (2p) Given a set of feature vectors which all belong to a specific class C (i.e. with an identical class label), we performed PCA on them and generated an orthonormal basis {u1, ..., up} as the outcome. When the training samples in the class are well localised, the basis can be considered as a tool to represent possible variations of feature vectors within C in terms of a p-dimensional subspace, L. Provide an answer to the following questions. Now, we consider to solve a K-class classification problem with the Subspace Method and assume that a subspace L(j) (j = 1 , ..., K ) has been computed with training data for each class, respectively. Briefly explain the way to determine the class to which vector x should belong using the projection lengths.

x should belong to the class where the projection length to the corresponding subspace is maximised.


Conjuntos de estudio relacionados

54 Kleidung - Maskulinum und Femininum

View Set

Chapter 13: The Spinal Cord, Spinal Nerves and Somatic Reflexes

View Set

Introduction to Botany, Chapter 2 (the Nature of Life), Chapter 1 (What is Plant Biology?)

View Set

Fin Mgt Ch. 15 Raising Capital Hoffman Rutgers

View Set

Chapter 5 - Generic Competitive Strategies - Strategic Management

View Set

Chpt.10: Carbohydrate structure and function

View Set