TBMI26

Ace your homework & exams now with Quizwiz!

"k-means clustering" and "mixture of gaussians" are two examples of the EM-algorithm (Expectation Maximization).

"k-means clustering" and "mixture of gaussians" are two examples of a general optimisation method. What is that method called?

Ställ dig vid en av utgångarna. Beräkna föregående rutas V(s) mha av formel. Fortsätt tills du inte kan backa mer

A learning system is exploring a room to find the best way to either of the two exits according to the drawing below. Note that the exits give different rewards. In the context of reinforcement learning, give the value function for each position in the room for the movement policy given by the arrows. Use the discount factor 0.5.

The training smaple was most likely worngly classifed and thus got a increased weight. It could be an outlier and could cause the classifer to the overtrain

After you have trained an AdaBoost classifier you note that one training sample has received a larger weight than other training samples. What could be the reason and effect of this?

The weight matrix W is used by all hidden layers. This is called weight sharing

All the weights on one layer in a neural network can be described as a matrix W. Describe an important property of this matrix for a convolutional layer in a CNN (convolutional neural network).

Face detection often use ensamle learning, i.e. supervised learning

An algorithm that detects faces is common in modern cameras and mobile phones. Is such a function typically trained using supervised learning, unsupervised learning or reinforcement learning?

• k-Nearest Neighbors - S • Support Vector Machines - S • AdaBoost - S • Principal Component Analysis - U • Multi-layer Perceptron (Neural Network) - S • Mixture of Gaussian Clustering - U

Classify the following learning methods as supervised (S) or unsupervised (U): • k-Nearest Neighbors • Support Vector Machines • AdaBoost • Principal Component Analysis • Multi-layer Perceptron (Neural Network) • Mixture of Gaussian Clustering

The slack variable is used when you can't separte classes with one line.

Describe or draw a situation when the so-called slack variables are required in Support Vector Machines.

Genetic alogorithms could be used for problems where varibles are dicrete and where we can't differentiate and use gradient decent methods.

Describe the kind of optimization problems for which Genetic algorithms may be a suitable method.

It refers to the problem of choosing to look for new polices (exploration) in the hopes of finding better solutions or instead use current knowledge to maxmize the reward

Explain (briefly) the exploration-exploitation dilemma in reinforcement learning.

Slack variables are used to allow some samples to be miss-classified, in order to not overfit to noisy data and outliers

Explain the purpose with the so-called slack variables in Support Vector Machines.

It will improve the accuracy when the distribution of the classes are very different, since the boundary will be in the middle between the class centroids.

Explain, or draw an example of, when a sigmoid function in the output layer of a linear classifier trained by minimizing the means sqaure error can improve the accuracy.

Correct classified samples / Every sample

How is the accuracy of a classifier calculated?

hamru svarar när han gjort den tentan

In which kind of learning tasks is linear units more useful than sigmoid activation functions in the output layer of a multi-layer neural network?

Kernal methods need to store all the training samples for the classification. Having lots of training would require a lot of storage space

Kernel methods have a main disadvantage when we have lots of training data. Which?

Q-learning: Reinforcemnt learning K-means: Unsupervized kNN: Supervized

Machine learning is often divided into the categories Supervised, Unsupervised and Reinforcement learning. Categorize the following three learning methods accordingly: • Q-leaerning • k-means • kNN (k nearest neighbors)

Support Vector Machines (SVM)

Mention a classifier that uses the maximum margin principle.

For example, using a momentum term.

Mention a method that can speed up gradient search.

Correlated features could mean that redundent information is provided. This could slow down the learning, increase the risk of overfitting and in the case of decsission tress "hide" some paths

Mention a reason why correlated features are undesired in supervised learning.

????????? Q(A, B) = cost + (y * V(B)) ) = V(A)

Suppose that you know the Q-function values for a certain state. How do you determine the V-value for that state?

10x10 (one cell for each distance between each sample)

We have 10 samples in a 20-dimensional space that we want to analyse with a kernel method. How large is the kernel matrix?

?????? Assumes equal distributions

What assumption is made about the distributions of the two classes in linear discriminant analysis (LDA)?

Random Forest

What do you get if you combine Bagging and decision trees?

The quotient between the difference between the cluster means and the variance of the projections of the clusters.

What is being optimized by Linear Discriminant Analysis (LDA)?

Emperical risk is the numer of wrong calssifications

What is described by the "empirical risk"?

The first eigenvalue of the data covariance matrix describes the maximum variance of the data

What is described by the first eigenvalue of the data covariance matrix?

???????????????????? The first eigenvalue of the covariance matrix describes the maximum variance between data

What is described by the first eigenvector of the data covariance matrix (if the data have zero mean)?

It needs to be differentiable

What is the basic requirement on an activation function for being used in a multilayer perceptron with back propagation?

Fewer parametrs to train and reduces risk of over-fitting

What is the benefit with a convolutional layer compared to a fully connected layer in a multi-layer neural network?

The purpose of the hidden layers in a multi-layer perceptron classifier is transform the data to a space where the problem i linearly separable.

What is the purpose of the hidden layers in a multi-layer perceptron calssifier?

The momentum term improves the speed of convergence by bringing some eigen components of the system closer to critical damping

What is the purpose with a momentum term in gradient descent?

The exploration-explotation problem

What problem can be illustrated by the multi-armed bandit?

K = 1: x -> O K = 3: x -> X

Which class does the data sample 'x' belong to using a k-Nearest Neighbor classifier with k = 1 and k = 3 respectively?

Back-propagation k-NN LDA (Linear Discriminant Analysis) SVN (Support Vector Machines)

Which of the following methods are supervised learning methods: • Mixture of Gaussians • Back-propagation • k-NN • LDA (Linear Discriminant Analysis) • SVN (Support Vector Machines) • PCA (Principal Component Analysis)

y = s, NO, can not be used since the activation function in a hidden layer needs to be non-linear to be useful y = tanh(s), YES y = s/abs(s), NO, this function is not differentiable(deriverbar i alla punkter) y = e^(-s^2), YES

Which of these functions can be used in the hidden layers of a back-prop network? • y = s • y = tanh(s) • y =s/abs(s) • y = e^(−s2)

The trick is to pre-train the weights with unsupervised learning before backproagation is applied to the system

Which trick is used in deep learning of a neural network to train many layers?

y = s, the activation function in a hidden layer needs to be non-linear in order to be useful y = sign(s), is not differentiable is every value and is therefore not useful

Why are the following two functions not useful in the hidden layers in a back-propagation neural network? • y = s • y = sign(s)

k(x,y) = Φ(x)^T * Φ(y) where Φ(x) is a non-linear function

Write the general definition of a kernel function k(x, y)

PCA (makes the dimensions orthogonal)

You have a high-dimensional data set where many of the features are correlated. Suggest a proper pre-processing before feeding that data into a classifier.

PCA: A line sperating the two classes FLD: A line going through both classes (so that the classes are sperated from each other when projected on that line)

You want to perform a dimensionality reduction from 2 dimensions to 1 dimension of the 2-class data below. Draw the approximate the projection directions you would obtain using Principal Component Analysis and Fisher Linear Discriminant.


Related study sets

⭐️CM1 test TWO: Med Surg evolve questions

View Set

JAVA-Data Types, Variables and Arrays

View Set

8, 9, 10 and 11 test business management

View Set

BIO110 - Module 3 Homework Study Guide

View Set

Health Assessment - Midterm Exam

View Set

Construction Management: Jumpstart Ch 5-12

View Set

Prerequisites for Azure administrators

View Set