Machine Learning Exam Questions

¡Supera tus tareas y exámenes ahora con Quizwiz!

a. COMMENT: The risk increases since the network will have more parameters than necessary to solve the problem, which is likely to lead to overfitting (= overtraining). Another way to describe it is that more parameters make it easier for the network to learn the patterns "by heart", instead of learning the underlying function. b. The risk decreases. More patterns make it more difficult to learn the patterns by heart. c. The risk increases. The network gets more time to overfit or to learn patterns by heart.

1. For neural networks, how is the risk of overtraining/overfitting affected (increased or decreased) by: a) increased network size? (number of hidden nodes and/or layers) b) increased training set size? (number of input-target examples) c) increased training time? (number of passes through the training set)

a.) the line equation of a 2 input binary perceptron (given by setting the weighted sum to 0, since the node flips there, and solve for x2) is: x2 = -(w1/w2)x1+theta/w2 This cuts the x2 axis at theta/w2 and the x1 axis at theta/w1. So in this case, the x2 axis at 2 and the x1 axis at -1. b.) the node responds with +1 on the outside (of the diamond) i.e. with 0 on the origo-side. Easiest tested for origo. With both inputs equal to 0 the weighted sum is -2 (due to the threshold) and therefore the node's output is 0 on that side. c.) By negating all weights (including the threshold).

A binary perceptron defines a hyperplane in the input space and responds with a +1 on one side of it, 0 on the other. The figure to the left illustrates four hyperplanes, together forming the shape of a diamond. Below that, a binary perceptron which defines one of them. The hyperplanes are enumerated 1-4 and they cut the x1-axis at -1 and +1 and the x2-axis at -2 and +2. a.) which of the four hyperplanes is defined by the perceptron below? b.) on which side of the hyperplane does the perceptron in the figure respond with +1? c.) What is the simplest way to adjust the weights of this perceptron so that it responds with a +1 on the other side?

In pbest, a particle only considers its personal best. It gets no information from the other particles, and pbest is therefore the most similar to random search (most chaotic). In gbest, particles strive for a weighted combination of their personal bests and the best position found by any particle in the swarm. This leads to a much more directed search with a tight swarm, where most particles strive in the same general direction. gbest is therefore the most likely to get stuck. lbest is a compromise between the two - a generalization of the algorithm which contains pbest and gbest as two extremes. Here, a particle strives for a weighted combination of its personal best and the best of its neighbours (usually not the whole flock), defined by a neighbourhood graph.

Comparing the three Particle Swarm Optimization variants pbest, lbest and gbest, which one is most similar to random search? Which one is most likely to get stuck in local optima, due to premature convergence?

c) It's a polygon I should have pointed out that the space is Euclidean. A node's Voronoi region is the input region for which it would win for any incoming pattern. The border between two adjacent nodes must therefore (in 2D Euclidean space) be a line, and if a node is surrounded by several other nodes, the lines would form the sides of a polygon.

Consider a competitive learning network applied to a 2D clustering problem.The network has been trained for a while, so the nodes have already found their approximate positions. Consider a node somewhere in the middle (surrounded by other nodes). Which of the following best descibes the shape of its Voronoi region? (Mark one) a) It's a circle b) It's a square c) It's a polygon d) It's a smooth polygon (a polygon where the sides may be curved) e) It can take any shape

wk = [0.4, -0.2, 0.1]

Consider a competitive learning network to which we present an input vector x = [0.5,−0.3, 0.1]. The closest node, the winner, k, has the weight vector wk = [0.3,−0.1, 0.1]. Compute the result of updating wk by the standard competetive learning rule, with learning rate h=0.5.

e) 0.95^2 Q(s,a) = r + gmaxQ(s',a'). For the last state, s'', before the goal it is just r (there are no Q-values in the goal state). So, going north there should have the value 1. Going to that state from any neighbouring state, s', should therefore have the value g*1 (= 0.95) and the state, s, in the bottom left, should have the value g*g*1 (= 0.952). d) 0.95^3 That's a wasted time step, since we end up in the same state, s, but we can still go to the goal after that. The chain just becomes one step longer. The value should be g*g*g*1 (= 0.953).

Consider a reinforcement learning problem where all possible actions are rewarded by 0 except for the one(s) leading directly to a goal state, for which the reward is 1. The Gridworld problem from lab 2, where you trained Marvin the robot to find the goal in a simple maze, was an example of such a problem. Let's assume that the agent is trained by Q-Learning and that it is currently in the position illustrated by the robot in the figure to the left, i.e. three steps from the goal (the cup). The exploration rate, e, is 0.5. The discount factor, g, is 0.95. The learning rate, h, is 0.1. What should the Q-value for moving right converge to? (Mark one) a) 0 b) 0.095 c) 0.5 d) 0.95^3 e) 0.95^2 What should the Q-value for moving down (into the wall) converge to, assuming that this action is allowed? (the agent remains in the same state) (Mark one) a) 0 b) 0.095 c) 0.5 d) 0.95^3 e) 0.95^2

Best: c) 15 binary inputs for each integer - where the integer ni is represented by setting the first ni bits to 1 and the rest to 0. Worst: b) Four binary inputs for each integer - one for each bit in a binary representation (i.e. the same integer in base 2). c) should be the best. Neural networks are pattern recognizers and c) represent the numbers as patterns (think of it as the bar of a thermometer). Distributed representations (avoiding the need to decode data) is more important than to keep the number of weights down. The number of effective parameters to adjust in the network is far less that the number of weights in this case, since they are so dependent (and many come from 0-inputs). a) is OK, but may lead to numerical problem. It hides an important piece of information from the network - that the numbers are actually integer. Here, they are represented as floats. It may be difficult for the network to distinguish between two adjacent integers. b) is, by far, the worst. Many students marked it as the best. This approach forces the network to learn not only how to multiply two numbers, but also how to decode binary numbers. Two numbers close to each other have very different bit string representations, which makes it difficult for the network to exploit one of its main advantages - its ability to generalize. Furthermore, the target function is monotonic w.r.t. the input representations in a) and c), but not in b). Multilayer perceptrons usually find monotonic functions faster than non-monotonic ones (since the sigmoids are monotonic).

Consider training a multilayer perceptron to approximate a monotonic function of two integers (e.g. to multiply them, but the exact function is not relevant here). Both integer inputs are in the range [0,15]. How would you represent the integers? a) One real number input, xi in [0,1], for each integer, ni, where xi = ni/15. b) Four binary inputs for each integer - one for each bit in a binary representation (i.e. the same integer in base 2). c) 15 binary inputs for each integer - where the integer ni is represented by setting the first ni bits to 1 and the rest to 0. Which is the best representation, and which is the worst? (Ring in the best, cross/strikeout the worst)

Indirect communication and coordination, by local modification and sensing of the environment

Define Stigmergy in one sentence!

Q(s,a) = r + R*Q(s',a') a' is the selected action in state s' (i.e. not necessarily the one with the maximum Q values, as in Q learning)

Define a' (in your equation)

A genotype is an encoding of a suggested solution to the problem. The phenotype is the interpretation of that encoding, i.e. what the individual actually does.

Define the two evolutionary computing concepts genotype and phenotype! (One sentence each should be sufficient)

w1 = 2 THETA = +2 w2 = +1 The line equation of a 2 input binary perceptron is given by setting the weighted sum to 0, since the node flips there, and solve for x2: x2 = -(w1/w2)x1 + (THETA/w2) This cuts the x2 axis at (THETA/w2) and the x1 axis at (THETA/w1). So, in this case, Theta should be equal to w1 and w2 should be half of that. Negating all values defines the same hyperplane, the only difference is on which side it responds with +1

Set the weights of a binary perceptron (write the values down in the empty squares below) to implement the hyperplane in the coordinate system to the right! The hyperplane cuts the x2 axis at +2 and the x1 axis at +1.

c) The training set had greater variance than this test set The activated area has a lower variance than the whole map, which must therefore have been trained on greater variance data. If it had been the other way around (d) many of the test pattern would activate the edge nodes (since the inputs are in that case often outside the area which the network was trained on)

The figure below illustrates the result after testing a self-organizing feature map on a test set (different from the training set trained on earlier). The squares represent the nodes and the blue numbers in some of them are the number of test cases for which that node was the winner (closest to the input). An empty square means that that node never won. What is the most likely reason why only a few of the nodes win for this test set? (Mark one) a) The network has not been trained for long enough b) The network has been trained for too long c) The training set had greater variance than this test set d) The training set had smaller variance than this test set e) The network got stuck in a local optimum

10001111 11010110

The two bit vectors to the left below have been selected for two-point crossover. The arrows mark the crossover points. What is the result? (fill in the empty squares to the right, representing the offspring)

The learning rate (h), i.e. the step length when moving nodes, and the radius/width of the neighbourhood function (if it is a Gaussian, this is controlled by s, the standard deviation)

Training a self-organizing feature map usually works best if at least two of its parameters are decreased over time. Which are these two parameters? (Name them and describe them briefly. One sentence each should be sufficient)

Train with batch learning Initializing the network from the data can of course be done also in competitive learning, but it does not make it equivalent to K-means if you still train by pattern learning

What is the simplest way to make Standard Competitive Learning equivalent to K-means?

c) A feedforward neural network with state, s, on its inputs and one output for every possible action, estimating Q(s, ai) for all actions ai. Only one of the outputs is trained using, the right hand side of the relation from question 20 as target.

When the state space is too large for a table, a neural networks can be used to estimate the Q-values. Which of the following setups should be best for this? (Mark one) a) A feedforward neural network with state, s, and action, a, on its inputs, and one output, trained to estimate Q(s, a) using the right hand side of the relation in question 20 as target. b) Same as a) but the output is trained using the right hand side of the equation you selected in question 21 as target. c) A feedforward neural network with state, s, on its inputs and one output for every possible action, estimating Q(s, ai) for all actions ai. Only one of the outputs is trained using, the right hand side of the relation from question 20 as target. d) Same as c) but the outputs are trained using the right hand side of the equation you selected in question 21 as target. e) A feedforward neural network with state, s, on its inputs and one output for every possible action, estimating Q(s, ai) for all actions ai. All outputs are trained using the right hand side of the relation from question 20 as target. f) Same as e) but the outputs are trained using the right hand side of the equation you selected in question 21 as target.

d) Add some noise to input values during training The second best of the suggested techniques would be b), which would probably also work pretty well as a regularization method. The rest are rather weird suggestions, and a) is probably the worst, since it would make the mapping non-functional (many-tomany instead of many-to-one).

Which of the below explanations best describes the regularization technique called noise injection? (Mark one) a) Add some noise to target values during training b) Add some noise to randomly selected weights during training c) Add some noise to randomly selected weights after training d) Add some noise to input values during training e) Add some noise to control parameters (hyper-parameter) during training

c) Find the closest node and move it towards x.

Which of the descriptions below, best describes what Standard Competitive Learning does, given an input vector, x? (Mark one) a) Move all nodes towards x. b) Find the K closest nodes and move them towards x. c) Find the closest node and move it towards x. d) Find the closest node, move it towards x, and move all other nodes away from x. e) Find the closest node and move it, and its closest neighbours, towards x. The neighbours are moved by a step length which decreases with distance from the closest node. f) Find the closest node and move it, and its closest neighbours, towards x. The neighbours are moved by a smaller step length than that of the closest node, but it is the same for all neighbours.

a) One hidden layer, which must non-linear One non-linear hidden layer is sufficient (in theory). The output layer may be linear and therefore should be for simplicity. It must have infinite range since we can't know the range of the target function and making it linear is the simplest way to ensure that.

Which of the following best describes the theoretical requirements for a multilayer perceptron to be able to approximate any continuous function to any degree of accuracy? (Mark one) a) One hidden layer, which must non-linear b) One hidden layer. Both the hidden layer and the output layer must be non-linear c) Two hidden layers, which must both be non-linear d) Two hidden layers. Both the hidden layers and the output layer must be non-linear e) Infinitely many hidden layers, which must all be non-linear f) Infinitely many hidden layers. All hidden layers and the output layer must be non-linear

d) After each state transition, instead of updating Q(s,a), store everything needed to do so (i.e. store the tuple (s,a, r, s')). Then select at random a previously stored such tuple and update that Q-value instead of the one we just stored. This is to avoid temporal dependencies between updates by training on them in random order.

Which of the following descriptions best explains experience replay, when used for Q-Learning? (Mark one) a) Similarly to batch learning, compute changes to Q(s,a) after every action, but don't actually update them until we we have reached the goal. We then go through the whole sequence of states and actions again, updating the Q-values. b) Train in two phases. In the first phase, the agent is trained the usual way, but we also store particularly successful trials/sequences during training so that we can replay them later. In the second phase, the agent is trained only on those stored trials (leading the agent through them), thus reinforcing the experience of the best trials. c) Similar to b) but the trials trained on are those of another agent, trained on the same problem, not the agent's own. The agents learn from each other's experiences. d) After each state transition, instead of updating Q(s,a), store everything needed to do so (i.e. store the tuple (s,a, r, s')). Then select at random a previously stored such tuple and update that Q-value instead of the one we just stored. This is to avoid temporal dependencies between updates by training on them in random order. e) After a state transition, instead of just updating Q(s,a) for the previous state visited, as we usually do, we go back and update the Q-values for all previously visited states, proportionally to how recently they were visited.

b) Backprop with adaptive and local momentum (for each weight)

Which of the following explanations best describes RPROP? (Mark one) a) Backprop for recurrent networks b) Backprop with adaptive and local momentum (for each weight) c) Backprop where the partial derivative, dE/dw, only decides the step length (ignores its sign) d) Backprop with regularization e) Backprop for deep learning networks (solves the problem with vanishing gradients)

b) Using elitism d) Using fitness selection instead of rank selection

Which of the following would increase the risk of premature convergence in evolutionary computing? (Mark one or several) a) A larger population b) Using elitism c) A higher mutation rate d) Using fitness selection instead of rank selection e) Using roulette wheel selection instead of fitness selection

d) Learning to imitate

Which of the statements below, best describes supervised learning? (Mark one) a) Learning by interaction b) Learning from observations of how data is structured c) Learning by trial-and-error d) Learning to imitate e) Learning to cluster data

b) Dynamic size (number of nodes) c) Constant parameters e) Dynamic neighbourhood

Which three of the following explanations best describe typical properties of Growing Neural Gas, when compared to Self-Organizing Feature Maps? (Mark three, all three must be correct) a) Fewer parameters b) Dynamic size (number of nodes) c) Constant parameters d) No parameters (parameter free) e) Dynamic neighbourhood f) Nodes are moved around in a more directed, but slightly more complex, way g) The behaviour of nodes are based on models of how gas molecules move around h) No neighbourhood function is required

c) The population has converged but maybe not the search e) The population has lost diversity

Which two of the following explanations best describe the concept of premature convergence (Mark two, both must be correct) a) The algorithm has converged faster than expected b) The population has converged and therefore also the search c) The population has converged but maybe not the search d) The search has converged but the population has not e) The population has lost diversity

d) Q-Learning assumes that all future actions are greedy, though they may actually not be. f) Both explore, it's just the update equation of Q-Learning which assumes that we do not (from now on).

Which two of the following explanations best describe the difference between Q-Learning and Sarsa? (Mark two, both must be correct) a) Q-Learning is greedy, Sarsa is not. b) Q-Learning is not allowed to explore during training, Sarsa is. c) Q-Learning requires that all future actions are greedy. d) Q-Learning assumes that all future actions are greedy, though they may actually not be. e) Both explore, it's just the update equation of Sarsa which assumes that we do not (from now on). f) Both explore, it's just the update equation of Q-Learning which assumes that we do not (from now on).

b) Q-Learning should work, but not Sarsa I should have stated more explicity (than the bracketed comment did) that the exploration policy is assumed to be e-greedy. Q-Learning updates Q-values towards the maximum Q-value in the next state also when the agent explores, which means that the agent will learn correct Q-values despite walking around randomly. It just never exploits what it learns. Q-Learning learns 'off-policy'. Sarsa learns 'on-policy'. It updates Q-values towards the value of the selected (randomly, in this case) action in the next state. The Q-values will never converge, though their expected value should - to the average of all values in the next state since they are chosen with equal probability.

Would it be possible to train Q-Learning or Sarsa to find the correct Qvalues for a reinforcement learning problem if the exploration rate, e, for some reason is stuck at e=1.0 and can't be changed? (in other words, the agent will always choose actions at random) (Mark one) a) No, neither Q-Learning nor Sarsa should work b) Q-Learning should work, but not Sarsa c) Sarsa should work, but not Q-Learning d) Both Q-Learning and Sarsa should work

d) It gets smaller and changes shape

a) It gets larger b) It get larger and changes shape c) It gets smaller d) It gets smaller and changes shape e) Nothing. It is not affected.

d) Weights would change, but it still wouldn't work for non-trivial problems, since all hidden layer nodes would in effect be equal (have the same weight values) The problem here is not the value 0 as such, but that they are all equal. This means that all nodes are equal from the start. The outputs will move in different directions during training, since they have different target values, but all hidden nodes will have had the same contribution to the output errors, and therefore also receive the same S-values from the output layer. So they will remain equal.

An MLP to be trained by Backprop, should be initialized by setting the weights to small random numbers with a zero mean. What would happen if we simply set all initial weights to 0? (for simplicity, let's assume that the MLP only has one hidden layer) (Mark one) a) It would work just as well b) It would work, but take much longer to train c) It would not work, because all weights will remain 0 forever d) Weights would change, but it still wouldn't work for non-trivial problems, since all hidden layer nodes would in effect be equal (have the same weight values) e) Weights would change, but it still wouldn't work for non-trivial problems, since all output layer nodes would in effect be equal (have the same weight values)

Because it follows gradients. weights are moved by (SEE PIC), i.e. it always move down-hill. if you only move down-hill you will get stuck in the closest local minimum, unless the step length (controlled by n) is greaat enough to overshoot it (then you take the risk of overshooting the gloval minimum as well)

Explain how and why the backpropagation algorithm, as all other gradient descent methods, may get stuck in local optima!

Binary perceptron classify by positioning a separating hyperplane between the classes. In the XOR-problem the classes are not separable by a hyperplane (a line, in this 2D-case). It requires at least two lines (see figure). Note that the word 'discriminant' does not imply anything about it's shape. A ellipse is also a disciminant, for example, and in that case it would suffice with one discriminant to solve the XOR-problem. So, the explanation here must state that the discriminant formed by binary perceptron is a hyperplane.

Explain why a single binary perceptron can not solve the XOR problem!

c) It may be difficult to find the exact value for the optimum

In supervised learning, what is the most important reason to avoid too high learning rates? (mark one) a) Training may be very slow b) We are likely to get stuck in a local optimum c) It may be difficult to find the exact value for the optimum d) We may overshoot a local optimum e) There are no good reasons to avoid high learning rates

Divide the data you have in K equal sized parts. Then, for each part i (1 £ i £ K ), train an all the data except part i and test on part i. Then compute average error over these K experiments. An extreme (but common) case of this is the so called leave-one-out principle, where K equals the size of the training set, i.e. each part contains only one pattern. Cross validation techniques are not specific to neural networks. They are applicable to any supervised learning system trained on a finite set of data.

Machine Learning usually requires lots of data to be trained properly. If you have too little data (too few input-target pairs) the first thing to try is to get more. However, sometimes this is simply not possible and, then, to split up the few data you have in a training set and a test set might be considered wasteful. Describe how K-fold cross validation can be used instead, to measure the system's ability to generalize

(triangle) has an effect similar to that of the momentum term in the previous question, but it's not the same thing. In RPROP, as long as the partial derivative ∂E/∂wi keeps its sign (which means that we are still moving downhill), the step length is increased. If the partial derivative changes sign it indicates that we overshot a minimum. The algorithm then cancels the weight change and decreases the step length. (In other words, we don't actually cross the minimum - we cancel the weight change if we would)

RPROP is a version of Backprop where only the sign of the partial derivative is used. The step length is instead determined by an update value, D, that is adaptive and individual for each weight. What is the basic principle for the adaptation of this update value?

When an action, a, has been selected in state s and the new state, s', has been reached, we know, by definition of Qvalues, that the following relation must hold (on average): Q(s,a) = r + RmaxQ(s',a') where r is the reward received when going from state s to state s', g is a discount factor and the max is over all possible actions in state s'. We can make a learning rule from this, by using the right hand side (r.h.s) of this relation, as a target for the left hand side (l.h.s). In other words, we move Q(s,a) towards the r.h.s. by adding a proportion of the r.h.s minus the l.h.s.: Q(s,a) <- Q(s,a) + n(r + RmaxQ(s',a')-Q(s,a))

Reinforcement Learning (RL) How are Q-values updated in Q-Learning? Write down the update equation and explain its parts, or write down the intuiton first, which in that case should lead you to the equation

The usual way to exploit this type of information is to add the constraints as weighted terms to the objective function. Let l = LAMDA So instead of minimizing just the squared error, Esq, we minimize Esq + l1g1(x,y) + l2g2(x,y) + ... and derive a new update rule from this. l decides the relative importance of each constraint. This requires that all the constraint functions g(x,y) are differentiable.

Sometimes when training a neural network to optimize something, there are known constraints on the problem which can be expressed as functions g(x,y) which is zero when the constraint is satisfied (x is the input vector, y is the output vector). How can this type of prior knowledge be used in neural network training?

The expression is 'a AND (b OR c OR d)'. Input a alone, or just b, c and d together (without a) is not sufficient to make the weighted sum exceed the threhold (3.5). It requires a and at least one of the other inputs.

What is the boolean function (expression) implemented by the binary perceptron below, where a 0 input/output represents False and 1 represents True? (the numbers by the links are weight values)

f(x) = 1/(1+exp(-LAMDAx))

Write down the equation and draw a graph of the logistic sigmoid function, often used as activation function in neural networks!

a. The correct answer is B. The line equation of a 2-iniput binary perceptron (given by setting the weighted sum to 0, since the node flips there, and solve for x2) is: x2 = -(w1/w2)x1 + THETA/w2 This cuts the x2 axis at THETA/w2. and the x1 axis at THETA/w1. So, in this case, Theta should be equal to w2 and w1 should be half of that. Thats only true for perceptron B. b.The correct answer is E, the only perceptron given here which has a 0 threshold. That makes the hyperplane cut through origo. c. A and F. They have the same weights, but with opposite signs. This does not change the slope or position of the hyperplane, only on which side the node will respond with +1.

b.) one of the perceptrons defines a hyperplane which crosses origo (0,0). which one? c.) Two of the perceptrons define the same hyperplane. They only differ on which side the node would output a 1. Which pair?

K-fold cross validation: Divide the data you have in K equal sized parts. Then, for each part i (1 £ i £ K ), train an all the data except part i and test on part i. Then compute average error over these K experiments. This is done when we have too little data to afford separate training and test sets. By using K-fold cross validation, we can use all the data for training and still get a generalization measure. An extreme (but common) case of this is the so called leave-one-out principle, where K = the size of the training set, i.e. each part contains only one pattern, but I don't think I mentioned this variant on lecture.

Explain K-fold cross validation in supervised learning! How is it done and why?

a.) GP operates on parse trees, as illustrated in class. Crossover is to select a random a sub-tree of each parent and swap them. Some point reductions for vagueness (illustrated examples would probably have helped). No points to students who did not heed the warning (within parenthesis, the question was about genetic programming, not genetic algorithms) b.) No, we cannot. In fact, it is very likely that crossover produces non-functional offspring this way, unless we put constraints on how the subtrees are selected. Even a simple operation such as swapping two constants may lead to a division-by-zero error, for example. In theory it works anyway, since the non-functional individuals are not likely to get a high fitness, so they will probably not be selected for the next generation anyway. But this can be very inefficient. If we want GP to perform better than random search, we must consider this issue

Genetic Programming a) Describe the most common basic form of recombination (crossover) used in Genetic Programming! (Note, GP, not GA). b) Given this crossover operator, can we be sure that the new individual encodes a legal (evaluable) expression? If not, is this a problem?

The neighbourhood graph defines who is neighbour to whom. A node which wins will drag its neighbours along towards the input. In SOFM, where the neighbourhood is fixed (though the radius is not) this means that some nodes will end up inbetween clusters, as for example node A in the illustration below, because they have two immediate neighbours (B and C) which pull at it in two directions. GNG solves this by having a dynamic neighbourhood graph. Connections in this graph have an associated 'age' parameter which is increased when one node is the winner (= closest to the input), and the other one is not the second closest (to the same input). Connections are removed if they grow too 'old' (when the age parameter is greater than a defined maximum). So, in the above example, at least one of the two connections of A (to B or C) will be removed, and A can move into the remaining neighbour node's cluster (unless it is still pulled away by other remaning nodes).

In Self Organizing Feature Maps, winning nodes drag their neighbours (in the network structure) along towards the input. The network structure is a fixed grid in SOFM. Explain how this may be a problem, in particular how nodes may get stuck between clusters, and how Growing Neural Gas tries to solve this problem!

In gbest, all particles strive (partly) towards the same point - the best position found so far in the whole swarm. This tends to keep the swarm tight together, with low variance compared to lbest. In lbest a particle strives (again, partly) for the best position found among its neighbours in a (usually sparse) structure, e.g. a ring. Some particles will still strive for the global best solution because that solution happens to be found by a particle in their neighbourhood, but some will not. In other words particles are more independent and base their decisions on more local information. This keeps the swarm more diverse and, therefore, less likely to get stuck. (gbest is a special case of lbest where the neighbourhood structure is a star)

In particle swarm optimization, why is gbest more likely to get stuck in suboptimal solutions than lbest?

b) We are likely to get stuck in a local optimum

In supervised learning, what is the most important reason to avoid too small learning rates? (Mark one) a) Training may be very slow b) We are likely to get stuck in a local optimum c) It may be difficult to find the exact value for the optimum d) We may overshoot a local optimum e) There are no good reasons to avoid small learning rates

Elitism - to guarantee that the best individual(s) in the population are copied/reproduced as they are, unaltered, to the next generation. This way we know that the best solution is preserved.

It has been claimed in some course books that that one important difference between particle swarm optimization and evolutionary computation is that PSO has memory (the particles remember their personal bests), while EC does not. This is not strictly true - EC may also have such a memory. What is that concept in EC which corresponds to the memory in PSO?

Supervised learning (SL) gets more specific feedback and is therefore, in general, more efficient and faster than reinforcement learning. On the other hand SL requires an expert (to give the correct answer) and that expert can be difficult to find, difficult to get clear answers from and/or expensive to hire. Furthermore the SL system can never become better than that expert since the goal is to imitate the expert, not to beat it. (from a SL perspective, beating the expert would actually be wrong!) Reinforcement learning (RL) learns by trial and error, and therefore requires no expert and may very well become better than such an expert, if there is one. In many applications, for example board games, where the task is to compete against a similar opponent, RL is also more autonomous since the agent can learn from playing against itself over and over again, without the need of an operator/supervisor to control learning. On the other hand, learning by trial-anderror is to learn from a very weak/noisy feedback so training may take very long time. This is essentially RL's only drawback, but it is a big one.

Learning to play a board game, such as chess, is an example of an application where you could chose to use supervised learning or reinforcement learning. What are the main advantages/disadvantages of each in that case?

a. Train the network on all the data you have, for a long time, i.e. try to overtrain the network on purpose. If you can't get a low error even if allowed to do that, the network is too small (or there are serious errors in the data). b. You may overtrain the network, i.e. lose the ability to generalize to unseen data. It also takes longer time, of course, but that is not the main problem. c. Split the data into at least three subsets - a training set, a validation set and a test set - so that you can decide when to stop training ('early stopping'). An alternative is to use K-fold cross validation, but you still need to make sure that you don't train for too long. Noise injection also helps.

Let's assume that you are given a multilayer perceptron which is a black box to you. You can see how many inputs and outputs there are, but you cannot peek inside to decide the number of hidden nodes, nor can you change any other internal properties of the network. Let's also assume that you have a set of data (i.e. a set of input vectors with corresponding desired output vectors) representing some unknown target function. The inputs and output vectors in this set have the same dimensionality as the number of inputs and outputs of the network, respectively. You are allowed to do what you wish with the given set of examples, but you will not be able to obtain more data. a) Suggest a method to find out if the network is powerful enough (i.e. have enough hidden nodes) to find the target function! b) What bad effects can be expected if the network is too large? c) If the network is too large, how can you avoid or minimize the bad effects when, as in this case, you are not allowed to change the size of the network?

Create a population of individuals where each individual represents a solution to the problem, for example a bit string (in the original form of genetic algorithms, GA). Evaluate all individuals, using a fitness function which measures the quality of the selected solution (usually we formulate this as a maximization problem, so greater fitness is better). Create a new population for the next generation, by selecting individuals probabilistically (directly or indirectly by fitness) for reproduction, mutation, and/or recombination. Reproduction is simply to copy the individual to the next generation/population. Mutation is to do the same, but make minor random changes to the copy, for example in GA by flipping a bit. Recombination, or crossover, is to combine two individuals (or more, but usually two). In GA, for example, by one-point crossover, where we create two new individuals from two parents by selecting a point in each of the two parent's bitstrings and swap the tails. We now have a new population where the average fitness is hopefully a little better. Repeat from the evaluation step until a stop-criterion is met.

Population Methods 8. Explain evolutionary computing, on a general level! Explain and give examples of the concepts and operators you use

To follow gradients in the search space is to use the slope of the objective function (its derivative) to decide in which direction to move, for example in the direction of the steepest descent. (Many students did not answer this part) Pro: The most important advantage (required for full credit) of not following gradients is that it should reduce the risk to get stuck in local optima. If we always follow gradients, we will almost certainly get stuck in the closest local optimum. Pro: Another advantage is that the objective function does not have to be differentiable. In classification, for example, we usually want to minimize the number of misclassifications, which is an integer and not differentiable so we have to express the objective function in some other way if the method requires gradients. Pro: A third possible advantage is that it might be too time-consuming to compute gradients for each member of a large population. Population methods work best if the computations per member are as simple as possible. Cons: The disadvantage of not following gradients is that it probably makes it more difficult to find the exact optimum when close to it. Fine-tuning is difficult if you throw away information on gradients. Unjustified claims that one way is faster than the other did not receive points.

Population methods, for example genetic algorithms and particle swarm optimization, usually don't follow, or depend on, gradients in the search space. Explain what this means and why this should be both an advantage (at least one, there are several) and a disadvantage!

see pic

RPROP is a version of the back propagation algorithm where only the sign of the partial derivative is used. The step length is instead determined by an update value, D, that is adaptive and individual for each weight. What is the basic principle for the adaptation of this update value?

RBF hidden nodes (only the hidden nodes) compute a distance between the input vector and a weight vector, whereas MLP hidden nodes compute weighted sums. This makes the RBF discriminant a hypersphere (assuming Euclidean distance), which covers a local region, whereas MLP hidden node form hyperplanes which are infinite. Each hyperplane cuts the whole universe in half. RBFs feed this distance through a Gaussian function, which has a maximum for 0 and decreases when the distance increases. In other words, a RBF hidden node only responds with high values for patterns close to its region, and with very low values for 'outliers' far from that region. MLP hiden nodes feed their weighted sums through a sigmoid. The response only depends on the distance from the hyperplane, which is infinitely long, and the maximum/minimum values are for patterns infinitely far from it.

Radial basis function netowrks belong to a family of algorithms sometimes called 'localized learning systems'. What is the difference between RBFs and MLPs, which makes the RBFs 'localized' and the MLPs more global?

Binary perceptrons form a hyperplane in the input space, which separates the classes, i.e. it only works if the classes are linearly separable. The classes in XOR are not. Full credit required an illustration (or good argument) of why XOR is not linearly separable, not simply a statement that it is not. However, it was not required to prove/show that perceptrons implement hyperplanes.

Supervised Learning Explain why a single binary perceptron cannot solve the XOR problem

d) The weighted sum COMMENT: "only the shape, not the position" in the problem formulation should have been a strong hint. Only d) changes the discriminant's shape (that fact that it is a hyperplane). The other alternatives only change its angle (b), its position (c) or its fuzzyness (e). Only 8 students passed, despite this being discussed in a lecture.

The Perceptron 6. What is it, in the equations defining a binary perceptron, which most directly defines the shape (only the shape, not the position) of the discriminant formed when using the perceptron for classification? (Mark one) a) The input values b) The weight values c) The threshold value d) The weighted sum e) The activation function

For example, make the output node an OR gate of the two network inputs and the hidden node an AND. Then connect the hidden node with the output node with a negative weight, so that the output will fire only if one of the inputs are 1, but not both. For example:

The most common architecture to solve the XOR problem is a 2-2-1 multilayer perceptron, that is to say a network with two inputs, two hidden nodes and 1 output. It is actually possible to solve the XOR problem with only one hidden node, if the output node also has direct connections to the inputs, bypassing the hidden node (see figure). Find a set of weights for this MLP which would solve the XOR problem!

Because the winner drags other nodes (its neighbours on the grid weighted by a neighbourhood function) along when moving towards the input. Point deductions for answers which just mentioned "updates" with no indication that these updates are moves in the input space towards a common goal - the current input.

Unsupervised Learning 6. Self organizing feature maps are less likely to be affected by the winner-takes-all problem than standard competitive learning is. Why?

a.) Competitive learning is used for classification/ clustering. A competitive learning network consists of one layer of nodes, all fully connected to the inputs. Each node's weight vector represents a position in the input space, a so called codebook vector. When an input vector, x, is presented, the node with the closest (Euclidean distance) weight vector,w, is moved towards that input vector: w := w + h(x - w). The other nodes (weight vectors) do not move. b.) K-means is equivalent to competitive learning, if the latter is trained in batch mode (accumulating weight changes for all patterns in the training set before updating the weights). In other words, the node is moved to towards the center of the patterns within its Voronoi region. (K is the number of classes/nodes.)

Unsupervised Learning Competitive learning a) Explain standard competitive learning, both in words (the general idea), and formally (as an algorithm, including an update equation for the weights)! b) How is this different from K-means?

If the distribution moves slowly enough, all nodes will still have a chance to win and will therefore move along. But if the distribution moves fast enough to leave a node's Voronoi region before any more patterns were drawn from there, that node will never win again and therefore never move. Unlike SOFM there are no 'neighbour' nodes who will help by dragging it along either. In the extreme case, where the distribution jumps away, only the closest node will win and we encounter the classic "winner-takesall" problem (where all patterns are in the same Voronoi region). Some point reductions for describing only the jumping case, not why it should work in the case of slowly moving distributions. I admit that the question formulation was unclear here (that I wanted an answer to both cases), but I still felt that I could not give max credit to students who skipped one of the cases.

Unsupervised Learning (UL) Consider training a standard competitive learning network on data drawn from a non-stationary distribution (the data points move around). This might work if the data points move slowly enough, but probably not if they move too fast (or if they jump around). Why?

c) A weighted sum over d-values in the output layer This is why it's called backpropagation (of errors). The delta-values are fed back from the outputs to the hidden layer. 53 students passed.

a) A weighted sum over S-values in the input layer b) A weighted sum over S-values in the hidden layer c) A weighted sum over S-values in the output layer

a.) Genetic algorithms would typically require more computing power/time since the populations are usually much larger (10 times larger or more) than in particle swarm optimization. This means that many more individuals have to be evaluated and this evaluation (the fitness function) is where both algorithms will have to spend most of their computing time. PSO must reevaluate each particle every generation, though, which is not necessarily the case in GA. Some students claimed the opposite. PSO must do this, since all particles move all the time and each one has to check if it found a new personal best. In GA, if an individual is reproduced (copied as it is) to the next generation and if the optimization problem is stationary, there is no reason to reevaluate it. However, both this difference and that PSO particle must remember their personal bests, is overshadowed by the difference in population size. b.) For problems where the objective function is smooth and has few local optima and/or plateus, gradient methods work very well and are hard to beat. An extreme example is the bowl function shown on the PSO lecture. Throwing away the gradient information also means that PSO and GA are not very good for fine-tuning. They may find the tallest hill faster than the gradient methods, but may then have difficulties finding the exact position of the hill-top. (By the way, this applies to RPROP as well - it also throws away most of the gradient information, but still requires differentiability!)

a. Particle Swarm Optimization and Genetic Algorithms are similar in that they can both be applied to the same type of problems. One of them is likely to require more computing time (and memory) than the other, though. Which, and why? b.) Particle Swarm Optimization and Genetic Algorithms have in common that they do not depend on gradients of the objective function to work, in contrast to many neural network algorithms. This independence is usually considered an advantage, but the "No Free Lunch" theorem tells us that there must be cases where it is a disadvantage. Describe such a case!

a. Node 2 wins because it is closest to the input. The same node also wins if we use the alternative interpretation of competitive learning, where we compute weighted sums instead and search for the maximum. b. w2 is moved halfway (since n=0.5) towards the input and becomes [+1, -1, 0]. The other weight vectors do not change, since they did not win.

Consider a competitive learning network with three inputs and four nodes. The four nodes have the following weight vectors: a.) Now we present the input vector [ +1, -1, +1 ] to the network. Which node wins the competition? b.) Given a learning rate n=0.5, what are the new weight vectors of the four nodes, after updating the network using the standard competitive learning rule?

See graph to the left (a similar example was discussed on lecture 8). The choices marked with * are the greedy choices (the ones with max Q-value). Q-Learning updates Q-values toward the max Q-value in the next state, i.e. it assumes that the agent will choose a greedy move there. Sarsa updates Q-values towards the Q-value of the chosen action (greedy or not). In the figure, in a, we chose the greedy action (b=b*) so there is no difference in that case. In b, however, we chose c which is not the greedy action (c*), so Q-Learning and Sarsa will update the Q-value of b towards different values.

Draw a state transition graph for a board game (or similar) being played by a reinforcement learning agent! Then use this graph to explain how Qvalues are updated in Q-Learning, and how this is different from Sarsa! (The explanation is more important than remembering the exact update equations, though that probably helps)

a.) There are two important choices: 1. How a solution is represented (how the genotype is defined), and, 2. How the crossover operator is defined over that representation. It is important to choose a representation and a crossover operator, so that crossover is likely to be constructive (likely to produce good new individuals). Unfortunately, this is much easier said than done. Crossover mixes individuals, by taking one part from individual A and one part from individual B. Is the part which is taken from A still meaningful when concatenated with B? Does it still represent the same thing? (is it even 'legal', i.e. possible to evaluate?) There may be dependencies between parts of the genome which makes this difficult. In one-point crossover, for example, the intepretation of the tail (which is swapped) may depend on the head (which is not). If so, crossover is likely to just be an inefficient form of mutation. b.) The representation discussed on the course for the Travelling-salesman problem (TSP), combined with a simple onepoint- crossover operator is one example. The result of crossover is not even likely to be legal (in violation of the constraints). Another example from the course is the basic form of Grammatical Evolution presented on lecture, where a numerical genotype is used to generate an expression given a BNF grammar. The meaning of a number in that sequence depends on the numbers before it, since that affects where we are in the grammar when we generate the expression.

Evolutionary computing is a form of stochastic search, to solve optimization problems. Stochastic, but hopefully not purely random. We want an evolutionary algorithm to be more 'clever' than random search, but there is actually a great danger that it will turn out to become just that - random search. a) What are the most important design choices to prevent this? b) Give an example of a bad design choice which would almost certainly make the algorithm degenerate into random search!

The idea is to look at the sign of the partial derivative, ∂E/∂w. It if keeps its sign, it indicates that we are moving downhill and can increase speed (increase the steplength). If it changes sign we have probably overshot a minimum and should reduce it. Some point reductions for being unclear what the sign represents (it's that it flips which is interesting, it's not the case that positive is "good" or negative is "bad"), or for just stating general principles such as reducing the step length over time (which is a good principle but not what RPROP does).

Explain the basic principle for manipulating step lengths in RPROP!

Momentum is to avoid zig-zag behaviour across the bottom of ravines, and to cross flat surfaces (plateaus), in the error landscape. We may also overshoot shallow local minima, if we're lucky. Implementation: Add a small proportion of the previous weight change, to the current one: (SEE PIC)

Explain the common extension to Backprop called Momentum! What is the intention and how is it implemented?

Competitive learning is to move the closest node (codebook vector) towards the latest input vector, to make it more likely to win also next time the same input vector is presented. The winner-takes-all problem occurs when a few nodes, in the extreme case only one node, wins all the time because they are closer than the other nodes to all the data. The other nodes will never win and therefore never move. At best this leads to underutilization of the network, at worst all the data is classified to the same class. The most common way to avoid this is to initialize the weight vectors by drawing vectors from the input data, so that each node is guaranteed to win for at least one input vector. Another way is to include the frequency of winning in the distance measure, so that a node that wins a lot will appear to be further away than it actually is. Introducing a neighbourhood function works as well, but is no longer SCL (by definition, see first page). Some point reductions also for not being clear why the winner wins and/or what happens when it does.

Explain the winner-takes-all problem in standard competitive learning, and what can be done to avoid it (still using SCL)!

GP operates on parse-trees of computer programs. A parse tree is a tree that represents the syntactic structure of a computer program (or any other string which is generated by a grammar). For example, the expression 5+3*x would be represented by the tree in the margin to the left. Crossover: Select two "parents". Pick a random sub-tree in each parent's parse-tree (i.e. a randomly selected sub-expression) and swap them. This produces two new individuals which are inserted in the new generation. Mutation: There are many ways to do this. For example, replace a subtree with a new randomly generated one, search for a constant and change it, replace a function node with another function (of the same arity), replace a variable by another (or a constant), etc.

Genetic algorithms operate on strings or vectors of values, while genetic programming, usually, operates on ... what? Describe this structure and show how crossover and mutation operate on this!

Two things are different. (1) A hidden layer RBF node computes the distance between its weight vector and the input vector, instead of a weighted sum, and (2) the result is fed through a Gaussian activation function, instead of a sigmoid.

How is the hidden layer of a radial basis function network different from the multilayer perceptron?

The risk grows with network size (a), shrinks with training set size (b) and grows with increased training time (c).

How is the risk of overtraining a neural network affected (increased or decreased) by: a) Increased network size (number of hidden nodes)? b) Increased training set size (number of input-output samples)? c) Increased training time (number of passes through the training set)?

2x3 input-hidden weights + 3x1 hidden-output weights + 4 thresholds (one per node) = 13.

How many network parameters are there to adjust in a conventional MLP with 2 inputs, 3 hidden nodes and one output?

Local information and Stigmergy (pheromones) see picture

In Ant Colony Optimization, simulated ants move around probabilistically according to a weighed combination of two parts. What do these parts represent and what happens if the algorithm is set to only care about either one of them?

Q-Learning moves the action value for action a in state s, Q(s,a) towards the maximum Q-value over all possible actions in the state we ended up in, s'. In other words, it assumes that the next action will be greedy. Whether the agent actually selected the greedy action in state s', or not, is irrelevant in Q-learning. Not so for Sarsa. Sarsa waits until a decision has been made in state s' to perform action a', and then moves Q(s,a) towards the selected action's Qvalue, Q(s',a'), greedy or not. In the special case when the agent happens to make the greedy choice in s' the Q-value update is equivalent to that in Q-Learning, but if the agent explores, the average target values will be smaller in Sarsa than in Q-Learning where it is always the max (also when it explores, which of course it does too). If we reduce the exploration rate over time, towards 0, and if we do it slowly enough, Q-Learning and Sarsa will converge to the same values.

In general, with constant and non-zero control parameters, Q-Learning will converge to slightly greater Q-values than Sarsa, when trained on the same problem. Why?

Supervised learning is to learn to imitate, in this case to imitate a human expert, who for every board position (= input to the system) can tell which move would be best. The learning system's output is compared to the expert's recommended move, and the difference/error is used to compute changes in the learning system, to reduce the error. Reinforcement learning is to learn by trial-and-error, by interaction with an environment, to maximize a long-term scalar measure of success. In this case, the system would learn by playing the game over and over, against an opponent which plays at approximately the same level. (The most common way to balance this game-level, is to let the learning system play against itself.) The system only learns from the result of the games, it is not given expert advice on which moves to make in each situation. Training supervised is faster - the learning is based on less noisy and more direct feedback and it limits the search to the (presumably) most interesting parts of the state space. Reinforcement learning can be very slow since it needs to explore. On the other hand, supervised learning can never become better than the expert it imitates. There is no such limit in reinforcement learning, since it does not try to imitate. Also, finding the expert required for supervised learning may be hard and/or expensive. It may be beneficial to combine the two. A reinforcement learning agent can be initially trained supervised by leading it through the game. In effect, the expert is the action selector, though the agent learns as if it had made its own choices. After a while, we drop the expert and let the agent train on by itself, using stochastic action selection as usual.

Learning to play a board game is an example of a learning problem which could be solved by either supervised learning or reinforcement learning, or a combination of them (AlphaGo, which recently beat the world champion in the game of Go, was trained in both ways). Describe both approaches, to make a machine learning system learn to play a board game, using supervised and reinforcement learning respectively. Also compare them to each other, what would be the main advantages/disadvantages of each? (Conceptual descriptions are sufficient, no equations required)

It is a problem in non-stationary environments. We would have to retrain the system from scratch everytime a significant change is detected. With decaying parameters, the algorithm will, at best, converge to a solution which was the optimal one when training started, but may no longer be. Predicting when/if the environment will change is usually not possible, so it is difficult to set the decay rate of the parameters to cover this.

Many learning algorithms have control parameters which appear as constants in the update formulae, but which in practice are often decreased over time. (The learning rate in most algorithms and the radius parameter in self organizing feature maps, for example.) This decrease over time is a potential problem. Why?

a.) Because we usually do not know the range of the target function, and even if we do it is unlikely that the range happens to be in the range of the usual sigmoid [0,1]. A function approximator should be able to output any value. This does not reduce the network's computational abilities. It is the hidden layer which must be non-linear, not the output layer. b.) Five, since the function has five monotonic regions. The hidden layer of a multilayer perceptron consists of sigmoidal nodes and sigmoids are monotonic. We therefore usually need at least as many hidden nodes as there are monotonic regions in the target function. c.) Too many hidden nodes is likely to lead to overfitting and the approximation to oscillate between the training set data points, i.e. solutions where there are more motonic regions than necessary to fit the data. d.) Weight decay is to let each weight strive for 0. A simple way to implement this: After updating a weight w, update it again using wnew=(1-e)wold, where e=[0,1[ is the decay rate. Weight decay can be used as a pruning technique (cutting connections with weights ending up close to 0), but it restricts the freedom of the hidden layer even if we don't cut connections. (It's also good for numerical reasons) The general concept here is called regularization. Other regularization methods mentioned or discussed on this course include Early stopping, Dropout, Noise injection, Multitasklearning, and Lagrangian optimization (adding constraint terms to the objective function).

Multilayer perceptrons a.) neural networks used for function approximation usually should have linear outputs. WHy? b.) how many hidden nodes would be required (at least) for a conventional multilayer perceptron (with one hidden layer) to do a good approximation of the function below? why? (PIC) c.) what is the likely effect on this approximation if you have too many hidden nodes? d.) one way to reduce this effect is to restrict the freedom of the hidden layer nodes. Several such methods have been discussed on this course, for example weight decay. What is weight decay?

a.) The discount factor, g. Future rewards are discounted by this factor to the power of n, where n is the number of steps away. Qlearning is trying to maximize a discounted sum of future rewards. The expected value of a state can be computed recursively as Q*(s,a) = r + gmaxQ*(s',a'), where r is the reward, g is the discount factor, s' is the next state and the max is over all possible actions (a') from that state. The purpose of gamma is to make the sum of future rewards (which we want to maximize) finite even if the goal is infinitely far away (or does not exist). It also has the side effect of making the agent prefer shorter paths to the goal than longer ones. b.) g2 = 0.81. COMMENT: Let's call the state immediately preceeding the goal s0, the state one step away from that s1 and the state where the robot is now s2. In s0, the agent would receive a reward 1 for the action going down, so Q*(s0,a) for that action is 1 (there is no maxQ-value in the goal state to discount). All other rewards are 0, so in s1, Q*(s1,a) is simply g times the maximum Q-value in s0 (=1), and in the state where the robot is now, s2, it would be g times that. So, g=1*0.9*0.9. Some students had the right idea, but extended the chain to yet another step (i.e. 0.93). The agent is rewarded with +1 for the last step, not 0.9. c.) In SARSA, the expected value Q*(s,a) is r + RQ*(s'a'), i.e. it depends on which move was actually chosen in s'. In Q-Learning we always update towards the maximum Q-value in that state, disregarding if we explored or not. In Sarsa, therefore, and in contrast to Q-Learning, the expected value depends on the exploration rate which was the parameter asked for here.

Reinforcement Leraning Consider a reinforcement learning problem where all possible actions are rewarded by 0 except for the ones leading directly to the goal state, which are rewarded by 1. The Gridworld problem from lab 2, shown to the left, where you trained a robot (Marvin) to find the goal (the gold cup) in a simple maze, is an example of such a problem. Let's assume that the agent is trained by Q-Learning and that the agent is currently three steps from the goal state (as in the figure). The best move the agent can make is to take one step closer to the goal, in this example going left or down. All moves from this state, including the best one(s) are rewarded by 0 but the best one(s) should have a greater value (Qvalue). a) In order to compute the expected value (the Q-value the algorithm should converge to) of taking the best action from this state, you would have to know the value of a certain parameter used in Qlearning. Which parameter is that and what is its purpose? b) Given that the value of that parameter is 0.9, what is the expected value of taking the best action from this particular state? (i.e. three steps from the goal) c) If SARSA were used instead of Q-learning, the computation of the corresponding expected values would become more difficult and you would need to know the value of yet another parameter. Which parameter is that and why is it relevant for SARSA, but not for QLearning?

a.) V(s) = r + R*V(s'), where R is a discount factor in the range [0,1]. This is a relation, not an update rule. Learning rates do not apply here. Nor do exploration rates b.) They are updated proportionally (the learning rate, h, is the proportionality constant) to the TD-error, which is the difference between the two sides of the equation from the previous question. The update rule then becomes: V(s) := V(s) + n[r + R*V(s') - V(s)] c.)e-greedy is perhaps the simplest way to implement stochastic action selection. With probability e, the agent explores (selects a random action). Otherwise (= with probability 1-e) it exploits what it knows, i.e. takes the action that is currently believed to be best, i.e. the one with the highest Q-value.

a.) define V(s) recursively in terms of the following rewards and values b.) show how values are updated, using the TD(0) update rule c.) In Q learning and Sarsa, values are associated to state action pairs instead of just states, as in TD(0). a common way to make sure that Q learning/Sarsa agents explore is to use the E-greedy algorithm. What is E-greedy?

a.) The greedy action is the action which is currently believed to be the best one (has the greated "merit" or value). In other words, it is the opposite of an exploratory action. b.) (see pic) c.) The temporal difference error here (for Q-learning) is the right hand side minus the left hand side of the above equation, i.e. ErrTD = r + RmaxQ(s',a')-Q(s,a)

A Q-value in Q-Learning can be defined as follows: Q(s, a) is a discounted sum of future rewards, after taking action a in state s, and assuming that all future actions are greedy. a. What does it mean to be greedy here? b.The definition can be rewritten on a simple recursive form, as an equation in terms of Q itself. How? Write down the equation and explain its parts c. How is this equation then used to define the temporal difference error?

Overtraining (= overfitting) typically occurs when a learning system with too great representational power (for example a neural network with too many hidden nodes) is trained on too little data, or for too long. This increases the risk that the learning system will either learn the training set examples by heart, or create a more complex solution than necessary to fit the data. An overtrained system will not generalize well, i.e. it will perform badly on new data, not seen during training. Point reductions for some vague descriptions (for example being unclear when/why it occurs) Noise injection is to add a small amount of noise to the input values every time an input pattern is presented to the learning system, so that the patterns never look quite the same. It forces the learning system to find more general representations. Some students suggested (without penalty) that this is done by extending the training set with slightly modified copies of the existing patterns. That would work too, but maybe not as well since all patterns would still look the same every time they are presented.

A very common challenge in machine learning, which almost all learning algorithms must face, is to avoid overtraining. Explain the concept, and how noise injection may help prevent this!!

Example 1: Compression/decompression of data. If we train a linear MLP like this, the N hidden nodes will approximate the N first principal components (more powerful with non-linear hidden nodes). However, we then probably need three hidden layers, since the compressor alone may need one (the hidden layer here is the output layer of the compressor), and if it does the decompressor will need one too. Example 2: Novelty detection. If we train a network to reproduce the inputs on the outputs, this is likely to fail if something unexpected happens to the input data (something which was never seen during training - a novelty). So in this application it is the error of the output layer which is our output/alarm signal. Some helicopters have alarm systems like this being trained on data from an accelerometer on the tail rotor axis. Auto-association works as long as the axis vibrates the way it usually does, but even a small crack in the rotor will change the vibration pattern and can be detected this way. Example 3: Auto-associators like this are sometimes used to train the (many) hidden layers in deep learning networks, to build up layer upon layer of increasingly more complex feature detectors. Bad example 1 (partial credit): Error correction. The idea is that the network should be able to repair damaged inputs. Recurrent networks, such as Hopfield networks, can be used for this, but it will probably not work very well with a feed-forward MLP like this. Note that if this worked, novelty detection (example 2) would not.

An auto-associator (or auto-associative memory) is a system which tries to reproduce its inputs on its outputs. A multilayer perceptron can be set up as an auto-associator by having the same number of inputs and outputs, and usually a lower number of hidden nodes. See, for example, the 5-3-5 MLP to the left. Since the network is an auto-associator the target vector for each input vector is the input vector itself. What possible applications could this have? (There are several, describe at least one)

a. Because there is no one to tell the agent which solution is the best one (unlike supervised learning). If the agent does not explore, it will most likely (*) get stuck in the first solution found (if it finds any solutions at all), never discovering that there are better solutions. (*) Actually, the environment might be stochastic enough to make the agent discover new solutions anyway, but I did not expect the students to note that. b. By the use of a stochastic action selector. The most common is e-greedy where the agent explores with probability e, by selecting actions at random. Otherwise it exploits what it knows instead, i.e. takes the action that is currently believed to be best. Another common action selector is Bolzmann selection, where the probability of an action being selected is proportional to its value.

An important issue in reinforcement learning is to know when, and to what extent, the agent should be allowed to explore. a) Why is it necessary to let the agent explore? b) How can this be achieved/implemented?

Stigmergy, i.e. indirect communication and coordination by local modification and sensing of the environment. ACO works by modelling real ants' use of stigmergy, through pheromone trails. Since this form of communication is indirect with information stored locally in the environment, as opposed to point-topoint communication with information stored in the ants, it does not matter much how many agents we have. They are almost independent and do not require central control. A real ant colony may consist of millions of individuals, so point-topoint communication would not be feasible as the only form of communication (of course, they do that too). Another advantage is that the individual agents can be made very simple but yet the system as a whole can solve fairly complicated coordination tasks.

Ant colony optimization algorithms are very scalable with respect to the number of agents, i.e. the complexity of the algorithm does not grow very much with the number of ants. What is the fundamental mechanism in these algorithm, which makes them so scalable (in that respect)?

Counting the number of mis-classifications is not a differentiable function. All functions of the network, including the objective function, must be differentiable in order to derive (literally) how much to move a weight given an input, an output and a desired output. Apart from not being differentiable, however, counting the number of mis-classifications is a perfectly valid objective function and would work if the network is trained by PSO or GA for example, which do not depend on gradients.

Back propagation a) When Backprop is used for classification, the objective function to be minimized is usually the same as in function approximation - a sum of the squared error between the output and the target. But in classification, the real objective (usually) is to minimize the number of mis-classifications. Why can this not be used directly, as the objective function?

Q-Learning updates the Q-value Q(s,a) for state s and action a, based on the maximum Q-value in the next state, s'. It does not matter if the agent selected another action there, due to exploration, the maximum Q-value in state s' will still be used to update Q(s,a). So even if the agent explores from s' and therefore enters the danger zone, Q(s,a) will not be affected. Sarsa, on the other hand, updates Q(s,a) based on the action actually selected in s', so if that was an exploratory move into the danger zone, Q(s,a) will be directly affected by that. This will in effect push down the expected value for Q(s,a) which may make it worth while to take a longer route around the danger zone. [Actually, it is not certain that the expected value of Q(s,a) will be pushed down sufficiently to make a detour worth while. It also depends on the exploration rate and how strongly we reward taking the shortest route. For example, in how large the positive reward for reaching the goal is, compared to the cost of getting there. I did not expect students to find that weakness in the problem formulation, and as far as I could tell, no one did]

Consider a navigation problem (sketched below) where we want a reinforcement learning agent to find the shortest path from A to B. There is a dangerous area between A and B (a region of states where the agent would be punished or possibly even killed if it enters), so the agent must find a path around it, to avoid the area. If we train both a Q-learning agent and a Sarsa agent on this problem, with constant and non-zero control parameters, the Q-Learning agent would typically converge to a shorter route (closer to the danger zone) than Sarsa, which would seem to be more careful and converge to a slightly longer route around the dangerous area. Why?

a.) The leftmost node wins, since its weight vector, [1.4, 0.3], is closest (Euclidean distance) to the input vector, [1.0, 0.5]). Some students had the network compute weighted sums instead of distances, which in this case would identify the same winner, so that's OK, but in general the weights should be normalized first if you want to define the winner this way. Also, if you compute weighted sums, the winner is the node with the greatest weighted sum, not the smallest. b.) The weight vector of the winner is moved halfway (since n=0.5) towards the input vector, so the new weight vector of that node is [1.2, 0.4]. The rest of the weights in the network are left unchanged in standard competitive learning. Surprisingly many students computed weight changes without using the input values (for example, just multiplying the weight with the step length). It should be obvious that no learning rule could work if it does not care about the input values. Some students had the right idea, but did some minor mathematical error (sign errors for example) which moves the node away from the input. The mistakes may have been minor, but it should have been obvious in those cases that the result was wrong (further away from the input).

Consider the competitive learning network depicted here to the left. The three circles on top are the nodes. The numbers by the connections are the weights. The black circles at the bottom are the inputs and the numbers below are the current input values. What happens now: a.) Which node wins, given this input, and why? b.) What happens to the weights of the network if the standard competitive learning rule is applied to this situation with step length n=0.5? Write down the new set of weights and justify your answer

where y is the output, wi is a weight, xi is an input and is the threshold. Point reductions for incomplete definitions, or for describing some special case (e.g. only for two inputs which is not what "binary" means here), for having a sigmoidal activation function (not binary), or no activation function at all. Many students did not answer this first part of the question, in which a formal definition was explicitly asked for. A general description without any information on how the node computes its output value, received no points. Full credit for the second part required that the student illustrated the XOR problem as a classification problem, to show that the classes are not linearly separable, together with at least a statement that the perceptron forms a linear discriminant (line, or hyperplane). (I did not require proof of the latter though, since the question was to "explain" this, not to "show" or "prove".) It is not the node being binary (the step function) which makes the perceptron form a hyperplane. The shape of the discriminant is a consequence of the weighted sum (linear combination).

Define a binary perceptron (formally, i.e. write down the mathematical expressions for what it computes), and explain why a single binary perceptron cannot solve the XOR problem!

1. Initialize. Set all weights to small random values with zero mean 2. Present an input vector, x = (x1, x2, ... , xn), and corresponding target vector, d = (d1,d2,...,dm) 3. Feed forward phase. Compute n etwork outputs by updating the nodes layer by layer from the first hidden layer to the outputs. The first hidden layer computes (for all nodes, yj): yh = f(Ei=0 wjixi) where x0=-1 The next layer applies the same formula, substituting this layer's node values for xi, etc. 4. Back propagation phase. Compute weight changes iteratively, layer by layer, from the output to the first hidden layer (the sum is over all k nodes in the layer 'above', i.e. the layer for which s-values were computed in the previous iteration) 5. Repeat from step 2 with a new input-target pair

Describe the back propagation learning rule, with enough detail for a reader to be able to take your description and implement it! You may assume that the reader knows what neurons, weights, and multilayer perceptrons are, but not how they are trained

b.) Premature convergence is when the population converges to one spot, losing diversity (and therefore the point of having a population at all). Consider an extreme case where one individual X in the population has a much greater fitness than any other. In fitness selection, X is very much more likely to be selected for recombination/ reproduction/mutation due to the great difference in fitness. Thus, its genotype, or parts of it, tends to spread rapidly in the population, in effect pulling all other individuals closer to it. In rank selection, since X is ranked number one, it is still the most likely to be selected, as it should be, but the probabilty does not depend on how much greater fitness it has. So, less fit individuals have a slightly greater probability of being selected in rank selection, and the whole population is therefore more likely to maintain diversity.

Evolutionary computation a. Select a (non-trivial) crossover point and perform a one point crossover on the two genotypes (character strings) in the margin to the left! What is the result? b.) ONe big challenge in evolutionary computation is to avoid premature convergence. This is less likely to occur if we use rank selection instead of fitness selection. Why?

Premature convergence is when all individuals in the population converge to the same place in the search space. To be precise, the word "convergence" here means that the individuals have converged (grouped together), not that the algorithm has converged (to a local minimum/maximum - though this is likely the case as well). It is bad because we then miss the main point of evolutionary computing - parallel search. The most common cure - to maintain diversity in the population - is mutation. Other ways include modifying the selection mechanism, reducing elitism, and/or increasing population size.

Explain the evolutionary computing concept premature convergence! What is it, why is it bad and how can it be avoided?.

Both learn from a finite training set by computing an error for each presented pattern and then the amount Dwji by which the parameters in the learning system (e.g. the weights in a neural network) should be changed to reduce that error. In pattern learning, the parameters are updated directly after each pattern presentation. This requires a random order of presentation, which is why this form of learning is sometimes called "stochastic". In batch learning the computed parameter changes are accumulated over time and the system is not updated until the whole training set has been presented once (one epoch). This is the more correct form of learning, from a gradient descent perspective (it is gradient descent)

Explain the two concepts pattern learning (= stochastic/online learning) and batch learning (= epoch/offline learning). From a theoretical (gradient descent) perspective, one is more correct than the other, which?

As in question 7, we use the r.h.s. of the relation Q(s,a) = r = RmaxQ(s',a') (not the update equation!) as the target for output a (with s on the inputs). The other outputs can not be trained, since we don't know what would have happened if they had been selected, so the target values for those outputs are set equal to their output values (= 0 error).

For most interesting RL applications, the state space is too large for the Q-values to be stored in a table. A common solution is to use a neural network, for example a multilayer perceptron, to approximate the Qvalues. See figure to the left. Note that the network estimates Q-values for all n possible actions (a) in state s in parallel. This network is then trained by supervised learning, for example using Backprop, so we have to define target values for the outputs (desired Qvalues). How?

a.) Typical applications would be ones where we still don't know exactly how they can be solved, but where we can easily generate training examples. Face recognition is a classical example. Any human (almost) can do it, but we don't know exactly how. It is therefore difficult to write down a solution as an algorithm but it is easy to create a set of [input(image), target(name)] pairs to train a learning system on. b.) In batch learning, the computed weight changes are accumulated (summed up) over time before being used to update the actual weights at the end of each epoch. Addition is commutative - the order in which we sum the weight changes up does not matter.In pattern learning, on the other hand, weights are updated after each pattern presentation which makes the order significant. Therefore the order is usually randomized, which is why this method is sometimes called stochastic learning. c.) Split the data into three subsets - a training set (largest), a validation set (smallest) and a test set. The training set is used for training, while plotting the error for the validation set. When the error on the validation set levels out, or starts to increase over time, there is no use to continue training since it will reduce generalization ability. It is this risk of overtraining the learning system, which this method tries to address. Mark the point (the epoch number) where this occurred and retrain the network this number of epochs (possibly now including the validation set in the training set). Finally test for generalization on the test set.

General training procedure questions a. Give an example of an application where you think it would be necessary to use some form of learning system, and motivate why a conventional programming approach would be less likely to succeed for this application! b.)The order in which the training patterns are presented to a learning system during training is irrelevant when batch (epoch) learning is used, but not so in pattern (on-line) learning. Why? c.) c. Describe the training procedure called early stopping! How is it used, and why?

see pic

In some neural network text books, it is claimed that a multilayer perceptron with one hidden layer can only form convex decision regions when used for classification. [A region is said to be convex when any two points within the region can be connected with a line without crossing the border. Circles and squares are examples of convex regions.] Consider the region R shown in the left margin, defined by the four lines a, b, c and d. Clearly, R is not a convex region, since the left side is concave. [The two corners to the left cannot be connected with a line without passing points outside the region.] Nevertheless, this region can be formed by a binary multilayer perceptron with two inputs (the x and y coordinates), four hidden nodes (representing the four lines, a, b, c and d, respectively) and one output which responds with a 1 for any point inside the region and 0 for any other point. Show, by construction, that the output node can form the depicted region, assuming that the four linear discriminants have already been found by the hidden nodes! (you only need to give the weights for the output node)

a.) RBFN should work best since it puts out hyperspheres in the input space, i.e. one node is sufficient to separate the two classes in this case. A MLP would have to form a circular(ish) region by combining hyperplanes, in this case at least three hidden nodes (= a triangle with extremely rounded corners), plus one output node. Some students had confused hyperspheres and Voronoi regions. Voronoi regions are not really applicable here. In contrast to competitive learning, there is no "winner" here or search for the closest node. Some point reductions for bad explanations and/or for not being clear (or being wrong) on what the discriminants do in the input space. b.) Because it affects the shapes. For example, if we transformed the coordinate system to polar coordinates in this case, the problem would become trivial for a MLP, but more difficult for RBF. A single binary perceptron node would suffice (and it would in this case actually only require one of the two input values - the distance from origo). Some students claimed that MLP and/or RBFN are only defined for Cartesian space. This is not true. The network as such does not "know" anything about the coordinate system used. It just makes the problem more or less difficult because the shapes (of both data and discriminants) change. Some students seemed to think that Cartesian means 2D, and that this problem therefore is Cartesian by definition. Instead they explained how MLP and RBFs behave differently in higher dimensions, which is not what was asked for here. Some students seemed to think that the coordinate system is somehow connected to the activation function and that a problem in non-Cartesian space would no longer be differentiable. Figure for question 5: Two cirular regions.

MLP vs RBF a.) Consider the classification problem depicted here to the left. The task is to separate the outer (black) circle from the inner (grey), in a Cartesian (x,y) coordinate system. Which architecture should be most suitable for this task (require the fewest number of nodes) - a multilayer perceptron or a radial basis function network? Why? b.) Why is it relevant for the previous subquestion, that the coordinate system is Cartesian?

a.) CONDITION 1 is that node j is an output node, CONDITION 2 is that node j is a hidden node. b.) the sum is over all nodes in the adjacent node layer which is closer to the outputs (or are the outputs). This is the node layer for which we compute the S-values, in the previous step in the back propagation procedure. c.)Lamda*yj(1-yj) would change in both rows where it occurs. This is the derivative of the logistic function. Lamda is the slope parameter of the logistic function. d.) (dj-yj) in the S-update for the output layer nodes, would change. The update for the hidden layer is not affected (other than indirectly through the sum over k). To be precise the output layer S-update also changes its sign if we do this, since the derivative of the objective function is actually -(dj-yj) e.) The effect is a very noisy error graph, noisy here meaning the opposite of smooth, where the error fluctuates (oscillates) a lot.

Multilayer perceptrons and back propagation Under the most common assumtions, the back propagation algorithm updates the weights of a multilayer perceptron as follows: a.) what is CONDITION 1 and CONDITION 2 here? b.) Explain the sum(over k) in the last row! what is k an index of? c.) These equations assume that the neurons have logistic activation functions. Which part or parts would change if we used another activation function? d.) Another assumption made above is that the objective function to be minimized is the squared error. Which part or parts would change if we replaced this by another objective function? e. During training it is common to plot the error as it changes over time in a graph. What is the typical effect seen in this graph, if the learning rate (gain, n) is set too high?

Premature convergence is when all individuals in the population converge to the same place in the search space, before the global optimum is found. Note that the word "convergence" here means that the individuals have converged (grouped together), not that the algorithm has converged (to a local minimum/maximum - though this is likely the case as well). It is bad because we then miss the main point of evolutionary computing - parallel search. Crossover would no longer find new solutions (excellent point from several students). The cure is to try to maintain diversity in the population, for example by increasing the mutation rate, or by modifying the mutation operator. Other ways include modifying the selection mechanism (e.g. rank selection instead of fitness selection), reducing elitism, and/or increasing population size.

Population Methods (EC and SI) What is premature convergence? Explain the concept, why it is a problem and suggest at least two ways to avoid it in evolutionary computing.

a.) Fitness selection may lead to greater risk of premature convergence, than rank selection. In Fitness selection, individuals are selected proporationally to their fitness value. If one individual has a much greater fitness then the others, that individual will be very likely to be selected and may therefore quickly dominate the population (in effect pulling all other individuals to it = premature convergence). In rank selection we instead select proportionally to rank in a list sorted after fitness. The best individual still has the greatest selection probability, but not that much greater than number two in the list (even if there is a big difference in actual fitness). b.) I should have asked for the 'most important' choices. Of course there may be others than than the one I had intended for students to discuss here, but the most important are: 1. The neighbourhood structure (how sparse/dense it is). The sparser the structure is, the less risk of premature convergence, since the particles will not affect each other directly. Therefore gbest (which is the densest form of lbest) is more likely to cause problems than lbest using a ring structure, for example. 2. The weight parameters, q1 and q2, in the update formula, which control the balance of the cognitive and social components in the update formula. If the social component is too strong, the population will be more likely to converge prematurely. 3. Population size (larger is better), (Not required for full credit)

Population methods (evolutionary computing and swarm intelligence) a.) How does the choice of using fitness or rank selection affect the risk of premature convergence in evolutionary computing, and why? b.) Which design choices in particle swarm optimization affect the risk of premature convergence and how? c.) which design choices in Ant System (the basic form of ant colony optimization disucssed on this course) affect the risk of premature convergence and how? The intended answer here was the choice of the parameters a and b, which control the balance between following pheromone trails v.s. acting on local information in the update formula. It's the same argument as for the balance of the cognitive/social components in PSO. If the "social component" (which here corresponds to the pheromone trails) is to strong it is more likely to converge prematurely. Population size does not have the same effect here, and if it does it may actually affect the risk both ways (but I did not expect students to note that). Same thing with evaporation rate - for normal value is should not affect this much, but in extreme cases it may (no evaporation rate at all for example, which may saturate the paths).

a.) RBF hidden nodes compute the distance between the input vector x and the weight vector w and feed that through a Gaussian activation function (or similar shape), thus producing a node value which is at max (1) for x=w and decreasing with distance from w (for a Gaussian, assymptotically towards 0). There is no threshold weight, but on the other hand there must be a parameter which defines the width of the activation function (for a Gaussian, the standard deviation, s). The comment in the question on "all relevant aspects" was intended as a hint not to forget the widths. b.) The two layers are trained differently. The output layer is just a layer of regular perceptrons and as such can be trained by for example the delta-rule. The hidden layer positions (x), are usually trained by competitive learning or K-means. Their widths (s) are usually set to a constant value, computed after the positions have been found, for example to the average distance between a node and its closest neighbour (in weight space).

Radial basis function (RBF) networks a.) what is computed by a hidden node in a RBF network? (in the feed forward phase) b.) how are RBF networks usually trained?

a.) RL to learn by trial-and-error. It is based on exploration. That's how it finds new solutions. Unlike supervised learning, there is no one who can tell the agent which solution is the best one (or better than the ones already found). If the agent does not explore, it will most likely get stuck in the first solution found (if it finds any solutions at all), never discovering that there are better solutions. (Actually, the environment may be stochastic enough to make the agent discover new solutions anyway, but I did not expect the students to note that.) b.) This is achieved by the use of a stochastic action selector. The most common is "e-greedy" where the agent explores with probability e, by selecting actions at random. Otherwise it exploits what it knows, i.e. takes the action that is currently believed to be best. Another common action selector is Bolzmann selection (a.k.a Roulette wheel sampling), where the probability of an action being selected is proportional to its Q-value (relative to the other possible actions from that state).

Reinforcement Learning An important issue in reinforcement learning is to know when, and to what extent, the agent should be allowed to explore. a) Why is it necessary to let the agent explore? b) Describe a common way to implement this! (there are several)

a. Epsilon-greedy: With probability e, disregard the learning system's suggestions (action merits, Q-values) and select an action at random instead. Otherwise, make the greedy choice (select the one with highest merit). Bolzmann selection: Select an action with probability proportional to its merit value relative to the others, e.g. in Q-Learning with probability b. Q-Learning moves the action value for action a in state s, Q(s,a) towards the maximum Q-value over all possible actions in the state we ended up in, s'. In other words, it assumes that the next action will be greedy. Whether the agent actually is greedy or not in state s' is irrelevant to Q-learning. Not so for Sarsa. Sarsa waits until a decision has been made in state s' to perform action a', and then moves Q(s,a) towards that Q-value, Q(s',a'), greedy or not. Of course, if the agent is greedy this is equivalent to Q-Learning, but if the agent explores as it should (in both cases), the average target values will be smaller in Sarsa than in Q-Learning. c. It is a delayed reinforcement learning problem. Despite receiving rewards after every action, Marvin must associate the rewards to a whole sequence of actions (delayed RL) instead of just the latest one (immediate RL) if he is to succeed here. Otherwise, when he eventually finds the socket (goal state) and gets his +100 reward, only the states immediately preceeding the goal state will be affected.

Reinforcement learning a. Reinforcement learning agents must be allowed to explore the state space in order to find the best solution, or it may get stuck in the first viable solution it finds. Give one (established) example of how this exploration behaviour can be implemented! b. In general, Q-learning comes up with greater action values than Sarsa, when trained on the same task. Why? c.Marvin, the robot, is depressed. His energy resources are almost up and he has to find a short route to a wall socket in order to recharge his batteries. Marvin is more than one move away from the closest wall and every move he makes costs one unit of energy (corresponding to an immediate reward of -1 for each action), but he will, on the other hand get +100 units of energy when he finds a socket. Is this an immediate or a delayed reinforcement learning problem? Why?

a. To reduce dimensionality in a way which maintains the statistical distribution of the data, so that, for example, two points that are close to each other in the highdimensional input space are still close to each other also in the lower-dimensional output space (and vice versa). b. Instead of just moving the winner (the weight vector which is closest to the input vector), we move all weight vectors towards the input, but to a degree which depends on how far that node is from the winner in the network structure (usually a 2-dimensional grid). This is done by defining a neighbourhood function f(j,k) where j is the index of the node to move and k is the index of the winner node. This function should have a max=1 for the winner itself (j=k) and then decrease with distance between j and k. A Gaussian function, for example. This neighbourhood function is used as a multiplier in the update formula: (SEE PIC)

Self Organizing Feature Maps can be described as an extension to Standard Competitive Learning to implement topologically preserving dimension reduction. a. Explain what it means to be topologically preserving! b.How is the Standard Competitive Learning Rule extended to do implement this?

Topologically preserving dimension reduction is to reduce dimensionality (project data from a high dimensional space to a lower dimensional space) in a way which maintains the statistical distribution of the data. For example, so that two points that are close to each other in the high-dimensional input space are still close to each other also in the lower-dimensional output space (and vice versa). Instead of just moving the winner (the weight vector which is closest to the input vector in the input space) as in SCL, all weight vectors are moved towards the input, but to a degree which depends on how far that node is from the winner in the network structure (usually a 2-dimensional grid), not in the input space. This is done by defining a neighbourhood function f(j,k) where j is the index of the node to move and k is the index of the winner node. This function should have a max=1 for the winner itself (j=k) and then decrease with distance (on the grid) between j and k. A Gaussian function, for example. This neighbourhood function is used as a multiplier in the update formula: (SEE PIC)

Self Organizing Feature Maps can be described as an extension to Standard Competitive Learning to implement topologically preserving dimension reduction. Explain what it means to be topologically preserving in this context, and how Standard Competitive Learning Rule is extended to implement this in SOFM

The neighbourhood topology decides which nodes in the network, if any, are moved together with the winner towards the input. In SCL there is no such neighbourhood, i.e. only the winner is moved. In SOFM, the neighbourhood is hard-coded as a grid and in that sense fixed, though the amount by how much a neighbour is moved towards the input may change over time. In other words, who is neighbour to whom is fixed, but the step length is not. In GNG, the neighbourhood is changing over time, by updating a graph. Nodes that strive for the same cluster may become neighbours, i.e. be connected in this graph if they were not already, and neighbouring nodes that strive for different clusters may have their neighbour status revoked, by cutting the edge between them in the graph. Point reductions for incomplete descriptions, most commonly not mentioning how the neighbourhood is defined (by a graph), what the effect is (that it decides which nodes to move), or implying that edges in GNG can only be removed (not created).

Self-organizing feature maps have a fixed neighbourhood topology, Growing Neural Gas has a dynamic neighbourhood and Standard Competitive Learning has no neighbourhood at all. Explain what this means in all three cases!.

You show this by first writing down the definition of what a perceptron computes (not only the weighted sum). Then you set the weighted sum to 0. That's where the perceptron "decides" what to do. So, now you have an equation where you can express x2 as a function of x1. That equation becomes a line equation x2 = kx+m where k = -(w1/w2) and m = (THETA/w2) A discriminant is not by definition a hyperplane as suggested by a few students. A discriminant can take any shape. Nor does the linear shape of the discriminant follow from the fact that binary perceptrons are binary.

Show that the discriminant formed by a binary perceptron with two inputs is a line

This is actually just a boolean logic problem. The question boils down to this: Is it possible for a single binary perceptron (the top node here) to implement the boolean function ab(c+d)? This is what "inside R" means given the above assumptions. One solution is wa=wb=2, wc=wd=1, q=4.5: Most common point reductions for making unmotived changes to the given assumptions (e.g. on which side the hyperplanes respond with +1), for coming up with the wrong boolean function (given your assumptions if you changed them), or no such function at all, or for not finding the correct weights to implement that function. Common mistake: Making the node a 4-input AND. AND is convex. Sidenote: So, ab(c+d) is possible to implement by a single perceptron, but we know that XOR is not. Where is the limit? A single binary perceptron can implement any conjunction which contains at most one disjunction (such as ab(c+d)), or any disjunction which contains at most one conjunction (e.g. a+b+cd+e). XOR ( ab + ab ) is a disjunction of two conjuctions so it is not solvable. In geometric terms, the region may contain one, but only one, concavity. The rest of it must be convex.

Some years ago, many neural network text books wrongly claimed that a multilayer perceptron with one hidden layer can only form convex decision regions when used for classification. (A region is said to be convex when any two points within the region can be connected with a line without crossing the border. Circles and squares are examples of convex regions.) Consider the region R shown in the left margin, defined by the four lines a, b, c and d. Clearly, R is not a convex region, due to the concavity on the left side. (Points inside R close to the the two corners to the left cannot be connected with a line without passing points outside the region.) Nevertheless, this region can be formed by a binary multilayer perceptron with two inputs (the x and y coordinates), four hidden nodes (representing the four lines/hyperplanes, a, b, c, and d, respectively) and one output which responds with a 1 for any point inside the region and 0 for any other point. Show, by construction, that the output node (in this case a binary node) can form the depicted region, assuming that the four linear discriminants have already been found by the hidden nodes! (you only need to give the weights for the output node)

The training set is used for training, i.e. to compute parameter changes (weights, if it is a neural network). The test set is used to test for generalization, i.e. how well the system performs on previously unseen data, after training. In early stopping, the validation set is used to decide when to stop training, by plotting the error on this set (while training on the training data). Sooner or later, the error on this validation set will stop decreasing, or possibly even increase. There is no point to continue training after that (the system will become specialized on the training data). The most common reasons for point reduction were not mentioning how to see when to stop (looking for a minimum using the validation set), and/or what to do then. Usually the network is simply reset and retrained for that number of epochs, but in principle we could 'rewind' the network to that point in time, if we kept all weight set versions during training. Confusing the test set with the validation set is fairly common mistake, and usually excusable, but in this case it should have been clear from the question formulation which is which.

Supervised Learning When training a learning system to solve a task, you should divide your training data into a training set and a test set. Furthermore, if you use early stopping (a form of cross validation) you should also have a validation set. Why? Explain the need for the three sets, and how they are used!

a. Stigmergy = Indirect communication and coordination by local modification and sensing of the environment. The most common example (it was not a requirement here to give examples) is how pheromones are used by ants to control the behaviour of other ants, but many other land living animals, including humans, also use stigmergy. Dogs leave messages to other dogs by urinating on points of interest, and humans sometimes leave messages to family members on kitchen tables. Road signs is another human example. It is of particular interest to computer science because it is an extremly scalable form of communication. An ant colony may consist of millions of individuals, so point-to-point communication would not be feasible. Furthermore, the individual agents can be made very simple but yet the system as a whole can solve fairly complicated coordination tasks b. The cognitive component strives for the best position found so far by this particle, i.e. the previously visited position in space which had the best fitness. The social component strives for the best position found so far by the neighbours of this particle in a neighbourhood graph. The neighbourhood is social, not topological - it is a network of friends, rather than a network of neighbours. In the extreme case (gbest) this neighbourhood graph covers the whole swarm, i.e. the social component strives for the best position found so far by any particle in the swarm. This is usually not a good idea though. Local interaction works better.

Swarm intelligence a. Define stigmergy in one sentence, and then explain why it is of particular interest to computer science! b. In the Particle Swarm Optimization variant called lbest, the particle velocities are updated by a weighted sum of two parts - a trade-off between a cognitive component and a social component. What are these two components trying to achieve? (what are they striving for?)

a. A codebook vector is a vector which represents a region in space. Any vector within this region is associated to this codebook vector and, consequently, treated as equal. It is likely, but not necessary, that the learning algorithm moves these codebook vectors to the centres of clusters found in the data. In a competitive learning network (for example) the codebook vectors are the weight vectors - one for each neuron/codebook. b. It is the name of the region described above, i.e. it is a region in space surrounding codebook vector v such that any vector within that region would have v as its closest codebook vector. Whether the data actually contains any vectors within that region or not is irrelevant. Formulated differently, if you think of this as a competitive learning neural network, the Voronoi region of node v is the set of input vectors, existing or not, for which v would win the competition. c. The hint in the question refers to the neural network implementation of PCA, mentioned on lecture 9, where a multilayer perceptron with equal number of inputs and outputs and a narrow hidden layer, is trained to reproduce the inputs on the outputs. This can for example be used as a compressor/decompressor of images. It is trained by a supervised training algorithm such as backprop, but in an unsupervised context - the target values are taken from the inputs, not from an external source. Of course, other techniques for dimension reduction, such as SOFM or conventional PCA, are also examples of self-organization which are not (in themselves) classification methods. Though the reason for reducing dimensionality is often to facilitate later classification, dimension reduction in itself is not classification.

Texts about unsupervised learning (self-organization) often mention codebook vectors and Voronoi regions. a. What is a codebook vector and what does it correspond to in neural network implementations such as Competitive Learning networks? b. What is a Voronoi region? c. Self-organization is almost always about classification. Can you think of a case which is not? In other words, a learning problem which is unsupervised (not relying on external target values) but which is not a classification problem? [Hint! At least one such case has been mentioned in lectures, though it was perhaps not pointed out that it is an unsupervised learning situation].

see pic

The binary perceptron a. Define a binary perceptron, i.e. write down its function y(x) where y is a 0 or 1 and x is a vector of input values! b. Now set the weight of such a binary perceptron, in this case with two inputs, a and b, so that it would implement the discriminant (thick black line) shown in the margin here to the left! The perceptron should repond with 1 on the right hand side and 0 on the left hand side of the discriminant. .................................................

a.) The boolean function is a+bc, i.e. a OR (b AND c) b.) The corresponding input fvalue is added to the weight, to increase the weighted sum.

The binary perceptron a. What is the boolean logic function (write down the expression) implemented by the binary perceptron in the margin to the left? a, b, and c are binary inputs, either 0 (false) or 1 (true) b. Binary perceptrons can be trained by the Perceptron Convergence Procedure. How are the weights updated in this algorithm if the perceptron output is 0 and the desired output is 1?

a.) pi,d is the best position found so far by this particle (i), pl,d is the best position found so far by any particle among this particle's neighbours. This neighbhourhood is fixed and defined by a graph, for example a ring structure or a hypercube. Point reductions (very common) for not explaining the neighbourhood (at least that it is structural and not topological) b.) Since the average is then close to 1. theta 1 and theta 2 controls the step length towards the personal best and the local best, respectively. Setting the value close to 2 means that the average value will be close to 1, i.e. the search will be centered (roughly) around the best positions found so far. There is usually no point to go back the exact best positions again, but to search around them seems reasonable.

The velocity update equation of the particle swarm optimization method lbest, in its basic form, can we written as: (PIC) where vi,d(t) is the velocity of particle i in dimension d at time t, xi,d(t) is the position of particle i in dimension d at time t, and U(0, theta 1) and U(0, theta 2) are random numbers drawn from a uniform distribution in the range [0, theta 1] and [0, theta 2], respectively. a.) explain the variables pi,d and pl,d! b.) theta1 and theta2 are constants, usually recommended to be set close to 2. What's the intuition behind this recommendation? Why 2?

They are closely related. A codebook vector (centroid) is a representative of a cluster/class. An input vector is classified to the class of its closest codebook vector. If we think of this as a neural network, a codebook vector is a weight vector of the neuron which represents that class. The point of unsupervised learning is to move these codebook vectors around in the input space, to represent the clusters of data as well as possible. A Voronoi region is the region around a codebook vector, for which it is the closest codebook vector, i.e. the region for which it would 'win' for any input vector generated there. One student expressed this very well as the codebook vector's "area of influence" (thanks! I will use that). The shape of this region depends on where the other codebook vectors are, and on the distance measure used. In (finite) Euclidean space a Voronoi region is a convex polytope. (Discussions on shapes were not required for full credit)

Unsupervised Learning Texts about unsupervised learning (self-organization) often mention codebook vectors (also known as centroids) and Voronoi regions. Explain both concepts!

a.) They are equivalent, if standard competitive learning is trained by epoch/batch learning. (If pattern learning is used, standard competitive learning becomes more stochastic, due to the random order of patterns presentations.) b.) They would also be uniformly distributed. The point of self organizing feature maps is to preserve topology, i.e. to preserve statistical distributions. Some students claimed that the weights would become equal, i.e. converge to the same spot. That would not preserve topology c.)The error is a (discounted) accumulated distance the node has moved around as a winner. A node which moves around a lot, when it wins, is likely to need help covering the data.

Unsupervised learning a.) Standard competitive learning and k means are closely related. How? b.) what would happen to the network weights if you trained a self organizing feature map on random data (i.e. data drawn from a uniform distribution)? c.) In Growing Neural Gas, new nodes are inserted between the node with the greatest accumulated error, and the node among its current neighbors with the greatest error. HOw is this error defined?

a. They are equivalent, if SCL is trained in batch learning mode. b. Competitive learning is to move the closest node (codebook vector) towards the latest input vector, to make it more likely to win also next time the same input vector is presented. The winner-takes-all problem occurs when a few nodes, in the extreme case only one node, wins all the time because they are closer to all the data than the other nodes (which never win and therefore never move). At best this leads to underutilization of the network, at worst all the data is classified to the same class. c. Inbetween the node x which has the greatest error and the node y among its current neighbours, with the greatest error. The error here is proporational to how much the node has moved lately (as a winner), so a great error indicates that the node is in an area with too few nodes to cover the data well.

Unsupervised learning / Self organization a) How is standard competitive learning related to the classical method K-means? b) Explain the winner-takes-all problem which can occur in competitive learning! c) When a new node is to be inserted in the Growing Neural Gas algorithm, it is inserted between two already existing nodes. Which two nodes?

e) The activation function The required number of nodes corresponds to the number of monotonic regions in the target function, since the network output is a linear combination of monotonic functions - the hidden layer activation functions.

What is it, in the equations defining a hidden node in an MLP, which most directly decides the number of hidden nodes required to approximate a function well? (Mark one) a) The input values (to the hidden layer) b) The weight values c) The threshold value d) The weighted sum e) The activation function

In SOFM, the nodes are connected in a neighbourhood structure (usually a 2D grid). Instead of just moving the winner (the weight vector which is closest to the input vector), as in competitive learning, we move all weight vectors towards the input, but to a degree which depends on how far that node is from the winner in the network structure. This is done by defining a neighbourhood function f(j,k) where j is the index of the node to move and k is the index of the winner node. This function should have a max=1 for the winner itself (j=k) and then decrease with distance between j and k. A Gaussian function, for example. This neighbourhood function is used as a multiplier in the update formula: (REFER TO PIC ON OTHER SIDE) This is what makes self organizing feature maps a map, i.e. topologically preserving, so that inputs close to each other in the input space will also activate areas close to each other on the grid. In other words, this is what makes the 2D map approximate the density function in the high-dimensional input space.

What is the purpose of the neighbourhood function used in self organizing feature maps? Please explain how the neighbourhood function is used, what it typically looks like (as a function), and why it is needed (what we want to achieve)!

tij(t) is the amount of pheromones on the trail from city i to j (at time t - it decays over time). hij is the inverted distance from city i to j.The transition probability is a trade-off between these two, weighed by constants a and b.The sum over C is just for normaliation (to make this a probability, i.e. so that it sums to 1). C is the set of feasible cities of ant k (i.e. the remaining cities to visit).

When Ant Colony Optimization is used in the travelling salesman problem, the probability that ant k moves from city i to city j at time t can be defined as follows:

If all weights are equal from the start (0 is just a special case) some of them(not all of them) will remain equal for ever. Specifically, all hidden-to-output weights to a given output node will remain equal and all input-to-hidden weights from a given input will remain equal. This in effect makes the hidden layer useless since all hidden nodes will implement the same function. Almost all students claimed that all weights will remain 0, but this is not the case. When the first pattern is presented, all weighted sums will be 0 and all node outputs will therefore be 0.5. However, unless 0.5 happens to be the correct answer for an output there will be an error (d-y) there, which means that the hidden-to-output layer weights will change. Though the output delta-values for the first pattern are propagated back to the hidden layer through 0 weights and therefore not affect the hidden layer, the output-deltas for the following patterns will since the hidden-to-output weights will then have changed

When training a multilayer perceptron with the backpropagation algorithm, the network should first be initialized by randomizing the weights. Why would it not work to initialize them by setting them all to zero?

d) A family of methods to avoid overfitting/overtraining

Which of the below explanations best describes the concept called regularization? (Mark one) a) To normalize the input values b) To make approximations smoother by adding smoothness constraints to the objective function c) To make classification results more regular by removing outliers d) A family of methods to avoid overfitting/overtraining e) To restrict weight values within chosen bounds

The randomization of the initial weights, and the random order in which the patterns were presented to the networks (we used pattern learning). Similiarity between some patterns, which make them hard to separate, is not a source of non-determinism. No random variables involved.

Which were the two sources of non-determinism when you trained a multilayer perceptron to recognize letters, on the introductory lab?

a. 1. There is less risk of getting stuck in local optima, since GA and PSO are population methods. 2. GA and PSO do not require the objective function (or any other function for that matter) to be differentiable, as back propagation and other gradient descent methods do. For example, multilayer perceptrons and backprop are often used in classification, but must then be trained to minimize another objective function - the squared error - than the one we usually want - the number of mis-classifications (which is not differentiable). b. In a neural network, knowledge is stored distributed over all weights. There is an infinite number of possible weight distributions which would implement the same function. Therefore, two networks that both perform well and have found similar functions are probably not at all similar internally, looking at the weights. Combining two such networks, as genetic algorithms do through crossover, is therefore very risky. Weights that were very good in network A may be completely useless when inserted in the same place in network B. PSO does not combine individuals, it moves them instead, and therefore does not have this problem. c. Elitism in evolutionary computing (e.g. genetic algorithms) is used to make sure that the algorithm does not destroy what it has learnt so far. Without elitism, there is a risk that the individual representing the best solution found so far does not survive, despite having the best fitness. Using elitism, the best individual(s) in the population is guaranteed to transfer, unaltered, to the next generation. All particles in PSO always remember their personal bests found so far. There is no potentially destructive operator (such as crossover) which could destroy this memory - the personal bests are only overwritten if an even better solution is found. The particles themselves may (and should) move around chaotically, but their stored memories are stable.

a. Genetic algorithms and particle swarm optimization can both be used to train neural networks. What are the two main advantages of training a neural network this way, compared to using a conventional algorithm such as back propagation? b. There is a property of neural networks, however, which makes it more difficult to set up a genetic algorithm for this particular application than to use particle swarm optimization. Which property is that and why is this a problem for genetic algorithms? c. Genetic algorithms often use a concept called Elitism. What is this, and why is it not needed (or already built in) in particle swarm optimization?

a.) The first S equation is for the output layer, the second (with the sum) for the hidden layer(s). b.) only (dj-yj) in the equation for the output layer nodes, would change. The equation for the hidden layer is not affected (the hidden nodes are affected, indirectly, through the sum over k, but the equation as such is not) c.) Lamda*yj(1-yj) would change in both equations where it occurs. This is the derivative of the logistic function, y=1/(1+e^-lamda*s), and lamda controls its slope.

a.) One of the S equations apply to the hidden layer(s), one applies to the output layer. Which is which? b.)One assumption made in these equations is that the objective function to be minimized is the squared error. Which part or parts would change if we wanted to change that, to minimize something else? c.) Both equations assume that the nodes are sigmoidal (to be more precise, that the activation functions are logistic). Which part or parts would change if we used another activation function for the nodes?


Conjuntos de estudio relacionados

AP Computer Science Unit 1 Vocabulary

View Set

Pediatrics Exam I (Pediatric Success Ch. 4 Respiratory Disorders)

View Set

chapter 27 caring for clients with hypertension

View Set

Microeconomics: Chapter 13 Graded Homework

View Set