Neural Networks and Distributed Information Processing
There is something missing, however.
As we have seen, the key to getting single unitsto represent Boolean functions such as NOT and OR lies in setting the weights and thetheshold. But this raises some fundamental questions: How do the weights get set? Howdoes the threshold get set? Is there any room for learning?
Afewyearslaterin1949DonaldHebbpublished The Organization of Behavior inwhichhe speculated about how learning might take place in the brain.
His basic idea (the ideabehind what we now call Hebbian learning ) is that learning is at bottom an associativeprocess. He famously wrote: When an axon of a cell A is near enough to excite cell B or repeatedly or persistentlytakes part in fi ring it, some growth or metabolic change takes place in both cells suchthat A ' s ef fi ciency, as one of the cells fi ring B, is increased. Hebbian learning proceeds by synaptic modification. If A is apresynaptic neuron and B apostsynaptic neuron, then every time that B fires after A fires increases the probabilitythat B will fire after A fires (this is what Hebb means by an increase in A ' s efficiency)
This single application of the perceptron convergence rule is enough to turn oursingle-unit network with randomly chosen weight and threshold into a NOT-gate
If we input a 1 into the network then the total input is 1X -0.5 = -0.5, which is below thethreshold.So theoutputsignal is0,asrequired.Andifwe inputa0intothenetworkthenthe total input is 0 X -0.5 = 0, which is above the threshold of -0.3. So the output signalis 1, as required. In both cases we have δ = 0 and so no further learning takes place. Thenetwork has converged on a solution
There are Boolean functions besides the binary ones.
In fact, there are n -ary Booleanfunctions for every natural number n (including 0). But cognitive scientists are generallyonly interested in one non-binary Boolean function. This is the unary function NOT. Asits name suggests, NOT A is true if A is false and NOT A is false if A is true. Again, this iseasily represented by a single network unit, as illustrated in Figure 8.6. The trick is to setthe weights and threshold to get the desired result
The perceptron convergence rule cannot be applied to multilayer networks, however
In order to apply the rule we need to know what the required output is for a given unit.This gives us the δ value (the error value), and without that value we cannot apply therule. The problem is that there is no required output for hidden units. If we know whatfunction we are trying to compute then we know what the required output is. Butknowing the function does not tell us what any hidden units might be supposed to do.And even if we do know what the hidden units are supposed to be doing, adjusting thethresholdsandweightsofthehiddenunitsaccordingtotheperceptronconvergence rulewould just throw our updating algorithm for the output unit completely out of step
The situation after Minsky and Papert ' s critique of perceptrons was the following
Itwas known (a) that any computable function could be computed by a multilayernetwork and (b) that single-layer networks could only compute linearly separable func-tions. The basic problem, however, was that the main interest of neural networks forcognitivescientistswasthattheycouldlearn.Anditwasalsothecasethat(c)thelearning algorithms that were known applied only to single-layer networks. The great break-through came with the discovery of an algorithm for training multilayer networks
The trick in getting a network to do this is to set the weights and the thresholdappropriately.
Look at Figure 8.5. If we set the weights at 1 for both inputs and thethreshold at 2, then the unit will only fire when both inputs are 1. If both inputs are 1then the total input is (I 1 W 1 ) þ (I 2 W 2 ) ¼ (1 1) þ (1 1) ¼ 2, which is the threshold.Since the network is using a binary threshold activation function (as described in theprevious paragraph), in this case the output will be 1. If either input is a 0 (or both are)then the threshold will not be met, and so the output is 0. If we take 1 to represent TRUEand 0 to represent FALSE, then this network represents the AND function. It functions aswhat computer scientists call an AND-gate
Before giving an informal account of the learning algorithm we need to remindourselves of some basic facts about how multilayer networks actually function.
Multi-layer networks are organized into different layers. Each layer contains a number of units.The networks in each layer are typically not connected to each other. All networkscontain an input layer, an output layer, and a number (possibly 0) of what are called hidden layers . The hidden layers are so called because they are connected only to othernetwork units. They are hidden from the " outside world. "
Let ' s see what ' s going on here.
One obvious feature is that the two changes haveopposite signs. Suppose δ ispositive. This meansthat our networkhas undershot (becauseitmeansthatthecorrectoutputisgreaterthantheactualoutput).Sincetheactualoutputisweakerthanrequiredwecanmaketwosortsofchangesinordertoclosethegapbetweenthe required output and the actual output. We can decrease the threshold and we can increase the weights. This is exactly what the perceptron convergence rule tells us to do.Weendupdecreasingthethresholdbecausewhen δ ispositive, -εXδ isnegative.Andweend up increasing the weights, because εX δX Ii comes out positive when δ is positive.
Rosenblatt was looking for a learning rule that would allow a network with randomweights and a random threshold to settle on a configuration of weights and thresholdsthat would allow it to solve a given problem.
Solving a given problem means producingthe right output for every input
Local learning algorithms are often used in networks that learn through unsupervisedlearning.
The backpropagation algorithm requires very detailed feedback, as well as away of spreading an error signal back through the network. Competitive networks , incontrast, do not require any feedback at all. There is no fixed target for each output unitand there is no external teacher. What the network does is classify a set of inputs in sucha way that each output unit fires in response to a particular set of input patterns
Rosenblatt called his learning rule the perceptron convergence rule .
The perceptronconvergence rule has some similarities with Hebbian learning. Like Hebbian learning itrelies on the basic principle that changes in weight are determined solely by whathappens locally - that is, by what happens at the input and what happens at the output.But, unlike Hebbian learning, it is a supervised algorithm - it requires feedback about thecorrect solution to the problem the network is trying to solve
Manycognitivescientistsatthetimesawthisproofasadeathsentencefortheresearchprogram of neural networks.
The problem does not seem too serious for binary Booleanfunctions. There are 16 binary Boolean functions and all but 2 are linearly separable. Butthings get worse when one starts to consider n -ary Boolean functions for n greater than 2.There are 256 ternary Boolean functions and only 104 are linearly separable. By the timewe get to n = 4 we have a total of 65,536 quarternary Boolean functions, of which only1,882 are linearly separable. Things get very much worse as n increases.
But much more significant are the problems posed by the training methods forartificial neural networks.
There is no evidence that anything like the backpropagationof error takes place in the brain. Researchers have failed to find any neural connectionsthat transmit information about error. What makes backpropagation so powerful is thatitallows fora form of " actionat a distance. " Units inthehidden layershavetheir weightschanged as a function of what happens at the output units, which may be many layersaway. Nothing like this is believed to occur in the brain
I began this chapter by describing artificial neural networks as responding to a need for aneurally inspired approach to modeling information processing. But just how biologic-ally plausible are neural networks?
This is a question to which computational neurosci-entists and connectionist modelers have devoted considerable attention.
It is standard when constructing neural networks to specify a learning rate
This is aconstant number between 0 and 1 that determines how large the changes are on eachtrial. We can label the learning rate constant ε (epsilon). The perceptron convergence ruleis a very simple function of δ and ε .
Once we understand how a single unit works it is straightforward to see how thewhole network functions.
We can think of it as a series of n time steps where n is thenumber of layers (including the input, hidden, and output layers). In the first time stepevery unit in the input layer is activated. We can write this down as an ordered series of numbers - what mathematicians call a vector . At step 2 the network calculates theactivation level of each unit in the first hidden layer, by the process described in theprevious paragraph. This gives another vector. And so on until at step n the network hascalculated the activation level of each unit in the output layer to give the output vector
So, XOR fails to be linearly separable and is also not computable by a single-layernetwork
You might wonder whether there is a general lesson here. In fact there is. Theclass of Boolean functions that can be computed by a single-unit network is precisely theclass of linearly separable functions. This was proved by Marvin Minsky and SeymourPapert in a very influential book entitled Perceptrons that was published in 1969
Information enters the network via the input layer
ach unit in the input layerreceives a certain degree of activation, which we can represent numerically. Each unitin the input layer is connected to each unit in the next layer. Each connection has aweight, again representable numerically. The most common neural networks are feed forward networks.Asthenamesuggests, activationspreadsforward through thenetwork.There is no spread of activation between units in a given layer, or backwards from onelayer to the previous layer.
No straight line separates the black dots from the white dots in the graph of XOR.
hismeans that XOR is not linearly separable. It turns out, moreover, that XOR cannot berepresented by a single-layer network. This is easier to see if we represent XOR in a truthtable. The table shows what the output is for each of the four different possible pairs of inputs - we can think of 1 as the TRUE input and 0 as the FALSE input.
InitssimplestformHebbianlearningisanexampleofunsupervisedlearning,
sincetheassociation between neurons can be strengthened without any feedback. In slogan form,Hebbianlearningistheprinciplethat neurons that fire together, wire together .Ithasprovedto be a very useful tool in modeling basic pattern recognition and pattern completion, aswell as featuring in more complicated learning algorithms, such as the competitivelearning algorithm
we see, then, how individual unit in a network functions
the next step is to see how they can be used to process information this typically requires combining units into neural networks but before looking at how that works it will be useful to think about a restricted class of neural networks, standardly called single-layer networks
the simplest activation function is a linear function on which the output signal increases in direct proportion to the total input (linear functions are so called because they take a straight line when drawn on a graph)
the threshold linear function is a slight modification of this this function yields no output signal until the total input reaches the threshold—and then the strength of the output signal increases proportionately to the total input there is also a binary threshold function, which effectively operates like an on/off switch it either yields no zero output (when the input signal is below threshold) or maximum threshold (when the input signal is below threshold))
neural networks are made up of individual units loosely based on biological neurons
there are many different types of neuron in the nervous system, but they all share a common basic structure each neuron is a cell and so has a cell body (soma) containing a nucleus there are many root-like extensions in the cell body these are called neurites there are two different types of neurite each neuron has many dendrites and a single axon the dendrites are thinner than the axon and form what looks like a little bush the axon itself eventually splits into a number of branches, each terminating in a little endbulb that comes close to the dendrites of another neuron
anybody who has taken a course in elementary logic will be familiar with an important class of mapping functions
these functions all have the same range as our even number functions—namely, the set consisting of the two truth values TRUE and FALSE using standard notation for sets we can write the range of the function as {TRUE, FALSE} instead of having numbers in the domain, however, the domain of these functions is made up of pairs of truth values these functions, the so-called binary Boolean functions, take pairs of truth values as their inputs and deliver truth values as their outputs they are called binary functions because the domain of the function consists of pairs (addition is also a binary function) they are called Boolean function (after then nineteenth-century mathematician George Boole) because both the domain and range are built up from truth values
We see, then, that individual neuron-like units can achieve a lot.
A single unit canrepresent some very basic Boolean functions. In fact, as any computer scientist knows,modern digitial computers are in the last analysis no more than incredibly complicatedsystems of AND-gates, OR-gates, and NOT-gates. So, by chaining together individualnetwork units into a network we can do anything that can be done by a digital computer. (This is why I earlier said that cognitive scientists are generally only interestedin one non-binary Boolean function. AND, NOT, OR, and a little ingenuity are enough tosimulate any n -ary Boolean function, no matter how complicated.)
no clear distinction between information storage and information processing
According to the physical symbol system hypothesis all information processing is rule- governed symbol manipulation. If information is carried by symbolic formulas in thelanguageofthought,forexample,theninformationprocessingisamatteroftransformingthoseformulasbyrulesthatoperateonlyontheformalfeaturesoftheformulas.Inthelastanalysis, information is carried by physical structures and the rules are rules for manipu-lating those symbol structures. This all depends upon the idea that we can distinguishwithinacognitivesystembetweentherepresentationsonwhichtherulesoperateandtherules themselves - just as, within a logical system such as the propositional or predicatecalculus,wecandistinguishbetweensymbolicformulasandtherulesthatweusetobuildthose symbolic formulas up into more complex formulas and to transform them. Consider how AND might be computed according to the physical symbol systemhypothesis. A system for computing AND might take as its basic alphabet the symbol " 0 " andthesymbol " 1. " Theinputstothesystemwouldbepairsofsymbolsandthesystemwouldhavebuiltintoitrulestoensurethatwhentheinputisapairof " 1 " sthenthesystemoutputs a " 1, " while in all other cases it outputs a " 0. " What might such a rule look like?Well, we might think about the system along the lines of a Turing machine (asillustrated insection1.2). Inthiscase theinputs wouldbesymbolswrittenontwosquaresof a tape. Assume that the head starts just to the left of the input squares. The followingprogram will work. Step 1 Move one square R. Step 2 If square contains " 1 " then delete it, move one square R and go to Step 6. Step 3 If square contains " 0 " then delete it, move one square R and go to Step 4. Step 4 Delete what is in square and write " 0. " Step 5 Stop. Step 6 If square contains " 0 " then stop. Step 7 If square contains " 1 " then stop.The tapeendsupwith a " 1 " onitonlywhen thetapestarted out withtwo " 1 " s onit.Ifthetape starts out with one or more " 0 " s on it then it will stop with a " 0. " The final state of the tape is reached by transforming the initial symbol structure by formal rules, exactlyas required by the physical symbol system hypothesis. And the rules are completelydistinct from the symbols on which they operate
distributed representations
According to the physical symbol system hypothesis, representations are distinct andidentifiable components in a cognitive system. If we examine a cognitive system from the outside,asitwere,itwillbepossibletoidentifytherepresentations.Thisisbecausephysicalsymbol structures are clearly identifiable objects. If the information a physical symbolcarries is complex, then the symbol is itself complex. In fact, as emerges very clearly in thelanguageofthoughthypothesis,thestructureandshapeofthephysicalsymbolstructureisdirectlycorrelated withthe structureand shapeofthe informationitis carrying.This need not be true in artificial neural networks. There are some networks for whichit holds. These are called localist networks. What distinguishes localist networks is thateach unit codes for a specific feature in the input data. We might think of the individualunits as analogs of concepts. They are activated when the input has the feature encodedthat the unit encodes. The individual units work as simple feature-detectors. There aremany interesting things that can be done with localist networks. But the artificial neuralnetworks that researchers have tended to find most exciting have typically been distrib- uted networks rather than localist ones. Certainly, all the networks that we have lookedat in this chapter have been distributed.The information that a distributed network carries is not located in any specific place.Or rather, it is distributed across many specific places. A network stores information in itspattern of weights. It is the particular pattern of weights in the network that determineswhat output it produces in response to particular inputs. A network learns by adjustingits weights until it settles into a particular configuration - hopefully the configurationthat produces the right output! The upshot of the learning algorithm is that the net-work ' s " knowledge " is distributed across the relative strengths of the connectionsbetween different units
Similarly OR is the name of the Boolean function that maps the pair {FALSE, FALSE} to FALSE, and the other three pairs to TRUE.
Alternatively, if you are given sentences A andB then the only circumstance in which it is false to claim A OR B is the circumstance inwhich both A and B have the value FALSE.
In short, there are many ways of developing the basic insights in neural networkmodels that are more biologically plausible than standard feedforward networks thatrequire detailed feedback and a mechanism for the backpropagation of error.
And in anycase, the question of whether a given artificial neural network is biologically plausibleneeds to be considered in the context of whether it is a good model. Neural networkmodels should be judged by the same criteria as other mathematical models. In particu-lar,theresultsofthenetworkneedtomeshreasonablycloselywithwhatisknownaboutthe large-scale behavior of thecognitive ability being modeled. So, for example,if what isbeingmodeledistheabilitytomaster somelinguistic rule(suchastherulegoverning theformation of the past tense), one would expect a good model to display a learning profilesimilar to that generally seen in the average language learner. In the next chapter we willlook at two examplesof models that do seem very promising in this regard. First, though,we need to make explicit some of the general features of the neural network approach toinformation processing
We have seen how our single-layer networks can function as AND-gates, OR-gates, andNOT-gates.
And we have also seen an example of how the perceptron convergence rulecanbeusedtotrainanetworkwitharandomlyassignedweightandarandomlyassignedthreshold to function as a NOT-gate. It turns out that these functions share a commonproperty and that that common property is shared by every function that a single-layernetwork can be trained to compute. This gives us a very straightforward way of classify-ing what networks can learn to do via the perceptron convergence rule.
Hebb was speculating about real neurons, not artificial ones.
And, although there isstrong evidence that Hebbian learning does take place in the nervous system, the firstsignificant research on learning in artificial neural networks modified the Hebbianmodel very significantly. In the 1950s Frank Rosenblatt studied learning in single-layernetworks. In an influential article in 1958 he called these networks perceptrons
It is important that the OR function assigns TRUE to the pair {TRUE, TRUE}, sothat A OR B is true in the case where both A and B are true.
As we shall see, there isa Boolean function that behaves just like OR, except that it assigns FALSE to{TRUE, TRUE}. This is the so-called XOR function (an abbreviation of exclusive-OR).XOR cannot be represented by a single-layer network.
As we see in the figure, our sample unit u i integrates the activation it receives from allthe units in the earlier layer to which it is connected.
Assume that there are n unitsconnected to u i . Multiplying each by the appropriate weight and adding the resultingnumbers all together gives the total input to the unit - which we can write as total input( i ). Ifwe represent the activationofeach unit u j by a j , then we canwrite downthis sumas NΣj-1 Wij Aj We then apply the activation function to the total input. This will determine the unit ' sactivity level, which we can write down as a i . In the figure the activation function is asigmoid function. This means that a i is low when total input ( i ) is below the threshold.Once the threshold is reached, a i increases more or less proportionally to total input. Itthen levels out once the unit ' s ceiling is reached
Multilayered networks can compute any computable function - not just the linearlyseparable ones.
But what stopped researchers in their tracks in 1969 was the fact that theyhad no idea how to train multilayered networks. The reason that so much weight wasplaced on single-layer networks was that there were rules for training those networks toconvergeonpatternsofweightsandthresholdsthatwouldcomputecertainfunctions - thebestknownofthoserulesbeingtheperceptronconvergencerulesexplainedabove.Single-layer networks do not have tobecompletely programmedin advance. They can learn
This tells us what the network is doing from a mathematical point of view.
But what thenetwork is doing from an information-processing point of view depends on how weinterpret the input and output units. In section 3.3 we looked at a network designed todistinguish between sonar echoes from rocks and sonar echoes from mines. The acti-vation levels of the input units represent the energy levels of a sonar echo at differentfrequencies, while the activation levels of the two output units represent the network ' s " confidence " thatitisencounteringarockoramine.Intheprevioussectionwelookedata network computing the Boolean XOR function. Here the inputs and outputs representtruth values. In the next chapter we will look at other examples of neural networks. Inorder to appreciate what all these networks are doing, however, we need to understandhow they are trained. This takes us back to Paul Werbos ' s learning algorithm
Moreover, most neural networks are supervised networks and only learn because theyaregivendetailedinformation about theextentoftheerror ateachoutput unit
Butverylittle biological learning seems to involve this sort of detailed feedback. Feedback inlearning is typically diffuse and relatively unfocused. The feedback might simply bethe presence (or absence) of a reward - a long way away from the precise calibration of degree of error required to train artificial neural networks
It is easiest to see what this property is if we use a graph to visualize the " space " of possible inputs into one of the gates.
Figure 8.8 shows how to do this for two functions.The function on the left is the AND function. On the graph a black dot is used to markthe inputs for which the AND-gate outputs a 1, and a white dot marks the inputs that geta 0. There are four possible inputs and, as expected, only one black dot (corresponding tothe case where both inputs have the value TRUE). The graph for AND shows that we canuse a straight line to separate out the inputs that receive the value 1 from the inputs thatreceive the value 0. Functions that have this property are said to be linearly separable .
It is important to keep these arguments in perspective, however.
For one thing, thebackpropagationoferrorisnottheonlylearningalgorithm.Thereareothersthataremuchmore biologically plausible. Computational neuroscientists and connectionist modelershavea number oflearning algorithms thatare much morerealistic thanthe backpropaga-tionalgorithm. These algorithmstendto bewhat are known as local algorithms
An example may make things clearer.
Let ' s consider the very simple single layernetwork depicted in Figure 8.7. This network only takes one input and so we only haveone weight to worry about. We can take the starting weight to be - 0.6 and the thresholdto be 0.2. Let ' s set our learning constant at 0.5 and use the perceptron learning rule totrain this network to function as a NOT-gate
the ability to learn from "experience"
Of course, talk of neural networks learning from experience should not be taken tooseriously. Neural networks do not experience anything. They just receive different typesof input. But the important point is that they are not fixed in how they respond toinputs. Thisis becausetheycan change theirweights.Wehave looked atseveral differentways in which this can take place - at several different forms of learning algorithm.Supervised learning algorithms, such as the backpropagation algorithm, change theweights in direct response to explicit feedback about how the network ' s actual outputdiverges from intended output. But networks can also engage in unsupervised learning(as we saw when we looked briefly at competitive networks). Here the network imposesits own order on the inputs it receives, typically by means of a local learning algorithm,such as some form of Hebbian learning.This capacity to learn makes neural networks a powerful tool for modeling cognitiveabilities that develop and evolve over time. We will look at examples of how this can bedone in the next chapter
We can start with a little history. The discovery of neural networks is standardly creditedto the publication in 1943 of a pathbreaking paper by Warren McCullough and WalterPitts entitled " A logical calculus of the ideas immanent in nervous activity. "
One of thethings that McCullough and Pitts did in that paper was propose that any digital com-puter can be simulated by a network built up from single-unit networks similar to thosediscussed in the previous section. They were working with fixed networks. Their net-works had fixed weights and fixed thresholds and they did not explore the possibility of changing those weights through learning
There are certainly some obvious and striking dissimilarities at many different levelsbetween neural networks and the brain.
So, for example, whereas neural network unitsare all homogeneous, there are many different types of neuron in the brain - twelvedifferent types in the neocortex alone. And brains are nowhere near as massively parallelas typical neural networks. Each cortical neuron is connected to a roughly constantnumber of neurons (approximately 3 percent of the neurons in the surrounding squaremillimeter of cortex). Moreover, the scale of connectionist networks seems wrong. Thecortical column is an important level of neural organization. Each cortical columnconsists of a population of highly interconnected neurons with similar response proper-ties. A single cortical column cuts vertically across a range of horizontal layers ( laminae )and can contain as many as 200,000 neurons - whereas even the most complicatedartificial neural networks rarely have more than 5,000 units. This " scaling up " fromartificial neural networks to cortical columns is likely to bring a range of further dis-analogies in its wake. In particular, genuine neural systems will work on data that are farless circumscribed than the inputs to artificial neural networks
The learning in this case is supervised learning.
So, whenever the network producesthe wrong output for a given input, this means that there is something wrong with theweightsand/orthethreshold.Theprocessoflearning(foraneuralnetwork)istheprocessofchangingtheweightsinresponsetoerror.Learningissuccessfulwhenthesechangesinthe weights and/or the threshold converge upon a configuration that always producesthe desired output for a given input
it is easier to see what is going on if you think of a binary Boolean function as a way of showing how the truth value of a complex sentence is determined by the truth values of the individual sentences from which they are built up
Some of the Boolean functions should be very familiar. There is a binary Boolean function standardly known as AND, for example. AND maps the pair {TRUE, TRUE} to TRUE and maps all other pairs of truth values to FALSE. To put it another way, if you are given a sentence A and a sentence B,then the only circumstance in which it is true to claim A AND B is the circumstance inwhich both A and B have the value TRUE.
But the real problems come with the type of learning that artificial neural networkscan do.
Some of these are practical. As we have seen, artificial neural networks learn bymodifying connection weights and even in relatively simple networks this requireshundreds and thousands of training cycles. It is not clear how much weight to attach to this. After all, the principal reason why training a network takes so long is thatnetworks tend to start with a random assignment of weights and this is not somethingone would expect to find in a well-designed brain
Information processing in multilayer networks is really a scaled-up version of infor-mation processing in single-unit networks.
The activation at a given input unit istransmitted to all of the units to which it is connected in the next layer. The exactquantity of activation transmitted by each unit in the input layer depends upon theweight of the connection. The total input to a given unit in the first hidden layer isdetermined exactly as in the single-unit case. It is the sum of all the quantities of activation that reach it. If the total input to the unit reaches the threshold then theunit fires (i.e. transmits its own activation). The amount of activation that each unittransmits is given by its activation function
The backpropagation algorithm solves this problem by finding a way of calculatingthe error in the activation level of a given hidden unit even though there is no explicit activation level for that unit
The basic idea is that each hidden unit connected to anoutput unit bears a degree of " responsibility " for the error of that output unit. If, forexample, the activation level of an output unit is too low, then this can only be becauseinsufficient activation has spread from the hidden units to which it is connected. This gives us a way of assigning error to each hidden unit. In essence, the error level of ahidden unit is a function of the extent to which it contributes to the error of the outputunit to which it is connected. Once this degree of responsibility, and consequent errorlevel, is assigned to a hidden unit, it then becomes possible to modify the weightsbetween that unit and the output unit to decrease the error
What has this got to do with neural networks?
The connection is that the networkunits that we looked at in section 8.1 can be used to represent someof the binaryBooleanfunctions. The first step is to represent Boolean functions using numbers (since we neednumbers as inputs and outputs for the arithmetic of the activation function to work).This is easy. We can represent TRUE by the number 1 and FALSE by 0, as is standard inlogic and computer science. If we design our network unit so that it only takes 1 and 0 asinputs and only produces 1 and 0 as outputs then it will be computing a Booleanfunction. If it has two inputs then it will be computing a binary Boolean function. If ithas three inputs, a ternary Boolean function. And so on.
Paul Werbos is one of the great unsung heroes of cognitive science.
The dissertation hesubmitted at Harvard University in 1974 for his PhD degree contained what is generallythought to be the earliest description of a learning algorithm for multilayer networks.Unfortunately, as with most PhD theses, it languished unread for many years. Werbospublished an extended version of the dissertation in 1994, but (as discussed in section 3.3)the start of neural network research in cognitive science is generally credited to thepublication in 1986 of a very influential two-volume collection of papers edited by JayMcClelland and David Rumelhart and entitled Parallel Distributed Processing: Explorations in the Microstructure of Cognition . The papers in the collection showed what could bedone by training multilayer neural networks. It was the start of a new way of thinkingabout information processing in cognitive science.
It is easy to see how we design our network unit to take only 0 and 1 as input. But howdo we design it to produce only 0 and 1 as output?
The key is to use a binary threshold activation function. As we saw in Figure 8.3, abinary threshold activation function outputs 0 until the threshold is reached. Once thethreshold is reached it outputs 1, irrespective of how the input increases. What we needto do, therefore, if we want to represent a particular Boolean function, is to set theweights and the threshold in such a way that the network mimics the truth table forthat Boolean function. A network that represents AND, for example, will have to outputa 0 whenever the input is either (0, 0), (0, 1), or (1, 0). And it will have to output a 1whenever the input is (1, 1).
Werbos called his algorithm the backpropagation algorithm .
The name has stuck and itis very revealing. The basic idea is that error is propagated backwards through thenetwork from the output units to the hidden units. Recall the basic problem for trainingmultilayer networks. We know what the target activation levels are for the output units.Weknow,forexample,thatanetworkcomputingXORshouldoutput0when theinputsare both 1. And we know that a mine/rock detector should output (1, 0) when its inputscorrespond to a mine and (0, 1) when its inputs correspond to a rock. Given this we cancalculate the degree of error in a given output unit. But since we don ' t know what thetarget activation levels are for the hidden units we have no way of calculating the degreeoferror inagiven hiddenunit.And that seemstomean thatwe havenowayofknowinghow to adjust the weights of connections to hidden units.
There is no comparable distinction between rules and representations in artificialneural networks.
The only rules are those governing the spread of activation valuesforwards through the network and those governing how weights adjust. Look again atthe network computing XOR and think about how it works. If we input two 1s into thenetwork (corresponding to a pair of propositions, both of which are true), then the infor-mationprocessinginthenetworkproceedsintwobasicstages.Inthefirststageactivationspreadsfromtheinputlayertothehiddenlayerand bothhiddenunitsfire.Inthesecondstageactivationspreadsfromthehiddenunitstotheoutputunitandtheoutputunitfires.The only rules that are exploited are, first, the rule for calculating the total input to aunit and, second, the rule that determines whether a unit will fire for a given total input(i.e. the activation function). But these are exactly the same rules that would be activatedif the network were computing AND or OR. These " updating rules " apply to all feedfor-ward networks of this type. What distinguishes the networks are their different patternsof weights. But a pattern of weights is not a rule, or an algorithm of any kind. Rather aparticularpatternofweightsiswhatresults fromtheapplicationofonerule(thelearningalgorithm). And it is one of the inputs into another rule (the updating algorithm)
The presence of hidden units is what allows the network in Figure 8.9 to compute theXOR function.
The problem for a single-unit network trying to compute XOR is that it can only assign one weight to each input. This is why a network that outputs 1 when thefirst input is 1 and outputs 1 when the second input is 1 has to output 1 when both inputsare 1. This problem goes away when a network has hidden units. Each input now has itsownunitandeachinputunitisconnectedtotwodifferentoutputunits.Thismeansthattwo different weights can now be assigned to each input.
The perceptron convergence rule allows learning by reducing error.
The starting pointis that we (as the supervisors of the network) know what the correct solution to theproblem is, since we know what mapping function we are trying to train the network tocompute. This allows us to measure the discrepancy between the output that thenetwork actually produces and the output that it is supposed to produce. We can labelthat discrepancy δ (small delta). It will be a number - the number reached by subtractingthe actual output from the correct output. So: δ = INTENDED OUTPUT - ACTUAL OUTPUT
Suppose that we input a 1 into this network (where, as before, 1 represents TRUE and0 represents FALSE).
The total input is 1 X -0.6 = -0.6. This is below the threshold of 0.2 and so the output signal is 0. Since this is the desired output we have δ= 0 and sonolearning takes place(since ΔT = -ε X δX 0.5 = 0, and ΔWalso comesout as 0). Butif we input a 0 then we get a total input of 0 X 0.6 = 0. Since this is also below thethreshold the output signal is 0. But this is not the desired output, which is 1. So we cancalculate δ = 1 - 0 = 1. This gives ΔT= = -0.5 X 1 = -0.5 and ΔW 0.5 X 1 X 0 = 0. Thischanges the threshold (to -0.3) and leaves the weight unchanged.
We can represent these functions using what logicians call a truth table.
The truthtablefor AND tells ushow thetruth value ofAAND B varies accordingto thetruth valueof A and B respectively (or, as a logician would say - as a function of the truth values of A and B) This truth table should come as no surprise. It just formalizes how we use the Englishword " and. "
The process is illustrated in Figure 8.10, which illustrates the operation of a samplehidden unit in a simple network with only one layer of hidden units. (Note that thediagram follows the rather confusing notation standard in the neural network literature)
The usual practice is to label a particular unit with the subscript i . So we write the nameoftheunitas ui. Ifwewanttotalkaboutanarbitraryunitfroman earlier layerconnectedto u i , we label that earlier unit with the subscript j and write the name of the unit as u j . Just to make things as difficult as possible, when we label the weight of the connectionfrom u j to u i we use the subscript ij , with the label of the later unit coming first. So, W ij isthe weight of the connection that runs from u j to u i
As one might imagine, competitive networks are particularly good at classificationtasks, which require detecting similarities between different input patterns.
They havebeen used, for example, to model visual pattern recognition. One of the amazing proper-ties of the visual system is its ability to recognize the same object from many differentangles and perspectives. There are several competitive network models of this type of position-invariant object recognition , including the VisNet model of visual processingdeveloped by Edmund Rolls and T. T. Milward. VisNet is designed to reproduce the flowofinformationthroughtheearlyvisualsystem(assketchedinsection3.2).Ithasdifferentlayers intended to correspond to the stages from area V1 to the inferior temporal cortex.Each layer is itself a competitive network, learning by a version of the Hebbian rule.
In local learning algorithms (as their name suggests) an individual unit ' s weightchanges directly as a function of the inputs to and outputs from that unit.
Thinkingabout it in terms of neurons, the information for changing the weight of a synapticconnection is directly available to the presynaptic axon and the postsynaptic dendrite.The Hebbian learning rule that we briefly looked at earlier is an example of a locallearning rule. Neural network modelers think of it as much more biologically plausiblethan the backpropagation rule
You may have been struck by the following thought. Earlier in this section I said thatany Boolean function, no matter how complicated, could be computed by a combin-ation of AND-gates, OR-gates, and NOT-gates.
This applies both to those Boolean func-tions that are linearly separable and to those that are not. So, why does it matter thatsingle-layer networks cannot compute Boolean functions that are not linearly separable?SurelywecanjustputtogetherasuitablenetworkofAND-gates,OR-gates,andNOT-gatesin order to compute XOR - or any other Boolean function that fails to be linearlyseparable. So why did researchers react so strongly to the discovery that single-unitnetworks can only compute linearly separable Boolean functions?
It should not take long to see, however, that the function on the right is not linearlyseparable.
This is the exclusive-OR function (standardly written as XOR). The OR func-tion that we have been looking at up to now has the value TRUE except when bothinputs have the value FALSE. So, A OR B has the value TRUE even when both A andB have the value TRUE. This is not how the word " or " often works in English. If I amoffered a choice between A or B it often means that I have to choose one, but not both.This way of thinking about " or " is captured by the function XOR. A XOR B has the valueTRUE only when exactly one of A and B has the value TRUE.
The key to making this work is that there are inhibitory connections between theoutput units.
This is very much in contrast to standard feedforward networks, wherethere are typically no connections between units in a single layer. The point of theseinhibitory connections is that they allow the output units to compete with each other.Each output unit inhibits the other output units in proportion to its firing rate. So, theunit that fires the most will win the competition. Only the winning unit is " rewarded " (by havingitsweightsincreased). Thisincrease inweights makesitmorelikely towin thecompetition when the input is similar. The end result is that each output ends up firingin response to a set of similar inputs
Suppose, for example, that we are trying to produce a network that functions as anAND-gate.
This means that, when the inputs each have value 1, the desired output is 1(since A AND B is true in the case where A is true and B is true). If the output that thenetworkactuallyproducesis0,then δ = 1.If,incontrast,theinputseachhavevalue0andthe actual output is 1, then δ = - 1
This is a very good question. It is indeed not too hard to construct a network that willcompute XOR.
Thishad beenknown fora longtimebefore Minsky andPapertpublishedtheir critique of Rosenblatt ' s perceptrons - at least as far back as the 1943 article byMcCullough and Pitts. Figure 8.9 shows a network that will do the job. This network iswhat is known as a multilayer network. Up to now we have been looking at single-layernetworks.Theunitsinsingle-layernetworksreceiveinputsdirectly.Multilayernetworks,in contrast, contain units that only receive inputs indirectly. These are known as hidden units . The only inputs they can receive are outputs from other units.
This method can be applied to as many levels of hidden units as there are in thenetwork.
We begin with the error levels of the output units and then assign error levelsto the first layer of hidden units. This allows the network both to modify the weightsbetweenthefirstlayerofhiddenunitsandtheoutputunitsandtoassignerrorlevelstothenextlayerofhiddenunits.Andsotheerroris propagated backdownthroughthenetworkuntiltheinputlayerisreached.Itisveryimportanttorememberthatactivationanderrortravel through the network in opposite directions. Activation spreads forwards throughthe network (at least in feed forward networks), while error is propagated backwards
The perceptron convergence rule can be described with a little symbolism.
We canassume that our networks are single-layer networks like those discussed earlier in thissection. They have a binary threshold activation function set up so that they outputeither 1 or 0, depending on whether or not the total input exceeds the threshold. Weassume also that the inputs to the network are always either 0 or 1 (so that the networksare really computing Boolean functions)
Now,thinkabouthowwewouldneedtosettheweightsandthethresholdtogetasingle-layer network to generate the right outputs.
We need the network to output a 1 when the firstinputis0andthesecondinputis1.ThismeansthatW2 (theweightforthesecondinput)must be such that 1X W2 is greater than the threshold. Likewise for the case where the firstinput is 1 andthe second input is 0. In order toget this to come out right we need W1 to besuch that 1XW1 is greater than the threshold. But now, with the weights set like that, it isinevitablethatthenetworkwilloutputa1whenbothinputs are1 - ifeachinputisweightedsothatit exceedsthethreshold, then itis certainthat addingthemtogetherwill exceedthethreshold.In symbols,if W1 > T and W2 > T, then it is inevitable thatW1 +W2 > T
Thinking about these questions takes us to the heart of the theory and practice of neural networks
What makes neural networks such a powerful tool in cognitive science is thatthey are capable of learning. This learning can be supervised (when the network is " told " what errors itismaking) or unsupervised (whenthenetworkdoes notreceivefeedback). Inorder to appreciate how neural networks can learn, however, we need to start with single-layernetworks.Single-layernetworkshavesomecruciallimitationsinwhattheycanlearn.The most important event in the development of neural networks was the discovery of alearning algorithmthat could overcomethe limitations of single-unit networks
neurons receive signals from other neurons
a typical neuron might receive input from 10,000 neurons, but the number is as great as 50,000 for some neurons in the brain area called the hippocampus these signals are received through the dendrites, which can be thought of as the receiving end of the neuron a sending neuron transmits a signal along its axon to a synapse, which is the site where the end of an axon branch comes close to a dendrite or the cell body of another neuron when the signal from the sending (or presynaptic) neuron reaches the synapse, it generates an electrical signal in the dendrites of the receiving (or postsynaptic) neuron
in short, we do not have the equipment and resources to study populations of neurons directly
and therefore many researchers have taken a new tack they have developed techniques for studying populations of neurons indirectly the approach is via models that approximate populations of neurons in certain important respects these models are standardly called neural network models
it is true that there are ways of directly studying the overall activity of populations of neurons
event-related potentials (ERPs) and event-related magnetic fields (ERFs) are cortical signals that reflect neural network activity and that can be recorded non-invasively from outside the skull recordings of ERPs and ERFs have the advantage over information derived from PET and fMRI of permitting far greater temporal resolution and hence of giving much more precise sense of the time course of neural events yet information from ERPs and ERFs is still insufficiently fine-grained they reflect the summed electrical activity of populations of neurons, but offer no insight into how that total activity level is generated by the activity of individual neurons
neural networks are built up of interconnected populations of units that are designed to capture some of the generic characteristics of biological neurons
for this reason they are sometimes called artificial neurons Figure 8.2 illustrates a typical network unit the unit receives a number of different inputs there are n inputs, corresponding to synaptic connections to presynaptic neurons signals from the presynaptic neurons might be excitatory or inhibitory this is captured in the model by assigning a numerical weight W, to each input Ij typically the weight will be a real number between -1 and 1 a positive weight corresponds to an excitatory synapse and a negative weight to an inhibitory synapse
one way of thinking about information processing is in terms of mapping function
functions are being understood here in the strict mathematical sense the basic idea of a function should be familiar, even if the terminology may not be addition is a function given two numbers as inputs, the addition function yields a third number as output the output is the sum of the two inputs multiplication is also a function here the third number is the product of the two inputs
the mapping function of addition has a domain made up of all the possible pairs of numbers
its range is made up of all the numbers in this case we can certainly have several different items in the domain mapping onto a single item in the range take A1 to be the pair <1, 3> and A2 be the pair <2, 2> the addition function maps both A1 and A2 onto 4 (which we can take to be B2)
other techniques have made it possible to study brain activity (in non-human animals, from monkeys to sea-slugs) at the level of the single neuron
microelectrodes can be used to record electrical activity both inside a single neuron and in the vicinity of that neuron recording from inside neurons allows a picture to be built up of the different types of input to the neuron, both excitatory and inhibitory, and of the mechanisms that modulate output signals in contrast, extra-cellular recordings made outside the neuron to allow researchers to track the activation levels of an individual neuron over extended periods of time and to investigate how it responds to distinct types of sensory input and how it discharges when, for example, particular movements are made
The perceptron convergence rule is very powerful.
n fact, it can be proved (althoughwe shan ' t do so here) that applying the rule is guaranteed to converge on a solutionin every case that a solution exists. But can we say anything about when there is nosolution - and hence about which functions a network can learn to compute via theperceptron convergence rule and which will forever remain beyond its reach? It turnsout that there is a relatively simple way of classifying the functions that a network canlearn to compute by applying the perceptron convergence rule. We will see how to do itlater in this section
for our purposes here, the differences between computational neuroscientists and connectionist modelers are less important than what they have in common
neural network models have given rise to a way of thinking about information processing very different from the PSSH and the LoT hypothesis neural network models are distinctive in how they store information, how they retrieve it, and how they process it and even those models that are not biologically driven remain neurally inspired
using microelectrodes to study individual neurons provides few clues to the complex patterns of interconnection between neurons
single neuron recordings tell us what the results of those interconnections are for the individual neuron, as they are manifested in action potentials, synaptic potentials, and the flow of neurotransmitters, but not about how the behavior of the population as a whole is a function of the activity in individual neurons and the connections between them at the other end of the spectrum, large-scale information about blood flow in the brain will tell us which brain systems are active, but is silent about how the activity of the brain is a function of the activity of the various neural circuits of which it is composed
let us make this a little more precise
suppose that we have a set of items we can call that a domain let there be another set of items, which we call the range a mapping function maps each item from the domain onto exactly one item from the range the defining feature of a function is that no item in the domain gets mapped to more than one item in the range functions are single-valued the operation of taking square roots, for example, is not a function (at least when negative numbers are included), since every positive number has two square roots
So far in this chapter we have been looking at the machinery of artificial neural net-works - at how theywork, howthey learn,what theycan do, and theways they relate tonetworks of neurons in the brain.
t is easy to get lost in the details. But it is important toremember why we are studying them. We are looking at neural networks because we areinterested in mental architectures. In particular we are interested in them as models of information processing very different from the type of models called for by the physicalsymbol system hypothesis. From this perspective, the niceties of different types of network and different types of learning rule are not so important. What are importantare certain very general features of how neural networks process information. Thissection summarizes three of the most important features.
detailed knowledge of how the brain works has increased dramatically in recent years
technological developments have been very important neuroimaging techniques, such as fMRI and PET, have allowed neuroscientists to begin establishing large-scale correlations between types of cognitive functioning and specific brain areas PET and fMRI scans allow neuroscientists to identify the neural areas that are activated during specific tasks combining this with the information available from studies of brain-damaged patients allows cognitive scientists to build up a functional map of the brain
Figure 8.4 gives an example of a mapping function
the arrows indicate which item in the domain is mapped to each item in the range it is perfectly acceptable for two or more items in the domain to be mapped to a single item in the range (as iis the case with A1 and A2) but, because functions are single-valued, no item in the domain can be mapped onto more than one in the range
the threshold functions are intended to reflect a very basic property of biological neurons, which is that they only fire when their total input is suitable strong
the binary threshold activation function models neurons that either fire or don't fire, while the threshold linear function models neurons whose firing rate increases in proportion to the total input once the threshold has been reached
everything we know about the brain suggests that we will not be able to understand cognition unless we understand what goes on at levels of organization between large-scale brain areas and individual neurons
the brain is an extraordinary complicated set of interlocking and interconnected circuits the most fundamental feature of the brain is its connectivity and the crucial question in understanding the brain is how distributed patterns of activation across populations of neurons can give rise to perception, memory, sensori-motor control, and high-level cognition but we have (as yet) limited tools for directly studying how populations of neurons work
there are many different types of neural network models and many different ways of using them
the focus in computational neuroscience is on modeling biological neurons and populations of neurons computational neuroscientists start from what is know about the biology of the brain and then construct models by abstracting away from some biological details while preserving others connectionist modelers often pay less attention to the constraints of biology they tend to start with generic models their aim is to show how those models can be modified and adapted to simulate and reproduce well-documented psychological phenomena, such as the patterns of development that children go through when they acquire language, or the way in which cognitive processes break down in brain damaged patients
none of these ways of studying the brain gives us direct insight into how information is processed in the brain
the problem is one of fineness of grain basically, the various techniques of neuroimaging are too coarse-grained and the techniques of single neuron recordings too fine-grained (at least for studying higher cognitive functions) PET and fMRI are good sources of information about which brain areas are involved in particular cognitive tasks, but they do not tell us anything about how those cognitive tasks are actually carried out a functional map of the brain tells us very little about how the brain carries out the functions in question we need to know not just what particular regions of the brain do, but how they do it nor will this information come from single neuron recordings we may well find out from single neuron recordings in monkeys that particular types of neuron in particular areas of the brain respond very selectively to a narrow range of visual stimuli, but we have as yet no idea how to scale this up into an account of how vision works
the basic activity of a neuron is to fire an electrical impulse along its axon
the single most important fact about the firing of neurons is that it depends upon activity at the synapses some of the signals reaching the neuron's dendrites promote firing and others inhibit it these are called excitatory and inhibitory synapses respectively if we think of an excitatory synapse as having a positive weight and inhibitory synapse a negative weight, then we can calculate the strength of each synapse (by multiplying the strength of the incoming signal by the corresponding synaptic weight) adding all the synapses together gives the total strength of the signals received at the synapses—and hence the total input to the neuron if this total input exceeds the threshold of the neuron then the neuron will fire
there are four different possible pairs of truth values
these pairs form the domain of the binary Boolean functions the range, as with all Boolean functions, is given by the set {TRUE, FALSE} each binary Boolean function assigns either TRUE or FALSE to each pair of truth values
like all mathematical models they try to strike a balance between biological realism, on the one hand, and computational tractability in the other
they need to be sufficiently "brain-like" that we can hope to use them to learn about how the brain works at the same time they need to be simple enough to manipulate and understand the aim is to abstract away from many biological details of neural functioning in the hope of capturing some of the crucial general principles governing the way the brain works the multilayered complexity of brain activity is reduced to a relatively small number of variables whose activity and interaction can be rigorously controlled and studied
the first step in calculating the total input to the neuron is to multiply each input by its weight
this corresponds to the strength of the signal at each synapse adding all these individual signals (or activation levels) together gives the total input to the unit, corresponding to the total signal reaching the nucleus of the neuron this is represented using the standard mathematical format in Figure 8.2 (Σ is the symbol for summation (repeated addition) the N above the summation sign indicates that there are N many things to add together each of the things added together is the product of Ij and Wj for some value of j between 1 and N) if the total input exceeds the threshold (T) then the neuron "fires" and transmits an output signal
the sigmoid function is a very commonly used nonlinear activation function
this reflects some of the properties of real neurons in that it effectively has a threshold below which total input has little effect and a ceiling above which the output remains remains more or less constant despite increases in total input the ceiling corresponds to the maximum firing rate of the neuron between the threshold and the ceiling the strength of the output signal is roughly proportionate to the total input and so looks linear but the function as a whole is nonlinear and drawn with a curve
consider now a mapping function with two items in its range
we can think about this as a way of classifying objects in the domain of the function imagine that the domain of the function contains all the natural numbers and the range of the function contains two items corresponding to TRUE and FALSE then we can identify any subset we please of the natural numbers by mapping the members of that subset onto TRUE and all the others onto FALSE if the subset that the function maps onto TRUE contains all and only the even numbers, for example, then we have a way of picking out the set of the even numbers this in fact is how the famous mathematician Gottlob Frege, who invented modern logic, thought about concepts he thought of the concept even number as a function that maps every even number to TRUE and everything else to FALSE
the one thing that remains to be specified is the strength of the output signal
we know that the unit will transmit a signal if the total input exceeds its designated threshold, but we do not yet know what that signal is for this we need to specify an activation function—a function that assigns an output signal on the basis of the total input neural network designers standardly choose from several different types of activation function some of these are illustrated in Figure 8.3
If we use the symbol Δ (big delta) to indicate the adjustment that we will make aftereach application of the rule, then the perceptron convergence rule can be written likethis (remembering that T is the threshold; Ii is the i-th input; and Wi is the weightattached to the i-th input)
Δ T = -ε X δ Δ Wi = ε X δ X Ii