Connectionist Models: The Briefest Course

Download Report

Transcript Connectionist Models: The Briefest Course

Connectionist Models: The Briefest Course

Robert M. French

LEAD – CNRS UMR 5022 Dijon, France

What do cows drink?

Symbolic AI ISA(cow, mammal) ISA(mammal, animal) Rule1: IF animal(X) AND thirsty(X) THEN lack_water(X) Rule2: IF lack_water(X) then drink_water(X) Conclusion: Cows drink water.

What do cows drink?

Connectionism: What interests Symbolic AI COW MILK DRINK 100 ms.

What interestsConnectionism

What do cows drink?

Connectionism: COW MILK DRINK 100 ms.

These neurons are activated without ever have heard the word “milk”

Artificial Neural Networks

“Systems that are deliberately constructed to make use of some of the organizational principles that are felt to be used in the human brain.”

Neurocomputing,

p. xiii) (Anderson & Rosenfeld, 1990,

The Origin of Connectionist Networks Major Dates

William James (1892): the idea of a network of associations in the brain.

McCulloch & Pitts (1943, 1947): the “logical” neuron Hebb (1949): The Organization of Behavior: Hebbian learning and the formation of cell assemblies Hodgkin and Huxley (1952): Description of the chemistry of neuron-firing.

Rochester, Holland, Haibt, & Duda (1956): first real neural network computer model Rosenblatt (1958, 1962): perceptron Minsky and Papert (1969) bring the walls down on perceptrons Hopfield (1982, 1984): Hopfield network, settling to an attractor Kohonen (1982): unsupervised learning network Rumelhart & McClelland and the PDP Research Group (1986): backpropagation, etc.

Elman (1990): the simple recurrent network Hinton (1980 – present ): just about everything else...

McCulloch & Pitts (1943, 1947)

1 0 Inputs 0

T

Output

The McCulloch & Pitts representation of the “essential” neuron was that it was a logic gate (here an

AND

gate) Inputs Output The real neuron was far, far more complex, but they felt that they had captured its

essence

. Neurons were the biological equivalent of logic gates.

Conclusion: Collections of neurons, appropriately wired together, can do logical calculus.

Cognition is just a complex logical calculus.

Hebb (1949) Connecting changes in neurons to cognition

Hebb asked: What changes at the neuronal level might make possible our acquisition of high-level (semantic) information?

His answer:

Learning rule

of synaptic reinforcement (Hebbian learning).

When neuron A fires and is followed immediately by the firing of neuron B, the synapse between the two neurons is strengthened, i.e., the next time A fires, it

will be easier for B to fire.

Connecting neural function to behavior

High level models of human cognition and behavior The Hebbian Gap

Neuronal population coding models

Low-level models of single neurons Even lower-level models of synapses and ion channels

Cell assemblies: Closing the Hebbian Gap

Cell assemblies at the neuronal level give rise to categories at the semantic level.

• The formation of cell assemblies involves persistence of activity without external input . Cell assemblies can overlap. e.g., the cell assembly associated with “dog” will overlap with those associated with “wolf”, “cat”, etc.

• recruitment : creation of a new cell assembly (via Hebbian learning) corresponding to a new concept • fractionation : creation of new cell assemblies from an old one, corresponding to the refinement of a concept.

A Hebbian Cell Assembly

By means of the Hebbian Learning Rule, a circuit of continuously firing neurons could be learned by the network.

The continuing activation in this cell assembly does not require external input.

The activation of the neurons in this circuit would correspond to

the perception of a concept

.

A Cell Assembly

Input from the environment

A Cell Assembly

Input from the environment

A Cell Assembly

Input from the environment

A Cell Assembly

Input from the environment

A Cell Assembly

Notice that the input from the environment is gone...

A Cell Assembly

A Cell Assembly

Rochester, Holland, Haibt, & Duda (1956)

• First real simulation that attempted to implement the principles outlined by Hebb in real computer hardware • Attempted to simulate the emergence of cell assemblies in a small network of 69 neurons. They found that everything became active in their network. • They decided that they needed to include inhibitory synapses . (Hebb only discussed excitatory synapses). This worked and cell assemblies did, indeed, form. • Probably the earliest example in neural network modeling of a network which made a prediction (i.e., inhibitory synapses are needed to form cell assemblies), that was later confirmed in real brain circuitry.

Rosenblatt (1958, 1962): The Perceptron

• Rosenblatt’s perceptron could learn to associate inputs with outputs.

• He believed this was how the visual system learned to associate low-level visual input with higher level concepts. • He introduced a learning rule (weight-change algorithm) that allowed the perceptron to learn associations.

The elementary perceptron

Consists of: • two layers of nodes (one layer of weights) • only feedforward connections • a threshold function on each output unit • a linear summation of the weights times inputs

x 1 w 1 t y desired output (“teacher”) actual output Threshold = T w 2

if else INPUTS i

  1

w i y

 0

x i

Threshold then y

 1 x 2

The perceptron (Widrow-Hoff) learning rule (weight change rule) is:

w new

w old

 

x

(

t

y

) where  0    1 is the learning constant,

w i I

“X”

x i

X

This perceptron learns to associate the visual input of two crossed straight lines with the character “X”. In other words, the output of the network will be the character “X”.

Generalization

“X”

w i I x i The real image in the world is degraded, but if the network has already learned to correctly identify the original complete “X”, it will recognize the degraded X as being an “X”.

Fundamental limitations of the perceptron

Minsky & Papert (1969) showed that the Rosenblatt two-layer perceptron had some fundamental limitations: They could only classify linearly separable sets.

This: X X Y Y Y Y X X X X Y Y But not this: X X X X Y Y Y Y Y X Y X

The (infamous) XOR problem

• Minsky and Papert showed there were a number of extremely simple patterns that no perceptron could learn, including a logic function XOR.

• Since cognition supposedly required elementary logical operations, this severely weakened the perceptron’s claim to be able to do general cognition.

XOR 0 1 0

Input

0 1 1 1 0

Output

0 1 1 0 There is no set of weights w

1

and w

2

and a threshold T, such that the perceptron below can learn the above XOR function.

x 1 w 1 t y desired output (“teacher”) actual output Threshold = T The activation arriving at the output node is .

w

1

x

1 

w

2

x

2 If

w

1

x

1 

w

2

x

2 

T

then we output 1, otherwise 0.

But

w x

1 1 

w x

2 2 

T

is a straight line if we consider x 1 and x 2 to be the axis of a coordinate system.

w 2 x 2

1

x 2

(0,1) (0,0)

w x

1 1

(1,1)

w x

2 2

(1,0)

x 1

0

T

NO!

No values of

w 1 , w 2 ,

and

T

will form a straight line w 1 x 1 + w 2 x 2 with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

= T

1

x 2

(0,1) (0,0)

w x

1 1

(1,1)

w x

2 2

(1,0)

x 1

0

T

NO!

No values of

w 1 , w 2 ,

and

T

will form a straight line w 1 x 1 + w 2 x 2 with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

= T

1

x 2

(0,1) (0,0)

w x

1 1

(1,1)

w x

2 2

(1,0)

x 1

0

T

NO!

No values of

w 1 , w 2 ,

and

T

will form a straight line w 1 x 1 + w 2 x 2 with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

= T

1

x 2

(0,1) (0,0)

w x

1 1

(1,1)

w x

2 2

(1,0)

x 1

0

T

NO!

No values of

w 1 , w 2 ,

and

T

will form a straight line w 1 x 1 + w 2 x 2 with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

= T

1

x 2

(0,1) (0,0)

w x

1 1

(1,1)

w x

2 2

(1,0)

x 1

0

T

NO!

No values of

w 1 , w 2 ,

and

T

will form a straight line w 1 x 1 + w 2 x 2 with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

= T

The Revival of the (Multi-layered) Perceptron: The Connectionist Revolution (1985) and the Statistical Nature of Cognition

By the early 1980’s Symbolic AI had hit a wall.

“Simple” tasks that humans do (almost) effortlessly (face, word, speech recognition, retrieving information from incomplete cues, generalizing, etc) proved to be notoriously hard for symbolic AI.

• Minsky (1967): “Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved.” • Minsky (1982): “The AI problem is one of the hardest ever undertaken by science.”

By the early 1980’s the

statistical nature

of much of cognition became ever more apparent.

Three factors contributed to the revival of the perceptron: • the radical failure of AI to achieve the goals announced in the 1960’s • the growing awareness of the statistical and “fuzzy” nature of cognition • the development of improved perceptrons, capable of overcoming the linear separability problems brought to light by Minsky & Papert.

Advantages of Connectionist Models compared to Symbolic AI

• Learning : Specifically designed

to learn .

• Pattern completion of familiar patterns.

• Generalization : Can generalize to novel patterns based on previously learned patterns.

• Retrieval with partial information : Can retrieve information in memory based on nearly any attribute of the representation.

• Massive parallelism .

100-step processing constraint (Feldman & Balard, 1982) Neural hardware is too slow and too unreliable for sequential models of processing. But we can do very complex processing in a few hundred ms. But transmission across a synapse (~10 -6 in.) occurs in about ~1 ms. Thus, complex tasks must be accomplished in no more than a few hundred serial steps, which is impossible. • Graceful degradation : when they are damaged, their performance degrades gradually.

Real Brains and Connectionist Networks

Some characteristics of real brains that serve as the basis of ANN design: •Neurons receive input from lots of other neurons.

•Massive parallelism: neurons are slow but there are lots of them •Learning involves modifying the strength of synaptic connections.

•Neurons communicate with one another via activation or inhibition.

•Connections in the brain have a clear geometric and topological structure.

•Information is continuously available to the brain.

•Graceful degradation of performance in the face of damage and information overload •Control is distributed, not central (i.e., no central executive).

•One primary way of understanding what the brain does is relaxation to attractors.

General principles of all connectionist networks

• a set of processing units • • a state of activation defined over all of the units • an output function (“squashing function”) for each unit: Transforms unit activation into outgoing activation; • a connectivity pattern with two features: - weights of the connections • • an - locations of the connections activation rule for combining inputs impinging on a unit to produce a total activation for the unit • a learning rule , by which the connectivity pattern is changed.

• an environment in which the system operates (i.e., how is the i/o represented and given to/taken from the system)

Knowledge storage and Learning

• Knowledge storage: Knowledge is stored exclusively in the

pattern of strengths of the connections (weights) between units .

The network stores multiple patterns in the SAME set of connections.

• Learning: The system learns by

strengths of these weights automatically adjusting the

as it receives information from its environment .

There are no high-level rules programmed into the system . Because all patterns are stored in the same set of connections, generalization, graceful degradation, etc. are relatively easy in connectionist networks. It is also what makes planning, logic, etc. are so hard.

Two major classes of networks

Supervised:

Includes all error-driven learning algorithms . The error between the desired output and the actual output determines how to change the weights. This error is gradually decreased by the learning algorithm.

Unsupervised:

There is no error feedback automatically clusters the input into categories.

signal.

The network Example: if the network is presented with 100 patterns, half of which are different kinds of ellipses and half of which are different types of rectangles, it would automatically group these patterns into the two appropriate categories. There is no feedback to tell the network explicitly “this is a rectangle” or “this is an ellipse.”

So, how did they solve the problem of linear separability?

ANSWER: i) By adding another “hidden” layer to the perceptron between the input and output layers, ii) iii) introducing a differentiable squashing function and discovering a new “generalized delta rule”) learning rule (the

“Concurrent” learning

1 epoch

Learning a series of patterns:

If each pattern in the series is learned to criterion (i.e., completely) sequentially, the learning of the new patterns will erase the learning of the previously-learned patterns. This is why concurrent learning must be used. Otherwise, catastrophic forgetting may occur.

Concurrent learning

- 1 st pattern presented to the network, change its weights a little to reduce the error on that pattern; 2 nd pattern, change its weights a little to reduce the error on that pattern; etc.

- last pattern, change its weights a little to reduce the error on that pattern; - REPEAT until the error for all patterns is below criterion

Backpropagation

output layer (nodes subscripted with

i

’s) error desired output (“teacher”) actual output w ij hidden layer (nodes subscripted with

j

’s) hidden layer representation w jk input layer (nodes subscripted with

k

’s) input from the environment Training of a backpropagation network i) layer.

Feedforward activation pass with activation “squashed” at hidden ii) The output is compared with the desired output (= error signal) iii) This error signal is “backpropagated” through the network to change the network’s weights (with gradient descent).

iv) When the overall error is below a predefined criterion, learning stops.

Backpropagation networks are excellent function-learners...

correct

1.0 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 1

...but they also suffer from catastrophic interference.

Humans:

A-B List A-C List 5 10

Learning Trials on A-C List

20

Backpropagation networks: correct

1.0 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 A-B List 5 10 15 20 25 30 35 40 A-C List 45 50

Learning Epochs on A-C List

They can learn to read words aloud (NetTalk, 1987) ....

... but they have trouble learning sequences.

Much of our cognition involves learning sequences of patterns. Standard BP networks are fine for learning input-output patterns, they cannot be used effectively to learn sequences of patterns.

Consider the sequence:

A B C D E F G H I

For this sequence we could train a network to associate the following

A

B B

C

D

E

C D E F

G

H

F G H I If we give the network A

as it’s “seed”, it would produce

B

on output, which we would feed back into the network to produce

we could reproduce the original sequence.

C

on output, and so on.

Thus,

But what about context-dependent sequences?

But what if the sequence were:

A B C D E F C H I

Here

C

is repeated. The technique above would give:

A

B B

C

D

E

C D E F

C

H

F C H I But the network could not learn this sequence since it has no context to distinguish the two different outputs associated with C

(for the first occurrence,

D

; for the second,

H

).

A “sliding window” solution

Consider a “sliding window” solution to provide the context. Instead of having the network learn single-letter inputs, it will learn two-letter inputs, thus:

AB

C BC

CD

DE

EF

D E F FG

GH

G H I Now the network is fed AB

(here, “A” servers as “context” for “B”) as its seed and it can reproduce the sequence with the repeated

C

without difficulty.

But what if we needed more than one letter’s worth of context , as in a sequence like this: A B C D E B C H I Now the network needs another context letter...and so on.

Conclusion : The Sliding Window technique doesn’t work in general.

Elman’s solution (1990) The Simple Recurrent Network

Output units Hidden units Input units copy Context units

SRN Bilingual language learning

(

French, 1998; French & Jacquet, 2004)

Input to the SRN: - Two “micro” languages, Alpha & Beta, 12 words each

-

An SVO grammar for each language - Unpredictable language switching

Attempted Prediction BOY LIFTS TOY MAN SEES PEN GIRL PUSHES BALL BOY PUSHES BOOK FEMME SOULEVE STYLO FILLE PREND STYLO GARÇON TOUCHE LIVRE FEMME POUSSE BALLON FILLE SOULEVE JOUET WOMAN PUSHES TOY.... (Note: absence of markers between sentences and between languages.)

The network tries each time to predict the next element.

We do a cluster analysis of its internal (hidden-unit) representations after having seen 20,000 sentences.

Clustering of the internal representations formed by the SRN

BOY : GIRL: MAN: WOMAN : LIFTS: SEES: TAKES : PUSHES: TOY: BALL : PEN: BOOK: GARCON: FILLE : FEMME: HOMME : SOULEVE : POUSSE: PREND: VOIT: JOUET: BALLON: STYLO: LIVRE: Alpha Beta N.B. It also works for micro languages with 768 words each

Unsupervised learning: Kohonen networks

Kohonen networks cluster inputs in a non-supervised manner.

There is no activation spreading or summing processes here: Kohonen networks adjust weight vectors to match input vectors.

1

output nodes

2

w

11

w

12

w

52

w

62 input layer

1 2 3 4 5 6

The next frontier...

Computational neuroscience using spiking neurons , and variables such as their connection density, their firing timing and synchrony, and so on, to better understand human cognitive functions.

We are almost at a point where the population dynamics of large networks these kinds of simulated neurons can realistically be studied.

of Further in the future neuronal models with Hodgkin-Huxley equations of membrane potentials and neuronal firing , will be incorporated into our computational models of cognition.

Ultimately...

Gradually, neural network models and the computers they run on will become good enough to give us a deep understanding of neurophysiological processes and their behavioral counterparts and to make precise predictions about them.

They will be used to study epilepsy, Alzheimer’s disease, and the effects of various kinds of stroke, without requiring the presence of human patients.

They will be, in short, like the models used in all of the other hard sciences. Neural modeling and neurobiology will then have achieved a truly symbiotic relationship.