A Brain-Like Computer for Cognitive Applications: The

Download Report

Transcript A Brain-Like Computer for Cognitive Applications: The

A Brain-Like Computer for
Cognitive Applications:
The Ersatz Brain Project
James A. Anderson
[email protected]
Department of Cognitive and Linguistic Sciences
Brown University, Providence, RI 02912
Paul Allopenna
[email protected]
Aptima, Inc.
12 Gill Street, Suite 1400, Woburn, MA
Our Goal:
We want to build a first-rate, second-rate
brain.
Participants
Faculty:
Jim Anderson, Cognitive Science.
Gerry Guralnik, Physics.
Tom Dean, Computer Science.
David Sheinberg, Neuroscience.
Students:
Socrates Dimitriadis, Cognitive Science.
Brian Merritt, Cognitive Science.
Benjamin Machta, Physics.
Private Industry:
Paul Allopenna, Aptima, Inc.
John Santini, Anteon, Inc.
Acknowledgements
This work was supported by:
A seed money grant from the Office of the Vice
President for Research, Brown University.
An SBIR, The Ersatz Brain Project, FA8750-05-C0122, to Aptima, Inc. (Woburn MA), Dr. Paul
Allopenna, Project Manager.
Also: Early support was received from a DARPA
grant to Brown University Engineering
Department in the Bio/Info/Micro program,
MDA972-00-1-0026.
Comparison of Silicon Computers
and Carbon Computer
Digital computers are
• Made from silicon
• Accurate (essentially no errors)
• Fast (nanoseconds)
• Execute long chains of logical
operations (billions)
• Often irritating (because they
don’t think like us).
Comparison of Silicon Computers
and Carbon Computer
Brains are
• Made from carbon
• Inaccurate (low precision, noisy)
• Slow (milliseconds, 106 times
slower)
• Execute short chains of parallel
alogical associative operations
(perhaps 10 operations/second)
• Yet largely understandable
(because they think like us).
Comparison of Silicon Computers
and Carbon Computer
• Huge disadvantage for carbon: more
than 1012 in the product of speed and
power.
• But we still do better than them in
many perceptual skills: speech
recognition, object recognition, face
recognition, motor control.
• Implication: Cognitive “software” uses
only a few but very powerful elementary
operations.
Major Point
Brains and computers are very different in their
underlying hardware, leading to major
differences in software.
Computers, as the result of 60 years of
evolution, are great at modeling physics.
They are not great (after 50 years of and largely
failing) at modeling human cognition.
One possible reason: inappropriate hardware leads
to inappropriate software.
Maybe we need something completely different: new
software, new hardware, new basic operations,
even new ideas about computation.
So Why Build a Brain-Like Computer?
1. Engineering.
Computers are all special purpose devices.
Many of the most important practical computer applications
of the next few decades will be cognitive in nature:





Natural language processing.
Internet search.
Cognitive data mining.
Decent human-computer interfaces.
Text understanding.
We claim it will be necessary to have a cortex-like
architecture (either software or hardware) to run these
applications efficiently.
2. Science:
Such a system, even in simulation, becomes a
powerful research tool.
It leads to designing software with a particular
structure to match the brain-like computer.
If we capture any of the essence of the cortex,
writing good programs will give insight into
biology and cognitive science.
If we can write good software for a vaguely brain
like computer we may show we really understand
something important about the brain.
3. Personal:
It would be the ultimate cool gadget.
A technological vision:
In 2055 the personal computer you buy in Wal-Mart will
have two CPU’s with very different architectures:
First, a traditional von Neumann machine that runs
spreadsheets, does word processing, keeps your
calendar straight, etc. etc. What they do now.
Second, a brain-like chip

To handle the interface with the von Neumann
machine,

Give you the data that you need from the Web or
your files (but didn’t think to ask for).

Be your silicon friend, guide, and confidant.
History : Technical Issues
Many have proposed the construction of brain-like
computers.
These attempts usually start with

massively parallel arrays of neural computing
elements

elements based on biological neurons, and

the layered 2-D anatomy of mammalian cerebral cortex.
Such attempts have failed commercially.
The early connection machines from Thinking
Machines,Inc.,(W.D. Hillis, The Connection Machine,
1987) was most nearly successful commercially and is
most like the architecture we are proposing here.
Consider the extremes of computational brain models.
First Extreme: Biological Realism
The human brain is composed of the order of 1010
neurons, connected together with at least 1014 neural
connections. (Probably underestimates.)
Biological neurons and their connections are extremely
complex electrochemical structures.
The more realistic the neuron approximation the smaller
the network that can be modeled.
There is good evidence that for cerebral cortex a
bigger brain is a better brain.
Projects that model neurons in detail are of scientific
importance.
But they are not large enough to simulate interesting
cognition.
Neural Networks.
The most successful brain
inspired models are
neural networks.
They are built from simple
approximations of
biological neurons:
nonlinear integration of
many weighted inputs.
Throw out all the other
biological detail.
Neural Network Systems
Units with these
approximations can build
systems that
  can be made large,
  can be analyzed,
  can be simulated,
  can display complex
cognitive behavior.
Neural networks have been
used to model (rather
well) important aspects of
human cognition.
Second Extreme: Associatively
Linked Networks.
The second class of brain-like
computing models is a basic
part of computer science:
Associatively linked
structures.
One example of such a
structure is a semantic
network.
Such structures underlie most
of the practically
successful applications of
artificial intelligence.
Associatively Linked Networks (2)
The connection between the biological nervous system
and such a structure is unclear.
Few believe that nodes in a semantic network
correspond in any sense to single neurons.
Physiology (fMRI) suggests that a complex cognitive
structure – a word, for instance – gives rise to
widely distributed cortical activation.
Major virtue of Linked Networks: They have sparsely
connected “interesting” nodes. (words, concepts)
In practical systems, the number of links converging
on a node range from one or two up to a dozen or so.
The Ersatz Brain Approximation:
The Network of Networks.
Conventional wisdom says neurons are the basic
computational units of the brain.
The Ersatz Brain Project is based on a different
assumption.
The Network of Networks model was developed in
collaboration with Jeff Sutton (Harvard Medical
School, now at NSBRI).
Cerebral cortex contains intermediate level
structure, between neurons and an entire
cortical region.
Intermediate level brain structures are hard to
study experimentally because they require
recording from many cells simultaneously.
Network of Networks Approximation
We use the Network of
Networks [NofN]
approximation to structure
the hardware and to reduce
the number of connections.
We assume the basic
computing units are not
neurons, but small (104
neurons) attractor
networks.
Basic Network of Networks
Architecture:
• 2 Dimensional array of
modules
• Locally connected to
neighbors
Cortical Columns: Minicolumns
“The basic unit of cortical operation is the
minicolumn … It contains of the order
of 80-100 neurons except in the
primate striate cortex, where the
number is more than doubled. The
minicolumn measures of the order of
40-50 m in transverse diameter,
separated from adjacent minicolumns
by vertical, cell-sparse zones … The
minicolumn is produced by the
iterative division of a small number of
progenitor cells in the
neuroepithelium.” (Mountcastle, p. 2)
VB Mountcastle (2003). Introduction [to a special
issue of Cerebral Cortex on columns]. Cerebral
Cortex, 13, 2-4.
Figure: Nissl stain of cortex in planum
temporale.
Columns: Functional
Groupings of minicolumns seem to form the
physiologically observed functional columns.
Best known example is orientation columns in
V1.
They are significantly bigger than minicolumns,
typically around 0.3-0.5 mm.
Mountcastle’s summation:
“Cortical columns are formed by the binding together of many
minicolumns by common input and short range horizontal connections.
… The number of minicolumns per column varies … between 50 and
80. Long range intracortical projections link columns with similar
functional properties.” (p. 3)
Cells in a column ~ (80)(100) = 8000
The activity of the nonlinear attractor
networks (modules) is
dominated by their
attractor states.
Attractor states may be
built in or acquired
through learning.
We approximate the
activity of a module
as a weighted sum of
attractor states.That
is: an adequate set of
basis functions.
Activity of Module:
x = Σ ciai
where the ai are the
attractor states.
Elementary Modules
The Single Module: BSB
The attractor
network we
use for the
individual
modules is
the BSB
network
(Anderson,
1993).
It can be
analyzed
using the
eigenvectors
and
eigenvalues
of its local
connections.
Interactions between Modules
Interactions between modules are described by state
interaction matrices, M.
The state interaction matrix elements give the
contribution of an attractor state in one module to the
amplitude of an attractor state in a connected module.
In the BSB linear region
x(t+1) = Σ Misi
+
f
weighted sum
input
from other modules
+
x(t)
ongoing
activity
The Linear-Nonlinear Transition
The first BSB processing stage is linear and sums
influences from other modules.
The second processing stage is nonlinear.
This linear to nonlinear transition is a powerful
computational tool for cognitive applications.
It describes the processing path taken by many
cognitive processes.
A generalization from cognitive science:
Sensory inputs  (categories, concepts, words)
Cognitive processing moves from continuous values
to discrete entities.
We can extend this
associative model to larger
scale groupings.
It may become possible to
suggest a natural way to
bridge the gap in scale
between single neurons and
entire brain regions.
Networks >
Networks of Networks >
Networks of
(Networks of Networks) >
Networks of
(Networks of (Networks
of Networks))
and so on …
Scaling
Binding Module Patterns Together.
An associative Hebbian
learning event will tend
to link f with g through
the local connections.
There is a speculative
connection to the
important binding
problem of cognitive
science and
neuroscience.
The larger groupings will
act like a unit.
Responses will be stronger
to the pair f,g than to
either f or g by itself.
Two adjacent modules interacting.
Hebbian learning will tend to bind
responses of modules together if f
and g frequently co-occur.
Sparse Connectivity
The brain is sparsely connected. (Unlike most neural
nets.)
A neuron in cortex may have on the order of 100,000
synapses. There are more than 1010 neurons in the
brain. Fractional connectivity is very low: 0.001%.
Implications:
• Connections are expensive biologically since they
take up space, use energy, and are hard to wire up
correctly.
• Therefore, connections are valuable.
• The pattern of connection is under tight control.
• Short local connections are cheaper than long ones.
Our approximation makes extensive use of local
connections for computation.
Interference Patterns
We are using local transmission of (vector)
patterns, not scalar activity level.
We have the potential for traveling pattern waves
using the local connections.
Lateral information flow allows the potential for
the formation of feature combinations in the
interference patterns where two different
patterns collide.
Learning the Interference Pattern
The individual modules are nonlinear learning networks.
We can form new attractor states when an interference
pattern forms when two patterns meet at a module.
Module Evolution
Module evolution with learning:

From an initial repertoire of basic attractor
states

to the development of specialized pattern
combination states unique to the history of
each module.
Biological Evidence
Biological Evidence:
Columnar Organization in Inferotemporal
Cortex
Tanaka (2003)
suggests a columnar
organization of
different response
classes in primate
inferotemporal
cortex.
There seems to be
some internal
structure in these
regions: for
example, spatial
representation of
orientation of the
image in the
column.
IT Response Clusters: Imaging
Tanaka (2003) used
intrinsic visual
imaging of cortex.
Train video camera
on exposed cortex,
cell activity can
be picked up.
At least a factor of
ten higher
resolution than
fMRI.
Size of response is
around the size of
functional columns
seen elsewhere:
300-400 microns.
Columns: Inferotemporal Cortex
Responses of a region
of IT to complex
images involve
discrete columns.
The response to a
picture of a fire
extinguisher shows
how regions of
activity are
determined.
Boundaries are where
the activity falls
by a half.
Note: some spots are
roughly equally
spaced.
Active IT Regions for a Complex Stimulus
Note the large number of roughly equally distant
spots (2 mm) for a familiar complex image.
Histogram of Distances
Were able to plot
histograms of
distances in a number
of published IT
intrinsic images of
complex figures.
Distances computed from
data in previous
figure (Dimitriadis)
Back-of-the-Envelope
Engineering
Considerations
Network of Networks Functional Summary.
• The NofN approximation assumes a two dimensional array of
attractor networks.
• The attractor states dominate the output of the system at
all levels.
• Interactions between different modules are approximated by
interactions between their attractor states.
• Lateral information propagation plus nonlinear learning
allows formation of new attractors at the location of
interference patterns.
• There is a linear and a nonlinear region of operation in
both single and multiple modules.
• The qualitative behavior of the attractor networks can be
controlled by analog gain control parameters.
Engineering Hardware Considerations
We feel that there is a size, connectivity, and computational
power “sweet spot” at the level of the parameters of the
network of network model.
If an elementary attractor network has 104 actual neurons,
that network display 50 attractor states. Each elementary
network might connect to 50 others through state
connection matrices.
A brain-sized system might consist of 106 elementary units
with about 1011 (0.1 terabyte) numbers specifying the
connections.
If 100 to 1000 elementary units can be placed on a chip there
would be a total of 1,000 to 10,000 chips in a cortex
sized system.
These numbers are large but within the upper bounds of
current technology.
Modules
Function of Computational (NofN) Modules:
• Simulate local integration: Addition
of inputs from outside, other modules.
• Simulate local dynamics.
• Communications Controller: Handle long
range (i.e. not neighboring)
interactions.
Simpler approximations are possible:
• “Cellular automaton”. (Ignore local
dynamics.)
• Approximations to dynamics.
Topographic Model for
Information
Integration
A Software Example:
Sensor Fusion
A potential application is to sensor fusion. Sensor fusion
means merging information from different sensors into a
unified interpretation.
Involved in such a project in collaboration with Texas
Instruments and Distributed Data Systems, Inc.
The project was a way to do the de-interleaving problem in
radar signal processing using a neural net.
In a radar environment the problem is to determine how many
radar emitters are present and whom they belong to.
Biologically, this corresponds to the behaviorally important
question, “Who is looking at me?” (To be followed, of
course, by “And what am I going to do about it?”)
Radar
A receiver for radar pulses provide several kinds of
quantitative data:
•
•
•
•
•
frequency,
intensity,
pulse width,
angle of arrival, and
time of arrival.
The user of the radar system wants to know qualitative
information:
•
•
•
•
How many emitters?
What type are they?
Who owns them?
Has a new emitter appeared?
Concepts
The way we solved the problem was by using a
concept forming model from cognitive science.
Concepts are labels for a large class of members
that may differ substantially from each other.
(For example, birds, tables, furniture.)
We built a system where a nonlinear network
developed an attractor structure where each
attractor corresponded to an emitter.
That is, emitters became discrete, valid
concepts.
Human Concepts
One of the most useful computational properties
of human concepts is that they often show a
hierarchical structure.
Examples might be:
animal > bird > canary > Tweetie
or
artifact > motor vehicle > car > Porsche > 911.
A weakness of the radar concept model is that it
did not allow development of these important
hierarchical structures.
Sensor Fusion and Information Integration
with the Ersatz Brain.
We can do simple sensor fusion in the Ersatz
Brain.
The data representation we develop is directly
based on the topographic data representations
used in the brain: topographic computation.
Spatializing the data, that is letting it find a
natural topographic organization that reflects
the relationships between data values, is a
technique of potential utility.
We are working with relationships between values,
not with the values themselves.
Spatializing the problem provides a way of
“programming” a parallel computer.
Topographic Data Representation
Low Values
Medium Values
High Values
••++++••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••++++•••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••++++••
We initially will use a simple bar code to code the
value of a single parameter.
The precision of this coding is low.
But we don’t care about quantitative precision:
want qualitative analysis.
We
Brains are good at qualitative analysis, poor at
quantitative analysis. (Traditional computers are
the opposite.)
For our demo Ersatz
Brain program, we
will assume we
have four
parameters derived
from a source.
An “object” is
characterized by
values of these
four parameters,
coded as bar codes
on the edges of
the array of CPUs.
We assume local
linear
transmission of
patterns from
module to module.
Demo
Each pair of
input patterns
gives rise to
an interference
pattern, a line
perpendicular
to the midpoint
of the line
between the
pair of input
locations.
There are places
where three or four
features meet at a
module. Geometry
determines
location.
The higher-level
combinations
represent relations
between several
individual data
values in the input
pattern.
Combinations have
literally fused
spatial relations
of the input data,
Formation of Hierarchical Concepts.
This approach allows the formation of what look like
hierarchical concept representations.
Suppose we have three parameter values that are fixed for
each object and one value that varies widely from
example to example.
The system develops two different types of spatial data.
In the first, some high order feature combinations are
fixed since the three fixed input (core) patterns never
change.
In the second there is a varying set of feature
combinations corresponding to the details of each
specific example of the object.
The specific examples all contain the common core pattern.
Core Representation
The group of
coincidences
in the
center of
the array is
due to the
three input
values
arranged
around the
left, top
and bottom
edges.
Left are two examples where
there is a different
value on the right side
of the array. Note the
common core pattern
(above).
Development of A “Hierarchy” Through
Spatial Localization.
The coincidences due to the core (three values)
and to the examples (all four values) are
spatially separated.
We can use the core as a representation of the
examples since it is present in all of them.
It acts as the higher level in a simple
hierarchy: all examples contain the core.
Key Point: This approach is based on
relationships between parameter values
and not on the values themselves.
Relationships are Valuable
Consider:
Which pair is most similar?
Experimental Results
One pair has high physical similarity to the initial
stimulus, that is, one half of the figure is
identical.
The other pair has high relational similarity, that
is, they form a pair of identical figures.
Adults tend to choose relational similarity.
Children tend to choose physical similarity.
However, It is easy to bias adults and children
toward either relational or physical similarity.
Potentially very a very flexible and programmable
system.
Filtering Using
Topographical
Representations
Now, show how to use these ideas to do
something (perhaps) useful. …
The Problem
• Develop a topographic data
representation inspired by the the
perceptual invariances seen in human
speech.
• Look at problems analyzing vowels in a
speech signal as an example of an
important class of signals.
• First in a a series of demonstrations
using the topography of data
representations to do useful
computation.
Speech Signal Basics
Vowels are long duration and often stable.
• But still hard to analyze correctly.
• Problems: different speakers, accents, high
variability, dipthongs, similarity between
vowels, context effects, gender
• The acoustic signals from a vowel are dominated
by the resonances of the vocal tract, called
formants.
• We are interested in using this problem as a
test case.
• Show difficulties of biological signal
processing.
• But: Important signal types, brains very good
with this type of data.
Vowel Processing
• Vocal tracts come in different sizes: men,
women, children, Alvin the Chipmunk.
• Resonant peaks change their frequency as a
function of vocal tract length.
• This frequency shift can be substantial.
• But: causes little problem for human speech
perception.
• An important perceptual feature for phoneme
recognition seems to be the ratios between the
formant frequencies, not just absolute values
of frequency.
• How can we make a system respond to ratios?
Power Spectrum of a Steady
State Vowel
Sound Spectrogram: Male American
Words: heed, hid, head, had, hod, hawed, hood, who’d
From: P Ladefoged (2000), A Course in Phonetics, 4th Edition, Henle
Sound Spectrogram: Female American
Words: heed, hid, head, had, hod, hawed, hood, who’d
From: P Ladefoged (2000), A Course in Phonetics, 4th Edition, Henle
Average Formant Frequencies for
Men, Women and Children.
Men
Women
Children
F1
F1
F1
[i]
267 (0.86)
310 (1.00)
360 (1.16)
[æ]
664 (0.77)
863 (1.00)
1017 (1.18)
[u]
307 (0.81)
378 (1.00)
432 (1.14)
Men
Women
Children
F2
F2
F2
2294 (0.82)
2783 (1.00)
3178 (1.14)
1727 (0.84)
2049 (1.00)
2334 (1.14)
876 (0.91)
961 (1.00)
1193 (1.24)
Men
Women
Children
F3
F3
F3
2937 (0.89)
3312 (1.00)
3763 (1.14)
2420 (0.85)
2832 (1.00)
3336 (1.18)
2239 (0.84)
2666 (1.00)
3250 (1.21)
Data taken from Watrous (1991) derived originally from
Peterson and Barney (1952).
Ratios Between Formant Frequencies
(Hz) for Men, Women and Children.
Men
Women
Children
F1/F2
F1/F2
F1/F2
[i]
0.12
0.11
0.11
Men
Women
Children
F2/F3
F2/F3
F2/F3
0.78
0.84
0.84
[æ]
0.38
0.42
0.43
[u]
0.35
0.39
0.36
0.71
0.72
0.70
0.39
0.36
0.37
Data taken from Watrous (1991) derived
originally from Peterson and Barney
(1952).
Other Representation Issues
• There is a roughly logarithmic spatial mapping
of frequency onto the surface of auditory
cortex.
• Sometimes called a tonotopic mapping.
• Logarithmic coding of a parameter changes
multiplication by a constant into the addition
of a constant.
• A logarithmic spatial coding has the effect of
translating the parameters multiplied by the
constant the same distance.
Spatial Coding of Frequency
Three data points
on a map of
frequency.
Multiply by ‘c’.
Distance moved on
map varies from
point to point.
Suppose use the
log of data value.
Now scale by ‘c’
Each point moves
an amount D.
Multiple
Maps
Human fMRI derived
maps in human
auditory cortex.
Note at least five,
probably six maps.
Some joined at high
frequency end and
some at low
frequency end.
(Figure 6 from
Talavage, et al., p.
1290)
Representational Filtering
Our computational goal:
Enhance the representation of ratios between
formant frequencies
De-emphasize the exact values of those
frequencies.
We wish to make filter using the data
representation that responds to one aspect of
the input data.
We suggest that brain-like computers can make use
of this strategy.
Use the Information
Integration Architecture
• Assume the information integration square array
of modules with parameters fed in from the
edges
• Map of frequency along an edge.
• Assume formant frequencies are precise points.
(Actually they are somewhat broad.)
• We start by duplicating the frequency
representation along the edges of a square.
Simple Topographic System
To Represent Relationships
Simplest system: Two
opposing maps of
frequency.
Look at points equally
distant between f1 on
one map and f2 on the
other.
Shift frequency by
constant amount, D.
The point of equal
distance between new
frequencies (f1+D) and
(f2+D) does not move.
Problems
• Unfortunately, this desirable
invariance property only holds on the
center line.
• Two points determine a line, not a
point. There are many equidistant
points.
• What happens off the center line is
more complex.
• Still interesting, but a triple
equidistant coincidence would be much
more stable.
Three Parameter Coincidences
• Assume we are interested in the more complex
system where three frequency components come
together at a single module at the same time.
• We conjecture the target module may form a new
internal representation corresponding to this
triple coincidence.
• Assume uniform transmission speed between
modules.
• Then we look for module locations equidistant
from the locations of triple sets of
frequencies.
Triple Coincidences
Construction
Location of triple
coincidences is a
function of
• Ratios of f’s.
• Values of f’s.
Careful parametric
study has not yet
been done.
But: Now mixes
frequency and
ratios.
Data Representation
Multiple triple coincidence
locations are present.
Depending on the triple
different modules are
activated.
A three “formant” system has
six locations corresponding to
possible triples.
If we shift the frequency by
an amount D (multiplication by
a constant!)the location of
the triple shifts slightly.
Two Different
Stimuli:
Selectivity of
Representation
The geometry of the triple
coincidence points varies
with the location of the
inputs along the edges.
A different set of
frequencies will give rise to
a different set of triple
coincidences.
Representation is selective.
Robust Data
Representation
The system is robust.
Changes in the shape of the
maps do not affect the
qualitative results.
Different spatial data
arrangements work nicely.
Changes in geometry have
possibilities for
computation.
The non-square arrangement
spreads out the triple
coincidence points along
the vertical axis.
Module Assemblies
Representation of a vowel is composed of multiple
triple coincidences (multiple active modules).
But since information can move laterally.
loops of activity become possible.
Closed
Idea proposed before: Hebb cell assemblies were
self exciting neural loops. Corresponded to
cognitive entities: concepts.
Hebb’s cell assemblies were hard to make work
because of the use of scalar interconnected units.
We have pattern sensitive interconnections.
Module assemblies may become a powerful feature of
the Network of Networks approach.
See if we can integrate relatively dense local
connections to form module assemblies.
Loops
If the modules are
simultaneously active the
pairwise associations forming
the loop abcda can be learned
through simple Hebb learning.
The path closes on itself.
Consider a. After traversing
the linked path a>b>c>d>a, the
pattern arriving at a around
the loop is a constant times
the pattern on a.
If the constant is positive
there is the potential for
positive feedback if the total
loop gain is greater than one.
Formation of
Module
Assemblies
A single frequency pattern
will give rise to multiple
triple coincidences.
Speculation: Assume a
module assembly mechanism:
Simultaneous activation can
associate the active
regions together for a
particular pattern.
Two different patterns can
give rise to different
module assemblies.
Provocative Neurobiology
• The behavior of the active regions under
transformations (i.e. multiplication by a
constant) has similarity to one of Tanaka’s
observations.
• Tanaka shows an intrinsic imaging response
inferotemporal cortex to the image of a model
head.
• As the head rotates there is a gradual shift of
the columnar-sized region.
• The total movement for 180 degree rotation is
about 1 mm (three or four columns).
• The shift seems to be smooth with rotation.
• Tanaka was sufficiently impressed with this
result to modify his columnar model.
Rotating Face Representation in
Inferotemporal Cortex
(from Tanaka, 2003)
Revised Tanaka
Columnar Model
for
Inferotemporal
Cortex
Theme, Variations, Transformations
• Speculation: Cortical processing involving
common continuous transformations may be
working on a “theme and variations” principle.
• There are an infinite number of possible
transformations.
• But the most common seems to be topographically
represented by a small physically contiguous
range of locations on the surface of cortex.
• By far the most common transformation for a
head would be rotation around the vertical axis
of the head caused by different viewing angles.
Potential Value
This is an example of an approach to signal
processing for biological and cognitive
signals.
Many important problems: for example, vision,
speech, even much cognition and information
integration.
Potentially interesting aspects to algorithm.
• Largely parallel
• Conjecture: Should be robust.
• Conjecture: May be able to handle common
important transformations.
• Speculation: May put information in useful form
for later cognitive processing.
• Speculation: If many small active areas
(modules) is the right form for output, then
this technique may work.
Potential Value (2)
To be done:
• Develop general rules for topographic
geometries
• Are the filter characteristics good?
Over what range of values?
• Example: Could we develop a “pure”
stable ratio filter? Right now, mixed.
• Since we are assuming traveling waves
underlying this model, what are the
temporal dynamics?
• >Does it work for real data?
Conclusions: Representation
• Topographic maps of the type we suggest
can do information processing.
• They can act like filters, enhancing
some aspects of the input pattern and
suppressing others.
• Here, enhancing ratios of frequency
components and suppressing absolute
frequency values.
• Speculation: Their behavior may have
some similarities to effects seen in
cortex.
Sparse Neural Systems:
The Ersatz Brain gets
Thinner
Neural Networks.
The most successful brain
inspired models are
neural networks.
They are built from simple
approximations of
biological neurons:
nonlinear integration of
many weighted inputs.
Throw out all the other
biological detail.
Layers
• Up to now we have emphasized local, lateral
interactions between cells and cortical
columns.
• But there are also long range projections in
cortex where one large group of cells projects
to another one some distance away.
• Traditional neural net processing is built
around these projection systems and have little
lateral interaction.
• They usually assume full connectivity between
layers.
• Is this correct?
Neural Network Systems
Standard neural
network is
formed using:
• multiple layers
• projections
between layers.
A Fully Connected Network
Most neural nets
assume full
connectivity between
layers.
A fully connected
neural net uses lots
of connections!
Limitation 1: Sparse Connectivity
We believe that the computational strategy used by the
brain is strongly determined by severe hardware
limitations.
Example: The brain is sparsely connected. Fractional
connectivity of the brain is very low: 0.001%.
Implications:
• Connections are expensive biologically since they
take up space, use energy, and are hard to wire up
correctly.
• Connections are valuable.
• The pattern of connection is under tight control.
• Short local connections are cheaper than long ones.
But many long projections do exist and are very
important.
Limitation 2: Sparse Coding
In sparse coding only a few active units
represent an event.
“In recent years a combination of experimental, computational, and
theoretical studies have pointed to the existence of a common
underlying principle involved in sensory information processing,
namely that information is represented by a relatively small number
of simultaneously active neurons out of a large population,
commonly referred to as ‘sparse coding.’” (p. 481)
BA Olshausen and DJ Field (2004). Sparse coding of sensor inputs.
Current Opinions in Neurobiology, 14, 481-487.
Advantages of Sparse Coding
There are numerous advantages to sparse coding.
Sparse coding provides
•increased storage capacity in associative memories
•is easy to work with computationally,
•Very fast! (Few or no network interactions).
•is energy efficient
Best of all:
It seems to exist!
Higher levels (further from sensory inputs) show
sparser coding than lower levels.
Inferotemporal cortex seems to be more selective,
less spontaneously active than primary areas (V1).
Sparse Connectivity + Sparse
Coding
See if we can make a learning
system that starts from the
assumption of both
•sparse connectivity and
•sparse coding.
If we use simple neural net
units it doesn’t work so well.
But if we use our Network of
Networks approximation, it
works better and makes some
interesting predictions.
The Simplest Connection
The simplest sparse
system has a single
active unit connecting
to a single active
unit.
If the potential
connection does exist,
simple outer-product
Hebb learning can learn
it easily.
Not interesting.
Paths
A useful notion in sparse
systems is the idea of a
path.
A path connects a sparsely
coded input unit with a
sparsely coded output unit.
Paths have strengths just as
connections do.
Strengths are based on the
entire path, from input to
output, which may involve
intermediate connections.
It is easy for Hebb synaptic
learning to learn paths.
Common Parts of a Path
One of many problems.
Suppose there is a
common portion of a
path for two single
active unit
associations,
a with d (a>b>c>d) and
e with f (e>b>c>f).
We cannot easily weaken
or strengthen the
common part of the path
(b>c) because it is
used in multiple
associations.
Interference occurs.
Make Many, Many Paths!
Some speculations: If independent paths are
desirable an initial construction bias would be to
make available as many potential paths as possible.
In a fully connected system, adding more units than
contained in the input and output layers would be
redundant.
They would add no additional processing power.
Obviously not so in sparse systems!
Fact: There is a huge expansion in number of units
going from retina to thalamus to cortex.
In V1, a million input fibers drive 200 million V1
neurons.
Network of Networks Approximation
Single units do not work so
well in sparse systems.
Let us our Network of
Networks approximation and
see if we can do better.
Network of Networks: the
basic computing units are
not neurons, but small
(104 neurons) attractor
networks.
Basic Network of Networks
Architecture:
• 2 Dimensional array of
modules
• Locally connected to
neighbors
Interactions between Modules
Interactions between modules are vector in nature
not simple scalar activity.
Interactions between modules are described by state
interaction matrices instead of simple scalar
weights.
Gain greater path selectivity this way.
Feedforward, Feedback
•Emphasize: Cortex is not a simple
feedforward system moving “upward” from
layer to layer. (Input to output)
•It has massive connections backwards
from layer to layer, at least as dense
as the forward connections.
•There is not a simple processing
hierarchy!
Columns and Their Connections
Columnar organization is
maintained in both forward
and backward projections
“The anatomical column acts as a
functionally tuned unit and point of
information collation from laterally offset
regions and feedback pathways.” (p. 12)
“… feedback projections from extra-striate
cortex target the clusters of neurons that
provide feedforward projections to the same
extra-striate site. … .” (p. 22).
Lund, Angelucci and Bressloff (2003).
Cerebral Cortex, 12, 15-24.
Sparse Network of Networks
Return to the simplest
situation for layers:
Modules a and b can display
two orthogonal patterns, A
and C on a and B and D on b.
The same pathways can learn
to associate A with B and C
with D.
Path selectivity can
overcome the limitations of
scalar systems.
Paths are both upward and
downward.
Common Paths Revisted
Consider the common path
situation again.
We want to associate patterns
on two paths, a-b-c-d and e-bc-f with link b-c in common.
Parts of the path are
physically common but they can
be functionally separated if
they use different patterns.
Pattern information
propagating forwards and
backwards can sharpen and
strengthen specific paths
without interfering with the
strengths of other paths.
Associative Learning along
a Path
Just stringing together simple associators works:
For module b:
Change in coupling term between a and b: Δ(Sab) = ηbaT
Change in coupling term between c and b Δ(Tcb) = ηbcT
For module c:
Δ(coupling term Udc) = ηcdT
Δ(coupling term Tbc) = ηcbT
If pattern a is presented at layer 1 then:
Pattern on d = (Ucd) (Tbc) (Sab) a
= η3 dcT cbT baT a
= (constant) d
Module Assemblies
Because information moves backward, forward, and
sideways, closed loops are possible and likely.
Tried before: Hebb cell assemblies were self
exciting neural loops. Corresponded to cognitive
entities: for example, concepts.
Hebb’s cell assemblies are hard to make work
because of the use of scalar interconnected units.
But module assemblies can become a powerful feature
of the sparse approach.
We have more selective connections.
See if we can integrate relatively dense local
connections with relatively sparse projections to
and from other layers to form module assemblies.
Biological Evidence:
Columnar Organization in IT
Tanaka (2003)
suggests a columnar
organization of
different response
classes in primate
inferotemporal
cortex.
There seems to be
some internal
structure in these
regions: for
example, spatial
representation of
orientation of the
image in the
column.
Columns: Inferotemporal Cortex
Responses of a region
of IT to complex
images involve
discrete columns.
The response to a
picture of a fire
extinguisher shows
how regions of
activity are
determined.
Boundaries are where
the activity falls
by a half.
Note: some spots are
roughly equally
spaced.
Active IT Regions for a Complex Stimulus
Note the large number of roughly equally distant
spots (2 mm) for a familiar complex image.
Intralayer Connections
Intralayer connections are sufficiently dense so that
active modules a little distance apart can become
associatively linked.
Recurrent collaterals of cortical pyramidal cells
form relatively dense projections around a pyramidal
cell. The extent of lateral spread of recurrent
collaterals in cortex seems to be over a circle of
roughly 3 mm diameter.
If we assume that:
•A column is roughly a third of a mm,
•There are roughly 10 columns in a square mm.
•A 3 mm diameter circle has an area of roughly 10
square mm,
A column projects locally to about 100 other columns.
Loops
If the modules are
simultaneously active the
pairwise associations forming
the loop abcda can be learned
through simple Hebb learning.
The path closes on itself.
Consider a. After traversing
the linked path a>b>c>d>a, the
pattern arriving at a around
the loop is a constant times
the pattern on a.
If the constant is positive
there is the potential for
positive feedback if the total
loop gain is greater than one.
Loops with Common Modules
Loops can be kept separate
even with common modules.
If the b pattern is
different in the two loops,
there is no problem. The
selectivity of links will
keep activities separate.
Activity from one loop will
not spread into the other
(unlike Hebb cell
assemblies).
If b is identical in the two loops b is ambiguous. There is no
a priori reason to activate Loop 1, Loop 2, or both.
Selective loop activation is still possible, though it
requires additional assumptions to accomplish.
Richly Connected Loops
More complex connection
patterns are possible.
Richer interconnection
patterns might have all
connections learned.
Ambiguous module b will
receive input from d as well
as a and c.
A larger context would allow
better loop disambiguation
by increasing the coupling
strength of modules.
Working Together
Putting in All Together:
Sparse interlayer connections
and dense intralayer
connections work together.
Once a coupled module
assembly is formed, it can be
linked to by other layers.
Now becomes a dynamic,
adaptive computational
architecture that becomes
both workable and
interesting.
Two Parts …
Suppose we have two
such assemblies
that co-occur
frequently.
Parts of an object
say …
Make a Whole!
As learning continues:
Groups of module
assemblies bind
together through Hebb
associative learning.
The small assemblies
can act as the “subsymbolic” substrate of
cognition and the
larger assemblies,
symbols and concepts.
Note the many new
interconnections.
Conclusion (1)
• The binding process looks like
compositionality.
• The virtues of compositionality are
well known.
• It is a powerful and flexible way to
build cognitive information processing
systems.
• Complex mental and cognitive objects
can be built from previously
constructed, statistically welldesigned pieces. (Like cognitive
Lego’s.)
Conclusion (2)
• We are suggesting here a possible model
for the dynamics and learning in a
compositional-like system.
• It is built based on constraints
derived from connectivity, learning,
and dynamics and not as a way to do
optimal information processing.
• Perhaps this property of cognitive
systems is more like a splendid bug fix
than a well chosen computational
strategy.
• Sparseness is an idea worth pursuing.
• May be a way to organize and teach a
cognitive computer.
Conclusions
Speculation: Perhaps digital computers and humans
(and brain-like computers??) are evolving
toward a complementary relationship.
• Each computational style has its virtues:
– Humans (and brain-like computers??): show
flexibility, estimation, connection to the
physical world
– Digital Computers: show speed, logic,
accuracy.
• Both styles of computation are valuable. There
is a place for both.
• But their hardware is so different that brainlike coprocessors make sense.
• As always, software will be more difficult
build and understand than hardware.