No Slide Title

Download Report

Transcript No Slide Title

The Brain's Capability for Language
Language-readiness: the capacity to acquire and use language.
Biological Evolution: The processes of genetic selection that yielded a
language-ready brain
Cultural Evolution: The processes of non-biological, social selection
whereby a variety of languages arose and “cross-pollinated”.
“Resetting the null hypothesis”, we claim that being “languageready” does not imply “having language”, and that many of the
features of language are the product of cultural evolution.
 The Mirror System Hypothesis: An account of how and why the
human brain differs from that of other primates to make humans
“language ready”.

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
1
Social Evolution
Ease of acquisition of a skill does not imply genetic encoding of the skill
per se:
 Surfing the Web and playing video games:
 computer technology has evolved to match the preadaptations of the human
brain and body.
Social evolution can result from both biological and cultural (nongenetic) evolution. Clearly, language enables an immense amplification
of the second factor.
Human evolution saw the co-evolution of increasingly complex social
structures and of increasingly complex patterns of behavior and
communication to serve those social interactions.
Gamble in Timewalkers views human evolution in terms of preadaptation
for global colonization, with language one of many relevant traits in that
evolution.
He emphasizes the relation of humans to other species in the same
environment.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
2
Basic Questions
a... What were those features of the human brain that pre-adapted us for human
language, i.e., made the human brain “language ready”,
b... What aspects of human language were already present in the earliest humans of
200,000 years ago?
a... How can this perspective on language evolution help explain why language looks
the way it does today?
b... What has been the interplay of biological inheritance and cultural evolution since
the emergence of Homo sapiens?
 Subsidiary debate: How can we best describe cultural evolution in a way which
reflects both its dependence on that biological inheritance and the vast variety
behavior exhibits across different cultures?
Dynamics of language on multiple time-scales: How can the study of language
acquisition (a minor focus of the course) and of historical linguistics (at most a side
topic) help tease apart biological and cultural contributions to the mastery of
language by present-day humans?

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
3
Biological Basis of Human Evolution
Background factors for our work: Homo sapiens has:
 Bipedality
 Manual dexterity
 Larynx well-suited for vocal production
We will stress (The Mirror System Hypothesis):
Ability to relate the actions of others to one’s own actions
Key issue beyond the mirror:
 Imitation: Ability to generate and comprehend hierarchical
structures “on the fly”.
Ability to rapidly acquire a vast array of flexible strategies for
pragmatic and communicative action.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
4
The Ancestral Communication System
For want of better data, we assume that our common human-monkey ancestors
shared with monkeys the following:
Primate Call System
a limited set of species-specific calls
Oro-Facial Gesture System
a limited set of gestures expressive of
emotion and related social indicators
Note the linkage between the two systems: communication is inherently multi-modal.
Combinatorial properties for the openness of communication are virtually absent in
primate calls and oro-facial communication
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
5
A Challenge for Our Account of Brain Evolution
The neural substrate for primate calls is in a region of monkey
cingulate cortex
For most humans, language is heavily intertwined with speech
Why then is the cingulate area – already involved in monkey
vocalization – not homologous to the Broca's area's substrate
for language?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
6
Evolution of Perceptual Systems
Primates have evolved general-purpose sensory mechanisms which are not
restricted to providing innate releasing mechanisms for specific fixed action
patterns.
 We do not require that all sensory processes be linked to/preceded by
motor activity.
 Yet our need to understand the integrated role of speaker/signer and
hearer/comprehender links us back to the motor system.
This does not require a strict motor theory of perception, but we still need to
specify how the mirror system is reflected into a multi-level language system.
Contrast sorting phonemes without meaning (tuning the perceptual system) vs.
recognizing the meaning of "milk" - integrating the action of drinking with the
sensory experience (sight, taste) and reinforcement (reduction of thirst and
hunger) that go with it.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
7
Selection and Binding
Whether we are planning an action or creating a sentence we
are choosing a subset of objects, states of action, and
relationships from a complex (perceived, remembered, or
imagined) scene.
How is the selection of those objects, actions and
relationships, and the binding of distributed representations of
each of these, neurally represented as an integrated subset of
a greater integrated whole?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
8
The Minimal Subscene as a Meeting Ground for
Action, Action Recognition and Language
Explore the interaction between focal visual attention and the recognition of objects and actions
to better understand how humans perceive and act upon complex dynamic scenes and how such
perception and action are linked to language.
Key Concept: ”Minimal subscenes" in which an object is linked to another object or two via
some action.
Hypothesis: These ground the basic sentences containing a verb together with noun phrases
which express such a subscene.
Modeling Component:
 Integrated neural models for action recognition, visual attention, and minimal subscene
representation to provide dynamic scene understanding integrated with scene description and
question answering.
Experimental Component.
visual psychophysics experiments to test human performance on minimal
subscene description and recognition, question answering, and related attentional
processes
 fMRI experiments to constrain our analysis of how the interactions among our
model components should be developed.

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
9
Broca’s and Wernicke’s Aphasias
Paul Broca (1865): Broca's aphasia is characterized by nonfluent speech,
few words, short sentences, and many pauses. The words that the
patient can produce come with great effort and often sound distorted.
The melodic intonation is flat and monopitched. This gives the speech
the general appearance of a telegraphic nature, because of the deletion
of functor words and disturbances in word order. However, aural
comprehension for conversational speech is relatively intact. There is
often an accompanying right hemiparesis involving face, arm, and leg.
Carl Wernicke (1874): Wernicke’s aphasia is known as a fluent aphasia
because the patient does not appear to have any difficulty articulating
speech, but may be paraphasic. However, comprehension of speech is
impaired and sometimes even single words are not comprehended. The
patient may even speak in a meaningless “neologistic” jargon, devoid of
any content but with free use of verb tenses, clauses, and subordinates.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
10
Broca’s and Wernicke’s Aphasias
Warning: Localization of Aphasias is HIGHLY Variable
Wernicke’s original drawing
(wrong hemisphere!)
“Perception”
Wernicke:
“Perception”
“Production”
Broca:
“Production”
MRI-scans from
Keith A. Johnson, M.D. and J.
Alex Becker
The Whole Brain Atlas
Broca’s Area (Negative Image)
http://www.med.harvard.edu./
AANLIB/home.html
Slice viewed from below:
So “right” is left
Wernicke’s Area
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
11
The Brain is Language-Ready
Versus
Language is Innate in the Brain
The central role of language-readiness is a key tenet of our
approach to the brain mechanisms of language
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
12
Brain Mechanisms for Language
We say:
 Broca’s and Wernicke’s area are key components of the human
brain’s mechanisms for language
but suggest this is shorthand for:
 Broca’s and Wernicke’s areas evolved biologically to support a
variety of mechanisms linking the production and perception of
complex behaviors (e.g., those involved in imitation)
they were so pre-adapted that when humans evolved language
culturally these areas ‘self-organized’ in response to a language-rich
environment to support language production and perception
rather than:
 They evolved to encode the syntax of language per se
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
13
Imaging the Human Brain
Measuring rCBF - regional Cerebral Blood Flow - to get a
measure of how activity in a brain region differs from task to
task:
PET:Positron Emission Tomography
fMRI: functional Magnetic Resonance Imaging
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
14
A High Level View of Human Brain Activity
Left: A recent positron emission tomography (PET) study found that regional cerebral blood
flow (rCBF) in the cerebellum is correlated with the accuracy of sensory prediction.
 Right: Higher cortical regions, in particular the right dorsolateral prefrontal cortex (DLPFC;
Brodmann area 9/46), come in to play when there is a conscious conflict between intentions
and their consequences.

From the review “From the Perception of Action to the Understanding of Intention” by Sarah-Jayne Blakemore and Jean
Decety, Nature Reviews Neuroscience 2001 2:561 et seq.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
15
Beyond Boxology
Not

Looking at the brain one large region at a time
But

Working down through the levels:
 How do brain regions compete and cooperate to yield overall
behavior (including changes of internal states)
 How does the neural circuitry within each region mediate its
contribution to overall “information processing”?
 How does synaptic plasticity allow experience to “self-organize”
the brain in both development and learning so that the brain at any
time is a dynamic blend of “nature and nurture”?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
16
Bickerton on Protolanguage
To keep the argument clear:
A prelanguage is the system of utterances used by a particular
prehominid species (including Homo sapiens) which we may
recognize as a precursor to human language in the modern sense.
Warning: We have no traces of any hominid prelanguage!!
Bickerton: Infant language, pidgins, and the “language” taught to
apes are all protolanguages made up of utterances comprising a few
words in the current sense without syntactic structure.
Bickerton’s Hypothesis: The prelanguage of Homo erectus was a
protolanguage in his sense. Language just “added syntax” through
the evolution of Universal Grammar.
My counter-proposal: The prelanguage of Homo erectus and early
Homo sapiens was composed mainly of “unitary utterances”:
“grufluk”
Words co-evolved culturally with syntax through fractionation.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
17
Protolanguage in Historical Linguistics
For Dixon (1997) [studying historical changes in languages]: a protolanguage is a
human language ancestral to a specific family of human languages.
Hypothesis: The rate of historical language change supports the view that language in
its full modern sense may not have been within the repertoire of early Homo sapiens,
and that the subsequent development of languages rested more on cultural than on
biological evolution.
Deep Time
The divergence of the Romance languages took about one thousand years.
The divergence of the Indo-European languages with their immense diversity
Hindi, German, Italian, English, ...
took about 6,000 years.
How can we imagine what has changed since the emergence of Homo sapiens some
200,000 years ago?
Or in 5,000,000 years of prior hominid evolution?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
18
Criteria for Language-Readiness
A Hypothesis on which human brain mechanisms underlie language
Properties Supporting Prelanguage Communication:
Symbolization: The ability to associate an arbitrary symbol with a class of episodes,
objects or actions.
At first, these symbols may not have been word-like in the modern sense
“grufluk”
These symbols may have been manual gestures, rather than being vocalized.
Intentionality. Extension of communication to be intended by the utterer to have a
particular effect on the recipient.
Parity: What counts for the speaker must count for the listener
(Mirror Property)
More General Properties:
Hierarchical Structuring: Perception and action involving components with sub-parts
(Action-oriented perception)
but the units of these structures may not map to symbols
Temporal Ordering: Coding hierarchical structures “of the mind”
Beyond the Here-and-Now: recalling past events, imagining future ones.
Paedomorphy and Sociality: Conditions for complex social learning
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
19
Paedomorphy
One key feature of humans is paedomorphy – the infant is helpless for 18
months (or 5 years or …) whereas a guinea pig can fend for itself from
birth.
Therefore, human infants have time to acquire a wide range of culturallydetermined behavior:
humans broke the bounds of a limited ecological niche and could
reinvent themselves culturally to master more and more new
environments, even to the point of adapting the environments to their
needs.
This in turn required the appropriate co-evolution of the biology of social
relations in general, and of extended child care and mother-child
relations
 Caution: Look for differences of degree rather than kind in
contrasting human with other species.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
20
Criteria for Language
A Hypothesis on what culture and learning add to the brain’s capabilities
Extending Parity, Hierarchical Structuring, and Temporal Ordering of
Language-Readiness:
Symbolization: The symbols become words in the modern sense,
interchangeable and composable units for the expression of meaning.
Syntax and Semantics: The matching of syntactic to semantic structures
co-evolves with the fractionation of utterances
Recursivity: is a byproduct
Beyond the Here-and-Now: Verb tenses or other circumlocutions
express the ability to recall past events or imagine future ones.
Learnability: To qualify as a human language, it must contain a
significant subset of symbolic structures learnable by most human
children. [But: It is not true that children master a language by 5 or 7 years of age.]
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
21
A Very Quick Tour of Evolution
Mammals
Primates
Hominids
Monkeys
Homo Sapiens
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
Apes
22
Primate Evolution: Two Key Branch Points
Adapted from Clive Gamble:
Timewalkers Figure 4.2
20 million years ago
Monkey  Human
5 million years ago
Chimp  Human
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
23
5 million years of hominid evolution
What were the biological
changes supporting
language-readiness?
What were the cultural
changes extending the
utility of language as a

socially transmitted
vehicle for communication
and representation?
How did biological and
cultural change interact
“in a spiral” prior to the
emergence of Homo
sapiens?



Adapted from Clive Gamble:
Timewalkers Figure 4.6
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
24
Out of Africa - Twice
NO:
YES:
Adapted from Clive Gamble:
Timewalkers Figure 8.1
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
25
Gamble, Timewalkers
Table 8.2
Life at the fireside:
the technology of survival
____________________________
Life on the land:
regional exchange
____________________________
Expansion into new habitats
and the rise
of complex behavior
spoken language
increased forward planning
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
26
and this is where the story really starts …
The Mirror System Hypothesis
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
27
The Grasp System of the Monkey Brain
F5 - grasp
commands in
premotor cortex
Giacomo Rizzolatti
AIP - grasp
affordances
in parietal cortex
Hideo Sakata
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
28
Grasp Specificity in an F5 Neuron
Precision pinch (top)
Power grasp (bottom)
(Data from Rizzolatti et
al.)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
29
FARS (Fagg-Arbib-Rizzolatti-Sakata) Model 1:
The Role of Prefrontal Cortex
•AIP extracts a set of
affordances but
•IT and PFC are
crucial to F5’s selection
of the affordance to
execute
AIP
AIP
Dorsal
Stream:
dorsal/ventral
Affordances
streams
Ways to grab
this “thing”
TaskConstra
Constraints
Task
ints ( F6)
(F6)
Working
Memory
W
orking Me
mory (46)
(46?)
Instruction
Stimuli
Instruction
Stim
uli (F2)
(F2)
PFC
Ventral
Stream:
Recognition
F5
F5
“It’s a mug”
IT
Fagg and Arbib,1998
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
30
Mirror Neurons
Rizzolatti, Fadiga, Gallese, and
Fogassi, 1995:
Premotor cortex and the
recognition of motor actions
Mirror Neurons: A subset of
the grasp-related premotor
neurons of F5 discharging
when the monkey observes
meaningful hand movements
made by the experimenter
The effective observed movement
 the effective executed
movement.
F5 is endowed with an
observation/execution matching
system
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
31
The Mirror Neuron System (MNS) Model
Object features
cIPS
7b: PF/PG
Object
affordance
extraction
Object affordance
-hand state
association
Hand
shape
recognition
Hand
motion
detection
STS
7a
F5canonical
AIP
Motor
program
(Grasp)
Integrate
temporal
association
Action
Mirror
Feedback recognition
Hand-Object
spatial relation
analysis
work with
Erhan Oztop
(Mirror
Neurons)
F5mirror
Motor
program
(Reach)
Motor
execution
M1
F4
Object
location
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
32
Direct and Inverse Models
The vertical path is the execution
system.
The loop on the left provides
Learn by Imitation
Learn by Doing
mechanisms for imitating observed"Social Learning"
Try to Grasp
view of object
Object
gestures in
such a way as to create expectations.
AIP
The observation matching system view
action
MP: Action
(inverse model) goes from "view ofof action
description
Motor Program
STS and PF
action
gesture" via gesture description (STS)
recognition
and gesture recognition (PF) to a
expectation
F5
representation of
the "command" for such a gesture
command
ENN
corollary
The expectation system (direct
discharge
MPG
model) from an F5 command via the
Mirror neurons
expectation neural network ENN to
grasp of object
MP, the motor program for generating
Non-Mirror Neurons
a given gesture.
The latter path may mediate a
From Arbib and Rizzolatti (1997)
comparison between "expected
gesture" and
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
33
Roles of Mirror Neurons
1) Self-correction: based on the discrepancy between intended and
observed self action.
2) Learning by imitation:
 at the level of the single element
 Beyond the Mirror System (narrowly conceived) based on "parsing"
into familiar elements and then repeating the observed structure
composed from those elements.
3) Social interaction.
A monkey seems at most able to parse some specific classes of
"ecological sequences/ behaviors".
But humans can parse "abstractly"
 This progression in behavior may be crucial for sentence
understanding--- exploiting the general ability for hierarchical
extraction of constituents.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
34
Studying Monkeys
Above: For their intrinsic interest
 Caution: Many different species
Below: To help us understand ourselves
 Monkeys (and chimpanzees and bonobos …) are our
cousins not our ancestors
 But we hope that their study will help us infer more about
our ancestors and how we came to be human.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
35
An Observation/Execution Matching System in Humans
Rizzolatti, Fadiga, Matelli, Bettinardi, Perani, and Fazio:
Broca's region is activated by observation of hand gestures: a PET study.
PET study of human brain with 3 experimental conditions:
 Object observation (control condition)
 Grasping object
 Grasping observation
The most striking result was that the highly significant frontal activation for (action and
action recognition versus control)
was in the rostral part of Broca's area.
But Broca’s area is a key language area!!!
Another PET data, by Petrides et al., showed that during execution of a sequences of
self-ordered hand movements there was a highly significant activation of Broca's area.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
36
F5 is Homologous to
Area 45 of Broca’s Area
Monkey
Massimo Matelli (in Rizzolatti and
Arbib 1998) provides the key to relating
F5 in the Monkey
to Area 45 in the Human.
Human
Broca's Area: Areas 44 + 45
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
37
A New Approach to the Evolution of Human Language
Rizzolatti, G., Fadiga L., Gallese, V., and Fogassi, L., 1996, Premotor cortex and the
recognition of motor actions. Cogn Brain Res., 3: 131-141.
Rizzolatti, G, and Arbib, M.A., 1998, Language Within Our Grasp, Trends in
Neuroscience, 21(5):188-194:
The functional specialization of human Broca's area to contribute to language-readiness
derives from an ancient mechanism related to production and understanding of motor
acts.
The "generativity" which some see as the hallmark of language is present in manual
behavior ...which can thus supply the evolutionary substrate for its appearance in
language.
Kimura argues that the left hemisphere is specialized not for language, but for complex
motor programming functions which are, in particular, essential for language
production.
 Language may require its own "copy" of motor sequencing mechanisms, with the
adjacency of these to "old" mechanisms. This makes lesions which dissociate the
two very rare.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
38
The Mirror System Hypothesis
The parity requirement for language in humans
what counts for the speaker must (approximately!!) count for
the hearer
is met because language evolved from the mirror system for
grasping in the common ancestor of monkey and human with its
capacity to generate and recognize a set of actions.
This adds a neural “missing link” to the tradition that roots speech in a prior system
for communication based on manual gesture. [See most recently:
William C. Stokoe (2001) Language in Hand: Why Sign Came Before Speech.]
Beyond the Mirror: We then have to understand that language (readiness) rests on
far more than a mirror system - seeing F5 as part of a larger mirror system, then
extending our understanding via imitation to language-readiness*.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
39
Four stages of evolution
were hypothesized in “Language Within our Grasp”:


grasping
a mirror system for grasping (i.e., a system that matches
observation and execution)
 a manual-based communication system, breaking through the fixed
repertoire of primate vocalizations to yield an open repertoire
 speech as a result of the "invasion" of the vocal apparatus by
collaterals from the manual/oro-facial communication system based on
F5/Broca's area
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
40
What the Mirror System Hypothesis Explains
Why language is multi-modal
 Why Broca’s area is the homologue of F5 rather than the cingulate
area devoted to monkey vocalizations

To achieve these implications we must go beyond the core data on the mirror system to
stress that
manipulation inherently involves hierarchical motor structures
which are unavailable for the closed call system of primates
 Note: These are not the property of premotor cortex in isolation but
involve (at least) their integration with SMA-proper (one division of
the Supplementary Motor Area) and the Basal Ganglia.

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
41
What the Mirror System Hypothesis does not say
(i) It does not say that having a mirror system is equivalent to having language.
 Monkeys have mirror systems but do not have language, and we expect that many
species have mirror systems for varied socially relevant behaviors.
(ii) It does not say that the ability to match the perception and production of single
gestures is sufficient for language.
(iii) It does not say that language evolution can be studied in isolation from cognitive
evolution more generally.
 In using language, we make use of, for example, negation, counterfactuals, and
verb tenses.
 But each of these linguistic structures is of no value unless we can understand that
the facts contradict an utterance, and can recall past events and imagine future
possibilities.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
42
How Vertebrate Brains Evolve
(Butler and Hodos 1996)
The course of brain evolution among vertebrates has been determined by



Formation of multiple new nuclei through elaboration or duplication
Regionally specific increases in cell proliferation in different parts of the
brain
Gain of some new connections and loss of some established connections
These phenomena can be influenced by relatively simple mutational
events that can thus become established in a population as the result of
random variation.
Selective pressures determine whether the behavioral phenotypic
expressions of central nervous system organization produced by these
random mutations increase their proportional representation within the
population and eventually become established as the normal condition.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
43
An Integrated State of Knowledge
Big question: How did evolution couple the separate parietalfrontal
subsystems into an “integrated state of knowledge”?
Fuster sees Prefrontal cortex (PFC) as evolving to increase working memory
capacity
Petrides argues that we need PFC to go beyond single items to keeping multiple
objects or events in order.
 Note the challenge of embedding the mirror system in a system for handling
sequential structure, and hierarchical structure more generally.
 How does this relate to the role of HC in episodic memory?
 Note the parallel problem of keeping multiple objects in spatial relation in
scene perception – and the related syndrome of simultagnosia.
 Note that events – not objects – are primary in our story, keeping action at
the center.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
44
Basic ingredients for linking the motor system to data on aphasia in the context
of extending the mirror system hypothesis
Enlargement of the pre-frontal lobe (which uses motivation to evaluate
future courses of action) to provide sophisticated memory structures
(coupled, e.g., to hippocampus) to extend the reach in space and time
Extending the number, sophistication and coordination of parietal-frontal
perceptuo-motor systems
 Extending the “reach” of mirror systems (F5/Broca's area)
 Sequential/Hierarchical behaviors (bring in SMA/BG: Adding
prefrontal circuitry with refinements of the basal ganglia and
cerebellum keeping pace; while the ratio of pre-motor cortex to
motor cortex increases drastically from monkey to human)
 POT (Parieto-Occipito-Temporal cortex) is a semantic storehouse its enlargement is a parallel development to the mirror system story
(Wernicke's area)
 past and future (bring in PFC and HC)
 motor control for vocalization (cingulate cortex, etc.)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
45
Beyond the Mirror
Imitation is the Key
Lewis Carroll
Through the Looking-Glass
and what Alice found there
Illustrations by John Tenniel
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
46
From Grasp to Language:
Seven hypothesized stages of evolution
 grasping
 a mirror system for grasping (i.e., a system that matches observation and execution)
[Shared with common ancestor of human and monkey]
 a simple imitation system for grasping [Shared with common ancestor of human and
chimpanzee]
 Pre-Hominid
 Hominid Evolution
 a complex imitation system for grasping,
 a manual-based communication system, breaking through the fixed repertoire of
primate vocalizations to yield an open repertoire
 proto-speech resting on the "invasion" of the vocal apparatus by collaterals from the
communication system based on F5/Broca's area
 Cultural Evolution in Homo Sapiens
 language: the change from action-object frames to verb-argument structures to
syntax and semantics: Co-evolution of cognitive and linguistic complexity
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
47
From Hominids to Homo sapiens
My current hypothesis is that:
Stages (4) and (5) and a rudimentary (pre-syntactic) form of (6) were
present in pre-human hominids; but
The "explosive" development of (6) that we know as language (7)
depended on "cultural evolution" well after biological evolution had
formed modern Homo sapiens.
This remains speculative, and one should note that biological evolution
may have continued to reshape the human genome for the brain even
after the skeletal form of Homo sapiens was essentially stabilized, as it
certainly has done for skin pigmentation and other physical
characteristics.
However, the fact that people can master any language equally well,
irrespective of their genetic community, shows that these changes are
not causal with respect to the structure of language.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
48
Chimps are not Monkeys
Apes imitate; monkeys do not. What does this say for our evolutionary
hypothesis?
Whiten study: chimps can learn a sequence quickly, whereas the monkey
cannot.
Speech recognition in Kanzi (a bonobo)
How does this square with the view that the "phonetic module" is a
speech-specific human module?
Hypothesis:
Extension of the mirror system from single actions to compound
actions was the key innovation in the brains of human, chimp and the
common ancestor (as compared to the monkey-human common
ancestor) relevant to language-readiness.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
49
The Monkey Cannot Imitate?
What is the evidence that the monkey "understands”, or recognizes the "goal" of an action? What
is involved in linking a goal to an action? To a sequence of actions?
Do the monkey’s social interactions require "understanding" rather than a complex of fixed action
patterns with releasers varying from innate to learned?
On our principle that the motor system can do more than just produce overt behavior, we can see
the latter behavior as more important for its parsing of a sequence than for its production of
behavior.
This "parsing" -- present in the chimp but not in the monkey -- may be a crucial transition
towards mechanisms for language-readiness.
Judy Cameron (personal communication) offers the following observation from the Oregon
Regional Primate Research Center:
Researchers at the Center had laboriously taught monkeys to run on a treadmill as a basis for
tests they wished to conduct. It took five months to train the first batch of monkeys in this
task. But they then found that if they allowed other monkeys to observe the trained monkeys
running on a treadmill, then the naïve monkeys would urn successfully the first time they
were placed on a treadmill."
This is not evidence that the monkey mirror system for grasping is part of a system for imitation
of hand movements, but does render this likely.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
50
Imitation
Imitation involves, in part, seeing the other's performance as a set of
familiar movements. But:
One must not only observe actions and their composition, but also
novelties in the constituents and their variations.
One must also perceive the overlapping and sequencing of all these
moves and then remember the “coordinated control program” so
constructed.
 Each approximation provides the framework in which attention can be
shifted to specific components which can then be tuned and/or
fractionated appropriately, or better coordinated with other components of
the skill.
 This process is recursive, yielding both the mastery of ever finer details,
and the increasing grace and accuracy of the overall performance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
51
Stage 3: Simple Imitation
Masako Myowa-Yamakoshi:
 the form of “imitation” employed by chimpanzees is a long and laborious process
compared to the rapidity with which humans can acquire novel sequences;
 the focus is on moving objects to objects rather than on the structure of movements
per se.
Monkeys less so and chimpanzees more so (and, presumably, the common ancestor of
human and chimpanzees) have
Simple imitation: imitating simple novel behaviors but only through repeated exposure.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
52
Chimpanzees use and make tools
Different tool traditions in isolated groups of chimpanzees:
 Different types of tools used for “termite fishing” at the Gombe in
Tanzania and at sites in Senegal.
 Chimpanzees use stones and other objects as projectiles to do harm
 Chimpanzees in Tai National Park, Ivory Coast, use stone tools to
crack nuts open, but chimps in the Gombe have not been seen do this.
The nut-cracking technique is not mastered until adulthood.
Mothers overtly correct and instruct their infants from the time they
first attempt to crack nuts, at age three years, and at least four years
of practice are necessary before any benefits are obtained.
Note: the form of imitation reported here for chimpanzees is a long and
laborious process compared to the rapidity with which humans can
acquire novel sequences.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
53
Stage 4: Complex Imitation
Humans have complex imitation: they can acquire (longer) novel
sequences in a single trial if the sequences are not too long and the
components are relatively familiar.
 The very structure of these sequences can serve as the basis for
immediate imitation or for the immediate construction of an
appropriate response, as well as contributing to the longer-term
enrichment of experience
Extension of the mirror system from single actions to compound
actions adequate to support complex imitation was an evolutionary
change of key relevance to language-readiness
 Hypothesis: This emerged on the hominid line after the divergence
from the common ancestor of humans and chimpanzees.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
54
Action = Movement + Goal/Expectation
What makes a movement into an action is that
(i) it is associated with a goal, and
(ii) initiation of the movement is accompanied by the creation of an
expectation that the goal will be met.
To the extent that the unfolding of the movement departs from that
expectation, to that extent will an error be detected and the movement
modified.
An individual performing an action is able to predict its consequences
and, therefore, the action representation and its consequences are
associated.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
55
Understanding and Awareness
Many authors have suggested that language and understanding are
inseparable, but our experience of scenery and sunsets and songs and
seductions makes clear that we humans understand more than we can
express in words.
Some aspects of such awareness and understanding are available to
animals who do not possess language.
But our development, as "modern" humans, i.e., as individuals
within a language-based society, greatly extends our understanding
beyond that possible for non-humans or for humans raised apart from
a language community.
Conversely other species are aware of aspects of their environment
and society that we humans can at best dimly comprehend.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
56
Hierarchical progress beyond the mirror system
Classic mirror system: just a fixed set of actions at one level. Need to understand multilevel representations and interactions.
Distinguish "imitating" a familiar action from imitating a complex behavior.
Shadowing experiments show effects of the highest applicable level in repeating a
sequence of phonemes - whether the changes were at the phoneme, word, or syntaxof-the utterance levels.
One may also see filling in of missing elements.
Higher levels may dominate but do not do so completely -- compare Magritte.
In aphasia, note the interaction of perception and production (this is bidirectional, not
just unidirectional).
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
57
Communication and Representation
The specific communication system based on primate calling
was not the precursor of language.
However, co-evolution of communication and representation
was essential for the emergence of human language.
Both
representation within the individual and
communication between individuals
 could provide selection pressures for the biological evolution of
language-readiness and the biological and cultural evolution of
language, with advances in the one triggering advances in the other.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
58
From Unit Actions to Complex Behaviors
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
59
From Unit Actions to Complex Behaviors
We hypothesize that the plan of an action (whether observed or "intended") is encoded in the
brain. Three cases:
a
whole set of actions is overlearned and encoded in stable neural connectivity.
the whole set of actions is planned in advance based on knowledge of the current
situation.
dynamic planning is involved, with the plan being updated and extended as new
observations become available.
An automaton formalism is broad enough to encompass the above range from overlearned to
dynamic plans, but it is still an open question as to how best to distribute the encoding of the
various components of the automaton between stable synapses, rapidly changing synapses, and
neural firing patterns. In general, this "automaton" will be event-driven, rather than operating on
a fixed clock – different sub-behaviors take different lengths of time, and may be terminated
either because of an external stimulus, or by some internal encoding of completion.
At a basic level, then, we might characterize imitation in terms of ability to "infer automata”.
However, complex behaviors may be expressed as coordinated control programs, which are
built up from assemblages of simpler schemas.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
60
The Problem of Serial Order in Behavior (Karl Lashley)
If we tried to learn a sequence like A  B  A  C
by reflex chaining, what is to stop A triggering B every time,
to yield the performance A  B  A  B  A  …..
(or we might get A  B+C  A  B+C  A  …..)?
A solution: Store the “action codes” (motor schemas) A, B, C, … in one
part of the brain (F5 in FARS) and have another area (pre-SMA in
FARS) hold “abstract sequences” and learn to pair the right action with
each element:
(pre-SMA): x1  x2  x3  x4 abstract sequence
(F5):
A
B
C
action codes/motor schemas
We further posit that Basal Ganglia (BG) manage priming and inhibition.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
61
FARS (Fagg-Arbib-Rizzolatti-Sakata) Model 2:
Sequential Behavior in the Sakata Task
Vis ual Input s
AIP
visual
mot or
The five F5 units
participate in a common
program (in this case, a
precision grasp), but
each cell fires during a
different phase of the
program.
But what controls
the sequencing?
F5
P recis ion
P inch
set
Go Si gnal
extens ion
maximum
aperture
reached
fl exion
hol d
contact
wit h object
releas e
2nd Go Si gnal
Fagg and Arbib,1998
Arbib and Itti: CS 664 (University of Southern California, SpringAct
2002)
Vision, Action and Language
ivatiIntegrating
on Connection
62
Beyond the Mirror System
F5 alone is not the “full” mirror system
 We
want not only the “unit actions” but also sequences and more
general patterns
The FARS model sketched how to generate a sequence positing roles for SMA and BG.
Our proposed mirror model must match this with a model of how


the units of a sequence and
their order/interweaving
can be recognized.
This new model requires recognition of a complex behavior on multiple occasions with
increasing success in recognizing component actions and in linking them together.
 [cf. scene analysis in vision.]
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
63
Sequential Learning: The Hominid Difference
Further, we need to study how the extension of the mirror system from single
actions to compound actions was further refined along the hominoid
evolutionary track.
We distinguish sequential learning at two levels:
(i) the abstraction of regularities in many sentences to come up with “syntax”;
 (ii) the ability, given syntactic and semantic knowledge, to extract the sequential
or semantic structure of an utterance (parsing) to reflect meaning upward from basic
units via constituent structures to larger units.

Extending the Mirror System Hypothesis, we must show how the ability to
comprehend and create utterances via their underlying syntactico-semantic
hierarchical structure can build upon the observation/ execution of single
actions.
 Distinguish between learning individual sequences via conditioning and
the ability to infer and use sequential (and more general hierarchical
structures) “at sight”.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
64
From Vocalization to Manual Gesture
and back to Vocalization again:
The path to proto-speech is indirect
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
65
The Ancestral Communication System
As before, we assume that our common human-monkey ancestors shared with
monkeys the following:
Primate Call System
a limited set of species-specific calls
Oro-Facial Gesture System
a limited set of gestures expressive of
emotion and related social indicators
Note the linkage between the two systems: communication is inherently multi-modal.
We have argued that the Mirror System Hypothesis explains why F5, rather than the
cingulate area already involved in monkey vocalization, homologous to the Broca's
area's substrate for language?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
66
Starting with Homo habilis
The Fossil Record
Imprints in the cranial cavity of endocasts indicate that "speech areas" were already
present in early hominids such as H. habilis long before the larynx reached the modern
“speech-optimal” configuration but
there is a debate over whether such areas were already present in
australopithecines
 were they for speech or proto-speech or proto-sign-language?

A Related Hypothesis
The transition from australopithecines to early Homo coincided with the transition
from a mirror system, enlarged but only for action recognition
to a human-like mirror system for intentional communication.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
67
Did Homo habilis have language?
Was Homo habilis language-ready?
Homo habilis has a small brain and poorly developed larynx - but
Homo habilis has the motor dexterity for sign language. Is there any
way to prove its brain is “less powerful” than that of a 7-year-old
human?
Homo habilis left few traces of technology that show evidence of a
complex culture that would exploit the use of language - but
 new data (Nature 1999) push sophisticated tool making back to 2
MYr BP
 language can develop to serve highly complex cultural interactions
even in a low-technology society
 Homo sapiens has only recently invented towns and hightechnology society, so a language-ready brain does not guarantee high
technology.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
68
From Praxis to Communication
Our hypothetical sequence for manual gesture:
pragmatic action directed towards a goal object
pantomime in which similar actions are produced away from the goal object
 Imitation is the generic attempt to reproduce movements performed by another,
whether to master a skill or simply as part of a social interaction. By contrast,
pantomime is performed with the intention of getting the observer to think of a
specific action or event. It is essentially communicative in its nature. The imitator
observes; the panto-mimic intends to be observed
abstract gestures divorced from their pragmatic origins (if such existed) and available
as elements for the formation of compounds which can be paired with meanings in
more or less arbitrary fashion.
 In pantomime it might be hard to distinguish a grasping movement signifying
“grasping” from one meaning “a [graspable] raisin”, thus providing an “incentive”
for coming up with an arbitrary gesture to distinguish the two meanings.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
69
Two Roles for Imitation in the
Evolution of Manual-Based Communication
1. Extending imitation from imitation of hand movements by hand movements to
pantomime which uses the degrees of freedom of the hand (and arm and body) to
imitate degrees of freedom of objects and actions other than hand movements.

this extends the repertoire of recognizable and describable actions well beyond those that
can be performed by the speaker/hearer (cf. “the bird is flying”).
2. Extending these pantomime movements to provide ad hoc gestures that may
convey to the observer information which is hared to pantomime in an “obvious”
manner. This requires extending the mirror system from the grasping repertoire to
mediate imitation of gestures to support the transition from ad hoc gestures to
conventional signs which can reduce ambiguity and extend semantic range.
Question: We start with a clear distinction between the representation of the grasp (the
action/proto-verb) and the raisin (the thing/proto-noun) in the brain, but in the spoken language
both noun and verb are uttered by actions. How does this relate to neural correlates of the
distinction between verb and noun?
 Perhaps the answer lies in the perceptual/semantic processing rather than the symbolic/linguistic
processing:
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
70
Noun/Verb pairs differentiated by movement
A change in the extent of movement will change
the meaning of a sign
Stokoe
Language in Hand
Figure 1
Figure 3
Figure 6
A change in the speed of movement will change the
meaning of a sign
Here the noun is characterized by short, repeated
movements, while the verb is characterized by a
single, prolonged movement
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
71
Stage 5: Gestural Communication Emerges
A distinct manuo-brachial communication system evolved to complement the primate
calls/oro-facial communication system
On this view, the "speech" area of early hominids
 i.e., the area somewhat homologous to monkey F5 and human Broca’s
is not yet even a proto-speech area!
Instead, it primarily mediated orofacial and manuo-brachial communication
Question: Did “protosign” precede the initiation of “protospeech” or (currently my
preferred hypothesis) is the better hypothesis that:
“Protosign” reached sufficient sophistication to provide a basic but effective form
of communication at a time when there were few arbitrary vocal gestures (as
distinct from species-specific primate calls)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
72
The Evolving Larynx and the Breath Group
P. Lieberman views the descent of the larynx first seen in Homo sapiens as being crucial
in enabling the wide articulatory range exploited in human speech.
 Caution: This view is still controversial.
Clearly, some level of language-readiness and vocal communication preceded this:
 a core of proto-speech (but not necessarily language) was needed to provide
pressures for larynx evolution.
Lieberman also suggests that the primate call made by an infant separated from its
mother not only survives in the human infant, but in humans develops into the breath
group that provides the contour for each continuous sequence of an utterance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
73
Stage 6: From Manual Gesture to Proto-Speech
The "generativity" which some see as the hallmark of language is present in manual
behavior. Combinatorial properties are inherent in the manuo-brachial system. This
provided the evolutionary opportunity for:
Stage 6. The manual-orofacial symbolic system then “recruited” vocalization.
Association of vocalization with manual gestures allowed them to assume a more open
referential character.
This explains why F5, rather than the primate call area provide the evolutionary
substrate for speech
This yields our explanation for the
evolutionary prevalence of the lateral motor system over the medial (emotion-related)
primate call system
in becoming the main communication channel in humans.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
74
Collateral Control of Vocalization
TINS, May 98: “This new use of vocalization necessitated its skillful control, a requirement that
could not be fulfilled by the ancient emotional vocalization centers. This new situation was most
likely the ‘cause’ of the emergence of human Broca’s area.”
I would now rather say:
Homo habilis and even more so Homo erectus had a “proto-Broca’s area” based on an
F5-like precursor mediating communication by manual and oro-facial gesture.
This made possible a process of collateralization whereby this “proto” Broca’s area
gained primitive control of the vocal machinery, thus yielding increased skill and
openness in vocalization.
Larynx and brain regions could then co-evolve to yield the configuration seen in Homo
sapiens.
Kojima on onomatopeia: The above hypothesis would see onomatopeia as a secondary
mechanism for extending the “vocabulary” of proto-speech, but perhaps we need to respond
with a “multiple factors” account of the evolution of the overall system.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
75
From Primate to Human (A Production Viewpoint)
Primate Call System
a limited set of species-specific calls
Larynx and
Vocal Cords
Oro-Facial Gesture System
a limited set of gestures expressive of
emotion and related social indicators
Facial Muscles
Manual Gesture System
an open set of communicative gestures
Arm and
Hand
Proto-Speech System
an open set of communicative gestures
Perception systems are not shown.The mirror system is thus implicit.
A primitive system
plus an advanced
system?
Or one multi-modal
controller?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
76
Linking the “F5-Broca” and Vocalization Systems
Our original aim was to show why speech did not evolve “simply” by
extending the classic primate vocalization system.
We now note the co-evolution of the two systems:
 Lesions centered in the anterior cingulate cortex and supplementary
motor areas of the brain can cause mutism in humans, similar to the
effects produced in muting monkey vocalizations
I hypothesize cooperative computation between cingulate cortex and
Broca’s area,
 with cingulate cortex involved in breath groups and emotional
shading (and imprecations!), and
 Broca’s area providing the motor control for rapid production and
interweaving of elements of an utterance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
77
Gesture Remains
TINS, May 98: “Manual gestures progressively lost their dominance, while in
contrast, vocalization acquired autonomy, until the relation between gestural
and vocal communication inverted and speech took off.”
Our use of writing as a record of speech has long since created the mistaken
impression that language is a speech-based system. But:
 McNeill has used videotape analysis to show the crucial use that people
make of gestures synchronized with speech
 Even blind people use manual gestures when speaking
 Sign languages are full human languages rich in lexicon, syntax, and
semantics.
 Moreover: not only deaf people use sign language, so do some aboriginal
Australian tribes, and some native populations in North America
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
78
Language acquisition
We locate phonology in a speech-manual-orofacial gesture complex:
a hearing person shifts the major information load of language -- but by
no means all of it -- into the speech domain, whereas
 for a deaf person the major information load is removed from speech
and taken over by hand and orofacial gestures
 and note that blind children accompany speech with hand
movements
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
79
Not three separate systems but a single system
operating in multiple motor and sensory modalities
Primate Call System
a limited set of species-specific calls
Larynx and
Vocal Cords
Genuine Cooperation
Oro-Facial Gesture System
a limited set of gestures expressive of
emotion and related social indicators
Facial Muscles
Manual Gesture System
an open set of communicative gestures
Arm and
Hand
Speech System
an open set of communicative gestures
Caution: One system but many brain regions, each with its own evolutionary story.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
80
Our Descriptions are different from Neural Representations
We may classify the specific structure
Hit (John, Mary, his hand)
as an instance of a more general structure
Hit (Agent, Recipient, Instrument)
but the brain-representations of the constituent entities may or may not entail their
recognition as belonging to these structures.
… we must ensure that descriptive categories are not automatically ascribed to the “neural
strategies” of the subject.
We must be careful:
Hit (John, Mary, his hand) is a symbolic string which represents two different
representations in the brain, neither of which looks like this structure.

The action-object frame represents (whether as a result of perception or motor planning or
both) the relation between an action, two agents and an “object” without demanding that any
names or words or explicit symbols be attached to any of these entities.
 The verb-argument structure is an abstraction from the action-object frame in that it lacks any
graded representation of the specific event, but is enriched by the linkage of each entity to a
specific name or symbol.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
81
The Biological Basis of Language-Readiness
“Knowing there are things and events”: The ability for perception of Action-Object
Frames in which an actor, an action, and related role players can be perceived in
relationship – was well established in the primate line:
Hypothesis: The ability to communicate a fair number of such frames was established in
the hominid line prior to the emergence of Homo sapiens.
 Recognizing an object and acting on it; Recognizing a conspecific and interacting
with it
 Recognizing action-object frames
 Extending the mirror system beyond single actions to a repertoire of action-object
frames which is unbounded a priori.
 Naming action-object frames
 creation of a “symbol toolkit” of meaningless elements from which an open ended
class of symbols can be generated
 abstract symbols are grounded in action-oriented perception
Note that such naming does not imply separate names for the actions and objects or their
attributes; i.e., it does not entail that utterances of protolanguage were compounded from words
akin to those we see in, e.g., the Indo-European languages.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
82
The Transition to Language
Hypothesis: The ability to communicate a fair number of action-object frames using
gesture and proto-speech was established in the hominid line prior to the emergence of
Homo sapiens.
The Transition to Homo sapiens may have involved “language amplification” through
increased speech ability and
 Fractionation of symbols to yield symbols for actions and objects, yielding the ability
to create an unlimited set of verb-argument structures linked to action-object frames
 The one word ripe halves the number of fruit names to be learned
 Separating verbs from nouns let’s one learn only m+n+p words to be able to form
m*n*p of the most basic utterances.
Consideration of the spatial basis for “prepositions” may further show how visuomotor
coordination underlies some aspects of language.
However, the basic semantic-syntactic correspondences have been overlaid by a
multitude of later innovations and borrowings.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
83
The spatial basis for “prepositions”
Consideration of the spatial basis for “prepositions”
may help show how visuomotor coordination underlies some aspects of language and
makes clear the “naturalness” of sign.
Stokoe
Language in Hand
Figure 10
The addition of movement transforms
IN to INTO and exemplifies the
differences in meaning between the two
signs
However, the basic semantic-syntactic correspondences have been overlaid by a
multitude of later innovations and borrowings.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
84
Co-Evolution of Words and Syntax
 The ability to compound those structures in diverse ways, with
abstraction and compounding of more generic verb-argument structure
Recognition of hierarchical structure rather than mere sequencing
could provide the bridge to constituent analysis in language – relating
particular subactions (themselves further decomposable) to
achievement of certain subgoals in a complex manipulation.
Syntax and semantics: compounding utterances, “going recursive
The result: A spiraling co-evolution of communication and
representation, extending the repertoire of achievable, recognizable and
describable actions.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
85
Abstraction, Negation, and Hierarchicalization
Claim: Many ways of expressing relationships were the discovery of Homo sapiens.
I.e., adjectives, conjunctions such as but, and, or or and that, unless, or because, etc.,
might well have been “post-biological” in their origin.
Extending the repertoire of recognizable and describable actions:
Recognition of hierarchical structure rather than mere sequencing could
provide the bridge to constituent analysis in language –
relating particular subactions (themselves further decomposable) to
achievement of certain subgoals in a complex manipulation.
But the power of language comes from breaking away from the hereand-now, not just by hierarchicalization but also by negation and
abstraction.
Need to analyze how the brain can support counterfactual cognitive
representations and relate them to language.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
86
The Action-Object Frame
The action-object frame is non-linguistic
 the representation of an action involving one or
more objects and agents. (Composing them yields
“schema assemblages”)
Verb-argument structure is an overt linguistic
representation
 in modern human languages, generally the action is
named by a verb and the objects are named by nouns
(or noun phrases). (Composing them yields semantic
structures.)
A grammar for a language is then a specific mechanism
(whether explicit or implicit) for converting verbargument structures in particular, and more complex
structures based on hierarchical compounds of verbargument structures more generally into strings of
words, and vice versa.
Cautionary Note: In the brain there is probably no single
grammar, but rather
 a “direct model/grammar” for production
 an “inverse model/grammar” for perception
Cognitive Structures
(Schema Assemblages)
P
r
o
d
u
c
t
i
o
n
Semantic Structures
(Hierarchical Constituents
expressing objects,
actions and relationships)
P
e
r
c
e
p
t
i
o
n
“Phonological” Structures
(Ordered Expressive
Gestures)
John hit Mary with his hand
is an English sentence for the structure we
may encode (arbitrarily!) as:
Hit (John, Mary, his hand)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
87
Neural Representations?
The mirror neuron analysis must be extended to address these questions:
How are action-object frames and verb-argument structures represented in the brain?
How are action-object frames mapped to and from verb-argument structures, and how
are the latter mapped to and from the utterances of (spoken, written, signed) language?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
88
Tentative steps towards a
“Mirror Neurolinguistics”
Cooperative
computation in the brain: to make sense of data relating
different brain regions to different aspects of language.
 Do these data reflect the brain's genetic prespecification and/or the
results of the self-organization of the infant brain when the infant
develops within a particular language community?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
89
Arbib and Caplan (1979)
based on Luria (1973)
C
Naming of objects
 Verbal expression
of motives
F"
 Speech
understanding
 Speech repetition
B
D
Switching
Control
Selective
Naming
Visual Input
Auditory Input
E
Updating
the Plan of
the Expr’n
...
Phonemic
Analysis
G
F'
Visual
Perception
Articulatory
System

F
A
Plan
Formation
Analysis of
Significant
Elements
Formation
of the
Linear
Scheme
H
Lexical
Analysis
I
J
Speech
Memory
Logical
Scheme
Monkey and Human: A Comparative Approach to Neurolinguistics
My goals:
 A fully articulated model of the monkey mirror system (grounded in
neurophysiology of macaque [and other?] monkeys;
 a cooperative computation model of interacting brain regions for human
neurolinguistics (language-readiness versus language) as well as human mirror systems
and imitation; and
 a coherent evolutionary framework which links them, both by synthetic brain
imaging and by brain imaging across monkeys, chimps, and other primates.
Not AIP Homologue:
Let’s discuss this!
F5 Homologue
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
91
“What” versus “Where”: Mishkin and Ungerleider
 “What” versus “How”
DF: Jeannerod et al.
Lesion here: Inability to Preshape
(except for objects with size “in the semantics”
reach programming
Parietal
Cortex
How (dorsal)
grasp programming
Visual
Cortex
Inferotemporal
Cortex
What (ventral)
AT: Goodale and Milner
Lesion here: Inability to verbalize or
pantomime size or orientation
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
92
Goodale and Milner 1
Our evolutionary theory suggests a progression from action to
pantomime to (pre)language
 object  AIP  F5canonical: pragmatics
 action  PF  F5mirror: action understanding
 scene  Wernicke’s  Broca’s: utterance
The "zero order” model of AT and DF data is:
 Parietal “affordances”  preshape
 IT “perception of object”  pantomime or verbally describe size
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
93
Goodale and Milner 2
 IT “perception of object” is needed to pantomime or verbally describe
size
seems to imply one cannot pantomime or verbalize an affordance; one
needs a "unified view of the object" (IT) to express attributes.
The problem with this is that the “language” path as shown in  is
completely independent of the parietal  F5 system, and so the data
seem to contradict our view in :
 scene  Wernicke’s  Broca’s: utterance
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
94
Necessary Background:
FARS (Fagg-Arbib-Rizzolatti-Sakata) Model Overview
AIP
AIP
•AIP extracts a set of
affordances but
•IT and PFC are crucial
to F5’s selection of the
affordance to execute
Dorsal
Stream:
dorsal/ventral
Affordances
streams
Ways to grab
this “thing”
TaskConstra
Constraints
Task
ints ( F6)
(F6)
Working
Memory
W
orking Me
mory (46)
(46?)
Instruction
Stimuli
Instruction
Stim
uli (F2)
(F2)
PFC
Ventral
Stream:
Recognition
F5
F5
canonical
“It’s a mug”
IT
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
95
Some Crucial Psychophysical Data
Bridgeman, Peery & Anand, 1997:
An observer sees a target in one of several possible positions, and a frame either
centered before the observer or deviated left or right.
Verbal judgments of target position are altered by the background
frame's position
 Pointing at the target never misses, regardless of the frame's
position.

The data demonstrate independent representations of visual space in the two systems,
with the observer aware only of the spatial values in the IT system
They have also shown that a symbolic message about which of two targets to jab can be
communicated from the cognitive (inferotemporal) to the sensorimotor (parietal)
system without communicating the cognitive system's spatial bias as well.
Hypothesis:
Just as F5mirror receives its parietal input from PF rather than AIP, so Broca's area
receives its size data as well as object identity data from IT via PFC, rather than via a
side
AIP. of Southern California, Spring 2002) Integrating Vision, Action and Language
Arbib
andpath
Itti: CSfrom
664 (University
96
Enhancing the Pathways
We thus enhance each of the pathways
 object  AIP  F5canonical: pragmatics
 action  PF  F5mirror: action understanding
 scene  Wernicke’s  Broca’s: utterance
by having PFC modulate the activity of each premotor component.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
97
An Early Pass on the AT/DF Challenge
Visual
Input
AIP
F5canonical
Choosing an
Action
PF
F5mirror
Recognizing an
Action
STS
IT
Recognizing an
Object or an
Action
Wernicke’s
Area
Broca’s
Area
Describing an
Episode, Object
or Action
Prefrontal (PFC)
Memory
Do these link the right boxes? What is the relationship?
Is PF a homologue of Wernicke’s area? How does the role of PFC in the FARS model
relate to its roles in the mirror system of monkey and in language?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
98
Going Further ...
 To extend our existing model of the mirror system in monkey to chart the
human brain mechanisms for recognizing interactions of actors and objects
 Use this to ground a theory of brain representations underlying the capacity
to symbolize episodes, actions, objects and actors.
 Use these representations to ground a functional/cognitive account
integrating syntax and semantics, and link this to a new approach to
neurolinguistics that goes "Beyond the Mirror" to test and refine the Mirror
System Hypothesis.
 Develop new models of the evolutionary linkage from the primitive mirror
system to modern human brain mechanisms: e.g., for the evolution of
increasingly subtle forms of imitation, and the transition from manual to
vocal gestures.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
99
A Sample of Further Research Topics
Modeling of monkey brain mechanisms for
 visually guided behavior & mirror neurons
 vocalization, communication & multi-modal integration
 compound behaviors and social interactions
Comparative modeling of primate (including human) brain mechanisms:
 extending the monkey model to chimp and human
 comparative/evolutionary model of different types of imitation
The minimal subscene as a meeting ground for action, action recognition and language
Neurolinguistics and a Functional/Cognitive Integration of Syntax and Semantics
 extending the basic sentence frames for descriptions and questions with minimal
subscenes
 Cognitive Form: linking the action-frame to semantic and syntactic structures
Evolution: From Grasp to Imitation to Language
 includes a study of the linkage between sign language and vocalization as a basis
for evolutionary theorizing
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
100
Evolution: From Grasp to Imitation to Language
We will build the bridge that links the basic mirror system for grasping to language via
the stages of
 a complex imitation system for grasping
 a manual-based communication system, and
 proto-speech, characterized as the open-ended production and perception of
sequences of vocal gestures.
Major Goals:
 to extend the work on modeling the mirror system to the study of the imitation of
actions in monkey, chimpanzee and human.
 to explain how, in the course of hominid evolution, the manual-orofacial symbolic
system may have "recruited" vocalization.
This approach will help us understand language as more a cultural product than a
biologically innate one, with culture shaping a brain which is language-ready in a
multi-modal way, so that language performance may integrate speech with manual and
oro-facial gesture, or may reduce to sign language as readily as to spoken language.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
101
Beyond the Mirror to Neurolinguistics
Object features
cIPS
Object
affordance
extraction
7b: PF/PG Object affordance
-hand state
association
Hand
shape
recognition
Hand
motion
detection
STS
AIP
Integrate
temporal
association
Mirror Action
Feedback recognition
Hand-Object
spatial relation
analysis
7a
(Mirror
Neurons)
F5mirror
The Mirror Neuron
System (MNS) Model
F5canonical
Motor
program
(Grasp)
Motor
program
(Reach)
F4
Motor
execution
M1
Object
location
If the monkey needs so many brain regions for the mirror system for
grasping, how many more brain regions will we need for an account of
language-readiness that goes beyond the mirror
to develop a full neurolinguistic model
far beyond the F5  Broca’s area homology??
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
102