A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science.

Download Report

Transcript A neglected problem in the computational theory of mind Object Tracking and the Mind-World gap Zenon Pylyshyn Rutgers Center for Cognitive Science.

A neglected problem in the
computational theory of mind
Object Tracking and the Mind-World gap
Zenon Pylyshyn
Rutgers Center for Cognitive Science
Before I begin I would like you to see a ‘video
game’ that will figure in the last part of my talk

The demonstration shows a task called
“Multiple Object Tracking”
 Track the initially-distinct (flashing) items
through the trial (here 10 secs) and indicate
at the end which items are the “targets”
 After each example I’d like you to ask
yourself, “How do I do it?”
 If you are like most of our subjects you will
have no idea, or a false idea…
Keep track of the objects that flash
512x6.83 172x 169
How do we do it? What properties
of individual objects do we use?
Going behind occluding surfaces does not disrupt tracking
Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual
objecthood. Cognitive Psychology, 38(2), 259-290.
Not all well-defined features can be tracked:
Track endpoints of these lines
Endpoints move exactly as the squares did!
The basic problem of cognitive science
 What
determines our behavior is not how the
world is, but how we represent it as being
 As Chomsky pointed out in his review of Skinner, if we
describe behavior in relation to the objective properties
of the world, we would have to conclude that behavior
is essentially stimulus-independent
 Every naturally-occurring behavioral regularity is
cognitively penetrable
 Any information that changes beliefs can
systematically and rationally change behavior
Representation and Mind
Why representations are essential

Do representations only come into play in
“higher level” mental activities, such as
reasoning?
 Even at early stages of perception many of
the states that must be postulated are
representations (i.e. what they are about
plays a role in explanations).
Examples from vision (1): Intrapercept constraints
Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.
Examples from vision (2):
The Pogendorf iIlusion depends on perceived
contours – they need not be physical edges
The rules of color mixing apply to perceived color
‘Red light and yellow light mix to produce orange light’
 This ‘law” holds regardless of how the red light and
yellow light are produced;
 The yellow may be light of 580 nanometer wavelength,
or it may be a mixture of light of 530 nm and 650 nm
wavelengths.
☺So long as one light looks yellow and the other looks red
the “law” will hold – the mixture will look orange.
Another example of a classical representation
Other forms of representation….
a)
b)
c)
d)
e)
f)
g)
Lines FG, BC are parallel and equal.
Lines EH, AD are parallel and equal.
Lines FB, GC are parallel and equal.
Lines EA, HD are parallel and equal.
Vertices EF, HG, DC and AB are joined....
Part-Of{Cube, Top-Face(EFGH), BottomFace(ABCD), Front-Face(FGCB), BackFace(EHDA)}
Part-Of{Top-Face(Front-Edge(FG), BackEdge(EH), Left-Edge(EF), Right-Edge(HG)},…
What’s wrong with this picture?
What’s wrong is that the CTM is incomplete — it does
not address a number of fundamental questions
 It fails to specify how representations connect with
what they represent – it’s not enough to use English
words in the representation (that’s been a common
confusion in AI) or to draw pictures (a common
confusion in theories of mental imagery)
 English labels and pictures may help the theorist recall
which objects are being referred to …
 But what makes it the case that a particular mental
symbol refers to one thing rather than another?
 How are concepts grounded? (Symbol Grounding Problem)
Another way to look at what the
Computational Theory of Mind lacks

The missing function in the CTM is a mechanism
that allows perception to refer to individual things
in the visual field directly and nonconceptually:
 Not as “whatever has properties P1, P2, P3, ...”, but as a
singular term that refers directly to an individual and
does not appeal to a representation of the individual’s
properties.
 Such a reference is like a proper name or a pointer in a
computer data structure, or like a demonstrative term
(like this or that) in natural language.
 Note that in a computer a pointer does not refer via a
location, despite what the term “pointer” suggests
An example from personal history: Why
we need to pick out individual things
without referring to their properties

We wanted to develop a computer system that would
reason about geometry by actually drawing a diagram
and noticing adventitious properties of the diagram
from which it would conjecture lemmas to prove
 We wanted the system to be as psychologically
realistic as possible so we assumed that it had a narrow
field of view and noticed only limited, spatiallyrestricted information as it examined the drawing
 This immediately raised the problem of coordinating
noticings and led us to the idea of visual indexes to
keep track of previously encoded parts of the diagram.
Begin by drawing a line….
L1
Now draw a second line….
L2
And draw a third line….
L3
Notice what you have so far….(noticings are
local – you encode what you attend to)
L1
V6
L2
There is an intersection of two lines…
But which of the two lines you drew are they?
There is no way to indicate which individual
things are seen again without a way to refer to
individual (token) things
Look around some more to see what is there ….
L5
L2
V12
Here is another intersection of two lines…
Is it the same intersection as the one seen earlier?
Without a special way to keep track of individuals the only
way to tell would be to encode unique properties of each of
the lines. Which properties should you encode?
In examining a geometrical figure one only
gets to see a sequence of local glimpses
The incremental construction of visual
representations requires solving a
correspondence problem over time

We have to determine whether a particular individual
element seen at time t is identical to another individual
element seen at a previous time t- . This is one
manifestation of the correspondence problem.
 Solving the correspondence problem is equivalent to
picking out and tracking the identity of token
individuals as they change their appearance, their
location or the way they are encoded or conceptualized
 To do that we need the capacity to refer to token
individuals (I will call them objects) without doing so
by appealing to their properties. This requires a special
form of demonstrative reference I call a Visual Index.
A note about the use of labels in this example
 There
are two purposes for figure labels. One is to specify
what type of individual it is (line, vertex,..). The other is to
specify which individual it is so it is individuated and thus
can be selected or bound to the argument of a predicate.
 The
second of these is what I am concerned with because
indicating which individual it is is essential in vision.
 Many people (e.g., Marr, Yantis) have suggested that individuals
may be marked by tags, but that won’t do since one cannot
literally place a tag on an object and even if we could it would not
obviate the need to individuate and index just as labels don’t help.
 Labeling
things in the world is not enough because to refer
to the line labeled L1 you would have to be able to think
“this is line L1” and you could not think that unless you
had a way to first picking out the referent of this.
The difference between a direct (demonstrative) and a
descriptive way of picking something out has produced
many “You are here” cartoons.
It is also illustrated in this recent New Yorker cartoon…
The difference between descriptive and
demonstrative ways of picking something out
(illustrated in this New Yorker cartoon by Sipress )
‘Picking out’

Picking out entails individuating, in the sense of separating
something from a background (what Gestalt psychologists
called a figure-ground distinction)

This sort of picking out has been studied in psychology under
the heading of focal or selective attention.
 Focal attention appears to pick out and adhere to objects rather than places

In addition to a unitary focal attention there is also evidence
for a mechanism of multiple references (about 4 or 5), that I
have called a visual index or a FINST
 Indexes are different from focal attention in many ways that we
have studied in our laboratory (I will mention a few later)
 A visual index is like a pointer in a computer data structure – it
allows access but does not itself tell you anything about what is
being pointed to
The requirements for picking out and keeping
track of several individual things reminded me of
an early comic book character called Plastic Man
Imagine being able to place several of your fingers on things
in the world without recognizing their properties while doing
so. You could then refer to those things (e.g. ‘what finger #2 is
touching’) and could move your attention to them. You would
then be said to possess FINgers of INSTantiation (FINSTs)
FINST Theory postulates a limited number of pointers
in early vision that are elicited by certain events in the
visual field and that enable vision to refer to those
things without doing so under concept or a description
FINSTs and Object Files form the link between the
world and its conceptualization
The only
nonconceptual
Object
File
contents
in thisare
picture
contents
are FINST
indexes!
conceptual!
Information (causal) link
FINST Demonstrative
reference link
A note on terminology





A FINST provides a reference to an individual visible ‘thing’
I sometimes call this referent a FING by analogy with FINST and
sometimes an object to conform with usage in psych, but FINGs are
nonconceptual so they do not pick out something as an object,
because OBJECT us a concept. Maybe “proto object”?
I have also called it a pointer, but that erroneously suggests that it
“points to” the location of an object, as opposed to the object itself.
In a computer, a pointer is the name of a stored datum.
I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’,
but that too is misleading because the reference of a demonstrative
depends on the intentions of the speaker
I have also noted that a FINST is like a proper name but that won’t
do since a name can pick out something not in sensory contact
whereas a FINST can only refer to a visible item (or one that is
briefly out of sight).
A quick tour of some evidence for FINSTs
•
The correspondence problem
 The binding problem
 Evaluating multi-place visual predicates
(recognizing multi-element patterns)
 Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset search
● Multiple-Object Tracking
•
Cognizing space without requiring a spatial
display in the head
A quick tour of some evidence for FINSTs
•
The correspondence problem (mentioned earlier)
 The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset selection
 Multiple-Object Tracking
• Cognizing space without requiring a spatial
display in the head
Individual objects and the binding problem


We can distinguish scenes that differ by conjunctions
of properties, so early vision must somehow keep track
of how properties co-occur – conjunction must not be
obscured. This is the called the binding problem
The most common proposal is that vision keeps track
of properties according to their location and binds
together co-located properties.
1
2
The proposal of binding conjunctions by the location
of conjuncts does not work when feature location is
not punctate and becomes even more problematic if
they are co-located – e.g., if their relation is “inside”
Binding as object-based
 The
proposal that properties are conjoined by virtue of their
common location has many problems
 In order to assign a location to a property you need to know its
boundaries, which requires distinguishing the object that has those
properties from its background (figure-ground individuation)
 Properties are properties of objects, not of locations – which is why
properties move when objects move. Empty locations have no
causal properties.
 The
alternative to conjoining-by-location is conjoining by
object. According to this view, solving the binding problem
requires first selecting individual objects and then keeping
track of each object’s properties (in its object file)
 If only properties of selected objects are encoded and if those
properties are recorded in object files specific to each object, then
all conjoined properties will be recorded in the same object file,
thus solving the binding problem
Attention spreads over perceived objects
A
C
A
C
Spreads to
B and not C
Spreads to
C and not B
*B
A
D
C
B
A
D
C
Spreads to
B and not C
Spreads to
C and not B
B
D
B
D
Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads t
other parts of the same visual object compared to equally distant parts of different objects.
A quick tour of some evidence for FINSTs
•
The correspondence problem (mentioned earlier)
 The binding problem
 Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset selection
 Multiple-Object Tracking
• Cognizing space without requiring a spatial
display in the head
Being able to pick out and refer to individual
distal elements is essential for encoding patterns
 Encoding relational predicates; e.g., Collinear
(x,y,z,..); Inside (x, C); Above (x,y); Square (w,x,y,z),
requires simultaneously binding the arguments of
n-place predicates to n elements in the visual scene
 Evaluating such visual predicates requires
individuating and referring to the objects over
which the predicate is evaluated: i.e., the arguments
in the predicate must be bound to individual
elements in the scene.
Several objects must be picked out at
once in making relational judgments
When we judge that certain objects are collinear, we must first
pick out the relevant objects while ignoring their properties
Several objects must be picked out at
once in making relational judgments

The same is true for other relational judgments like inside or onthe-same-contour… etc. We must pick out the relevant individual
objects first. Are dots Inside-same contour? On-same contour?
A quick tour of some evidence for FINSTs
•
The correspondence problem
 The binding problem
 Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without first having to search for them
 Subitizing
 Subset selection

•
Multiple-Object Tracking
Cognizing space without requiring a spatial
display in the head
More functions of FINSTs
Further experimental explorations
using different paradigms

Recognizing the cardinality of small sets of things:
Subitizing vs counting (Trick, 1994)
 Searching through subsets – selecting items to search
through (Burkell, 1997)
 Selecting subsets and maintaining the selection during a
saccade (Currie, 2002)

Application of FINST index theory to infant
cardinality studies (Carey, Spelke, Leslie, Uller, etc)
Indexes explain how children are able to acquire
words for objects by ostension without suffering
Quine’s Gavagai problem.
A quick tour of some evidence for FINSTs
• The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset selection

•
Multiple-Object Tracking
Cognizing space without requiring a spatial
display in the head
Another example of MOT: With self occlusion
5 x 5 1.75 x 1.75
Self occlusion dues not seriously impair tracking
Some findings of Multiple Object Tracking
 Basic finding: Most people can track at least 4 targets that
move randomly among identical non-target objects (even
5 year old children can track 3 objects)
 Object properties do not appear to be recorded during
tracking and tracking is not improved if all objects are
visually distinct (no two objects have the same color, shape or size)
 How is it done?
 We showed that it is unlikely that the tracking is done by
keeping a record of the targets’ locations and updating them by
serially visiting the objects (Pylyshyn & Storm, 1998)
 Other strategies may be employed (e.g., tracking a single
deforming pattern), but they do not explain tracking
 Hypothesis: FINST Indexes get assigned to targets. At the end
of the trial these pointers can be used to move attention to the
targets and hence to select them
What role do visual properties play in MOT?

Certain properties may have to be present in order for an
object to be indexed, and certain properties (probably
different properties) may be required in order for the index to
keep track of the object, but this does not mean that such
properties are encoded, stored, or used in tracking.
 Compare this with Kripke’s distinction between properties that fix
the referent of a proper name and the property that the name refers
to. The former only plays a role at the name’s initial “baptism.”

Is there something special about location? Do we record and
track properties-at-locations?
 Location in time & space may be essential for individuating objects,
but locations need not be encoded or made cognitively available
 The fact that an object is actually at some location or other does not
mean that it is represented as such. Representing property ‘P’ (where
P happens to be at location L) ≠ Representing property ‘P-is-at-L’.
A way of viewing what goes on in MOT

According Kahneman & Treisman’s Object File theory, the
appearance of a new visual object causes a new Object File to
be created. Each object file is associated with its respective
object – presumably through a FINST Index.

The object file may contain information about the object to
which it is attached. But according to FINST Theory, keeping
track of the object’s identity does not require the use of this
information. The evidence suggests that in MOT, little or
nothing is stored in the object file except maybe in special cases
(e.g., when the object suddenly changes or disappears).

What makes something the same object over time is that it
remains connected to the same object-file (by the same FINST).
Thus, for vision to treat something as the same enduring
individual does not require appeal to properties or concepts.
Why is this relevant to foundational
questions in the philosophy of mind?

According to Quine, Strawson, and most philosophers, you
cannot pick out or track individuals without concepts (sortals)
 But you also cannot pick out individuals with only concepts
 Sooner or later you have to pick out individuals using non-
conceptual causal connections between thoughts and things

The present proposal is that FINSTs provide the needed
non-conceptual mechanism for individuating objects and for
tracking their identity, which works most of the time in our
kind of world. It relies on a natural constraint (Marr)
 FINST indexes provide the right sort of connection for
predicating properties of the world by allowing the
arguments of predicates to be bound to objects prior to the
predicates being evaluated. They may thus be the basis for
early vocabulary learning.
But there must be some properties
that cause indexes to be grabbed!

Of course there are properties that are causally
responsible for indexes being grabbed, and also
properties (probably different ones) that make it
possible for objects to be tracked;
 But these properties need not be represented
(encoded) and used in tracking
 The distinction between object properties that cause
indexes to be assigned and those that are represented
(in Object Files) is similar to Kripke’s distinction
between properties that are needed to pick out name
an object and those that constitute its meaning
Effect of target properties on MOT

Changes of target properties are not reported nor
even noticed during MOT

Keeping all targets at different color, size, or shape
does not improve tracking
Observers do not use target speed or direction in
tracking (e.g., by anticipating where the targets will
be when they reappear after occlusion)

Some open questions
 We
have arrived at the view that only properties of
selected (indexed) objects enter into subsequent
conceptualization and perception-based thought (i.e.,
only information in object files is made available to
cognition)
So what happens to the rest of the visual
information?
 Visual
information seems rich and fine-grained while
this theory only allows for the properties of 4 or 5
objects to be encoded!
 The present view leaves no room for nonconceptual
representations whose content corresponds to the
content of conscious experience
 According to the present view, the only content that
An intriguing possibility….
Maybe the theoretically relevant information we
take in is less than (or at least different from) what
we experience
 This possibility has received attention recently with the
discovery of various “blindnesses” (e.g., changeblindness, inattentional blindness, blindsight…) as well as
the discovery of independent-vision systems (e.g.,
recognition and motor control)
 The qualitative content of conscious experience may not
play a role in explanations of cognitive processes
 Even if unconceptualized information enters into causal
process (e.g., motor control) it may not be represented or
made available to the cognitive mind it – not even as a
nonconceptual representation
Vision science has always been deeply
ambivalent about role of conscious
experience
Isn’t how things appear one of the things that our
theories must explain? Answer: There is no a priori
‘must explain’!
● The
content of subjective experience is a major type of
evidence. But it may turn out not to be the most reliable
source for inferring the relevant functional states. It
competes with other types of evidence.
● How things appear cannot be taken at face value: it carries
substantive theoretical assumptions. It also draws on many
levels of processing.
It was a serious obstacle to early theories of vision (Kepler)
It has been a poor guide in the case of theories of mental imagery
(e.g., color mixing, image size, image distances). ‘Reading X off an
What next?
This picture leaves many unanswered
questions, but it does provide a mechanism
for solving the binding problem and also
explaining how mental representations
could have a nonconceptual connection with
objects in the world (something required if
mental representations are to connect with
actions)

For a copy of these slides see:
http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionRefere
nce.ppt

Or MIT Press
Paperback
You are now here
X
But you are also here