Plan of talk: Visual Indexes

Download Report

Transcript Plan of talk: Visual Indexes

Vision needs non-conceptual
connections to objects in the
world (just as concepts do)
 Introduction to a theory of
visual indexes (aka FINSTs)
Zenon Pylyshyn
Rutgers Center for Cognitive Science
Plan of talk: Visual Indexes
Theoretical motivations behind the FINST theory
 The need for a primitive mechanism of individuation
 Because individuation must be of distal objects, we
have the Correspondence Problem: When do two
proximal tokens correspond to the same distal object?
• A special case: incremental construction of visual representations
Empirical studies of individuation and indexing
 Object-specific effects (static & moving objects)
 Multiple Object Tracking technique.
What, if any, encoded properties are used to
individuate, index, and track objects?
Visual Indexes (FINSTs) and what they
mean for vision science and cognitive science
The need for a mechanism
that individuates objects
Examples of solving the
correspondence problem
Individuating distal objects
requires solving the
correspondence problem
Object-based allocation of
visual attention
A special case of the correspondence problem occurs
when visual representations
are constructed over time.
Multiple Object Tracking
and Visual Indexes: what it
means for connecting
vision and the world
An important function of early vision is
to individuate and select token elements
(let’s call them “objects” for now)
 The most basic perceptual operation is the
individuation and selection that precedes the
formulation of perceptual judgments.
 Making visual judgments presupposes that
the things (objects) that judgments are about
have been individuated and selected (or
indexed – i.e., made accessible).
Another way to put this is that the arguments
of perceptual predicates P(x,y,z,…) must be
bound to things in the world in order for the
judgment to have perceptual content.
Several objects must be picked out
at once in relational judgments
 For example, when we judge that certain objects are
collinear, we must select (and the visual system must
be able to refer to) the relevant individual objects.
Several objects must be picked out
at once in relational judgments
 The same is true for other relational predicates,
like inside or on-the-same-contour… etc. We
must pick out the relevant individual objects first.
Several objects must be picked out
at once in numerical judgments
In subitizing, the cardinality of sets of 4 or
less can be judged rapidly and accurately
(over 4 is slow and error-prone).
Subitizing only occurs if items can be
automatically individuated.
Enumerating different layouts of squares
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers
enumerated differently? A limited capacity preattentive stage in vision.
Psychological Review, 101(1), 80-102.
Another property that cannot be
used to subitize: on-same-contour
Individuation is different from
discrimination
How do we select (and index)
objects in our field of view?
 The principal way we select individual objects is
by foveating them – by looking directly at them
(Notice that this results in a deictic reference).
 We can also select with focal attention, which is
independent of direction of gaze.
 Focal attention appears to be unitary, yet we can
select more than one thing at a time (e.g., in
making a relational judgment). So it seems that
we need to distinguish attending from selecting:
That’s where Visual Indexes or FINSTs come in.
 A question for later: In virtue of what properties
are primitive objects individuated and indexed?
Indexes must individuate and select objects
in the world. This leads to the ubiquitous
correspondence problem in vision
 Apparent motion, stereo vision, tracking, and
very many visual computations face the problem
of identifying which proximal image-features
correspond to the same individual distal object.
 Less well known is the correspondence problem
faced when a single visual representation is
constructed incrementally over time.
 The way the correspondence problem is solved
determines what the vision system counts as an
individual. These primitive individuals (called
“objects”) are thus mind-dependent.
Example of the correspondence
problem for apparent motion
The gray disks correspond to the first flash and the black ones
to the second flash. Which of the 24 possible matches will the
visual system select as the solution to this correspondence
problem? What principal does it use? (Dawson & Pylyshyn, 1988)
Curved matches
Linear matches
One of the most troubling forms of
the correspondence problem occurs
because visual representations are
constructed incrementally over time
It is clear that when vision requires eye
movements, a visual representation is constructed
incrementally. But there is also evidence that
percepts are built up over time even for the
automatic perception of simple forms. So this
type of correspondence problem is routine in
vision. Why does it constitute a special problem?
Example: Drawing a diagram and noticing its properties
Some of the distinct “views” while exploring the diagram
The correspondence problem for
incremental construction of a visual
representation
 When a property F of some particular individual
(token) object O is noticed or encoded, the visual
system must check whether object O is already
represented. If it is, the new property must be
associated with the existing representation of O.
 If the only way to identify a particular individual
object O is by its description, then the way to solve
this correspondence problem is to find an object in
memory that bears a particular description (one that
had been unique at the time). Which description? If
objects can change their properties, we don’t know
under what description the object was last stored.
Perhaps we look for an object with a description that
overlaps the present one, or perhaps we construct a
description that somehow incorporates time.
The correspondence problem for
incremental construction of a visual
representation
 Even if it were otherwise feasible to solve the
correspondence problem by searching for a
unique past description, this would in general be
computationally intractable (technically, matching
descriptions is an NP-hard problem).
In any case it is unlikely that this is what our
visual system does, for many reasons – e.g., we do
not in general find it more difficult to perceive a
scene that has many identical parts, as would be
predicted from this technique (since it would then
be more difficult to find a unique descriptor for
each object and the correspondence problem
would quickly grow in complexity).
In virtue of what visual properties
are objects individuated?
The most plausible property used in selecting
and accessing an object is its location (this is
often the only unique property available).
The notion of a pointer suggests the use of
location-as-access.
Virtually all theories of visual attention and
property detection assume that we access an
object’s properties by first retrieving its
location.
But….
 Although there is a great deal of evidence for the
priority of encoding location, this does not show
that properties must be accessed by their location.
 In studies in which objects remain stationary,
location is confounded with individuality since in
these cases being at a particular location is
coextensive with being a particular individual.
 But there is also recent evidence that we can
access an object’s properties solely by virtue of
the object’s persistence qua individual. This is
referred to as object-based attention.
Unconfounding location and
individuality
There are at least two possible ways to
unconfound location and individuality:
1. use moving objects
2. use “objects” whose identity and/or
‘motion’ is independent of their spatial
location.
Distinguishing access-by-location
and access-by-individual
1. Moving objects
 Object-specific priming (Object Files)
 Object-specific Inhibition of Return *
 Simultanagnosia & Visual Neglect *
 Multiple Object Tracking (MOT)
2. Spatially coincident objects
 Single-object advantage *
 tracking in “feature space”
* Some of these may be omitted for lack of time
Moving object studies…
Object-specific Priming (object-file theory)
Kahneman, D., Treisman, A., & Gibbs, B. J. (1992). The reviewing of object files:
Object-specific integration of information. Cognitive Psychology, 24(2), 175-219.
Sequence of displays in a simple Object-Priming experiment
Object File example: Wrong letter in box
Object File example: Correct letter in box
Moving object studies…
Inhibition of Return
(Tipper et al. 1991)
When the target-cue interval is between
300 ms and 900 ms it takes longer to
detect a cued target than an uncued one
– this is called Inhibition of Return.
Moving object studies…
“Inhibition of Return” moves
with the object that is inhibited
If the cued object moves, the
Inhibition of Return moves with it.
Multiple Object Tracking Experiments
How do we do it? What properties of
individual objects do we use?
People can track 5 or more objects
under a wide variety of conditions
Objects don’t even have to avoid collisions!
Objects can even disappear from view,
as long as they do it in the right way
There must be local evidence of an occluding surface.
A possible location-updating tracking algorithm
1. While the targets are visually distinct, scan attention to each target
2.
3.
4.
5.
6.
7.
and encode its location on a list. When targets begin to move;
For n=1 to 4; Check the n’th position in the list and retrieve the
location Loc(n) listed there.
Go to location Loc(n). Find the closest element to Loc(n).
Update the n’th position on the list with the actual location of the
element found in #3. This becomes the new value of Loc(n).
Move attention to the location encoded in the next list position,
Loc(n+1).
Repeat from #2 until elements stop moving.
Go to each Loc(n) in turn and report elements located there.
We compared the above algorithm with human performance on the very
same displays. We assume (1) focal attention is required to encode
locations (i.e., encoding is not parallel), (2) focal attention is unitary
and has to be scanned from location to location. But it assumes no
encoding (or dwell) time at each element.
Predicted performance of the location updating
algorithm as a function of attention scanning speed
What properties are used in
(a) selecting objects, and
(b) tracking objects?
 Notice that these are different
operations and need not involve
the same properties
Role of object properties
What properties can be used to
select (index) an object in MOT?
We have evidence that under certain
conditions selecting objects can be done
either automatically or voluntarily.
 Automatic selection requires “popout” features
(sudden appearance, motion, stereo depth, etc)
 Voluntary selection can use any discriminable
property, but the objects must be attended
serially and the property must be available
long enough for this to occur (Annan study)
Role of object properties (continued)
What properties can be used to
track indexed objects?
 We have some evidence that observers do not
encode or use intrinsic object properties (e.g.,
color, shape) during tracking:
 When we stop and ask, observers cannot tell us what
properties objects had and they do not notice when
properties like color/shape change during occlusion;
 There is some evidence that tracking occurs (at least
for small numbers of objects) even if it is not taskrelevant (e.g., object-based priming and IOR);
 We have some evidence that when objects differ in
non-identifying (asynchronously changing) properties,
they are not tracked any better than if they do not
differ in these properties.
Role of object properties (continued)
What properties can be used to
track indexed objects?
 We have some evidence that observers do not use
an encoding of the trajectory of objects in
tracking – i.e., tracking is not predictive (Brian
Keane).
 Tested condition in which all objects disappear for t
milliseconds (up to half a second) then reappeared:




Where the would have been at that time (worst)
Where they were when they disappeared (best)
Where they were t ms previously (almost as good as above)
All shifted left, right, up or down by the same distance
 Targets are tracked most poorly when they reappeared
where they would have been at that time, best when
they reappeared where they disappeared, and in
between for the other conditions.
Role of object properties (continued)
Do observers use some version
of object locations for tracking?
 It has been suggested that perhaps instead of using
the location-updating method to track, observers
respond to the objects’ “spatiotemporal trajectory”
property (e.g., to their “space-time worms”).
Spatiotemporal continuity as a
property that is used in tracking
 Could a mechanism respond to spatiotemporal
continuity without responding to object identity?
 The notion of spatiotemporal trajectory presupposes
that it is the trajectory of a single individual object,
and not a sequence of time-slices of different objects.
Therefore it assumes that the individual object has
been selected and tracked. Responding to a
spatiotemporal trajectory may be the same as tracking
an object’s identity.
Another way to unconfound
individuality and location
Can we attend to objects that are not
distinguished by their location?
 Single-object advantage studies
Can we track (generalized) objects that do
not move through real space, but move
through some other property space?
Observers can track non-spatial
‘virtual objects’ that move
through a ‘property space’:
Tracking superimposed surfaces
Two superimposed
Gabor patches that
vary in spatial
frequency, color and
angle
Blaser, Pylyshyn & Holcombe (2000)
Changing feature dimensions
Surfaces move randomly
in “feature-space”
Snapshots
snapshots taken every 250 ms
Such generalized ‘objects’ can be tracked
individually, and they also show single-object
superiority for change detection.
Some speculations about what vision
needs and what the Early Vision
module may provide (1)
1. We need a mechanism that puts us in
causal contact with distal objects in a
visual scene – a contact that does not
depend on the object satisfying a certain
(conceptual) description, but on a brute
causal connection.
 We need such a connection in order to
connect vision and action.
 We need such a connection in order to ground
concepts to their instances.*
Speculations on what vision needs and
what the visual module may provide (2)
2. We need a mechanism that keeps track of the
identity of distal objects without using their
encoded properties – this happens whenever the
correspondence problem is solved.
 Such a mechanism realizes a rudimentary identitytracker, with its own internal ‘rules’.
3. This is not a general identity-maintenance
process; it will not allow you to recognize the
identity of a person in a picture and a person on
the street. But it may provide a way to maintain
same-objecthood within the modular early vision
system. There is also this tantalizing fact …
 There is evidence for such a mechanism in babies as
young as 4 months (Leslie, Spelke)!
Other studies: Implications for
visually-controlled action, infant
cognition, and robotics
A short tour of research in which the notion
of deictic (or indexical) reference has been
appealed to.
Ballard, Hayhoe et al.’s proposal
for a “deictic strategy”
People appear to use their direction-of-gaze as a
reference point in encoding patterns and would
prefer to make more eye movements rather than
devote extra effort to memorizing a simple pattern.
Ballard, D. H., Hayhoe, M. M., et al. (1997). Deictic codes for the
embodiment of cognition. Behavioral and Brain Sciences, 20(4), 723-767.
Use of deictic pointers in the Ballard et
al. study
 The task is to copy the model by getting blocks from the resource
and constructing a copy of the model in the workspace
 If subjects memorized and copied 2-block patterns it would take
them 4 glances. Instead, subjects made 18 fixations into the
model and did not memorize more than they needed for the next
basic one-block action. The most common sequence was: fixate
model; fixate and pickup block; fixate model; fixate workspace
and drop off block (M-P-M-D). If the color/location of a block
changed during the sequence, the change went undetected,
showing that the color/location of other blocks was not encoded.
 The strategy of using where the eye points as the reference for the
memory representation is inefficient from the perspective of the
number of eye movements required, but it appears to be more
efficient from the point of view of the memory cost.
 This result illustrates the habitual use of a deictic strategy wherein
pointing into a real scene take precedence over memorization
Ballard, Hayhoe et al. call this
method of exercising perceptual
motor skills the “deictic strategy”
 People appear to use direction-of-gaze as the reference
point in encoding patterns and would prefer to make more
eye movements than memorize even a very simple pattern.
 But notice that subjects need to be able to move their gaze
back to where they left off, so they need more than one
deictic reference pointer, just as FINST theory postulates.
 The use of deictic references is a very general strategy, not
only because of the cost of storing a complex spatial
representation, but also because the information is then in
the right form for action – for the command “pick that up ”
where the demonstrative refers to what is being attended
or foveated, which may remain unrecognized and indeed
even unconceptualized in any way.
Relation to work on infants’ sensitivity
to the cardinality of sets of objects
Alan Leslie’s “Object Indexes”
Infants as young as 4 months of age show surprise (longer
looking time) when they watch two things being placed behind
a screen and when the screen is lifted it reveals only one thing.
Below 10 months of age they are in general not surprised when
the screen is lifted to reveal two things that are different from
the ones they saw being placed behind the screen, so long as
their numerosity is correct.
In some cases, infants (age 10 months) use the difference in
color of the objects they are shown one-at-a-time to infer their
numerosity, but they do not record the colors and use them to
identify the objects that are revealed when the screen is lifted.
Leslie & Tremoulet: Infants aged 10 and 12 months are shown a red and then a green object which are then
hidden behind a screen. The 10 month old is surprised if raising the screen reveals the wrong number of
objects, not if it reveals the wrong color of objects. Color is used to individuate objects, but not to keep track
of them! At 12 months children can use color to keep track of how many objects went behind the screen.
Object Indexes in infant enumeration
Leslie, A. M., Xu, F., Tremolet, P. D., & Scholl, B. J. (1998). Indexing and the
object concept: Developing `what' and `where' systems. Trends in Cognitive
Sciences, 2(1), 10-18.
 Leslie used the notion of an “object index” (which
is the same as a FINST) to explain these results.
According to his account, babies set up an index to
each object they attend to. When the object
disappears behind the screen, the index remains
active. When objects reappear and the indexes are
not matched one-one, it creates a failure of
expectation, which leads to longer looking times.
 Much remains unspecified (e.g. what do the
indexes point to when the objects are hidden?) but
the appeal to indexes is consistent with the apparent
abstraction to the numerical identity of objects.
Mental representation of space:
The core of the imagery debate
 It appears that some forms of thought (i.e., those
accompanied by the phenomenology of “seeing
with the mind’s eye”) have spatial properties in a
way that other forms of thought do not.
 It is, of course, possible to encode spatial relations
in any form of representation, but what do we do
about such properties of imagery as …
 S-R compatibility, eye-movements, visual-motor
adaptation, image superposition findings (scanning,
interference, illusions,…). These all suggest that
images have spatial properties. This has led to the
picture-in-the-head neuroscience program.
The good news is: We don’t need a
spatial display in our head if we have
the right kind of deictic contact with
real (perceived) space
 None of the experiments that are alleged to show
the existence of a spatial display (in visual cortex)
need to appeal to anything more than a small
number of imagined locations.
 If we can index a small number of (occupied)
locations in real space (using FINSTs) we can use
them to allocate attention or to program motor
commands.
 If these indexed objects are also bound to objects
of thought this will result in our thoughts (i.e.
images) having persisting spatial relations.
Some related trends in artificial
intelligence: Situated Robots
 Some people in Artificial Intelligence have
embraced (and has in some cases been overcome
by) a recognition of the need for a special
indexical relation between representations and
the world. While some of this “situated”
movement has become a fad, there is an
important point behind the situated movement,
and it is the same point the Visual Index theory
has been making: We need some nonconceptual
connections between representations and things.
Forms of representation for a robot: using indexicals
Pylyshyn, Z.W. (2000). Situating vision on the world.
Trends in Cognitive Sciences, 4(5), 197-207
Indexes play a role very similar to that of demonstratives.
Are demonstratives essential for characterizing beliefs
and for explaining the connection between beliefs and
actions? Here is an example due to John Perry*:
“The author of the book Hiker’s Guide to the Desolation
Wilderness stands in the wilderness beside Gilmore Lake,
looking at the Mt. Tallac trail as it leaves the lake and climbs
the mountain. He desires to leave the wilderness. He
believes that the best way out from Gilmore Lake is to follow
the Mt. Tallac trail up the mountain … But he doesn’t move.
He is lost. He is not sure whether he is standing beside
Gilmore Lake, looking at Mt. Tallac, or beside Clyde Lake,
looking at the Maggie peaks. Then he begins to move along
the Mt. Tallac trail. If asked, he would have to explain the
crucial change in his beliefs in this way: ‘I came to believe
that this is the Mt. Tallac trail and that is Gilmore Lake’.”
* Perry, J. The problem of the essential indexical. In Themes from Kaplan (eds.
Almog, J., Perry, J. & Wettstein, H.) (Oxford University Press, New York, 1989).
Perry’s example is intended to show that in order to
understand and explain the action of the lost author
it is essential to use demonstratives such as this and
that in expressing the author’s beliefs.
A unique description of the Mt. Tallac trail
might help bring the person to the right belief,
but the problem of connecting the belief to an
action would remain unsolved until the person
had a deictic or demonstrative thought such as:
“That is the Mt. Tallac trail.”
or perhaps,
“The trail I am now looking at is the Mt. Tallac
trail”
Summary: FINSTs keep us
connected with the world
Selected references related to this talk
•Annan, V., & Pylyshyn, Z. W. (2002). Can indexes be
voluntarily assigned in multiple object tracking? Paper
presented at Vision Sciences 2002, Sarasota, FL.
•Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N.
(1997). Deictic codes for the embodiment of cognition.
Behavioral and Brain Sciences, 20(4), 723-767.
•Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000).
Tracking an object through feature-space. Nature, 408(9), 196199.
•Burkell, J., & Pylyshyn, Z. W. (1997). Searching through
subsets: A test of the visual indexing hypothesis. Spatial Vision,
11(2), 225-258.
•Dawson, M., & Pylyshyn, Z. W. (1988). Natural constraints in
apparent motion. In Z. W. Pylyshyn (Ed.), Computational
Processes in Human Vision: An interdisciplinary perspective
(pp. 99-120). Stamford, CT: Ablex Publishing.
•Intriligator, J., & Cavanagh, P. (2001). The spatial resolution of
attention. Cognitive Psychology, 4(3), 171-216.
•Leslie, A. M., Xu, F., Tremoulet, P. D., & Scholl, B. J. (1998).
Indexing and the object concept: Developing `what' and `where'
systems. Trends in Cognitive Sciences, 2(1), 10-18.
•Nissen, M. J. (1985). Accessing features and objects: Is location
special? In M. I. Posner & O. S. Marin (Eds.), Attention and
performance XI (pp. 205-219). Hillsdale, NJ: Lawrence
Erlbaum.
•Pylyshyn, Z. W. (1989). The role of location indexes in spatial
perception: A sketch of the FINST spatial-index model.
Cognition, 32, 65-97.
•Pylyshyn, Z. W. (1994). Some primitive mechanisms of spatial
attention. Cognition, 50, 363-384.
• Pylyshyn, Z. W. (2000). Situating vision in the world. Trends in
Cognitive Sciences, 4(5), 197-207.
• Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and
situated vision. Cognition, 80(1/2), 127-158.
• Pylyshyn, Z. W. (submitted). Tracking without keeping track: some
puzzling findings concerning multiple object tracking.
• Pylyshyn, Z. W., Burkell, J., Fisher, B., Sears, C., Schmidt, W., &
Trick, L. (1994). Multiple parallel access in visual attention.
Canadian Journal of Experimental Psychology, 48(2), 260-283.
• Pylyshyn, Z. W., & Storm, R. W. (1988). Tracking multiple
independent targets: evidence for a parallel tracking mechanism.
Spatial Vision, 3(3), 1-19.
• Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items
through occlusion: Clues to visual objecthood. Cognitive
Psychology, 38(2), 259-290.
• Scholl, B. J., Pylyshyn, Z. W., & Feldman, J. (2001). What is a
visual object: Evidence from target-merging in multiple-object
tracking. Cognition, 80, 159-177.
• Scholl, B. J., Pylyshyn, Z. W., & Franconeri, S. L. (submitted). The
relationship between property-encoding and object-based
attention: Evidence from multiple-object tracking.
• Sears, C. R., & Pylyshyn, Z. W. (2000). Multiple object tracking
and attentional processes. Canadian Journal of Experimental
Psychology, 54(1), 1-14.
• Tipper, S., Driver, J., & Weaver, B. (1991). Object-centered
inhibition of return of visual attention. Quarterly Journal of
Experimental Psychology, 43A, 289-298.
The End . . . (except for appendices)
 Appendix 1: Other findings concerning the
Multiple Object Tracking task
 The question of how the correspondence problem
in vision is solved in general
 The question of whether objects are individuated
and/or tracked by their locations.
 Almost everyone (except me) believes that they are.
Appendix: Some other findings concerning
object tracking (1)
 Detection of events on targets is better than on
nontargets, but this does not generalize to locations
between targets;
 Objects can continue to be tracked when they
disappear completely behind occluders, as long as the
mode of disappearance is compatible with there
being an occluding surface;
 Objects can all disappear from view for as long as
330 ms without impairing tracking;
 When objects disappear behind an occluder and
come out a different color or shape, the change is
unnoticed;
Appendix: Some other findings
concerning object tracking (2)
 Not all distinct feature clusters can be tracked; some,
like the endpoints of a line, cannot;
 People can track items that automatically attract
attention, or they can decide which items to track;
but in the latter case it appears that the may have to
visit each object serially
 Successful tracking of an object entails keeping track
of it as a particular individual, yet people are poor at
keeping track of which successfully tracked (initially
numbered) item is which. This may be because:
 When observers make errors, they are more likely to switch
the identity of a target with that of another target than the
identity of a target with that of a nontarget.
The whole truth about multiple
object tracking
And many more demos ….
How do we do it? What properties of
individual objects do we use?
 MOT with occlusion
 MOT with Virtual Occluders
 MOT with implosion/explosion
 MOT MOT of the endpoints of a line
 MOT squares with rubber band connections
 MOT with IDs (which is which?)
 Track non-flashed (3 blinks)
 Track Non-flashed (one flash)
1
Most theories of attention assume
that objects are accessed by the
prior encoding of their location:
1
1. Theories of visual search, including Treisman’s
Feature Integration Theory, assume that location
provides the means for detecting propertyconjunctions. To find a conjunction of properties
one finds the first property, determines its location
on the master feature map, and checks to see
whether the second property is also located there.
The case for the prior encoding of location
2
2. It has been frequently reported that when people
detect certain properties (e.g., color) in a display,
they very often also know where these properties
are located – even when they fail to report any
other properties (e.g., shape).
*
There are also many reports of the detection of properties without
being able to report where the properties occurred. This happens
mainly when a second task is being performed that distracts attention
and it leads to such errors as conjunction-illusions.
The case for the prior encoding of location …
3
3. Some people have explicitly tested the locationmediation hypothesis by cuing a search display
with one property and examining the resulting
joint probabilities of detecting various other
properties. For example, Mary-Jo Nissan cued a
search display with a color C and measured (or
estimated) the probability of detecting shape S,
location L, and both shape and location S & L.
She showed that:
P(L & S | C) = P(L| C) * P(S | L)
which is what one would expect if location
mediated the joint detection.
The correspondence problem for
incremental construction of a visual
representation
 We are interested in solutions that could be carried out
by the vision module (as opposed to the cognitive
mind), so the solution should meet certain criteria –
e.g., capitalize on a natural constraint, as it does in
apparent motion and other early vision phenomena.
 It would make sense if early vision kept track of
individual objects using only “local support” evidence,
without relying on specific encoded properties.
 We will see that it is unlikely that locations or other object
properties are stored and used in solving the general
correspondence problem.
 I will consider some proposals for how our visual
system solves the correspondence problem – e.g., the
proposal that it uses spatiotemporal information.
If object properties are not used in
solving the general correspondence
problem, where does this leave us?
 It leaves us needing a primitive indexing
mechanism that picks out individual objects qua
individuals, and that keeps track of these objects as
they move around and change their properties
 We do not need to assume an unlimited capacity for
indexing. Indeed it seems that there might not be
more than 4 or 5 of these indexes available.
 Index maintenance favors continuous movements,
but can track objects that disappear when local cues
are compatible with certain phenomena that hold in
our kind of world (e.g., occlusions by opaque
surfaces, blinks, saccadic eye movements, etc)
 Such a mechanism was proposed in 1978 and was
called a FINST.