Lecture 1 Connecting vision and the world: An empirical

Download Report

Transcript Lecture 1 Connecting vision and the world: An empirical

Translate ‘thinking’ to ‘symbol processing’
• The dilemma of naturalizing the (cognitive) mind arises
because the content of representations (what they are
about) need not exist, even in principal, and even when
they do exist, its how we take them to be that matters.
 Nothing that plays a causal role in the natural world can have these properties
• What to do?
 Computation is the only straw afloat (Fodor).
Translate ‘thinking’ to ‘symbol processing’
• If we encode beliefs, knowledge, thoughts, and other
“propositional attitudes” (e.g., want, fear, hope, know,
guess,…) as symbolic codes, then we can use the
rules of logic to draw inferences from them to new
beliefs. We call this process “reasoning.”
• If we can design a machine so that its physical states
correspond to symbolic expressions, we may be able
to get the machine to change states in such a way as
to generate the inference (which is truth preserving, at
least under certain background conditions – e.g. if
there is enough working memory and enough time.)
That’s just what a Turing Machine does!
Also every working computer since the Eniac (1946) (every stored program computer)


A Turing Machine is a very simple theoretical
machine that changes its state depending on what
symbols happen to be under the read head on its tape
and what state it is currently in.
Alan Turing showed, by working through massive
details, that such a machine could compute any
function we can formally describe. In particular a
particular set of TM rules can compute whatever
function some other Turing Machine would compute
given a certain input. This machine is called the
Universal Turing Machine and is one of the most
important discoveries of the 20th century.
A Turing Machine could not be simpler!
Rule 1
If you are in state S1 and you read
Q then move left and write D and
go into state S2
State Si
Rule Box
Read ‘B’
Write ‘T’
# C DQ T B ∆ VC  | 
#
The very minimum one needs to be a Turing machine is the capacity to ‘read’
symbols, write symbols, move the read/write head and change states depending on
what symbol it reads and what state it is in. It uses a finite but unbounded tape.
Cognitive Science and the Tri-Level Hypothesis
Intelligent systems are organized at three
(or more) distinct levels:
1. The physical or biological level
2. The symbolic or syntactic level
3. The semantic or knowledge level
This means that different regularities
may require appeal to different levels
The central role of representation creates
some serious problems for a natural science

A serious problem arises because we are not aware
of (most of) our representations
 What we may sometimes be aware of is what our thoughts
are about, or what they would look or sound like if they
were expressed – but never properties of the thoughts
themselves.
 It is important to distinguish properties of thoughts from
properties of what they are about (e.g. mental images)


We are not even aware of deciding, choosing or
willing an action (Wegner)
Introspective evidence is just one type of evidence
and it has turned out to be mostly misleading
 What we are often aware of is our confabulations (Tammet)
Steps in naturalizing the
Representational Theory of Mind
• Mechanize the process of inference (e.g., CTM)
• Show how the basic constructs of a theory of
•
cognition could be naturalized in principle – in
terms of some natural process.
Since explanations of cognitive phenomena involve
such concepts as refer, individuate, keep track of,
encode, attend, select, imagine and so on, it is
important to the overall project that such accounts
be translatable in principle to natural processes.
 It is not enough to show where in the brain some process
occurs, or to develop a theory that
 looks like a nervous system!
An important neglected problem in the
Representational/Computational Theory of Mind

RTM and CTM fail to specify how representations connect
with what they represent – it’s not enough to use English
words in the representation (that’s been a common
confusion in AI) or to draw pictures (a common confusion
in theories of mental imagery).
 English labels and pictures may help the theorist recall which
objects he/she intended to refer to … but
 What makes it the case that a particular mental symbol refers
to one thing rather than another? Or,
 How are concepts grounded? (Symbol Grounding Problem
and some confusions that it engendered – (Barsalou, 1999)
Another way to look at what the
Computational Theory of Mind lacks

The missing function in the CTM is a mechanism that
allows perception to refer to individual things in the
visual field directly and nonconceptually:
 Not as “whatever has properties P1, P2, P3, ...”, but as a
singular term that refers directly to an individual and does not
appeal to a representation of the individual’s properties.
 Such a reference is like a proper name or like a pointer in a
computer data structure, or like a demonstrative term (like this
or that) in natural language.
 There is much more to come on the mechanism of visual
indexing because it has been explored empirically and also
because it seems to have the right properties for our purpose
(it works like a reference yet is clearly a causal link).
An example from personal history: Why
we need to pick out individual things
without referring to their properties



We wanted to develop a computer system that would
reason about geometry by actually drawing a diagram
and noticing adventitious properties of the diagram,
from which it would conjecture lemmas to prove
We wanted the system to be as psychologically
realistic as possible so we assumed that it had a narrow
field of view and noticed only limited, spatiallyrestricted information as it examined the drawing
This immediately raised the problem of coordinating
noticings and led us to the idea of visual indexes to
keep track of previously encoded parts of the diagram.
Begin by drawing a line….
L1
Now draw a second line….
L2
And draw a third line….
L3
What do we have so far?
You know there are three lines, but you
don’t know the spatial relations between
them. That requires:
1. Seeing several of them together (at least in pairs)
2. Knowing which object that was noticed at time t+1
corresponds to which object that was noticed at time t.
Establishing (2) requires solving one form of
what is called the correspondence problem.
This problem is ubiquitous in perception.
Solving it over time is called tracking.
For example, suppose you recall noticing
two intersecting lines such as these:
L1
L2
You see that there is an intersection of two lines…
But which of the two lines you drew earlier are they?
There is no way to indicate which individual things
are seen again without a way to refer to individual
(token) things
Look around some more to see what is there ….
L5
L2
V12
Here is another intersection of two lines…
Is it the same intersection as the one seen earlier?
Without a special way to keep track of individuals the only
way to tell would be to encode unique properties of each of
the lines. Which properties should you encode?
In examining a geometrical figure one only
gets to see a sequence of local glimpses
A note about the use of labels in this example
There are two purposes for figure labels. One is to specify
what type of individual it is (line, vertex,..). The other is to
specify which individual it is in order to keep track of it
and in order to bind it to the argument of a predicate.
 The second of these is what I am concerned with because
indicating which individual it is is essential in vision.

 Many people (e.g., Marr, Yantis) have suggested that individuals
may be marked by tags, but that won’t do since one cannot
literally place a tag on an object and even if we could it would not
obviate the need to individuate and index just as labels don’t help.

Labeling things in the world is not enough because to refer
to the line labeled L1 you would have to be able to think
“this is line L1” and you could not think that unless you
had a way to first picking out the referent of this.
The correspondence Problem
A frequent task in perception is to establish a
correspondence between proximal tokens that arise
from the same distal token.




Apparent Motion. Tokens at different times may
correspond to the same object that has moved.
Constructing a representation over time (and over eye
fixations) requires determining the correspondence between
tokens at different stages in constructing the representation.
Tracking token individuals over time/space. To distinguish
“here it is again” from “here is another one” and so to
maintain the identity of objects.
Stereo Vision requires establishing a correspondence
between two proximal (retinal) tokens – one in each eye
The difference between a direct (demonstrative) and a
descriptive way of picking something out has produced
many “You are here” cartoons.
It is also illustrated in this recent New Yorker cartoon…
The difference between descriptive and
demonstrative ways of picking something out
(illustrated in this New Yorker cartoon by Sipress )
Picking out

Picking out entails individuating, in the sense of separating
something from a background (what Gestalt psychologists
called a figure-ground distinction)

This sort of picking out has been studied in psychology under
the heading of focal or selective attention.
 Focal attention appears to pick out and adhere to objects rather than places

In addition to a unitary focal attention there is also evidence
for a mechanism of multiple references (about 4 or 5), that I
have called a visual index or a FINST
 Indexes are different from focal attention in many ways that we have
studied in our laboratory (I will mention a few later)
 A visual index is like a pointer in a computer data structure – it allows
access but does not itself tell you anything about what is being pointed to.
Note that the English word pointer is misleading because it suggests that
vision picks out objects by pointing to their location.
The requirements for picking out and keeping
track of several individual things reminded me of
an early comic book character called Plastic Man
Imagine being able to place several of your fingers on things
in the world without recognizing their properties while doing
so. You could then refer to those things (e.g. ‘what finger #2 is
touching’) and could move your attention to them. You would
then be said to possess FINgers of INSTantiation (FINSTs)
FINST Theory postulates a small pool of indexes in
modular vision that are elicited (grabbed) by certain objectcentered events in the visual field. These indexes provide a
reference to those objects without using concepts
This idea is intriguing but it is missing
some important functions and distinctions



We need to distinguish the mechanisms of early vision (inside the
vision module) from those of general cognition
We may need to distinguish different types of information in
different parts of vision (e.g., representations vs physical states,
conceptual vs nonconceptual, as well as personal vs subpersonal).
Closely related to this, we need to distinguish between the process
of vision (seeing) from that of belief fixation.
 Finally, we need to provide a motivated proposal for what the
modular (nonconceptual? subpersonal?) part of vision provides to
the rest of the cognitive mind. This is a difficult problem and will
occupy some of our time in the rest of this class.
Summarizing the theory so far



A FINST index is a primitive mechanism of reference for
refering to individual visible objects in the world. There are a
small number (around 4-5) indexes available at any one time.
Indexes refer to individual objects without referring to them
under conceptual categories. They provide direct or
demonstrative reference. Q: Is this seeing without seeing as?
Indexing objects is prior to property encoding. Objects are
picked out and referred to without using any encoding of
their properties.
 This does not mean that object properties are irrelevant to the
grabbing of indexes or to their subsequent tracking?
 The claim that we initially refer to objects without having encoded
their location is surprising to many people (why?)
 What may be even more surprising is the assumption that vision can
refer to objects without knowing anything about them!
Summarizing the theory so far



An important function of these indexes is to bind arguments of
visual predicates to things in the world to which the predicates
apply. Only predicates with bound arguments can be evaluated.
Since predicates are quintessential concepts, indexes are a necessary
bridge from physical objects to conceptual representations.
Indexes can also bind arguments of motor commands, including the
command to move focal attention or gaze to the indexed object: e.g.,
MoveGaze(x).
We will see shortly that when certain properties of objects are
encoded (based on some priority ranking), the codes are stored in
files that are specific to the individual objects. These are what we
called Object Files and they play an important role in several
functions computed early in the visual process.
 So this is what the model consists of at this stage 
First Approximation: FINSTs and Object Files and
the link between the world and its conceptualization
Inside the Module?
A note on terminology
●
●
●
I sometimes refer to an index as a pointer and sometimes as a
demonstrative or even as the name of an object. These terms all
suggest some important aspect of an index, but they are misleading
Unlike a linguistic demonstrative, an index is grabbed (causally) by
things in the world independent of the intentions of a perceiver.
And although an index is like a proper name, it can only refer to
objects with which the perceiver is in sensory contact (in what
Fodor calls the “perceptual circle”).
 Notice a strange consequence of the assumption that indexes do
not refer to objects in terms of (person-level*) concepts: indexing
an object tells us neither what nor where it is!
* I added this proviso because some people speak of the categories involved in
Modular representations as being concepts. If so they are different from the
concepts with which we think.
A quick tour of some evidence for Indexes
●
●
●
●
The correspondence problem
The binding problem
Evaluating multi-place visual predicates
Operating over several visual elements without
having to search for them first:


●
Multiple-Object Tracking (MOT)

●
Subitizing
Subset search
Does MOT require voluntary attention? Does it require the
encoding of object properties? Could it use such codes?
Imagining space without requiring a spatial display in
the head {This is a large topic beyond the scope of this class, but see
Things and Places, Chapter 5}
Apparent Motion solves a correspondence problem:
Dawson Configuration (Dawson &Pylyshyn, 1988)
Linear trajectory?
Curved trajectory?
Which criterion does the visual module prefer?
Dawson Configuration (animated)
Apparent Motion solves a correspondence problem
Dawson Configuration (Dawson &Pylyshyn, 1988)
Nearest mean distance?
Nearest vector distance?
Nearest configural distance?
Which criterion does the visual module prefer?
Dawson Configuration (animated)
Colors & shapes are ignored
Dawson Configuration Different properties Ignored
Apparent motion and computing 3D structure
Apparent motion and computing 3D structure
Rotating Sphere
Walker
Use of the “Ternus Configuration” to demonstrate
the connection between individuation of objects
and computing correspondence
The finding is that the correspondence problem,
like the subitizing phenomena to be described
next, requires the prior individuation of objects.
If objects are not individuated for any of a
number of reasons, the correspondence
problem is not solved.
Yantis use of the “Ternus Configuration” to
demonstrate the early visual effect of objecthood
Short time delays result in the middle object being
individuated as a single object persisting over time.
Because of this the middle object cannot be seen as
moving with the other objects, resulting in “element
motion” (since it is seen as the “same object” it does not
appear to move)
Long time delays result in “group motion” because
the middle object does not persist but is perceived
as a new object each time it reappears
Relevance to the present theme



These different examples illustrate the need and the
capability of vision to keep track of objects’ numerical
identity (or their same-individualness) in a primitive nonconceptual way which ignores their visible properties.
In each case the correspondence is computed over the
individuals without any conscious awareness.
The examples and others (stereovision, incremental
construction representations, and keeping track of
individuality over time/space) are on different time scales
so it is an empirical matter whether they involve the same
mechanism, but they do address the same problem –
tracking individuals without using their unique properties.
The incremental construction of visual
representations requires solving a
correspondence problem over time

We have to determine whether a particular individual
element seen at time t is identical to another individual
element seen at a previous time t- . This is one
manifestation of the correspondence problem.
 Solving the correspondence problem is equivalent to
picking out and tracking the identity of token
individuals as they change their appearance, their
location or any other identifying property.
 To do that vision needs to refer to token individuals (we
generally call them objects) without doing so by
appealing to their properties. This requires a special
form of demonstrative reference we call a Visual Index.
A quick tour of evidence for Indexes
●
●
●
●
The correspondence problem
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without having to search for them first
 Subitizing
 Subset selection
 Multiple-Object Tracking
• Imagining space without requiring a spatial
display in the head
Individual objects and the binding problem


We can distinguish scenes that differ by conjunctions of
properties, so early vision must somehow keep track of
how properties co-occur – conjunction must not be
obscured. How to do this is called the binding problem.
The most common proposal is that vision keeps track of
properties according to their location and binds together colocated properties.
1
2
The proposal of binding conjunctions by the location
of conjuncts does not work when feature location is
not punctate and becomes even more problematic if
they are co-located – e.g., if their relation is “inside”
FEATURE DEMONS COGNITIVE DEMONS
Pandemonium
An early architecture for
vision, called Pandemonium,
was proposed by Oliver
Selfridge in 1959. This idea
continues to be at the heart of
many psycho-logical models,
including ones implemented
in contemporary connectionist
or neural net models.
Vertical lines
Horizontal lines
Oblique lines
IMAGE
DEMONS
Right angles
DECISION DEMON
Acute angles
CORTICAL
SIGNAL
PROCESSING
Discontinuous
curves
Continuous
curves
Binding as object-based

The proposal that properties are conjoined by virtue of their
common location has many problems
 In order to assign a location to a property you need to know its
boundaries, which requires distinguishing the object that has those
properties from its background (figure-ground individuation)
 Properties are properties of objects, not of locations – which is why
properties move when objects move. Empty locations have no
causal properties.

The alternative to conjoining-by-location is conjoining by
same object. According to this view, solving the binding
problem requires first selecting individual objects and then
keeping track of selected (salient) properties of each object
(in its Object File) but not using these properties to track.
 If only properties of index-selected objects are encoded and
recorded in each object’s OFs, then all conjoined properties will be
recorded in the same object file, thus solving the binding problem
A quick tour of some evidence for FINSTs
●
●
●
●
The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at once
without having to search for them first
 Subitizing
 Subset selection
●
●
Multiple-Object Tracking
Cognizing space without requiring a spatial display
in the head
Being able to refer to individual objects or
object-parts is essential for recognizing patterns
Encoding relational predicates; e.g., Collinear (x,y,z,..);
Inside (x, C); Above (x,y); Square (w,x,y,z), requires
simultaneously binding the arguments of
n-place predicates to n elements* in the visual scene
 Evaluating such visual predicates requires
individuating and referring to the objects over
which the predicate is evaluated: i.e., the
arguments in the predicate must be bound to
individual tokens elements in the scene.
*Note: “elements” is used to refer to objects that serve as parts of other objects
Several objects must be picked out at
once in making relational judgments
When we judge that certain objects are collinear, we must first
pick out the relevant objects while ignoring their properties
Several objects must be picked out at
once in making relational judgments

The same is true for other relational judgments like inside or onthe-same-contour… etc. We must pick out the relevant individual
objects first. Are dots Inside-same contour? On-same contour?
*Note: Ullman (1984) has shown that some patterns cannot be recognized without
doing so in a serial manner, where the serial elements must be indexed first.
And that is yet another reason why Connectionist architectures cannot work!
A quick tour of some evidence for FINSTs
•
The correspondence problem
 The binding problem
 Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at
once without first having to search for them
 Subitizing
 Subset selection

•
Multiple-Object Tracking
Cognizing space without requiring a spatial
display in the head
More functions of FINSTs
Further experimental explorations
using different paradigms

Recognizing the cardinality of small sets of things:
Subitizing vs counting (Trick, 1994)
 Searching through subsets – selecting items to search
through (Burkell, 1997)
 Selecting subsets and maintaining the selection during a
saccade (Currie, 2002)

Application of FINST index theory to infant
cardinality studies (Carey, Spelke, Leslie, Uller, etc)
Indexes explain how children are able to acquire
words for objects by ostension without suffering
Quine’s Gavagai problem.
Signature subitizing phenomena only appear when
objects are automatically individuated and indexed
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A
limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Subitizing results

There is evidence that a different mechanism is involved in
enumerating small (n ≤ 4) and large (n > 4) numbers of items
(even different brain mechanisms – Dehaene & Cohen, 1994)
 Rapid small-number enumeration (subitizing) only occurs when
items are first (automatically) individuated*
 Subitizing is not affected by precuing location while counting is*
 Subitizing is insensitive to distance among items*
 Our account for what is special about subitizing is that once
FINST indexes are assigned to n ≤ 4 individual objects, the objects
can be enumerated without first searching for them. In fact they
might be enumerated simply by counting active indexes which is
fast and accurate because it does not require visual scanning
* Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated
differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Subset selection for search
+
+
+
Target =
+
single
feature
search
conjunction
feature
search
Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial
Vision, 11(2), 225-258.
Subset search results:

Only properties of the subset matter – but note that
properties of the entire subset are taken into account
simultaneously (since that is what distinguishes a
feature search from a conjunction search)
 If the subset is a single-feature search it is fast and the
slope (RT vs number of items) is shallow
 If the subset is a conjunction search set, it takes longer
and is more sensitive to the set size
As with subitizing, the distance between targets
does not matter, so observers don’t seem to be
scanning the display looking for the target
The stability of the visual world entails the capacity
to track some individuals after a saccade


There is no problem about how tactile selection can
provide a stable world when you move around while
keeping your fingers on the same objects – because in
that case retaining individual identity is automatic
But with FINSTs the same can be true in vision – for a
small number of visual objects
 This is compatible with the fact that it appears one
retains the relative location of only about 4 elements
during saccadic eye movements (Irwin, 1996)
[Irwin, D. E. (1996). Integrating information across saccadic eye
movements. Current Directions in Psychological Science, 5(3), 94-100.]
The selective search experiment with a saccade induced
between the late onset cues and start of search
Onset of new objects
grabs indexes
+
+
A saccade
occurs
here
+
Target =
+
single
feature
search
conjunction
feature
search
Even with a saccade between selection and access, items can be accessed efficiently
A quick tour of some evidence for FINSTs
●
●
●
●
The correspondence problem (mentioned earlier)
The binding problem
Evaluating multi-place visual predicates
(recognizing multi-element patterns)
Operating over several visual elements at once
without having to search for them first
 Subitizing
 Subset selection

●
Multiple-Object Tracking
Imagining space without requiring a spatial
display in the head
Demonstrating the function of FINSTs with
Multiple Object Tracking (MOT)

In a typical experiment, 8 simple identical objects are
presented on a screen and 4 of them are briefly
distinguished in some visual manner – usually by flashing
them on and off.

After these 4 targets are briefly identified, all objects
resume their identical appearance and move randomly.
The observers’ task is to keep track of the ones that had
been designated as targets at the start

After a period of 5-10 seconds the motion stops and
observers must indicate, using a mouse, which objects are
the targets
Another example of MOT: With self occlusion
5 x 5 1.75 x 1.75
Self occlusion dues not seriously impair tracking
Tracking self-occluding circles
Tracking boxes with repulsion
Tracking self-occluding circles
Tracking boxes with repulsion
Some findings with Multiple Object Tracking
●
●
●
Basic finding: Most people can track at least 4 targets
that move randomly among identical non-target objects
(even some 5 year old children can track 3 objects)
Object properties do not appear to be recorded during
tracking and tracking is not improved if all objects are
visually distinct (no two objects have the same color, shape or size)
How is it done?



We showed that it is unlikely that the tracking is done by
keeping a record of the targets’ locations and updating them by
serially visiting the objects (Pylyshyn & Storm, 1998)
Other strategies may be employed (e.g., tracking a single
deforming pattern), but they do not explain tracking 
Hypothesis: FINST Indexes are grabbed by targets. At the end
of the trial these indexes can be used to move attention to the
targets and hence to select them in making the response
What role do visual properties play in MOT?

Certain properties must be present in order for an index to be
grabbed, and certain properties (probably different
properties) must be present in order for the index to keep
track of the object, but this does not mean that such
properties are encoded, stored, or used in tracking.

Is there something special about location? Do we record and
track properties-at-locations?
 Location in time & space may be essential for individuating or
clustering objects, but metrical coordinates need not be encoded or
made cognitively available
 The fact that an object is actually at some location or other does not
mean that it is represented as such. Representing property ‘P’ (where
P happens to be at location L) ≠ Representing property ‘P-at-L’.
A way of viewing what goes on in MOT

An object file may contain information about the object to
which it is bound. But according to FINST Theory, keeping
track of the object’s identity does not require the use of this
information. The evidence suggests that in MOT, little or
nothing is stored in the object file. Occasionally some
information may get encoded and entered in the Object File
(e.g., when an object appears or disappears) but this is neither
mandatory nor is it relevant to the tracking process itself.
Another way of viewing MOT



What makes something the same object over time is that it remains
connected to the same object-file by the same Index. Thus, for
something to be seen as the same enduring object no appeal to
properties or concepts is needed, as long as the object be trackable.
In other words, an object is something that can be visually tracked.
It seems that tracking may be a reflex – there is some evidence that
it proceeds without interference from other attentive tasks
Franconeri et al. showed that the apparent sensitivity of tracking
performance to such properties as speed is due to a confound of
speed with inter-object distance.1 Minimum distance between
objects appears to be the only property critical to MOT performance2
1 Franconeri, S., Lin, J., Pylyshyn, Z., Fisher, B., & Enns, J. (2008). Evidence against a speed limit in multiple-object tracking.
Psychonomic Bulletin & Review, 15(4), 802-808.
2 Franconeri, S., Jonathan, S. J., & Scimeca, J. M. (2010). Tracking Multiple Objects Is Limited Only by Object Spacing, Not by
Speed, Time, or Capacity. Psychological Science, 21, 920-925.
Additional examples of MOT









MOT with occlusion
MOT with virtual occluders
MOT with matched nonoccluding disappearance
Track endpoints of lines
Track rubber-band linked boxes
Track and remember ID by location
Track and remember ID by name (number)
Track while everything briefly disappears (½ sec) and
goes on moving while invisible
Track while everything briefy disappears and reappears
where they were when they disappeared
Why is this relevant to foundational
questions in the philosophy of mind?
●
●
●
●
According to Quine, Strawson, and most philosophers, you
cannot pick out or track individuals without concepts (sortals)
But you also cannot pick out individuals with only concepts
 Sooner or later you have to pick out individuals using nonconceptual causal connections between things and thoughts.
The present proposal is that FINSTs provide the needed nonconceptual mechanism for individuating objects and for tracking
their (numerical) identity, which works most of the time in our
kind of world. It relies on a natural constraint (Marr).
FINST indexes provide the right sort of connection to allow the
arguments of predicates to be bound to objects prior to the
predicates being evaluated.
 They may also be the basis for learning nouns by ostension.
There must be some properties that
cause indexes to be grabbed!

Of course there are properties that are causally
responsible for indexes being grabbed, and also
properties (probably different ones) that make it
possible for objects to be tracked;
 But these properties need not be represented
(encoded) and used in referring to enduring objects

The distinction between properties that cause
indexes to be grabbed and those that are represented
(in Object Files) is similar to Kripke’s distinction
between properties that are needed to name an object
(its baptismal) and those that constitute its meaning
Effect of target properties on MOT



Changes of object properties are not noticed during MOT
Keeping all targets at different color, size, or shape does not
improve tracking
Observers do not use target speed or direction in tracking
(e.g., they do not track by anticipating where the targets will
be when they reappear after occlusion)
 Targets can go behind an opaque screen and come out the other
side transformed in: color, shape, speed or direction of motion
(up to 60° from pre-occlusion direction), without affecting
tracking, but also without observers noticing the change!
 What affects tracking is the distance travelled while behind the
occluding screen. The closer the reappearance to the point of
disappearance the better the tracking – even if the closer
location happens to be in the middle of the occluding screen!
Some open questions

We have arrived at the view that only properties of selected
(indexed) objects enter into subsequent conceptualization and
perception-based thought (i.e., only information in object files
is made available to cognition)
So what happens to the rest of the visual information?

Visual information seems rich and fine-grained while this
theory only allows for the properties of 4 or 5 objects to be
encoded!
 The present view also leaves no room for representations whose
content corresponds to the content of conscious experience
 According to the present view, the only content that modular
nonconceptual representations have is the demonstrative content
of indexes that refer to perceptual objects
 Question: Why do we need any more than that?
An intriguing possibility….
Maybe the theoretically relevant information we take in is
less than (or at least different from) what we experience
 This possibility has received attention recently with the discovery
of various “blindnesses” (e.g., change-blindness, inattentional
blindness, blindsight…) as well as the discovery of independentvision systems (e.g., recognition and motor control)
 The qualitative content of conscious experience (its qualia) may
not play a role in explanations of cognitive processes.
 Even if detailed quantitative information enters into causal process
(e.g., motor control) it may not be represented !
 For something to be a representation its content must figure in explanations
– it must capture generalizations. It must have truth conditions and therefore
allow for misrepresentation. It is an empirical question whether current
proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyn’s Razor
An alternative view of
reference by Indexes
●
●
This provisional revised theory responds to Fodor’s argument
that there is no seeing without seeing-as
According to Fodor, the vision module must do more than the
current theory assumes, because its output must provide the
basis for induction over what something is seen as.
 This is not the traditional argument that percepts have a finer grain
than that of concepts. But that argument relies too much on our
phenomenology which more often than not leads us astray.
●
So the vision module must contain more than object files. It
must be able to classify objects by their visual properties
alone, or to compute for each object a particular appearanceclass to which it belongs (see black swan example).
An alternative view of
reference by Indexes
●
●
Since the vision module is encapsulated it must assign each
object x to an equivalence class based solely on its sensory
appearance. It must do this for a large number of such classes,
based on innate mechanisms modified parametrically by visual
experience (this means that the form of modification is
constrained in terms of finite options provided by the vision
module – cf Waltz’ blocks world).
Denote the equivalence class to which each token visual object x
is assigned by L (x). The L (x) must be sufficiently distinctive to
allow the cognitive system subsequently to recognize x as an
instance of something it knows about. The sequence from x to
recognition must be correct most of the time in our kind of world
(so it must embody a natural constraint).
An alternative view of
reference by Indexes
●
This idea of an appearance class L (x) has been explored in
computational vision, where a number of different functions
have been proposed, many of them based on mathematical
compression or encryption functions.
 An early idea which has implications for the present discussion, is a
proposal by David Marr called a Multiple-View proposal. He wrote:
“The Multiple View representation is based on the insight that if one chooses
one’s primitives correctly, the number of qualitatively different views of an object
may be quite small.” Marr cites Minsky as speculating that the representation of a
3D shape might consist of a catalog of different appearances of that shape, and
that this catalog may not need to be very large. (Marr & Nishihara, 1976)
 The search for the most general form of representation has yielded
many proposals, many of which have been tested in Psychology Labs.
e.g., generalized cylinders and part-decomposition: Biederman, I. (1987).
Recognition-by-components: A theory of human image interpretation. Psychological Review, 94, 115-148.
Seeing without Seeing As?
●
Vision delivers an equivalence-class to which the object
belongs by virtue of sharing whatever appearance was mapped
by the function L (x) . It is an appearance class because it can
only use information from the sensorium and is subject to
constraints built into the modular vision system (i.e., what we
called “natural constraints”). So in that respect one might say
that seeing is always a seeing as where the category implied
by the “as” is L (x). But insofar as this is not the category
under which the representation of a familiar object enters into
thought, it is not the same seeing as it eventually becomes,
when interpreted in terms of knowledge and long-term
memory. The L (x) is now replaced by familiar categories of
thought (e.g. table, chair, Coca Cola bottle, Warhol Brillo
Box, and so on, categories rich in their connections).
More on the structure of
the Visual Module


In order to compute L (x), the vision module must
contain enough machinery to map a token object x onto
an equivalence class designated by L(x) using only
module-specific processes and representations, without
appealing to general knowledge.
The module must also have around 4-5 Object Files, as
described earlier, because it needs those to solve the
binding problem as well as to bind predicate arguments
to objects and to use the Recognition-By-Parts process
for recognizing complex objects (Biederman, I. (1987). Recognition-bycomponents: A theory of human image interpretation. Psychological Review, 94, 115-148.).
An alternative view of reference by Indexes
The alternative view of what goes on inside the vision module might
look something like this:
Sensory properties
filed by object ID
Table of
canonical
appearances
L (x)
1. Compute relations
between objects R(x,y);
2. Compute appearances
► Input is sensory
information from OFs;
► output is standard form
of object shape L (x)
A note on top-down vs bottom-up flow


This dichotomy has been the source of a great deal of
misunderstanding of what goes on in a module. Information
does flow in both directions, but our claim is that it does so
only within the visual module, not across the capsule boundary
A more interesting distinction alluded to here under the phrase
“the visual system queries” as opposed to passively receiving
information. This is a question of where control resides. An
appropriate event in the visual scene grabs an index – the
responsibility here rests with an external event. In computer
talk this is called an interrupt because the current process is
interrupted by an external event. To say that the system
interrogates the scene through the index is to say that the
initiative belongs with the internal process. In computer talk
this corresponds to a test operation.
A note on top-down vs bottom-up flow



What is interesting about the distinction between interrupt and test is
that only an interrupt can be open ended. Things can be set up so that
it is not known in advance what sort of property will cause an interrupt.
On the other hand you can’t have a test operation unless you specify
what you are looking for – you have to test for, which, like see as or
select for, is an intentional act, where an interrupt can be a causal event
So in our case grabbing is a causal event whereas querying is an
intentional representation-governed event. Similarly selecting is
intentional, so scanning or switching visual attention would be
intentional whereas attention can also be elicited or grabbed in which
case the event would be causal.
There may also be combinations of the two, as when you decide to
track certain targets. To do that, according to this theory, you combine
the intentional act of moving your focal attention to a particular object,
with the causal event, whereby some object in the scope of the
attention is enabled and can then grab an index.
Open Questions about the
augmented FINST model


The modular processes must somehow recover relations
between objects, and these may or may not be encoded in
Object Files.
Since information in the module may serve a number of
subsequent functions – including visual-motor coordination
and multimodal perceptual integration – it will have to
represent metrical information, very likely in an analog
form. The question of representing metrical information is
one we leave for the future since little is known about how
analogue representation might function in cognition (they
will have to connect with thoughts so will they need A-D
conversion?)