Attention in Psychology

Download Report

Transcript Attention in Psychology

Attention in Psychology:
Historical Background

Attention was one of the first concepts to appear in
Psychology texts (ca 1730) – e.g., Ebbinghaus, Titchener, …
 Early discussions (Hatfield, 1998) focused on properties such as
 Narrowing (Aristotle, 4th century BC)
 Active Directing (Lucretius, 1st century AD)
 Involuntary shifts (Hippo, 400 AD)
 Clarity (Buridan, 14th century)
 Fixation over time (Descartes, 17th century)
 Effector sensitivity (Descartes)
 All the above phenomena (William James, early 1900s)
The functions of focal attention


A central notion in the present analysis is the notion of “picking
out” or selecting. The usual mechanism that is appealed to in
explaining perceptual selection is attention (sometimes called
focal attention or selective attention).
Why must we select anyway?
This is a rarely asked question to which there are several
answers:
 We need to select because we can’t process all the information
available. This is the resource-limitation reason. <But in what
ways is it limited? Along what dimensions?>
 We need to select because certain patterns cannot be computed
without first marking certain special elements of a scene
 We need to select because of the way relevant information in the
world is packaged (Strawson’s Collecting Principles). It is a
response to the Binding Problem
 We need to select because selection is a consequence of the first
line of causal contact between mind and world: it precedes all
conceptualizing and predicating.

Attention and Selection
We will concentrate on the Selection or Filtering aspects of
attention. We will ask:
1.
Why do we need to select anyway?

Because our processing capacity is limited?
The Big Question: In what way is it limited? (Miller, 1957)
 We will return to this core question after some preliminaries on
the early study of attention as selection and the filter theory.
2.
On what basis do we select? Some alternatives:




3.
We select according to what is important to us (e.g., affordances)
We select what can be described physically (i.e., “channels”)
We select based on what can be encoded without accessing LTM
We “pick out” things to which we subsequently attach concepts: i.e., we pick
out objects (or regions?)
What happens to what we have not selected? A largely unsolved mystery
(though in some cases there are plausible answers).

Big Question #1: Why do we need to select
information? Because capacity is limited.
Along which dimensions is human information
processing capacity limited?
 Channel capacity: Shannon-Hartley Theorem
(
Channel Capacity  Bandwidth  Log 2 1 
Signal
Noise
)
Capacity measured in some sort of “chunks” (Miller)
 Capacity measured in terms of the number of
arguments that can be simultaneously bound to
cognitive routines (Newell)

To what things in the world can the arguments of
visual predicates be bound?
Amount of information in terms of the
Information-theoretic measure (entropy)




Amount of information in a signal depends on how much one’s
estimate of the probability of events is changed by the signal.
H = -pi Log2 (pi) … information in bits
“One of by land, two if by sea” contains one bit of information if
the two possibilities were equally likely, less if they were not (e.g.,
if one was twice as likely as the other the information in the
message would be ⅓ Log ⅓ + ⅔ Log ⅔ = 0.92 bits <using Excel>)
The amount of information transmitted depends on the potential
amount of information in the message and the amount of correlation
between message sent and message received. So information
transmitted is a type of I-O correlation measure.
The information measure is an “ideal receiver” or competence
measure. It is the maximum information that could be transmitted,
given the statistical properties of messages, assuming that the
sender and receiver know the code.
Information transmitted in a typical
absolute judgment experiment
 Information transmitted in an experiment in which subjects were presented with
tones drawn from a known practiced set (of a given size, which determines the
value of input information) and had to name the tones from a learned name set.
 The information transmitted was always around 2.5 bits or an average of 6.25
equiprobable alternatives!
Why can we retain different amounts of information
just by using a different encoding vocabulary?

Answer: The architecture of the cognitive system has
the property that it can deal with a fixed maximum
number of items, regardless of what the items are.

This property can be exploited to get around the
bottleneck of the short-term memory. We do this by
recoding the input into a smaller number of discrete
units, called chunks.

There is also evidence that it takes additional time to
encode and decode chunks, so the recoding technique
is a case of time-capacity tradeoff or what is known in
CS as a compute-vs-store tradeoff.

Allan Newell’s novel model to account for the time taken
in the Sternberg memory scan experiment attributes the
observed RT to encoding or chunking.
Example of the use of chunking
• To recall a string of binary bits – e.g., 00101110101110110101001
• People can recall a string of about 8 binary integers. If they learn a
binary encoding rule (000, 011, 102, 113) they can recall
about 8 such chunks or 18 binary bits. If they learn a 3:1 chunking rule
(called the Octal number system) they can recall a 24 bit string, etc
Early studies: Colin Cherry’s
“Cocktail Party Problem”
 What determines how well you can select one conversation among
several? Why are we so good at it?
 The more controlled version of this study used dichotic
presentations – one “channel” per ear.
 Cherry found that when attention is fully occupied in selecting
information from one ear (through use of the “shadowing” task),
almost nothing is noticed in the “rejected” ear (only if it was not
speech).
 More careful observations shows this was not quite true
 Change in spectral properties (pitch) is noticed
 You are likely to notice your name spoken
 Even meaning is extracted, as shown by involuntary ear switching and
disambiguating effect of rejected channel content
Broadbent’s Filter Theory
Effectors
Motor planner
Filter
Very Short Term Store
Senses
Rehearsal loop
Limited Capacity Channel
Store of conditional
probabilities of past
events (in LTM)
Broadbent, D. E. (1958). Perception and Communication. London: Pergamon Press.
Stroop test demonstrates top down effects
Baseline: Name the colors of the ink











Stroop Effect in English
Name the colors of the ink
RED GREEN BLUE PINK BROWN ORANGE GREEN
PINK RED YELLOW GREEN YELLOW RED BROWN
RED BLUE BROWN GREEN RED ORANGE RED
BLUE YELLOW PINK ORANGE GREEN BLUE
BROWN PINK RED YELLOW GREEN YELLOW RED
BROWN PINK RED YELLOW GREEN YELLOW RED
PINK ORANGE GREEN BLUE BROWN PINK RED
YELLOW GREEN YELLOW RED BROWN RED
BLUE GREEN BROWN YELLOW GREEN YELLOW
RED PINK ORANGE GREEN RED BLUE BROWN
GREEN RED ORANGE RED BLUE YELLOW
YELLOW GREEN YELLOW RED BROWN PINK RED
YELLOW GREEN PINK RED YELLOW
Stroop Effect in Spanish
Name the colors of the ink
TINTO VERDE AZUL MARROM ROSA NARANJA
VERDE ROSA TINTO AMARELO VERDE AMARELO
TINTO MARROM TINTO AZUL MARROM VERDE
TINTO NARANJA TINTO AZUL AMARELO ROSA
NARANJA VERDE AZUL MARROM ROSA TINTO
AMARELO VERDE AMARELO TINTO MARROM ROSA
TINTO AMARELO VERDE AMARELO TINTO ROSA
NARANJA VERDE AZUL MARROM ROSA TINTO
AMARELO VERDE AMARELO TINTO MARROM
TINTO AZUL MARROM VERDE AMARELO VERDE
AMARELO TINTO ROSA NARANJA VERDE TINTO
AZUL MARROM VERDE TINTO NARANJA TINTO
AZUL TINTO NARANJA AMARELO VERDE ROSA
AMARELO VERDE AMARELO TINTO AZUL NARANJA
Type of Interference of attended message shows
that the rejected message was understood

Moral: Although the rejected channel appears to be rejected,
it is being processed enough to understand the words!
 The semantic interpretation of attended message depends on
the meaning content of the rejected message. Subjects were
asked to paraphrase the attended message in:
 Channel 1 (attended): “I think I will go down to the bank but I will
be back for dinner”
 Channel 2 (rejected): “The election results will depend on the
value of the dollar against the Euro and the state of the economy”
 OR Channel 2 (rejected): “The rain has resulted in erosion by the
overflowing river”
From: Lackner, J. R., & Garrett, M. F. (1972). Resolving ambiguity: Effects of biasing context in
the unattended ear. Cognition, 1, 359-372.
From here on I will focus on the
special case of visual attention

Visual working memory and visual selection
 What is the nature of the input, storage and
information processing limits in vision?
Studies of the capacity of Visual
Working Memory (Luck & Vogel, 1997)

People appear to be able to retain about 4 properties of an
object (4 colors, 4 shapes, 4 orientations, etc) over a short time

People can also retain the identity of 4 objects for a short time.

Luck and Vogel (1997) found that as long as there are not
more than 4 properties per object, people can retain large
numbers of properties when the properties are on different
objects (a phenomenon that is reminiscent of Miller’s
“chunking hypothesis” except the chunks are visual objects).
*
Luck, S., & Vogel, E. (1997). The capacity of visual working memory for features and
conjunctions. Nature, 390, 279-281.
Luck & Vogel on visual STM
1
Luck & Vogel on visual STM
1
Luck & Vogel on visual STM
2
Luck & Vogel on visual STM
2
Luck & Vogel on visual STM
3
Luck & Vogel on visual STM
3
Luck & Vogel on visual STM
4
What does visual attention select?
(What is the basis for selection?)

If visual attention is selection, what does it select?
 An obvious answer is places. We can select places by moving
our eyes so our gaze lands on different places.
 When places are selected, are they selected automatically?
 Must we always move our eyes to change what we attend to?
 Studies of Covert Attention-Movement: Posner (1980).
 How does attention switch from one place to another?
 Is it always the case that we attend to places? Can we attend to
any other property? Can we select on the basis of color, depth,
spatial frequency, affordances, or the property a painting has of
having been painted by Da Vinci (A property to which Bernard
Berenson was able to attend extremely well). cf Gibson


How else can visual attention select?
Can we control the size and shape of the region that is
selected, or is selection always punctate and data-driven?
 Zoom Lens model of spatial attention (Eriksen & St James, 1986).
 We control where attention moves:
 Is this automatic or voluntary?
 How do we know where to direct our attention? How do we
specify a location or object prior to attending to it?
 We need a way to specify where or what prior to attending to it!
Keep this conundrum in mind – we will return to it later!

How narrowly can we focus our attention? Can we make it
pick out one out of several objects?
 Are there special conditions under which we are able to pick out
individual things? We will return to “attentional resolution” or
the minimum spacing for selecting individual object.
Covert movement of attention
Example of an experiment using a cue-validity paradigm for showing that the
locus of attention moves without eye movements and for estimating its speed.
Posner, M. I. (1980). Orienting of Attention. Quarterly Journal of Experimental Psychology, 32, 3-25.
Extension of Posner’s demonstration of attention switch
Fixation
frame
Cue
Target-cue
interval
Detection target
*
Cued
Uncued
*
*
Along the
path
Does the improved detection in intermediate locations entail that the “spotlight of
attention” moves continuously through empty space?
But there are empirical reasons why objects are a
better basis for attentional selection than location

There is experimental evidence that attention attaches
to things rather than places
 When attention is exogenously summoned, the
appearance of analog movement of focal attention can
be explained by a punctate object-based theory of
attention-allocation – Sperling & Weichselgartner (1995)
Sperling & Weichselgartner (1995) “Episodic” or
Quantal Theory of Attention switching
Assumes a quantal “shift” in attention in which the spotlight pointed
at location -2 is extinguished and, simultaneously, the spotlight at
location +2 is turned on. Because extinction and onset take a
measurable amount of time, there is a brief period when the
spotlights partially illuminate both locations simultaneously.
This object-based view of attentional
selection is at the heart of FINST theory

When we discuss some of the reasons for attention
and the mechanisms involved I will propose that there
are good reasons on both grounds for supposing that
attention attaches itself to objects rather than locations
It also appears that we can to some extent
control the shape of our attended region
Farah, M. J. (1989). Mechanisms of imagery-perception interaction. Journal of
Experimental Psychology: Human Perception and Performance, 15, 203-211.
We can select a shape even when it is
intertwined among other similar shapes
Are there items on the left and on the right that have
the same shape? On a surprise test at the end, subjects
were not able to recall shapes that had been present but
had not been attended in the task (Rock & Gutman, 1981)
Other examples of attentionally induced
inhibition

Negative Priming (Treisman & DeShepper, 1996).
 Is there a figure on the right that is the same as the figure on the left?
 When the figure on the left is one that had appeared as an ignored
figure on the right, RT is long and accuracy poor.
 This “negative priming” effect persisted over 200 intervening trials
and lasted for a month!
Another negative
attention effect:
Inattentional
Blindness
Inattentional Blindness

The background task is to report which of two arms of the + is
longer. One critical trial per subject, after about 3,4 background
trials. Another “critical” trial presented as a divided attention
control.

25% of subjects failed to see the square when it was presented in
the parafovea (2° from fixation).

But 65% failed to see it when it was at fixation!

When the background task cross was made 10% as large,
Inattentional Blindness increased from 25% to 66%.

It is not known whether this IB is due to concentration of
attention at the primary task, or whether there is inhibition of
outside regions.
(Mack & Rock, 1988)
Does inhibition play a role? Noticing odd
stimuli when their location is not pre-marked
Inattentional Blindness 50%
Inattentional Blindness 20%
In what other ways might our visual
information capacity be limited?

There are obviously limitations on the input side of
vision that depend on the acuity of the sensors and the
range of physical properties to which they respond.

But there is a limitation beyond that of acuity: The
perceptual system is limited in what it can individuate
and how many of these individuals it can deal with at
one time. The capacity to individuate is different from
the capacity to discriminate.
 Some reason for thinking that individuating is a distinct
process
Exploring the limits of attention and the
units over which selection operates

It appears that the human information-processing
bottleneck cannot be expressed perspicuously in terms of
information-theoretic measures, nor can it be specified in
physical parameters (e.g., in terms of locations or spatiotemporal regions), although such measures often do
capture important aspects of attention (e.g., visual attention
often moves continuously through space).
 But there are other possible ways one might consider
expressing the limits of attention.
 Over the past 25 years evidence has been accumulating that the
human attention system is, at least in part, tuned to individual
objects in the world. This would certainly make sense from an
evolutionary perspective. But what does this mean?
Summary of what we have so far

We saw that visual representations must be conceptual for
empirical and logical reasons

The empirical reasons derive in part from the nature of
generalizations and errors of recall
 The logical reason is that vision must interact with thoughts
and lead to new beliefs and plans of action


We saw that a large part of vision is cognitively
impenetrable and encapsulated and that cognition can only
be brought to bear prior to or after its automatic operation:
As attention or interpretation.
We saw that there are good design reasons for vision to be
selective and we considered several bases for selection.
But selection is not only for filtering information to a more
manageable amount, but it is also required for other
reasons. These other reasons make it plausible that
selection should operate over objects rather than bits of
information in the Shannon sense.
The increasingly important role played by
‘Objects’ in studies of visual attention

Miller’s ‘Magic Number 7’ has continued to haunt us even
beyond studies of short-term memory (STM).

There is a limitation in visual information processing that is
beyond the limitation of acuity and of channel capacity: The
perceptual system is limited in what it can individuate and
how many of these individuals it can deal with at one time.

The capacity to individuate is different from memory capacity
and discrimination capacity.
 This notion of individuating and of individuals may be related to
Miller’s “chunks”, but it has a special role in vision which we will
explore in the next lecture
 First some reasons why individuating is a distinct process
End of Attention segment

Next we will deal with a mechanism that is very
closely related to focal attention but yet which
(according to some people) is quite different
 This mechanism is what enables us to select several
objects at once and to keep that selection even of the
properties of the object change and the object moves
around. It is a sticky pointer or perceptual
demonstrative reference
 This is about picking out individuals qua individuals
Segment on Visual Indexing

To be continued later…
Picking out is different from discriminating:
Pick out the third contour from the left
Individuating as a distinct process

Individuating has its own psychometric function: The
minimum distance for individuating is much larger than
for discriminating.
 It may be that in vision our attention is limited in the
number of things we can individuate and simultaneously
access (more on this later). But how do you determine
what counts as a “thing”?
 Individuating is a prerequisite for recognition of patterns
and other properties defined among a number of
individual parts
 An example of how we can easily detect patterns if they are
defined over a small enough number of parts is subitizing
 Another area where the concept of an individual has
become important is in cognitive development, where it is
clear that babies are sensitive to the numerosity of
individual things in a way that is distinct from their
perceptual abilities but is limited in its capacity
Pick out 3 dots and keep track of them
 In a field of identical elements you can select a number of them and
move your attention among them (e.g., “move one up” or Move 2 right”
etc) so long as at no time do you have to hold on to more than 4 dots
Pick out 3 dots I will cue and keep track of them
 After you pick out the 3 cued dots, I’ll ask you move your attention from the
center one. Describe the new relation among the three dots.
 In a field of identical elements you can select several of them and move your
attention among them (e.g., “move one up” or Move 2 right” etc) so long as at no
time do you have to hold on to more than 4 dots (Intriligator & Cavanagh, 2001)
Visual Indexes (aka FINSTs)

The hypothesis is that in vision there is a limit to how
many objects (individuals) can be selected and bound
to the arguments of cognitive functions at one time.
 There is evidence that we can hold on to 4 objects in
visual short term memory (Luck & Vogel, 1997).
 There is evidence that Objects (i.e., individual things)
may be the basic units of visual attention
 FINST Theory claims that there is a mechanism for
picking out and referring to (pointing to) primitive
visual elements (which are generally referred to as
Objects)
The requirements for picking out individual things
and keeping track of them reminded me of an
early comic book character called “Plastic Man”
Imagine being able to place several of your fingers on things
in the world without being able to detect their properties in
this way, but being able to refer to those things so you could
move your gaze or attention to them. If you could you would
possess FINgers of INSTantiation = FINSTs!
Individuals and patterns

Vision does not recognize patterns by applying templates
since the size, shape, retinal location, orientation, and
other properties must be abstracted away,

A pattern is encoded over time (and often over saccades),
therefore the visual system must keep track of the
individual parts and merge descriptions of the same part
at different times and stages of encoding

Therefore in order to recognize a pattern, the visual
system must pick out individual parts and bind them to
the representation being constructed

Examples include what Ullman called “visual routines”
Are there collinear items (n>3)?
Several objects must be picked out at once
in making relational judgments

The same is true for other relational judgments like inside or on-the-samecontour… etc. We must pick out the relevant individual objects first.
Respond: Inside-same contour? On-same contour?
When items cannot be individuated, predicates over
them cannot be evaluated
 Do these figures contain one or two distinct curves?
 Individuating these curves requires a “curve tracing”
operation, so Number_of_curves (C1, C2, …) takes
time proportional to the length of the shortest curve.
The figure on the left is one continuous curve, the one
on the right is two distinct curves – as shown in color.
Signature subitizing phenomena only appear when objects
are automatically individuated and indexed
Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A
limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.
Encoding conjunctions of properties

Experiments showing the special difficulty that
vision has in detecting conjunctions of several
properties have provided a basis for understanding
an important problem in in visual analysis
How are conjunctions of features detected?
Read the vertical line of digits in the following display
Under these conditions Conjunction Errors are very frequent
Rapid visual search (Treisman)
Find the following simple figure in the next slide:
This case is easy – and the time is independent of
how many nontargets there are – because there is
only one red item. This is called a ‘popout’ search
This case is also easy – and the time is independent of
how many nontargets there are – because there is only
one right-leaning item. This is also a ‘popout’ search.
Rapid visual search (conjunction)
Find the following simple figure in the next slide:
Find the unique item in this slide
Serial vs parallel search?



Finding an element that differs from all others in a scene
by a single feature – called a single-feature search – is
fast, error-free and almost independent of how many
nontargets there are;
Finding an object that differs from all others by a
conjunction of two or more features (and that shares at
least one feature with each object in the scene) – called a
conjunction search – is usually slow, error-prone, and is
worse the more nontargets there are in the scene*.
These results suggest that in order to find a conjunction,
which requires solving the binding problem, attention has
to be scanned serially to all objects.
* This way of putting is simplifies things. Under certain
conditions the serial-parallel distinction breaks down
Single-Feature vs Conjunction-feature search
Reaction time (seconds)
Idealized graph of Single-Feature
and Conjunction Feature Search
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Single Feature Search
Conjunction Feature Search
2
3
5
7
11
Number of elements in search set
16
What is attention is for?
Treisman’s Attention as Glue Hypothesis
 The purpose of visual attention is to Bind properties
together in order to recognize objects
 This is called the “binding problem” or the “many
properties problem” and it is of considerable interest to
philosophers as well as vision scientists
 We can recognize not only the presence of “squareness”
and “redness” in our field of view, but we can also
distinguish between different ways they may be conjoined
The role of attention to location in Treisman’s Feature Integration Theory
Conjunction detected
Color maps
Shape maps
Orientation maps
R
Y
G
Master location map
Original Input
Attention “beam”
The ‘attention-as-glue’ hypothesis has a corollary:
In computing conjunctions of properties attention
must be directed primarily at objects since it is
objects that have the conjoined properties

Instead of being like a spotlight beam that can be
scanned around a scene, and can be zoomed to cover a
larger or smaller area, maybe attention can only be
directed towards occupied places – i.e., to visual objects
An alternative view of how we
solve the binding problem

If we assume that only properties of indexed objects
are encoded and stored in Object Files, then properties
that belong to the same object are stored in the same
Object File, so the binding problem does not arise


This is the Object-Based Attention view exemplified by
FINST Theory
The assumption that only properties of indexed objects
are encoded raises the problem of what happens to
properties of the other (unindexed) objects or
unencoded properties in a display
I will return to this conundrum later.
FINST Theory postulates a limited number of pointers in early
vision that are elicited by causal events in the visual field and
that enable vision to refer to things without doing so under
concept or a description
What happens to unattended
objects in vision (esp in tracking)?
There are three possibilities
1. No properties other than of indexed objects are encoded
 It may be that the richness of visual phenomenology is
illusory!
 Visual information without experience & vice-versa
2. Other properties are encoded by are only available within
modules (e.g., two visual systems)
3. Unattended (unindexed) objects are tracked but access to
them is inhibited
 Mack & Rock (Inattentional Blindness)
 MOT research
Evidence for attentional selection based on
Objects

Single Object Advantage: pairs of judgments are faster when
both apply to the same perceived object

Entire objects acquire enhanced sensitivity from focal
attention to a part of the object
Single-Object advantage occurs even with generalized
“objects” defined in feature space



Simultanagnosia and hemispatial neglect show object-based
effect
Attention moves with Moving Objects
 IOR
 Object Files
 MOT
Single-object superiority even when the
shapes are controlled
More controls for the Baylis study… (Baylis, 1994)
Controls for
separability,
convexity, area…
Attention spreads over perceived objects
A
C
A
C
Spreads to
B and not C
Spreads to
C and not B
B
A
D
C
B
A
D
C
Spreads to
B and not C
Spreads to
C and not B
B
D
B
D
Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads t
other parts of the same visual object compared to equally distant parts of different objects.
Objecthood endures over time

Several studies have shown that what counts as
an object (as the same object) endures over
time and over changes in location;
 Certain forms of disappearances in time and changes in
location preserve objecthood.

This gives what we have been calling a “visual
object” a real physical-object character and
partly justifies our calling it an “object”.
The time-course of attention:
Inhibition of return

If we vary the time between the cue and target in a modified Posner
paradigm, we find that when the Cue-Target-Onset-Asynchrony
(CTOA) gets to around 300-900 ms, reaction time to the target begins
to increase. This is called Inhibition-of-return (Klein, 2000).
 To get this effect we actually have to attract attention to the target
location and then attract it back to the origin. IOR is one of many
examples of an inhibition effect being produced by attention.
Inhibition of return appears to be object-based
(as well as to some extent location-based)

Inhibition-of-return is thought to help in visual
search since it prevents previously visited objects
from being revisited
 The original study used static objects. Then
(Tipper, Driver & Weaver, 1991) showed that IOR
moves with the inhibited object.
IOR appears to be object-based (it travels
with the object that was attended)
Demo of Object File Experiment
Tracking objects not defined by distinct spatial
locations and spatial trajectories
Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000). Tracking an object
through feature-space. Nature, 408(Nov 9), 196-199.
There is also evidence from neuropsychology
that is consistent with the object-based view

Neglect
 Balint and simultanagnosic patients
Visual neglect syndrome is object-based
When a right neglect patient is shown a dumbbell that rotates,
the patient continues to neglect the object that had been on the
right, even though It is now on the left (Behrmann & Tipper, 1999).
Simultanagnosic (Balint Syndrome) patients only attend
to one object at a time
Simultanagnosic patients cannot judge the relative length of two
lines, but they can tell that a figure made by connecting the ends
of the lines is not a rectangle but a trapezoid (Holmes & Horax, 1919).
Balint patients can only attend to one object at a time
even if they are overlapping
Luria, 1959
Multiple Object Tracking

One of the clearest cases illustrating object-based
attention is Multiple Object Tracking

Keeping track of individual objects in a scene requires
a mechanism for individuating, selecting, accessing
and tracking the identity of individuals over time
 These are the functions we have proposed are carried out by
the mechanism of visual indexes (FINSTs)
 We have been using a variety of methods for studying visual
indexing, including subitizing, subset selection for search,
and Multiple Object Tracking (MOT).
Multiple Object Tracking

In a typical experiment, 8 simple identical objects are
presented on a screen and 4 of them are briefly
distinguished in some visual manner – usually by
flashing them on and off.

After these 4 “targets” have been briefly identified, all
objects resume their identical appearance and move
randomly. The subjects’ task is to keep track of which
ones had earlier been designated as targets.

After a period of 5-10 seconds the motion stops and
subjects must indicate, using a mouse, which objects
were the targets.

People are very good at this task (80%-98% correct).
The question is: How do they do it?
Keep track of the objects that flash
How do we do it? What properties
of individual objects do we use?
Keep track of the objects that flash
How do we do it? What properties
of individual objects do we use?
Explaining Multiple Object Tracking
 Basic finding: People (even 5 year old children)
can track 4 to 5 individual objects that have no
unique visual properties
 How is it done?
 Can it be done by keeping track of the only
distinctive property of objects – their location?
A possible location-based tracking algorithm
1. While the targets are visually distinct, scan
attention to each target in turn and encode its
location on a list.
2. When targets begin to move, check the n’th position
in the list and go to the location encoded there:
Call it Loc(n).
3. Find the closest element to Loc(n).
4. Update the actual location of the element found in
#3 in position n in the list: this becomes the new
value of Loc(n).
5. Move attention to the location encoded in the next
list position, Loc(n+1).
6. Repeat from #3 until elements stop moving.
7. Report elements whose locations are on the list.
Use of the above algorithm assumes (1) focal attention is required to
encode locations (i.e., encoding is not parallel), (2) focal attention is
unitary and has to be scanned continuously from location to
location. It assumes no encoding (or dwell) time at each element.
Predicted performance for the serial tracking algorithm as a function of the
speed of movement of attention
If we are not using and updating objects’
locations, then how are we tracking them?

Our hypothesis, which is independently motivated, is that
there are a small number of primitive indexes or pointers,
each of which can pick out a particular individual object
 The index keeps providing access to the object as the object
changes its properties and its location.

The object is not selected by using an encoding of any of its
properties. It is picked it out nonconceptually just as the
demonstrative that does in language.
 Nonconceptual selection is selection without classification
(without encoding the selected thing as having certain properties
or as being a member of a certain category)
 Nonconceptual contact with the world is essential in order to
ground concepts in causal connections
A FINST is a mechanism that:
1.
2.
3.
4.
5.
Picks out, and
Keeps track of
 individual distal elements, and
Does so directly (i.e., without mediation of concepts and
without appealing to or using any encoded properties of
the individuals). Therefore,
FINSTs pick out and track individuals as individuals
rather than as bearers of certain properties
FINSTs do not pick out and track individuals as members
of any category: The connection to the world is purely
causal and nonconceptual, so there is no “seeing as”
relation.

So the visual system (and the person) literally does not what is
being selected and tracked, even though this indexed selection
allows further properties of the object in question to be encoded
subsequently!
Additional examples of MOT






MOT with occlusion
MOT with virtual occluders
MOT with implosions
MOT with line endpoints
"Rubber band" displays
MOT with IDs (corners)
Summary of some properties of indexing revealed
by recent experiments
1. Targets can be tracked even when they disappear
behind an occluder and, under certain conditions,
even when all objects disappear from view (Scholl &
Pylyshyn, 1999; Keane & Pylyshyn, VSS2003). Demo: MOT
with
occlusion
2. Properties of targets are not encoded during MOT nor
are they used in tracking. Changes in target
properties are not even noticed (Scholl, Pylyshyn & Franconeri,
1999; Bahrami, 2003).
3. Not all well-defined clusters of features can be
tracked: Only ones that correspond to objects (Scholl,
Pylyshyn & Feldman, 2001). Demo: "Rubber band" displays
Summary of some properties of indexing revealed
by recent experiments
4. Indexes are assigned primarily in an exogenous,
automatic, involuntary and data-drive manner. They can
also be assigned endogenously (voluntarily) but we
believe this happens only by moving focal attention to
each target serially (Annon & Pylyshyn, VSS2003).
5. Index maintenance in tracking appears to be nonpredictive and non-attentive (Keane & Pylyshyn, VSS2003; Leonard &
Pylyshyn, VSS2003).
6. Target-target confusions are much more numerous than
target-nontarget confusions. The reason appears to be
that nontargets are inhibited, which may prevent them
from being swapped with nontargets (Pylyshyn & Leonard, VSS2003).
Summary of some properties of indexing revealed
by recent experiments
7. Keeping track of objects as targets is easier than keeping
track of their identity (when the latter is provided at the
start of the trial by a name or special location)
The poorer recall of object identities is surprising, given that in order to
judge an object as a target one needs to trace its identity back to an object
that had been visibly distinct at the start of a trial! So why is ID lost?
8. One reason is that target-target confusions are much more
numerous than target-nontarget confusions. But why
should this be so?
9. One reason may be that nontargets are inhibited, which
may prevent them from being swapped with nontargets.
We have shown this is so experimentally. But that leaves
a serious puzzle: How can inhibition travel with objects
when no indexes are available for tracking?
The beginnings of the puzzle of clustering prior to
indexing, and what that might mean!



If moving objects are inhibited then inhibition moves along with the
objects. How can this be unless they are being tracked? And if they
are being tracked there must be at least 8 FINSTs!
This puzzle may signal the need for a kind of individuation that is
weaker than the individuation we have discussed so far – a mere
clustering, circumscribing, figure-ground distinction without a
pointer or access mechanism – i.e. without reference!
It turns out that such a circumscribing-clustering process is needed
to fulfill many different functions in early vision. It is needed
whenever the correspondence problem arises – whenever visual
elements need to be placed in correspondence or paired with other
elements. This occurs in computing stereo, apparent motion, and
other grouping situations in which the number of elements does not
affect ease of pairing (or even results in faster pairing when there are
more elements). Correspondence is not computed over continuous
visual manifolds but only over some pre-clustered elements.
An alternative view of how to solve the
Binding Problem

According to the current version of FINST theory, only
properties of indexed objects are encoded (conceptualized)
 The binding problem never arises because properties are always
encoded as properties of an indexed object, and no other
properties are encoded at all.

This is in conflict with strong intuitions – namely that we
see much more than we conceptualize. So what do we do
about the things we “see” but do not conceptualize?
 Some philosophers say they are represented nonconceptually?

But what is such a representation like? And what makes it a
representation, as opposed to just a biological reaction?
 My provisional answer is that such biological reactions
(e.g., retinal activity) are not representations at all – they
have no truth values and so they cannot misrepresent
 This is another hard issue to be deferred to later
Puzzles raised by FINST theory and MOT results
 If the only information about indexed objects is
encoded and made available to the cognitive mind,
what happens to information about other parts of the
visual scene?
 There are, after all, only about 4 or 5 indexes and surely
we see a lot more of the world than 4 or 5 objects!
 This raises the question about whether non-indexed
objects are ‘processed’ in any sense at all, and
whether they are even represented in some
(presumably nonconceptual) way.
 Do objects that are not indexed have any effect on
the visual system at all?
 The mystery of unattended objects
 Functional blindness in normal vision (to come)
The problem is what to do about the items
that were not attended but in some sense
had been ‘seen’
Some considerations:
 We should not equate ‘attended’ with indexed or selected
or with any other information-processing function? To be
attended is typically defined in terms of either the task
goals (where unattended means unreported) or the
perceptual experience
 More
on forms of inattentional blindness later
 Non-indexed items may continue to be indexable for a short
time after they physically disappear (e.g., occlusions in MOT)
 The question is whether this persistence is a form of
nonconceptual representation or a mere latency or inertia in the
visual mechanism, and that question eventually comes back to
whether we must advert to semantical notions in stating the
generalizations (De Morgan’s Canon or Occam’s Razor).
Another puzzle: Punctate inhibition of moving objects?

We have recently obtained evidence that nontargets are
inhibited (as measured by the rate of detection of small faint
probe dots).



There appears to be no inhibition of the empty region through which
the nontargets move
The inhibition is spatially local
How can a punctate moving object be inhibited unless the
object is being tracked? And how can it be tracked if there are
many (n > 5) of them?

But there is some sense in which moving objects must be tracked:
 E.g., Dynamic random-dot stereograms, kinetic depth effect

Maybe Indexing is a two-stage process?
1.
2.
Individuate
Reference (for accessing)
Exp 1: Probe-dot detection (statistically adjusted using regression)
80%
Adjusted Mean Probe Detection Performance
Adjusted Detection Performance
75%
70%
65%
60%
55%
50%
45%
40%
NonTargets
Open Space
Location of Probe
Targets
Recent experimental results on Inhibition of nontargets
Experiment 1: 3 locations
100%
Probe Detection while Tracking and Nontracking
90%
Detection %
While Tracking
Non-Tracking Control
80%
70%
60%
50%
40%
OpenSpace
Target
Probe Location
NonTarget
Recent experimental results on Inhibition of nontargets
Expt 2: 5 locations
100%
Probe Detection during tracking and nontracking
Probes Detected (%)
95%
Nontracking (Control)
Tracking
90%
85%
80%
75%
70%
65%
Space
Target
NonTarget
Probe Location
NearTarget
NearNonTarg
Exp 2: Showing results when statistically adjusted using regression
Probe Detection accuracy
(adjusted for nontracking control)
100%
Probe Detection (%)
95%
90%
85%
80%
75%
70%
Empty Space
Target
NonTarget
Location of Probe
NearTarg
NearNonTarg
The effect of doubling the number of nontargets
Probe detection statistically adjusted using the
nontracking control as covariate
80%
12 Items
Probe Detection
75%
8 Items
70%
65%
60%
55%
50%
45%
Space
Nontarget
Probe Location
Target
The beginnings of the puzzle of individuating
prior to indexing, and what that might mean!



If moving objects are inhibited then inhibition moves along with
the objects. How can this be unless they are being tracked? And
if they are being tracked there must be at least 8 FINSTs!
This puzzle may signal the need for a kind of individuation that is
weaker than the individuation we have discussed so far – a mere
clustering, circumscribing, figure-ground distinction without a
pointer or access mechanism – i.e. without reference!
It turns out that such a circumscribing-clustering process is needed
to fulfill many different functions in early vision. It is needed
whenever the correspondence problem arises – whenever visual
elements need to be placed in correspondence or paired with other
elements. This occurs in stereo, apparent motion, and other
situations in which increasing the number of elements does not
increase the difficulty of computing correspondences.
 Correspondence is not computed over continuous visual manifolds but only
over some pre-clustered elements.
Example of the correspondence
problem for apparent motion
The grey disks correspond to the first flash and the black ones
to the second flash. Which of the 24 possible matches will the
visual system select as the solution to this correspondence
problem? What principal does it use?
Curved matches
Linear matches
Here is how it actually looks
Why does the apparent motion take the form it does?

The principle appears to be one of minimizing the vector
difference between each possible correspondence pair
and that of its nearest neighbors (Dawson & Pylyshyn, 1988)
 This principle arises from (is justified by) the natural
constraints of rigidity and opacity:
 In our kind of world most image features arise from distal
elements on the surface of opaque rigid objects, i.e., the vast
majority of perceived distal elements are on the visible surface
of opaque rigid objects
 Therefore each distal element is likely to move the same
amount and in the same direction as elements near to it (since
they are likely to be on the same surface)
Views of a dome
Structure from Motion Demo
Cylinder Kinetic Depth Effect
The correspondence problem for biological motion
Reprise … what are FINSTs?





They are a primitive reference mechanism that refer to
individual objects in the world (FINGs?)
Objects are picked out and referred to without using any
encoding of their properties, including their location.
Picking out objects is prior to encoding their locations!
Indexing is nonconceptual because it does not represent
an individuals as a member of some conceptual category –
not even as being in the category individual or object!
FINSTs serve as visual demonstratives, much like the
terms this or that do in language, by picking out and
referring to individuals without using their properties.
The central function of FINST indexes is to bind
arguments of visual predicates or of motor commands to
things in the world to which they must refer. Only
predicates with bound arguments can be evaluated.
Schema for how FINSTs function in
visual-motor control