Transcript Document

Value based decision making:
behavior and theory
Institute for Theoretical Physics and Mathematics
Tehran
January, 2006
Leo Sugrue
Greg Corrado
SENSORY INPUT
low level sensory analyzers
DECISION MECHANISMS
motor output structures
ADAPTIVE BEHAVIOR
SENSORY INPUT
low level sensory analyzers
DECISION MECHANISMS
representation
of stimulus/
action value
motor output structures
REWARD
HISTORY
ADAPTIVE BEHAVIOR
How do we measure value?
Herrnstein RJ, 1961
Choice Fraction
The Matching Law
Behavior: What computation does the
monkey use to ‘match’?
Theory: Can we build a model that replicates
the monkeys’ behavior on the matching task?
How can we validate the performance of the?
model? Why is a model useful?
Physiology: What are the neural circuits
and signal transformations within the brain
that implement the computation?
An eye movement matching task
Baiting Fraction
6:1
6:1
2:1
1:1
1:2
1:2
1:6
Dynamic Matching Behavior
Dynamic Matching Behavior
Rewards
Dynamic Matching Behavior
Rewards
Responses
Relation Between Reward and Choice is Local
Rewards
Responses
How do they do this?
What local mechanism underlies the
monkey’s choices in this game?
To estimate this mechanism we need a
modeling framework.
Linear-Nonlinear-Poisson (LNP)
Models
of choice behavior
Strategy estimation is straightforward
Estimating the form of
the linear stage
How do animals weigh past rewards
in determining current choice?
Estimating the form of
the nonlinear stage
Differential
Value
How is differential value mapped onto the
animal’s instantaneous probability of
choice?
Probability of Choice (red)
Monkey F
Monkey G
Differential Value (rewards)
Our LNP Model of
Choice Behavior
Model Validation
• Can the model predict the monkey’s next choice?
• Can the model generate behavior on its own?
Can the model predict the monkey’s
next choice?
Predicting the next choice: single experiment
Predicting the next choice: all experiments
Can the model
generate behavior on its own?
Model generated behavior: single experiment
Distribution of stay durations summarizes
behavior across all experiments
Stay Duration (trials)
Model generated behavior: all experiments
Stay Duration (trials)
Model generated behavior: all experiments
Stay Duration (trials)
Ok, now that you have a reasonable model
what can you do with it?
1. Explore second order behavioral questions
2. Explore neural correlates of valuation
Ok, now that you have a reasonable model
what can you do with it?
1. Explore second order behavioral questions
2. Explore neural correlates of valuation
Choice of Model Input
reward history:
0000010100001
choice history:
0000111110011
Surely ‘not getting a reward’ also has
some influence on the monkey’s behavior?
Choice of Model Input
reward history:
0000010100001
choice history:
0000111110011
hybrid history:
0000010100001
d d d
d
d = the value of an
unrewarded choice
Can we build a better model by taking
unrewarded choices into account?
hybrid history:
0000010100001
d d d
d
• Systematically vary the value of d
• Estimate new L and N stages for the model
• Test each new model’s ability to
a) predict choice and
b) generate behavior
Unrewarded choices: The value of nothin’
Predictive Performance
Generative Performance
Value of Unrewarded Choices (d)
Value of Unrewarded Choices (d)
Unrewarded choices: The value of nothin’
Predictive Performance
Generative Performance
Value of Unrewarded Choices (d)
Value of Unrewarded Choices (d)
Choice of Model Input
Contrary to our intuition inclusion of
information about unrewarded choices does not
improve model performance
Optimality of Parameters
Weighting of past rewards
Is there an ‘optimal’ weighting function to maximize
the rewards a player can harvest in this game?
Weighting of past rewards
• The tuning of the 2 (long) component of the Lstage affects foraging efficiency. Monkeys have
found this optimum.
• The tuning of the s, the nonlinear function relating
value to p(choice) affects foraging efficiency. The
monkeys have found this optimum also.
• The 1 (short) component of the L-stage does not
affect foraging efficiency. Why do monkeys
overweight recent rewards?
The differential model is a better predictor of
monkey choice
• Monkeys match; best LNP model
• Model predicts and generates choices
• Unrewarded choices have no effect
• Monkeys find optimal 2 and s; 1 not critical
• Differential value predicts choices better
than fractional value
?
Best LNP model:
Candidate decision variable, differential value:
g(v1 - v2) = pc
Aside: what would Bayes do?
1) maintain beliefs over baiting probabilities
2) be greedy or use dynamic programming
Firing rates in LIP are related
to target value on a trial-by-trial basis
into
RF
LIP
out
of RF
gm020b
http://brainmap.wustl.edu/vanessen.html
Target Value
The differential model also accounts for more
variance in LIP firing rates
What I’ve told you:
•
•
•
•
•
•
Howmatching
we control/measure
value
The
law
An
experimental
tasktask
based on that principle
A dynamic
foraging
A
simple
model of value based model
choice
Our
Linear-Nonlinear-Poisson
How
we validate
that model
Predictive
and generative
validation
How
wemodels,
use the optimality
model to explore
behavior
Hybrid
of reward
weights
How
wefiring
use the
model
explore value
Neural
in area
LIPtocorrelates
with
related signals
in the
‘differential
value’
on brain
a trial-by-trial basis
Foraging Efficiency Varies as a Function of 2
Foraging Efficiency Does Not Vary as a Function of 1
What do animals do?
Animals match.
Matching is a probabilistic policy:
pchoose = f pbait , pbait 

Matching is almost optimal within
the set of probabilistic policies.
+ the change
over delay
Greg Corrado
How do we implement
the change over delay?
only one ‘live’ target at a time