Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst Barto@cs.umass.edu Autonomous Learning Laboratory –

Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst [email protected] Autonomous Learning Laboratory –

Transcript Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst [email protected] Autonomous Learning Laboratory –

Searching in the Right Space
Perspectives on Computational
Reinforcement Learning
Andrew G. Barto
Autonomous Learning Laboratory
Department of Computer Science
University of Massachusetts
Amherst
[email protected]
Autonomous Learning Laboratory – Department of Computer Science
Computational Reinforcement Learning
Artificial Intelligence
(machine learning)
Control Theory and
“Reinforcement learning (RL) bears a tortuous
relationship
with
Operations
Research
Psychology
historical and contemporary ideas in classical and instrumental
Computational
conditioning.” —Dayan
2001
Reinforcement
Learning (RL)
Neuroscience
Artificial Neural Networks
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
 High-level intro to RL
 Part I: The personal odyssey
 Part II: The modern view
 Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The View from Machine Learning
 Unsupervised Learning
• recode data based on some given principle
 Supervised Learning
• “Learning from examples”, “Learning with a
teacher”, related to Classical (or Pavlovian)
Conditioning
 Reinforcement Learning
• “Learning with a critic”, related to Instrumental
(or Thorndikian) Conditioning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Classical Conditioning
Pavlov, 1927
Tone
(CS: Conditioned Stimulus)
Food
(US: Unconditioned Stimulus)
Salivation
(UR: Unconditioned Response)
•
•
•
Anticipatory salivation
(CR: Conditioned Response)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Edward L. Thorndike (1874-1949)
Learning by “Trial-and-Error”
puzzle
box
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Trial-and-Error = Error Correction
Artificial Neural Network:
learns from a set of examples via error-correction
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Least-Mean-Square” (LMS) Learning Rule
“delta rule”, Adaline, Widrow and Hoff, 1960
input pattern
x1
x2
xn
w1
V
w2
wn
adjust weights
–
+
+
D wi = a[z - V] xi
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
z
actual output
desired
output
Trial-and-Error?
 “The boss continually seeks a better worker by
trial and error experimentation with the structure
of the worker. Adaptation is a multidimensional
performance feedback process. The `error’ signal
in the feedback control sense is the gradient of
the mean square error with respect to the
adjustment.”
Widrow and Hoff, “Adaptive Switching Circuits”
1960 IRE WESCON Conventional Record
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
MENACE
Michie 1961
“Matchbox Educable Noughts and Crosses Engine”
xo
xoo
x
xox
x
o
x x
oo
ox
xoo
xx
o
x x
oox
x ox
oox
x
x x
xo
o
x
ox
o
ox
xo
o
o
oxo
x
x
xo
xo
xo
o
oox
ox
ox
ox
o
ox
oo
x
o
x
o
ox
o
xox
o
oox
oo x
x
ox
x
o
ox
o
o
x
o
o
oxx
o
ox
o
o
o
x
o
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
o
x
x
o
xo
o
oxo
o
o
x
o
o
ox
o
Essence of RL (for me at least!): Search + Memory
 Search: Trial-and-Error, Generate-andTest, Variation-and-Selection, . . .
 Memory: remember what worked best
for each situation and start from there
next time
RL is about caching search results
(so you don’t have to keep searching!)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Generate-and-Test
 Generator should be smart:
• Generate lots things that are likely to be good
based on prior knowledge and prior
experience
• But also take chances …
 Tester should be smart too:
• Evaluate based on real criteria, not convenient
surrogates
• But be able to recognize partial success
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
 High-level intro to RL
 Part I: The personal odyssey
 Part II: The modern view
 Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Players
 Harry Klopf
 Rich Sutton
 Me
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Neural Models and Memory”
Arbib, Kilmer, and Spinelli
in Neural Mechanisms of Learning and
Memory, Rosenzweig and Bennett, 1974
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
A. Harry Klopf
“Brain Function and Adaptive Systems -A Heterostatic Theory”
Air Force Cambridge Research Laboratories
Technical Report 3 March 1972
“…it is a theory which assumes that living adaptive systems seek, as
their primary goal, a maximal condition (heterostasis), rather than
assuming that the primary goal is a steady-state condition
(homeostasis). It is further assumed that the heterostatic nature of
animals, including man, derives from the heterostatic nature of neurons.
The postulate that the neuron
is a heterostat (that is, a maximizer) is a generalization of a more
specific postulate, namely, that the neuron is a hedonist.”
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Klopf’s theory (very briefly!)
 Inspiration: The nervous system is a society
•
•
•
•
•
•
of self-interested agents.
Nervous Systems = Social Systems
Neuron = Man
Man = Hedonist
Neuron = Hedonist
Depolarization = Pleasure
Hyperpolarization = Pain
 A neuronal model:
•
•
•
A neuron “decides” when to fire based on comparing a spatial and
temporal summation of weighted inputs with a threshold.
A neuron is in a condition of heterostasis from time t to t+ if it
maximizes the amount of depolarization and minimizes the amount of
hyperpolarization over this interval.
Two ways to adapt weights to do this:
• Push excitatory weights to upper limits; zero out inhibitory weights
• Make neuron control its input.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Heterostatic Adaptation
 When a neuron fires, all of its synapses that were active
during the summation of potentials leading to the response
become eligible to undergo changes in their
transmittances.
 The transmittances of an eligible excitatory synapse
increases if the generation of an action potential is followed
by further depolarization for a limited time after the
response.
 The transmittances of an eligible inhibitory synapse
increases if the generation of an action potential is followed
by further hyperpolarization for a limited time after the
response.
 Add a mechanism that prevents synapses that participate
in the reinforcement from undergoing changes due to that
reinforcement (“zerosetting”).
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Components of Klopf’s Theory
 Eligibility
 Closed-loop control by neurons
 Extremization (e.g., maximization) as goal
instead of zeroing something
 “Generalized Reinforcement”: reinforcement is
not delivered by a specialized channel
The Hedonistic Neuron
A Theory of Memory, Learning, and Intelligence
A. Harry Klopf
Hemishere Publishing Corporation 1982
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Eligibility Traces
Klopf, 1972
x
t
t
a histogram of the lengths of
feedback pathways in which
the neuron is embedded
y
xy
t
Optimal ISI
The same curve as the reinforcementeffectiveness curve in conditioning:
max at 400 ms; 0 after approx 4 s.
xt
wt
xy t
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
yt
Dw t = a y t xy t
Later Simplified Eligibility Traces
visits to state s
accumulating
trace
replace
trace
TIME
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Rich Sutton
 BA Psychology, Stanford, 1978
 As an undergrad, discovered Klopf’s 1972 tech
report
 Two unpublished undergraduate reports:
• “Learning Theory Support for a Single Channel Theory
of the Brain” 1978
• “A Unified Theory of Expectation in Classical and
Instrumental Conditioning” 1978 (?)
 Rich’s first paper:
• “Single Channel Theory: A Neuronal Theory of Learning”
Brain Theory Newsletter, 1978.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s Theory
d
Vij = Cij A j - Pj E ij
dt
 Aj: level of activation of mode j at time t
 Vij: sign and magnitude of association from
mode i to mode j at time t
 Eij: eligibility of Vij for undergoing changes at
time t. It is proportional to the average of the
product Ai(t)Aj(t) over some small past time
interval (or an average of the logical AND).
 Pj: expected level of activation of mode j at time
t (a prediction of level of activation of mode j)
 Cij a constant depending on particular association
being changed
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
What exactly is Pj?
 Based on recent activation of the mode: The
higher the activation within the last few seconds,
the higher the level expected for the present . . .
 Pj(t) is proportional to the average of the
activation level over some small time interval (a
few seconds or less) before t.
xt
wt
xy t
yt
Dwt = a y t - y t  xy t
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s theory
 Contingent Principle: based on reinforcement a
neuron receives after firings and the synapses which
were involved in the firings, the neuron modifies its
synapses so that they will cause it to fire when the
firing causes an increase in the neuron’s expected
reinforcement after the firing.
• Basis of Instrumental, or Thorndikian, conditioning
 Predictive Principle: if a synapse’s activity predicts
(frequently precedes) the arrival of reinforcement at the
neuron, then that activity will come to have an effect on
the neuron similar to that of reinforcement.
• Basis of Classical, or Pavlovian, conditioning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton’s Theory
 Main addition to Kopf’s theory: addition of the
difference term — a temporal difference term
 Showed relationship to the Rescorla-Wagner
model (1972) of Classical Conditioning
• Blocking
• Overshadowing
 Sutton’s model was a real-time model of both
classical and instrumental conditioning
 Emphasized conditioned reinforcement
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Rescorla Wagner Model, 1972
“Organisms only learn when
events violate their expectations.”
DVA = a - V 
DVA :change in associative strength of CS A
a: parameter related to CS intensity
: parameter related to US intensity
V : sum of associative strengths of all CSs present
(“composite expectation”)
A “trial-level” model
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Conditioned Reinforcement
 Stimuli associated with reinforcement take on
reinforcing properties themselves
 Follows immediately from the predictive
principle: “By the predictive principle we propose
that the neurons of the brain are learning to
have predictors of stimuli have the same effect
on them as the stimuli themselves” (Sutton,
1978)
 “In principle this chaining can go back for any
length …” (Sutton, 1978)
 Equated Pavlovian conditioned reinforcement
with instrumental higher-order conditioning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Where was I coming from?
 Studied at the University of Michigan: at the time a hotbed of
genetic algorithm activity due to John Holland’s influence (PhD
in 1975)
 Holland talked a lot about the exploration/exploitation tradeoff
 But I studied dynamic system theory, relationship between
state-space and input/output representations of systems,
convolution and harmonic analysis, finally cellular automata
 Fascinated by how simple local rules can generate complex
global behavior:
•
•
•
•
•
•
Dynamic systems
Cellular automata
Self-organization
Neural networks
Evolution
Learning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Sutton and Barto, 1981
 “Toward a Modern Theory of Adaptive Networks: Expectation
and Prediction” Psych Review 88, 1981
 Drew on Rich’s earlier work, but clarified the math and
simplified the eligibility term to be non-contingent: just a
trace of x instead of xy.
 Emphasized anticipitory nature of CR
 Related to “Adaptive System Theory”:
• Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s
“Informon”, Anderson’s associative memory networks)
• Pointed out relationship between Rescorla-Wagner model
and Adaline, or LMS algorithm
• Studied algorithm stability
• Reviewed possible neural mechanisms: e.g. eligibility =
intracellular Ca ion concentrations
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“SB Model” of Classical Conditioning
xt
wt
yt
Dw t = c y t - y t  x t
x


y t = y t-1
x t 1 = a x t  x t
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Temporal Primacy Overrides Blocking in SB model
our simulation
Kehoe, Schreurs,
and Graham 1987
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Intratrial Time Courses (part 2 of blocking)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Adaline Learning Rule
LMS rule, Widrow and Hoff, 1960
target
output
zt
input pattern
xt
wt
Dwt = a [zt - y t ] x t
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
yt = w x
T
t t
“Rescorla–Wagner Unit”
US
zt
to UR
CS
xt
yt = w x
T
t t
wt
yt = w x
T
t t
vector of “associative strengths”
“composite expectation”
Dwt = a [zt - yt ]xt
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
to CR
Important Notes
 The “target output” of LMS corresponds to the
US input to Rescorla-Wagner model
 In both cases, this input is specialized in that it
does not directly activate the unit but only
directs learning
 The SB model is different, with the US input
activating the unit and directing learning
 Hence, SB model can do secondary
reinforcement
 SB model stayed with Klopf’s idea of “generalized
reinforcement”
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
One Neural Implementation of S-B Model
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Major Problem: US offset
 e.g., if a CS has same time course as US, weights
would change so US will be cancelled out.
US
CS
Final result
Why? Because trying to zero out yt – yt–1
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Memory Networks
Kohonen et al. 1976, 1977; Anderson et al. 1977
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search Network
Barto, Sutton, & Brouwer 1981
Input vect orX(t) = x1 (t), , x n (t) randomly chosen from set
X = X1, X 2,
, Xk
Output vector in response X(t)
to is vectorY(t) = y1 (t), , y m (t)
For eachX a (t ) the payoff is a scalar
Z a (Y(t))
X(t) is the context vector at time t

1 if s(t)  NOISE(t)  0
y(t) = 
0 otherwise
where
n
s(t) =  w i (t) x i (t)
i=1
Autonomous Learning Laboratory – Department of Computer Science

Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search Network
Barto, Sutton, Brouwer, 1981
Dwi (t) = c[z(t) - z(t -1)]
 [y(t -1) - y(t - 2)] x i (t -1)
Problem of context transitions:
add a predictor

Dwi (t) = c[z(t) - p(t -1)]
 [y(t -1) - y(t - 2)] x i (t -1)

Dwpi (t) = cp[z(t) - p(t -1)] x i (t -1)
“one-step-ahead LMS predictor”

Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Relation to Klopf/Sutton Theory
Dwi (t) = c[z(t) - p(t -1)]
 [y(t -1) - y(t - 2)] x i (t -1)
eligibility = yÝx

Did not include generalized reinforcement
since z(t) is a specialized reward input

Associative version of the ALOPEX algorithm
of Harth & Tzanakou, and later Unnikrishnan
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Associative Search Network
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Landmark Learning”
An illustration of associative search
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Barto & Sutton 1981
“Landmark Learning”
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Landmark Learning”
swap E and W
landmarks
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Note: Diffuse Reward Signal
reward
x1
y1
x2
y2
x3
y3
Units can learn different things
despite receiving identical inputs . . .
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Provided there is variability
 ASN just used noisy units to introduce variability
 Variability drives the search
 Needs to have an element of “blindness”, as in
“blind variation”: i.e. outcome is not completely
known beforehand
 BUT does not have to be random
 IMPORTANT POINT:
Blind Variation does not have to be random, or dumb
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Pole Balancing
Barto, Sutton, & Anderson 1984
Widrow & Smith, 1964
“Pattern Recognizing Control Systems”
Michie & Chambers, 1968
“Boxes: An Experiment in Adaptive Control”
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
MENACE
Michie 1961
“Matchbox Educable Noughts and Crosses Engine”
xo
xoo
x
xox
x
o
x x
oo
ox
xoo
xx
o
x x
oox
x ox
oox
x
x x
xo
o
x
ox
o
ox
xo
o
o
oxo
x
x
xo
xo
xo
o
oox
ox
ox
ox
o
ox
oo
x
o
x
o
ox
o
xox
o
oox
oo x
x
ox
x
o
ox
o
o
x
o
o
oxx
o
ox
o
o
o
x
o
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
o
x
x
o
xo
o
oxo
o
o
x
o
o
ox
o
The “Boxes” Idea
 “Although the construction of this Matchbox Educable
Noughts and Crosses Engine (Michie 1961, 1963) was
undertaken as a ‘fun project’, there was present a more
serious intention to demonstrate the principle that it may
be easier to learn to play many easy games than one
difficult one. Consequently it may be advantageous to
decompose a game into a number of mutually independent
sub-games even if much relevant information is put out of
reach in the process.”
Michie and Chambers, “Boxes: An Experiment in Adaptive Control”
Machine Intelligence 2, 1968
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Boxes
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Actor-Critic Architecture
ACE = adaptive critic element
ASE = associative search element
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Actor
ASE: associative search element
1 if s(t)  NOISE(t)  0
y(t) = 
-1 otherwise
Dwi (t) = a r(t)ei (t)
where eligibilityei (t) = yx i (t)
Note:
1) Move from changes in evaluation to just r
.
 2) Move from y to just y in eligibility
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Critic
ACE: adaptive critic element
n
p(t) =  v i (t) x i (t)
i=1
Dv i (t) =  [r(t)   p(t) - p(t -1)] ei (t)
where eligibility ei (t) = x i (t)
Note differences with SB model:
1) Reward has been pulled out of the weighted sum

2) Discount factor : decay rate of predictions if not
sustained by external reinforcement
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Putting them Together
rˆ(t) = r(t)   p(t) - p(t -1)
internal reinforcement
effect ive reinforcement
temporal- difference (T D) error (t)

s
y
action
p(s)
lower reward
prediction
s’
p(s’)
higher reward
prediction
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
make taking action y
in state s more likely
Actor & Critic learning almost identical
Adaptive Critic
e = trace of presynaptic
activity only
Actor
e = trace of preand postsynaptic
correlation
p
e
–

e
Dw   e
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
+
+
prediction
+
r
noise
action
primary
reward
“Credit Assignment Problem”
Marvin Minsky, 1961
Getting useful training information to the
right places at the right times
Spatial
Temporal
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Associative Reward-Penalty Element” (AR-P)
Barto & Anandan 1985
1 if s(t)  NOISE(t)  0
y(t) = 
-1 otherwise
where
(same as ASE)
n
s(t) =  w i (t) x i (t)
i=1
[
]
 r(t)y(t) - Ey(t) s(t) x (t) if r(t) = 1

i
Dw
i (t) = 

 r(t)y(t) - Ey(t) s(t) x i (t) if r(t) = -1
[

0
0   1
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
]
AR-P
[
]
 r(t)y(t) - Ey(t) s(t) x (t) if r(t) = 1

i
Dwi (t) = 

 r(t)y(t) - Ey(t) s(t) x i (t) if r(t) = -1
[
]
 If  = 0, “Associative Reward-Inaction Element” AR-I
 Think of r(t)y(t) as desired response

Where we got
the termBoostrap
“Critic”
 Stochastic version of Widrow et al.’s “Selective
Element” [Widrow, Gupta, & Maitra “Reward/Punish: learning
with a critic in adaptive threshold systems”, 1973]
 Associative generalization of LR-P, a “stochastic Learning
automaton” algorithm (with roots in Tsetlin’s work and in
mathematical psychology, e.g., Bush & Mosteller, 1955)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
AR-P Convergence Theorem
 Input patterns linearly independent
 Each input has nonzero probability of being
presented on a trial
 NOISE has cumulative distribution that is strictly
monotonically increasing (excludes uniform dist.
and deterministic case)
  has to decrease as usual….
 For all stochastic reward contingencies, as 
approaches 0, the probability of each correct
action approaches 1.
 BUT, does not work when  = 0.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Contingency Space: 2 actions (two-armed bandit)
Explore/Exploit
Dilemma
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Interesting follow up to AR-P theorem
 Williams’ REINFORCE class of algorithms (1987)
generalizes AR-I (i.e.,  = 0).
 He showed that the weights change according to
an unbiased estimate of the gradient of the
reward function
 BUT NOTE: this is the case for which our
theorem isn’t true!
 Recent “policy gradient” methods generalize
REINFORCE algorithms
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Learning by Statistical Cooperation
Barto 1985
Feedforward networks of AR-P units
Most reward
achieved when the
network
implements the
identity map
each unit has an (unshown) constant input
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Identity Network Results
 = .04
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
XOR Network
Most reward
achieved when the
network
implements XOR
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
XOR Network Behavior
 = .08
Visible element
Hidden element
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Notes on AR-P Nets
 None of these networks work with  = . They almost
always converge to a local maximum.
 Elements face non-stationary reward contingencies;
they have to converge for all contingencies, even hard
ones.
 Rumelhart, Hinton, & Williams published the backprop
paper shortly after this (in 1986).
 AR-P networks and backprop networks do pretty much
the same thing, BUT backprop is much faster.
 Barto & Jordan “Gradient Following without
Backpropagation in Layered Networks” First IEEE
conference on NNs, 1987.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
On the speeds of various layered network algorithms
My recollection of a talk by Geoffrey Hinton c. 1988
 Backprop: slow
 Boltzmann Machine: glacial
 Reinforcement Learning: don’t ask!
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“Credit Assignment Problem”
Marvin Minsky, 1961
Getting useful training information to the
right places at the right times
Spatial
Temporal
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Teams of Learning Automata
 Tsetlin, M. L. Automata Theory and Modeling of
Biological Systems, Academic Press NY, 1973
 e.g. the “Goore Game”
 Real games were studied too…
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Neurons and Bacteria
 Koshland’s (1980) model of bacterial tumbling
 Barto (1989) “From Chemotaxis to Cooperativity: Abstract
Exercises in Neuronal Learning Strategies” in The
Computing Neuron, Durbin, Miall, & Mitchison (eds.),
Addison-Wesley, Workingham England
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model of Pavlovian Conditioning
Sutton & Barto 1990
 The adaptive critic (slightly modified) as a model of
Pavlovian conditioning
n
p(t) =  v i (t) x i (t)
“floor”
i=1
Dv i (t) =  [ (t)   p(t) - p(t -1)] ei (t)
where eligibility ei (t) = x i (t)
US instead of r

Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model
Predictions of what?
“imminence weighted
sum of future USs”
i.e. discounting
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD Model
“Complete Serial Compound”
..
.
“tapped delay line”
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Summary of Part I
•
•
•
•
•
•
•
•
•
•
Eligibility
Neurons as closed-loop controllers
Generalized reinforcement
Prediction
Real-time conditioning models
Conditioned reinforcement
Adaptive system/machine learning theory
Stochastic search
Associative Reinforcement Learning
Teams of self-interested units
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Key Computational Issues











Trial-and-error
Error-Correction
Essence of RL (for me): search + memory
Variability is essential
Variability needs to be somewhat blind but not
dumb
Smart generator; smart tester
The “Boxes Idea”: break up large search into many
small searches
Prediction is important
What to predict: total future reward
Changes in prediction are useful local evaluations
Credit assignment problems

Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
 High-level intro to RL
 Part I: The personal odyssey
 Part II: The modern view
 Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Part II: The Modern View
 Shift from animal learning to sequential
decision problems: stochastic optimal
control
 Markov Decision Processes (MDPs)
 Dynamic Programming (DP)
 RL as approximate DP
 Give up the neural models…
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Samuel’s Checkers Player 1959
CURRENT BOARD
+20
EVALUATION FUNCTION
(Value Function)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
V
Arthur L. Samuel
“. . . we are attempting to make the score,
calculated for the current board position,
look like that calculated for the terminal
board positions of the chain of moves
which most probably occur during actual
play.”
Some Studies in Machine Learning
Using the Game of Checkers, 1959
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
TD-Gammon
Tesauro, 1992–1995
Value = estimated prob. of winning
STATES:
configurations of the playing board (about 10
ACTIONS:
moves
REWARDS: win:
lose:
+1
0
Start with a random network
Play very many games against self
Learn a value function from this simulated experience
This produces (arguably) the best player in the world
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
20
)
Sequential Decision Problems
 Decisions are made in stages.
 The outcome of each decision is not fully
predictable but can be observed before the next
decision is made.
 The objective is to maximize a numerical measure
of total reward over the entire sequence of
stages: called the return
 Decisions cannot be viewed in isolation: need to
balance desire for immediate reward with
possibility of high reward in the future.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Agent-Environment Interface
Agent and environm ent int eract at discret e t ime steps
: t = 0,1, 2,
Agent observes st at e at st ep
t:
st S
produces act ion at st ep
t : at  A(st )
get s resulting reward: rt 1 
and result ing next stat: e st 1
...
st
at
rt +1
st +1
at +1
rt +2
st +2
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
at +2
rt +3 s
t +3
...
at +3
Markov Decision Processes
 If a reinforcement learning task has the Markov
Property, it is basically a Markov Decision
Process (MDP).
 If state and action sets are finite, it is a finite
MDP.
 To define a finite MDP, you need to give:
• state and action sets
• one-step “dynamics” defined by transition probabilities:
Pss = Prst 1 = s st = s,at = a for alls, s S, a A(s).
a
• reward expectations:
Rss  = Ert 1 st = s,at = a,st 1 = s  for alls, s S, a A(s).
a
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Elements of the MDP view









Policies
Return: e.g, discounted sum of future rewards
Value functions
Optimal value functions
Optimal policies
Greedy policies
Models: probability models, sample models
Backups
Etc.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Backups
V(s)  ?
s
T
T
T
T
T
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
T
Stochastic Dynamic Programming
V(s)  maxE [r   V (succssorof sunder a)]
a
Needs a probability
model to compute
all the required
expected values
s
s 
r
T
T
T
T
T
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
T
e.g., Value Iteration
Lookup–table storage of
V0  V1 
V
 Vk  Vk 1 
V

a SWEEP
= update the value of each state once using the max
backup
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Dynamic Programming
Bellman 195?
“… it’s impossible to to use the word,
dynamic, in a pejorative sense. Try
thinking of some combination which will
possibly give it a pejorative meaning.
It’s impossible. … It was something not
even a Congressman could object to.”
Bellman
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Stochastic Dynamic Programming
COMPUTATIONALLY COMPLEX:
• Multiple exhaustive sweeps
• Complex "backup" operation
• Complete storage of evaluation function,
NEEDS ACCURATE PROBABILITY MODEL
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Approximating Stochastic DP
 AVOID EXHAUSTIVE SWEEPS OF STATE SET
• To which states should the backup operation be applied?
 SIMPLIFY THE BACKUP OPERATION
• Can one avoid evaluating all possible next states in
each backup operation?
 REDUCE DEPENDENCE ON MODELS
• What if details of process are unknown or hard to
quantify?
 COMPACTLY APPROXIMATE V
• Can one avoid explicitly storing all of V
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
?
Avoiding Exhaustive Sweeps
 Generate multiple sample paths: in reality or
with a simulation (sample) model
 FOCUS backups around sample paths
 Accumulate results in V
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Simplifying Backups
V(s)  ?
s
T
T
T
T
T
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
T
Simple Monte Carlo
V(s)  (1- a )V (s)  a REWARD ( path)
• no probability model
needed
s
• real or simulated
experience
• relatively efficient on
very large problems
T
T
T
T
T
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
T
Temporal Difference Backup
V(s)  (1- a )V (s)  a [r   V( s)]
• no probability model
needed
• real or simulated
experience
s
r
s 
T
T
T
• incremental
• but less informative than
a DP backup
T
T
T
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
Our familiar TD error
Rewrite this
V(s)  (1- a )V (s)  a [r   V( s)]
to get:
V(s)  V (s)  a [r   V( s) - V (s)]
TD error
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Why TD?
Loss
90%
New
Bad
10%
Win
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Compactly Approximate V
Function Approximation Methods:
e.g., artificial neural networks
description
of state
s
ANN
V(s)
evaluation of
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
s
Q-Learning
s
a
Q

Watkins 1989; Leigh Tesfatsion
action values

Q (s, a)
expected return for taking action
in state and following an
optimal policy thereafter
a
s
For any state, any action with a maximal
optimal action value is an optimal action:
( optimal action in
Let
Q(s,a) =
Q (s, a)
s ) = arg max
a
current estimate of
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005


Q (s, a)
The Q-Learning Backup
[
]
Q(s,a)  1- aQ(s,a)  a r  maxQ(s,b)
s
s 
T
T
T
b
r a
T
T
T
Does not need a
probability model
(for either learning or
performance)
T
T
T
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
T
Another View: Temporal Consistency
Vt -1 = rt  rt 1  rt 2  L
Vt = rt 1  rt  2  rt 3  L
so:
Vt -1 = rt  Vt
or:
rt  Vt - Vt -1 = 0
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
“TD error”
Review





MDPs
Dynamic Programming
Backups
Bellman equations (temporal consistency)
Approximating DP
•
•
•
•
Avoid exhaustive sweeps
Simplify backups
Reduce dependence on models
Compactly approximate V
 A good case can be made for using RL to
approximate solutions to large MDPs
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Plan
 High-level intro to RL
 Part I: The personal odyssey
 Part II: The modern view
 Part III: Intrinsically Motivated RL
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Common View
Environment
action
state
reward
Agent
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Less Misleading Agent View…
memory
reward
external sensations
RL
agent
state
internal
sensations
actions
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Motivation
 “Forces” that energize an organism to act and
that direct its activity.
 Extrinsic Motivation: being moved to do
something because of some external reward ($$,
a prize, etc.).
 Intrinsic Motivation: being moved to do
something because it is inherently enjoyable.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Intrinsic Motivation
 An activity is intrinsically motivated if the agent
does it for its own sake rather than as a step
toward solving a specific problem
 Curiosity, Exploration, Manipulation, Play, Learning
itself . . .
 Can an artificial learning system be intrinsically
motivated?
 Specifically, can a Reinforcement Learning system
be intrinsically motivated?
Working with Satinder Singh
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Usual View of RL
Reward looks extrinsic
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Less Misleading View
All reward is intrinsic.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
So What is IMRL?
 Key distinction:
• Extrinsic reward = problem specific
• Intrinsic reward = problem independent
 Learning phases:
• Developmental Phase: gain general
competence
• Mature Phase: learn to solve specific
problems
 Why important: open-ended learning via
hierarchical exploration
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Scaling Up: Abstraction
 Ignore irrelevant details
• Learn and plan at a higher level
• Reduce search space size
• Hierarchical planning and control
• Knowledge transfer
• Quickly react to new situations
• c.f. macros, chunks, skills, behaviors, . . .
 Temporal abstraction: ignore temporal
details (as opposed to aggregating states)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The “Macro” Idea
 A sequence of operations with a name; can be
invoked like a primitive operation
• Can invoke other macros. . . hierarchy
• But: an open-loop policy
 Closed-loop macros
• A decision policy with a name; can be
invoked like a primitive control action
• behavior (Brooks, 1986), skill (Thrun &
Schwartz, 1995), mode (e.g., Grudic &
Ungar, 2000), activity (Harel, 1987),
temporally-extended action, option (Sutton,
Precup, & Singh, 1997)
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Options (Precup, Sutton, & Singh, 1997)
A generalization of actions to include temporally-extended
courses of action
An option is a tripleo = I,  ,  
 I : initiation set: t he set of states in which
o may be started
  : is the policy followed during
o
  : t ermination condit ions
: gives the probabilit y of
terminating in each state
Example: robot docking
I : all states in which charger is in sight

 : pre-defined controller
 : terminate when docked or charger not visible
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Options cont.
 Policies can select from a set of options & primitive
actions
 Generalizations of the usual concepts:
• Transition probabilities (“option models”)
• Value functions
• Learning and planning algorithms
 Intra-option off-policy learning:
• Can simultaneously learn policies for many
options from same experience
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Options define a Semi-Markov Decision Process
Time
MDP
State
Discrete time
Homogeneous discount
SMDP
Continuous time
Discrete events
Interval-dependent discount
Options
over MDP
Discrete time
Overlaid discrete events
Interval-dependent discount
A discrete-time SMDP overlaid on an MDP
Can be analyzed at either level
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Where do Options come from?
 Dominant approach: hand-crafted from the start
 How can an agent create useful options for
itself?
• Several different approaches (McGovern, Digney,
Hengst, ….). All involve defining subgoals of various
kinds.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Canonical Illustration: Rooms Example
4 rooms
4 hallways
ROOM
HALLWAYS
4 unreliable
primitive actions
up
left
O1
right
Fail 33%
of the time
down
G?
O2
G?
8 multi-step options
(to each room's 2 hallways)
Given goal location,
quickly plan shortest route
Goal states are given
a terminal value of 1
All rewards zero
 = .9
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Task-Independent Subgoals





“Bottlenecks”, “Hubs”, “Access States”, …
Surprising events
Novel events
Incongruous events
Etc. …
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
A Developmental Approach
 Subgoals: events that are “intrinsically
interesting”; not in the service of any specific
task
 Create options to achieve them
 Once option is well learned, the triggering event
becomes less interesting
 Previously learned options are available as
actions in learning new option policies
 When facing a specific problem: extract a
“working set” of actions (primitive and abstract)
for planning and learning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
For Example:
 Built-in salient stimuli: changes in lights and
sounds
 Intrinsic reward generated by each salient event:
• Proportional to the error in prediction of that event
according to the option model for that event
(“surprise”)
 Motivated in part by novelty responses of
dopamine neurons
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Creating Options
 Upon first occurrence of salient event: create an
option and initialize:
•
•
•
•
Initiation set
Policy
Termination condition
Option model
 All options and option models updated all the
time using intra-option learning
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Playroom Domain
Agent has eye, hand, visual
marker
Actions:
move eye to hand
move eye to marker
move eye N, S, E, or W
move eye to random object
move hand to eye
move hand to marker
move marker to eye
move marker to hand
If both eye and hand are on
object: turn on light, push
ball. etc.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
The Playroom Domain cont.
Switch controls room lights
Bell rings and moves one square
if ball hits it
Press blue/red block turns music
on/off
Lights have to be on to see
colors
Can push blocks
Monkey cries out if bell and
music both sound in dark
room
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Skills
 To
•
•
•
•
•
•
•
•
•
•
•
•
•
•
make monkey cry out:
Move eye to switch
Move hand to eye
Turn lights on
Move eye to blue block
Move hand to eye
Turn music on
Move eye to switch
Move hand to eye
Turn light off
Move eye to bell
Move marker to eye
Move eye to ball
Move hand to ball
Kick ball to make bell
ring
 Using skills (options)
• Turn lights on
• Turn music on
• Turn lights off
• Ring bell
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Reward for Salient Events
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Speed of Learning Various Skills
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Learning to Make the Monkey Cry Out
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Connects with Previous RL Work






Schmidhuber
Thrun and Moller
Sutton
Kaplan and Oudeyer
Duff
Others….
But these did not have the option framework
and related algorithms available
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Beware the “Fallacy of Misplaced Concreteness”
Alfred North Whitehead
We have a tendency to mistake our models for
reality, especially when they are good models.
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Thanks to all my past PhD Students














Rich Sutton
Chuck Anderson
Stephen Judd
Robbie Jacobs
Jonathan Bachrach
Vijay Gullapalli
Satinder Singh
Bob Crites
Steve Bradtke
Mike Duff
Amy McGovern
Ted Perkins
Mike Rosenstein
Balaraman Ravindran
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
And my current students








Colin Barringer
Anders Jonsson
George D. Konidaris
Ashvin Shah
Özgür Şimşek
Andrew Stout
Chris Vigorito
Pippin Wolfe
And the funding agencies
AFOSR, NSF, NIH, DARPA
Autonomous Learning Laboratory – Department of Computer Science
Andrew Barto, Okinawa Computational Neuroscience Course, July 2005
Whew!
Thanks!

Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst [email protected] Autonomous Learning Laboratory –

Transcript Searching in the Right Space Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst [email protected] Autonomous Learning Laboratory –

Directory