From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh,

Download Report

Transcript From Reflex to Reason Rich Sutton AT&T Labs with thanks to Satinder Singh,

From Reflex to Reason
Rich Sutton
AT&T Labs
with thanks to Satinder Singh,
Doina Precup, and Andy Barto
Overall Goal
A computational understanding of a broad
span of the mind’s activities
what it computes
why it computes it
At a high level, without
specifics of sensory and motor systems
specific representations and algorithms
neural implementations
language
What does the mind do?
Is there an overall, simple answer?
Marr’s 3 levels
Main Claims
• Mind is about predictions
– making predictions
– discovering what predictions can be made
Prediction
Semantics
• Knowledge is predictions
– action-contingent and temporally-flexible predictions
– agent-centric, grounded in experience from the bottom up
• The mind’s ultimate goal is to make rewardmaximizing decisions
– but most of its effort is devoted to subgoal of prediction
• A few simple mechanisms enable working flexibly
with predictions
– TD learning and Bellman backups
Prediction Semantics
• A prediction is a signal with meaning
Y
new
link
Pred.
of X
existing
link
Response
X
• Knowing that one signal is a prediction of another
enables it to do useful work for you
• When something new predicts X, you know what to do
• Prediction semantics constrains in two directions
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
These together are much of what the mind does
Can we explain them all in a uniform way?
Pavlovian Conditioning,
the Conditioning of Reflexes
CS
Tone
US
Eyeshock
UR
Eyeblink
before learning
CR
Eyeblink
after learning
(No US)
Almost any reflex can be conditioned:
salivation
orienting
heart rate, blood pressure
gill withdrawall
nausea, taste aversion
fear, secondary reinforcers
CER: freezing, suppression
neutral stimuli
• Animal can be viewed as learning that the CS predicts the US
• And then responding in anticipation
• But Why? Why should a prediction of the US produce the
same response as the US?
(Inadequate) Comp. Theories of CC
• Instrumental theories -- the CR makes the US feel
better
– Works well for eyeblink, salivation, not for 2ndary reinforcers
– Does not explain the similarity of CR and UR
– Does not explain apparent conflict of CC and instrumental
• Anticipation theories -- whatever you are going to do,
CC causes you to do it earlier
– Why earlier? Earlier is not always better!
– How much earlier? CR tends to occur at time of US
• Prediction theories -- CC is learning to predict the US
– Works for fear, CER, 2ndary reinforcers
– Does not explain response to CR or to UR
– Explains “What” but not “Why”
Pred Rep’n Theory of Conditioning
The reflex is not US  Response:
NOT
US
reflex
R
US
OR
reflex
R
learnable
CS
CS
But Prediction of US  Response:
US
CS
learnable
+
Pred
of US
reflex
R
USs could
habituate!
Pred Rep’n Theory of Conditioning (2)
supervisory
US
cue
+
CS1
prediction of
supervisory US
Response
CS2
• Consider an innate, learnable association US  Response
– represents an innate guess, e.g., that a shock now is good
predictor of a shock coming up
– but could be wrong
– Predicts URs could habituate, change over time depending on
their relationship to themselves
*
*
Long USs predict themselves
*
*
Short USs are poor self-predictors
Pred Rep’n Theory of Conditioning (3)
• Implications for response topography/generation
–
–
–
–
predicts maximal CR at time of US onset (correct)
predicts CR onset only so early as to enable this
predicts threshold phenomena in CR production
predicts interaction of threshold with relative
effectiveness of reinforced and unreinforced trials
response
topography
CR
US
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
The Reward Hypothesis
That purposes can be adequately represented as
maximization of the cumulative sum of a scalar
reward signal received from the environment
• Is this reasonable?
• Is it demeaning?
• Is there no other choice?
• It seems to be adequate
and perhaps completely satisfactory
Reinforcement Learning Theory:
What to Compute and Why
• Policies
 : States  Pr(Actions)
• Value Functions
V  : States  

V ( s) = E

t -1
g
 rewardt start in s0 = s, follow 
t =1
• 1-Step Models
P s t 1 st ,a t
Predictions!
E rt 1 st ,a t
Honeybee Brain & VUM Neuron
Hammer, Menzel
The Acrobot Problem
Goal: Raise tip above line
e.g., Dejong & Spong, 1994
Sutton, 1995
fixed
base
Torque
applied
here
Minimum–Time–to–Goal:
4 state variables:
2 joint angles
2 angular velocities
q1
q2
tip
CMAC of 48 layers
RL same as Mountain Car
Reward = -1 per time step
Prediction Semantics of RL
Representations
of state
and action
Value, a
pred. of
reward
learned
links
fixed
link
Action Selection
Pick the highest
valued action
reward
An action that predicts reward in a state...
should to that extent be favored in that state
Examples of Reinforcement Learning
• Robocup Soccer Teams
Stone & Veloso, Reidmiller et al.
– World’s best player of simulated soccer, 1999; Runner-up 2000
• Inventory Management
Van Roy, Bertsekas, Lee & Tsitsiklis
– 10-15% improvement over industry standard methods
• Dynamic Channel Assignment
Singh & Bertsekas, Nie & Haykin
– World's best assigner of radio channels to mobile telephone calls
• Elevator Control
Crites & Barto
– (Probably) world's best down-peak elevator controller
• Many Robots
– navigation, bi-pedal walking, grasping, switching between skills...
• TD-Gammon and Jellyfish
Tesauro, Dahl
– World's best backgammon player
TD-Gammon
... ...
...
Value
Tesauro, 1 9 9 2 -1 9 9 5
Act ion select ion
by 2 -3 ply search
...
TD Error
Vt 1 - Vt
St art wit h a random Net work
Play millions of games against it self
Learn a value funct ion from t his simulat ed experience
This produces arguably t he best player in t he world
Prediction Semantics in TD-Gammon
• A prediction of winning can substitute for winning
– the central idea of Temporal-Difference (TD) learning
• learning a prediction from a prediction!
– also key idea of dynamic programming
– and all heuristic search
• In lookahead search, predictions are composed to
produce longer-term predictions
– key to all state-space planning
– suggests prediction semantics is key element of reasoning
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Planning as RL over Mental Simulation
I.e., learning on model-generated experience:
1. Learn a model of the world’s transition dynamics
transition probabilities, expected immediate rewards
“1-step model” of the world
2. Use model to generate imaginary experiences
internal thought trials, mental simulation (Craik, 1943)
3. Apply RL as if experience had really happened
Reward
Policy
Value
Function
1-Step Model
Dyna Algorithm
1. s  current state
2. Choose an action, a, and take it
3. Receive next state, s’, and reward, r
value/policy
planning
model
4. Apply RL backup to s, a, s’, r
e.g., Q-learning update
5. Update Model(s, a) with s’, r
6. Repeat k times:
- select a previously seen state-action pair s,a
- s’, r  Model(s, a)
- Apply RL backup to s, a, s’, r
7. Go to 1
direct
RL
acting
experience
model
learning
State-Space Search
is based on a Prediction Semantics
in seeking to evaluate
this state
we use predictions from these
Prediction Semantics in Planning
is just like in TD-Gammon
• Predictions substitute for path outcomes
• Predictions are composed to predict consequences of
arbitrary sequences of action
Naïve RL Theory of Reason
Reason is RL on model-generated experience
Reward
Policy
Value
Function
• Pro:
– Very simple, uniform, general
– Sufficient to reproduce e.g., latent learning
• Con
– Seems too low-level
– Represents only a limited kind of knowledge
1-Step Model
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Experience
A mind interacts with its world
actions
Agent
observations
World
To produce two time series:
Actions:
Observations:
L at -3 , at -2 , at -1 , at L ?
Lot -3 , ot -2 , ot -1 , ot L ? ?
Experience
Experience is the data; it is all we really know
Experience provides something for knowledge to be about
World Knowledge  Predictions
• The world is a black box, known only by its I/O
behavior (observations in response to actions)
• Therefore, all meaningful statements about the
world are statements about the observations it
generates
• The only observations worth talking about are
future ones
Therefore:
The only meaningful things to say about
the world are predictions
Predictions = statements about the joint distribution
of future observations and actions
Non-predictive “Knowledge”
• Mathematical knowledge, theorems and proofs
– always true, but tell us nothing about the world
– not world knowledge
• Uninterpretted signals, e.g., useful representations
– real and useful, but not by themselves world knowledge,
only an aid to acquiring it
• Knowledge of the past
• Policies
– could be viewed as predictions of value
– but by themselves are more like uninterpretted signals
Predictions capture “regular”, descriptive world knowledge
Every Prediction must be Grounded
in Two Directions
history of
actions &
observations
prediction
if I do action 1,
then obs 12 will be 0
for three steps
recognition
grounding
“symbol
grounding”
prediction
grounding
“prediction
semantics”
Both Recognition and Prediction
Grounding are Needed
• “Classical” AI systems omit recognition grounding
– e.g., “Tweety is a bird”, “John loves Mary”
– sometimes called the “symbol grounding problem”
• Modern AI sytems tend to skimp prediction grounding
– supervised learning, Bayes nets, robotics…
• It is not OK to leave prediction grounding to external,
human observers
– the information is just not in the machine
– we don’t understand it; we haven’t done our job!
• Yet this is such an appealing shortcut that we have
almost always done it
Prediction Semantics
formalized as Macros-Actions
Sutton, Precup & Singh
AIJ 1999 etal.
Let  : States  Pr(Actions) be an arbitrary policy
Let b : States  Pr({0,1}) be a termination condition
Then macro-action <,b> is a kind of experiment
– do  until b says “stop”
– measure something about the resulting experience
Suppose we measure
– the state at the end of the experiment
– the total reward during the experiment
Then the macro prediction for <,b> would predict
Pr(end-state), E{total reward} given start-state
Predictions of this form can represent a lot...
...possibly all world knowledge
Sutton, Precup,
& Singh, 1999
Rooms Example
4 stochastic
primitive actions
HALLWAYS
up
left
o1
G1
G2
o2
right
Fail 33%
of t he t ime
down
8 multi-step macro-actions
( t o each room' s 2 hallways)
Policy of
one macro-action:
Target
Hallway
Planning with Macro-Predictions
wit h cell-t o-cell
primit ive act ions
V (goal)=1
Iteration #0
Iteration #1
Iteration #2
Iteration #1
Iteration #2
wit h room-t o-room
opt ions
macro-actions
V (goal)=1
Iteration #0
Learning Path-to-Goal with and
without Hallway Macros-Actions
1000
Actions
Steps
per 100
episode
Macros
& actions
Macros
10
1
10
100
Episodes
1000
10,000
Illustration: Reconnaissance
Mission Planning (Problem)
25
15 (reward)
25 (mean time between
weather changes)
50
8
options
50
– Actions: which direction to fly now
– Options: which site to head for
• Options compress space and time
5
100
• Mission: Fly over (observe) most
valuable sites and return to base
• Stochastic weather affects
observability (cloudy or clear) of sites
• Limited fuel
• Intractable with classical optimal
control methods
• Temporal scales:
10
– Reduce steps from ~600 to ~6
– Reduce states from ~1011 to ~106
50
Base
100 decision steps
QO* (s, o) = rso   psosVO* ( s)
any state
(106)
s 
sites only (6)
Illustration: Reconnaissance
Mission Planning (Results)
Expected Reward/Mission
60
30
– Assumes options followed to
completion
– Plans optimal SMDP solution
• SMDP planner with re-evaluation
50
40
• SMDP planner:
High Fuel
Low Fuel
SMDP
SMDP
Static
planner
Planner Re-planner
with
re-evaluation
Temporal abstraction
of options on
finds better approximation
each step
than static planner, with
little more computation
than SMDP planner
– Plans as if options must be
followed to completion
– But actually takes them for only
one step
– Re-picks a new option on every
step
• Static planner:
– Assumes weather will not change
– Plans optimal tour among clear
sites
– Re-plans whenever weather
changes
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Reason
Combining knowledge to obtain new knowledge,
flexibly and generally
We must be able to reason about any event as a
possible (sub)goal, not just about rewards
This is the final step
Subgoals
• Many natural macro-actions are goal-oriented
– E.g., drive-to-work, open-the-door
• So replicate planning in-miniature for each subgoal
• Macros can then be learned to achieve each subgoal
• Many can be learned at once, independently
– Solves classic problem of subgoal credit assignment
– Solves psychological puzzle of goal-oriented action
rooms
example
• Models of such macros are goal-oriented recognizers
– correspond to classical “concepts”
– e.g., a “chair” state is one where sitting is predicted to work
Rooms Example
Independent learning of all 8 Subgoals
0.7
0.4
0.6
RMS Error in
subgoal values
0.3
upper
hallway
subgoal
ideal
values
0.5
lower
hallway
subgoal
0.4
0.2
0.3
learned
values
0.2
0.1
0.1
0
0
20,000
40,000
60,000
Time steps
80,000
100,000
0
0
20,000
40,000
Two subgoal
state values
60,000
80,000
100,000
Time Steps
All 8 hallway macros and predictions are learned accurately
and efficiently while actions are selected totally at random
Co-Existence of Hedonism
and Exploration/Constructivism
• The ultimate goal is still reward
• Still one primary policy and set of values
• But many other policies, values, and predictions are
learned not directly in service of reward
• Most time is spent in exploration and discovery,
gaining knowledge rather than reward:
– What possibilities does the world afford?
– How can I control and predict it in a variety of ways?
– What concepts can be learned that might help later?
• From hedonism to curiosity and constructivism
Main Claims
• Mind is about predictions
– making predictions
– discovering what predictions can be made
Prediction
Semantics
• Knowledge is predictions
– action-contingent and temporally-flexible predictions
– agent-centric, grounded in experience from the bottom up
• The mind’s ultimate goal is to make rewardmaximizing decisions
– but most of its effort is devoted to subgoal of prediction
• A few simple mechanisms enable working flexibly
with predictions
– TD learning and Bellman backups
What is New?
• The formalization of macro-actions
– provide temporal abstraction
– as well as action contingency (experiments)
– mesh seemlessly with learning and planning methods
• Using the goal-oriented machinery of RL
– for knowledge construction
– for perceptual concepts
• Taking the discipline of predictive knowledge seriously
– speaking only in terms of the subjective, experiential data
Should Knowledge be Experiential?
Allowing only Predictions in terms of Data?
loses
• Expressiveness
– can’t talk about objects, space, people; no “is-a” or “part-of”
• External (human) coherence
– verbal labels, interpretability, explainability, calibration
– the “shortcut” of entering knowledge directly into the agent
gains
• The knowledge will have meaning to the machine
• It can be mechanically learned/verified/extended
• It will be suited for a general reasoning processes
– composition and backup of predictions to yield new predictions