Transcript IntroAI_14x

Reinforcement Learning (2)
Lirong Xia
Tue, March 21, 2014
Reminder
• Project 2 due tonight
• Project 3 is online (more later)
– due in two weeks
1
Recap: MDPs
• Markov decision processes:
–
–
–
–
–
States S
Start state s0
Actions A
Transition p(s’|s,a) (or T(s,a,s’))
Reward R(s,a,s’) (and discount ϒ)
• MDP quantities:
– Policy = Choice of action for each (MAX) state
– Utility (or return) = sum of discounted rewards
2
Optimal Utilities
• The value of a state s:
– V*(s) = expected utility starting in s and
acting optimally
• The value of a Q-state (s,a):
– Q*(s,a) = expected utility starting in s,
taking action a and thereafter acting
optimally
• The optimal policy:
– π*(s) = optimal action from state s
3
Solving MDPs
• Value iteration
– Start with V1(s) = 0
– Given Vi, calculate the values for all states for depth i+1:
V i 1  s   m ax  T  s , a , s '   R  s , a , s '    V i  s '  
a
s'
– Repeat until converge
– Use Vi as evaluation function when computing Vi+1
• Policy iteration
– Step 1: policy evaluation: calculate utilities for some fixed
policy
– Step 2: policy improvement: update policy using one-step
look-ahead with resulting utilities as future values
– Repeat until policy converges
4
Reinforcement learning
• Don’t know T and/or R, but can observe R
– Learn by doing
– can have multiple episodes (trials)
5
The Story So Far: MDPs and RL
Things we know how to do:
• If we know the MDP
– Compute V*, Q*, π* exactly
– Evaluate a fixed policy π
• If we don’t know the MDP
Techniques:
• Computation
• Value and policy
iteration
• Policy evaluation
– If we can estimate the MDP then
• Model-based RL
solve
• sampling
– We can estimate V for a fixed
policy π
– We can estimate Q*(s,a) for the • Model-free RL:
• Q-learning
optimal policy while executing
an exploration policy
6
Model-Free Learning
• Model-free (temporal difference) learning
– Experience world through episodes
(s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…)
– Update estimates each transition (s,a,r,s’)
– Over time, updates will mimic Bellman updates
Q -V alue Iteration (m odel-based, requires know n M D P)
Q i 1  s , a  
 R  s , a , s '    m ax Q  s ', a '  
T
s
,
a
,
s
'



i
a'


s'
Q -Learning (m odel-free, requires only ex perienced transitions)
Q ( s , a )  (1   ) Q ( s , a )    r   m ax Q  s ', a '  
a'


7
Detour: Q-Value Iteration
• We’d like to do Q-value updates to each Q-state:
Q i 1  s , a    T  s , a , s '   R  s , a , s '    m ax Q i  s ', a '  
a'


s'
– But can’t compute this update without knowing T,R
• Instead, compute average as we go
– Receive a sample transition (s,a,r,s’)
– This sample suggests
Q ( s , a )  r   m ax Q  s ', a ' 
a'
– But we want to merge the new observation to the old
ones
– So keep a running average
Q ( s , a )  (1   ) Q ( s , a )    r   m ax Q  s ', a '  
a'


8
Q-Learning Properties
• Will converge to optimal policy
– If you explore enough (i.e. visit each q-state many times)
– If you make the learning rate small enough
– Basically doesn’t matter how you select actions (!)
• Off-policy learning: learns optimal q-values, not the
values of the policy you are following
9
Q-Learning
• Q-learning produces tables of q-values:
10
Exploration / Exploitation
• Random actions (ε greedy)
– Every time step, flip a coin
– With probability ε, act randomly
– With probability 1-ε, act according to current
policy
11
Today: Q-Learning with state
abstraction
• In realistic situations, we cannot possibly learn
about every single state!
– Too many states to visit them all in training
– Too many states to hold the Q-tables in memory
• Instead, we want to generalize:
– Learn about some small number of training states from
experience
– Generalize that experience to new, similar states
– This is a fundamental idea in machine learning, and we’ll
see it over and over again
12
Example: Pacman
• Let’s say we discover through
experience that this state is
bad:
• In naive Q-learning, we know
nothing about this state or its
Q-states:
• Or even this one!
13
Feature-Based Representations
• Solution: describe a state using
a vector of features (properties)
– Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state
– Example features:
•
•
•
•
•
•
•
Distance to closest ghost
Distance to closest dot
Number of ghosts
1/ (dist to dot)2
Similar to a evaluation function
Is Pacman in a tunnel? (0/1)
…etc
Is it the exact state on this slide?
– Can also describe a Q-state (s,a)
with features (e.g. action moves
closer to food)
14
Linear Feature Functions
• Using a feature representation, we can write a Q
function (or value function) for any state using a few
weights:
() () ()
()
Q ( s,a) = w f ( s,a) + w f ( s,a) + + w f ( s,a)
V s = w1 f1 s + w2 f 2 s + + wn f n s
1 1
2 2
n n
• Advantage: our experience is summed up in a few
powerful numbers
• Disadvantage: states may share features but
actually be very different in value!
15
Function Approximation
( )
( )
( )
( )
Q s,a = w1 f1 s,a + w2 f 2 s,a + + wn f n s,a
• Q-learning with linear Q-functions:
transition = (s,a,r,s’)
difference   r   m ax Q  s ', a '    Q ( s , a )
a'


Q(s,a) ¬ Q(s,a) + a éëdifferenceùû
wi ¬ wi + a éëdifferenceùû f i s,a
• Intuitive interpretation:
( )
Exact Q’s
Approximate Q’s
– Adjust weights of active features
– E.g. if something unexpectedly bad happens, disprefer all
states with that state’s features
• Formal justification: online least squares
16
Example: Q-Pacman
s
Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a)
fDOT(s,NORTH)=0.5
fGST(s,NORTH)=1.0
Q(s,a)=+1
R(s,a,s’)=-500
difference=-501
α= North
r = -500
s’
wDOT←4.0+α[-501]0.5
wGST ←-1.0+α[-501]1.0
Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a)
17
Linear Regression
prediction
y = w0 + w1 f1 x
()
prediction
y = w0 + w1 f1 x + w2 f 2 x
()
18
()
Ordinary Least Squares (OLS)
(
total error = å yi - y i
i
)
2
æ
ö
= åç yi - å wk f k x ÷
ø
i è
k
()
2
19
Minimizing Error
Imagine we had only one point x with features f(x):
2
1

error  w    y   w k f k  x  
2
k

 error  w 


   y   wk f k  x   f m  x 
wm
k




wm  wm    y   wk f k  x   f m  x 
k


Approximate q update explained:
“target”
“prediction”
w m  w m    r   m ax Q ( s ', a ')  Q ( s , a )  f m  x 
a

 20
How many features should we use?
• As many as possible?
– computational burden
– overfitting
• Feature selection is important
– requires domain expertise
21
Overfitting
22
The elements of statistical learning. Fig 2.11.
23
Overview of Project 3
• MDPs
– Q1: value iteration
– Q2: find parameters that lead to certain optimal policy
– Q3: similar to Q2
• Q-learning
– Q4: implement the Q-learning algorithm
– Q5: implement ε greedy action selection
– Q6: try the algorithm
• Approximate Q-learning and state abstraction
– Q7: Pacman
– Q8 (bonus): design and implement a state-abstraction Qlearning algorithm
• Hints
– make your implementation general
– try to use class methods as much as possible
24