more from Thursday (modified from Dan Klein's)
Download
Report
Transcript more from Thursday (modified from Dan Klein's)
Reinforcement Learning
Known
Unknown
Assumed
•Current state
•Transition model
•Markov transitions
•Available actions
•Reward structure
•Fixed reward for (s,a,s’)
•Experienced
rewards
Problem: Find values for fixed policy (policy evaluation)
Model-based learning: Learn the model, solve for values
Model-free learning: Solve for values directly (by sampling)
1
The Story So Far: MDPs and RL
Things we know how to do:
Techniques:
If we know the MDP
Model-based DPs
Compute V*, Q*, * exactly
Evaluate a fixed policy
If we don’t know the MDP
We can estimate the MDP then solve
We can estimate V for a fixed policy
We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy
Value and policy
Iteration
Policy evaluation
Model-based RL
Model-free RL:
Value learning
Q-learning
2
Recap: Optimal Utilities
The utility of a state s:
V*(s) = expected utility
starting in s and acting
optimally
The utility of a q-state (s,a):
Q*(s,a) = expected utility
starting in s, taking
action a and thereafter
acting optimally
s
s is a
state
a
(s, a) is a
q-state
s, a
s,a,s’
s’
(s,a,s’) is a
transition
The optimal policy:
*(s) = optimal action from
state s
3
Temporal-Difference Learning
Big idea: learn from every experience!
Update V(s) each time we experience (s,a,s’,r)
Likely s’ will contribute updates more often
s
(s)
s, (s)
Temporal difference learning
Policy still fixed!
Move values toward value of whatever
successor occurs: running average!
s’
Sample of V(s):
Update to V(s):
Same update:
5
Example: TD Policy Evaluation
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
Take = 1, = 0.5
7
Problems with TD Value Learning
TD value leaning is a model-free way
to do policy evaluation
However, if we want to turn values into
a (new) policy, we’re sunk:
s
a
s, a
s,a,s’
s’
Idea: learn Q-values directly
Makes action selection model-free too!
8
Active Learning
Full reinforcement learning
You don’t know the transitions T(s,a,s’)
You don’t know the rewards R(s,a,s’)
You can choose any actions you like
Goal: learn the optimal policy
… what value iteration did!
In this case:
Learner makes choices!
Fundamental tradeoff: exploration vs. exploitation
This is NOT offline planning! You actually take actions in the
world and find out what happens…
9
Model-Based Active Learning
In general, want to learn the optimal policy, not
evaluate a fixed policy
Idea: adaptive dynamic programming
Learn an initial model of the environment:
Solve for the optimal policy for this model (value or
policy iteration)
Refine model through experience and repeat
Crucial: we have to make sure we actually learn
about all of the model
10
Example: Greedy ADP
Imagine we find the lower
path to the good exit first
Some states will never be
visited following this policy
from (1,1)
We’ll keep re-using this
policy because following it
never collects the regions
of the model we need to
learn the optimal policy
?
?
11
What Went Wrong?
Problem with following optimal
policy for current model:
Never learn about better regions
of the space if current policy
neglects them
?
?
Fundamental tradeoff:
exploration vs. exploitation
Exploration: must take actions
with suboptimal estimates to
discover new rewards and
increase eventual utility
Exploitation: once the true
optimal policy is learned,
exploration reduces utility
Systems must explore in the
beginning and exploit in the limit
12
Detour: Q-Value Iteration
Value iteration: find successive approx optimal values
Start with V0*(s) = 0, which we know is right (why?)
Given Vi*, calculate the values for all states for depth i+1:
But Q-values are more useful!
Start with Q0*(s,a) = 0, which we know is right (why?)
Given Qi*, calculate the q-values for all q-states for depth i+1:
13
Q-Learning
We’d like to do Q-value updates to each Q-state:
But can’t compute this update without knowing T, R
Instead, compute average as we go
Receive a sample transition (s,a,r,s’)
This sample suggests
But we want to average over results from (s,a) (Why?)
So keep a running average
14
Q-Learning Properties
Will converge to optimal policy
If you explore enough (i.e. visit each q-state many times)
If you make the learning rate small enough
Basically doesn’t matter how you select actions (!)
Off-policy learning: learns optimal q-values, not the
values of the policy you are following
S
E
S
E
15
Exploration / Exploitation
Several schemes for forcing exploration
Simplest: random actions ( greedy)
Every time step, flip a coin
With probability , act randomly
With probability 1-, act according to current policy
Regret: expected gap between rewards during
learning and rewards from optimal action
Q-learning with random actions will converge to optimal values,
but possibly very slowly, and will get low rewards on the way
Results will be optimal but regret will be large
How to make regret small?
16
Exploration Functions
When to explore
Random actions: explore a fixed amount
Better ideas: explore areas whose badness is not (yet)
established, explore less over time
One way: exploration function
Takes a value estimate and a count, and returns an optimistic
utility, e.g.
(exact form not important)
17
Q-Learning
Q-learning produces tables of q-values:
18
Q-Learning
In realistic situations, we cannot possibly learn
about every single state!
Too many states to visit them all in training
Too many states to hold the q-tables in memory
Instead, we want to generalize:
Learn about some small number of training states
from experience
Generalize that experience to new, similar states
This is a fundamental idea in machine learning, and
we’ll see it over and over again
19
Example: Pacman
Let’s say we discover
through experience
that this state is bad:
In naïve q learning, we
know nothing about
this state or its q
states:
Or even this one!
20
Feature-Based Representations
Solution: describe a state using
a vector of features (properties)
Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state
Example features:
Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc.
Is it the exact state on this slide?
Can also describe a q-state (s, a)
with features (e.g. action moves
closer to food)
21
Linear Feature Functions
Using a feature representation, we can write a
q function (or value function) for any state
using a few weights:
Advantage: our experience is summed up in a
few powerful numbers
Disadvantage: states may share features but
actually be very different in value!
22
Function Approximation
Q-learning with linear q-functions:
Exact Q’s
Approximate Q’s
Intuitive interpretation:
Adjust weights of active features
E.g. if something unexpectedly bad happens, disprefer all states
with that state’s features
Formal justification: online least squares
23
Example: Q-Pacman
24
Linear Regression
40
26
24
22
20
20
30
40
20
0
0
Prediction
30
20
10
20
10
0
0
Prediction
25
Overfitting
30
25
20
Degree 15 polynomial
15
10
5
0
-5
-10
-15
0
2
4
6
8
10
12
14
16
18
20
Policy Search
29
Policy Search
Problem: often the feature-based policies that work well
aren’t the ones that approximate V / Q best
E.g. your value functions from project 2 were probably horrible
estimates of future rewards, but they still produced good
decisions
We’ll see this distinction between modeling and prediction again
later in the course
Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards
This is the idea behind policy search, such as what
controlled the upside-down helicopter
30
Policy Search
Simplest policy search:
Start with an initial linear value function or q-function
Nudge each feature weight up and down and see if
your policy is better than before
Problems:
How do we tell the policy got better?
Need to run many sample episodes!
If there are a lot of features, this can be impractical
31