CS 294-5: Statistical Natural Language Processing
Download
Report
Transcript CS 294-5: Statistical Natural Language Processing
CS 5368: Artificial Intelligence
Fall 2010
Lecture 12: MDP + RL (Part 2)
10/14/2010
Mohan Sridharan
Slides adapted from Dan Klein
1
Recap: Reinforcement Learning
Basic idea:
Receive feedback in the form of rewards.
Agent’s utility is defined by the reward function.
Must learn to act so as to maximize expected rewards.
2
Reinforcement Learning
Known
Unknown
Assumed
•Current state
•Transition model
•Markov transitions
•Available actions
•Reward structure
•Fixed reward for (s,a,s’)
•Experienced
rewards
Problem: Find values for fixed policy (policy evaluation):
Model-based learning: Learn the model, solve for values.
Model-free learning: Solve for values directly (by sampling).
3
The Story So Far: MDPs and RL
Things we know how to do:
Techniques:
If we know the MDP
Model-based RL:
Compute V*, Q*, * exactly.
Evaluate a fixed policy .
Value and policy
Iteration.
Policy evaluation..
If we don’t know the MDP:
We can estimate the MDP then solve. Model-free RL:
Value learning.
We can estimate V for a fixed policy .
Q-learning.
We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy.
4
Passive Learning
Simplified task:
You do not know the transitions T(s,a,s’).
You do not know the rewards R(s,a,s’).
You are given a policy (s).
Goal: learn the state values (and the model?). Policy evaluation.
In this case:
Learner “along for the ride”.
No choice about what actions to take. Just execute the policy
and learn from experience.
This is NOT offline planning!
5
Active Learning
More complex task.
R and T still unknown. Does not have fixed policy that determines its
behavior!
Must learn what actions to take.
Same set of algorithms can be modified to address the challenges.
We begin with passive model-based and model-free methods.
6
Example: Direct Estimation
y
Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
-100
x
= 1, R = -1
V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
7
Model-Based Learning
Idea:
Learn the model empirically through experience.
Solve for values as if the learned model were correct.
s
Simple empirical model learning:
Count outcomes for each s, a.
Normalize to give estimate of T(s, a, s’).
Discover R(s, a, s’) when we experience (s, a, s’).
Solving the MDP with the learned model:
(s)
s, (s)
s, (s),s’
s’
Iterative policy evaluation, for example:
8
Example: Model-Based Learning
y
Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
-100
x
=1
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
(done)
9
Model-Free Learning
Want to compute an expectation weighted by P(x):
Model-based: estimate P(x) from samples, compute expectation.
Model-free: estimate expectation directly from samples.
Why does this work? Because samples appear with the right
frequencies!
10
Sample-Based Policy Evaluation?
s
Who needs T and R? Approximate the
expectation with samples (drawn from T).
(s)
s, (s)
s, (s),s’
s2’
ss’1’
s3’
Almost! But we only
actually make progress
when we move to i+1.
11
Temporal-Difference Learning
Big idea: learn from every experience!
Update V(s) each time we experience (s,a,s’,r)
Likely s’ will contribute to updates more often.
s
(s)
s, (s)
Temporal difference learning:
Policy can still be fixed!
Move values toward value of whatever
successor occurs: running average!
s’
Sample of V(s):
Update to V(s):
Same update:
13
Example: TD Policy Evaluation
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
Take = 1, = 0.5
15
Problems with TD Value Learning
TD value leaning is a model-free way
to do policy evaluation.
However, if we want to turn values into
a (new) policy, we are sunk:
s
a
s, a
s,a,s’
s’
Idea: learn Q-values directly.
Makes action selection model-free too!
16
Return to Active Learning
Full reinforcement learning
You do not know the transitions T(s,a,s’)
You do not know the rewards R(s,a,s’)
You can choose any actions you like.
Goal: learn the optimal policy.
… what value iteration did!
In this case:
Learner makes choices!
Fundamental tradeoff: exploration vs. exploitation.
You actually take actions in the world and find out what happens.
17
Model-Based Active Learning
In general, want to learn the optimal policy, not evaluate
a fixed policy.
Idea: adaptive dynamic programming
Learn an initial model of the environment.
Solve for the optimal policy for this model (value or policy
iteration).
Refine model through experience and repeat.
Crucial: we have to make sure we actually learn about all of the
model.
18
Example: Greedy ADP
Imagine we find the lower path to
the good exit first.
Some states will never be visited
following this policy from (1,1).
?
?
Can keep re-using this policy but
following it never explores the
regions of the model we need in
order to learn the optimal policy .
19
What Went Wrong?
Problem with following optimal policy
for current model:
Never learns about better regions of
the space if current policy neglects
them.
?
?
Fundamental tradeoff: exploration vs.
exploitation.
Exploration: take actions with
suboptimal estimates to discover new
rewards and increase eventual utility.
Exploitation: once the true optimal
policy is learned, exploration reduces
utility.
Systems must explore in the
beginning and exploit in the limit.
20
Detour: Q-Value Iteration
Value iteration: find successive approx optimal values
Start with V0*(s) = 0, which we know is right (why?)
Given Vi*, calculate the values for all states for depth i+1:
But Q-values are more useful!
Start with Q0*(s,a) = 0, which we know is right (why?)
Given Qi*, calculate the q-values for all q-states for depth i+1:
21
Q-Learning
We would like to do Q-value updates to each Q-state:
But cannot compute this update without knowing T, R.
Instead, compute average as we go:
Receive a sample transition (s,a,r,s’).
This sample suggests:
But we want to average over results from (s,a) (Why?)
So keep a running average:
22
Q-Learning Properties
Will converge to optimal policy:
If you explore enough (i.e. visit each q-state many times).
If you make the learning rate small enough.
Basically does not matter how you select actions (!)
Off-policy learning: learns optimal q-values, not the values of the
policy you are following.
S
E
S
E
On-policy vs. off-policy: Chapter 5 on RL textbook.
23
Exploration / Exploitation
Several schemes for forcing exploration:
Simplest: random actions ( greedy).
Every time step, flip a coin.
With probability , act randomly.
With probability 1-, act according to current policy.
Regret: expected gap between rewards during learning
and rewards from optimal action.
Q-learning with random actions will converge to optimal values,
but possibly very slowly, and will get low rewards on the way.
Results will be optimal but regret will be large.
How to make regret small?
24
Exploration Functions
When to explore:
Random actions: explore a fixed amount
Better ideas: explore areas whose badness is not (yet) established, explore less
over time
One way: exploration function.
Takes a value estimate and a count, and returns an optimistic utility, (exact form
not important):
25
Q-Learning
Q-learning produces tables of q-values:
26
Q-Learning
In realistic situations, we cannot possibly learn about
every single state!
Too many states to visit them all in training.
Too many states to hold the q-tables in memory.
Instead, we want to generalize:
Learn about some small number of training states from
experience.
Generalize that experience to new, similar states.
This is a fundamental idea in machine learning, and we will see it
over and over again!
27
Example: Pacman
Let’s say we discover
through experience that
this state is bad.
In naïve Q-learning, we
know nothing about this
state or its q-states.
Or even this one!
28
Feature-Based Representations
Solution: describe a state using a
vector of features (properties).
Features map from states to real
numbers that capture important
properties of the state.
Example features:
Distance to closest ghost/dot.
Number of ghosts.
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
Is it the exact state on this slide?
Can also describe a q-state (s, a) with
features (e.g. action moves closer to
food).
29
Linear Feature Functions
Using a feature representation, can write a q-function
(or value function) for any state using a few weights:
Advantage: our experience is summed up in a few
powerful numbers
Disadvantage: states may share features but actually
be very different in value
30
Function Approximation
Q-learning with linear q-functions:
Exact Q’s
Approximate Q’s
Intuitive interpretation:
Adjust weights of active features.
E.g. if something unexpectedly bad happens, do not prefer all states with that
state’s features.
Formal justification: online least squares.
31
Example: Q-Pacman
32
Linear Regression
40
26
24
22
20
20
30
40
20
0
0
Prediction
30
20
10
20
10
0
0
Prediction
33
Overfitting
30
25
20
Degree 15 polynomial
15
10
5
0
-5
-10
-15
0
2
4
6
8
10
12
14
16
18
20
36
Policy Search
37
Policy Search
Problem: often the feature-based policies that work well are not the
ones that approximate V or Q best.
E.g. value functions may provide horrible estimates of future rewards, but they
can still produce good decisions.
Will see distinction between modeling and prediction again later in the course.
Solution: learn the policy that maximizes rewards rather than the
value that predicts rewards.
This is the idea behind policy search, such as what controlled the
upside-down helicopter.
38
Policy Search
Simplest policy search:
Start with an initial linear value function or q-function.
Nudge each feature weight up and down and see if
your policy is better than before.
Problems:
How do we tell the policy got better?
Need to run many sample episodes!
If there are a lot of features, this can be impractical
39