CS 294-5: Statistical Natural Language Processing

Download Report

Transcript CS 294-5: Statistical Natural Language Processing

CS 5368: Artificial Intelligence
Fall 2010
Lecture 12: MDP + RL (Part 2)
10/14/2010
Mohan Sridharan
Slides adapted from Dan Klein
1
Recap: Reinforcement Learning
 Basic idea:
 Receive feedback in the form of rewards.
 Agent’s utility is defined by the reward function.
 Must learn to act so as to maximize expected rewards.
2
Reinforcement Learning
Known
Unknown
Assumed
•Current state
•Transition model
•Markov transitions
•Available actions
•Reward structure
•Fixed reward for (s,a,s’)
•Experienced
rewards
Problem: Find values for fixed policy  (policy evaluation):
Model-based learning: Learn the model, solve for values.
Model-free learning: Solve for values directly (by sampling).
3
The Story So Far: MDPs and RL
Things we know how to do:
Techniques:
 If we know the MDP
 Model-based RL:
 Compute V*, Q*, * exactly.
 Evaluate a fixed policy .
 Value and policy
Iteration.
 Policy evaluation..
 If we don’t know the MDP:
 We can estimate the MDP then solve.  Model-free RL:
 Value learning.
 We can estimate V for a fixed policy .
 Q-learning.
 We can estimate Q*(s,a) for the
optimal policy while executing an
exploration policy.
4
Passive Learning
 Simplified task:




You do not know the transitions T(s,a,s’).
You do not know the rewards R(s,a,s’).
You are given a policy (s).
Goal: learn the state values (and the model?). Policy evaluation.
 In this case:
 Learner “along for the ride”.
 No choice about what actions to take. Just execute the policy
and learn from experience.
 This is NOT offline planning!
5
Active Learning
 More complex task.
 R and T still unknown. Does not have fixed policy that determines its
behavior!
 Must learn what actions to take.
 Same set of algorithms can be modified to address the challenges.
 We begin with passive model-based and model-free methods.
6
Example: Direct Estimation
y
 Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
-100
x
 = 1, R = -1
V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
7
Model-Based Learning
 Idea:
 Learn the model empirically through experience.
 Solve for values as if the learned model were correct.
s
 Simple empirical model learning:
 Count outcomes for each s, a.
 Normalize to give estimate of T(s, a, s’).
 Discover R(s, a, s’) when we experience (s, a, s’).
 Solving the MDP with the learned model:
(s)
s, (s)
s, (s),s’
s’
 Iterative policy evaluation, for example:
8
Example: Model-Based Learning
y
 Episodes:
+100
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
-100
x
=1
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
(done)
9
Model-Free Learning
 Want to compute an expectation weighted by P(x):
 Model-based: estimate P(x) from samples, compute expectation.
 Model-free: estimate expectation directly from samples.
 Why does this work? Because samples appear with the right
frequencies!
10
Sample-Based Policy Evaluation?
s
 Who needs T and R? Approximate the
expectation with samples (drawn from T).
(s)
s, (s)
s, (s),s’
s2’
ss’1’
s3’
Almost! But we only
actually make progress
when we move to i+1.
11
Temporal-Difference Learning
 Big idea: learn from every experience!
 Update V(s) each time we experience (s,a,s’,r)
 Likely s’ will contribute to updates more often.
s
(s)
s, (s)
 Temporal difference learning:
 Policy can still be fixed!
 Move values toward value of whatever
successor occurs: running average!
s’
Sample of V(s):
Update to V(s):
Same update:
13
Example: TD Policy Evaluation
(1,1) up -1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(1,3) right -1
(2,3) right -1
(2,3) right -1
(3,3) right -1
(3,3) right -1
(3,2) up -1
(3,2) up -1
(4,2) exit -100
(3,3) right -1
(done)
(4,3) exit +100
(done)
Take  = 1,  = 0.5
15
Problems with TD Value Learning
 TD value leaning is a model-free way
to do policy evaluation.
 However, if we want to turn values into
a (new) policy, we are sunk:
s
a
s, a
s,a,s’
s’
 Idea: learn Q-values directly.
 Makes action selection model-free too!
16
Return to Active Learning
 Full reinforcement learning





You do not know the transitions T(s,a,s’)
You do not know the rewards R(s,a,s’)
You can choose any actions you like.
Goal: learn the optimal policy.
… what value iteration did!
 In this case:
 Learner makes choices!
 Fundamental tradeoff: exploration vs. exploitation.
 You actually take actions in the world and find out what happens.
17
Model-Based Active Learning
 In general, want to learn the optimal policy, not evaluate
a fixed policy.
 Idea: adaptive dynamic programming 
 Learn an initial model of the environment.
 Solve for the optimal policy for this model (value or policy
iteration).
 Refine model through experience and repeat.
 Crucial: we have to make sure we actually learn about all of the
model.
18
Example: Greedy ADP
 Imagine we find the lower path to
the good exit first.
 Some states will never be visited
following this policy from (1,1).
?
?
 Can keep re-using this policy but
following it never explores the
regions of the model we need in
order to learn the optimal policy .
19
What Went Wrong?
 Problem with following optimal policy
for current model:
 Never learns about better regions of
the space if current policy neglects
them.
?
?
 Fundamental tradeoff: exploration vs.
exploitation.
 Exploration: take actions with
suboptimal estimates to discover new
rewards and increase eventual utility.
 Exploitation: once the true optimal
policy is learned, exploration reduces
utility.
 Systems must explore in the
beginning and exploit in the limit.
20
Detour: Q-Value Iteration
 Value iteration: find successive approx optimal values
 Start with V0*(s) = 0, which we know is right (why?)
 Given Vi*, calculate the values for all states for depth i+1:
 But Q-values are more useful!
 Start with Q0*(s,a) = 0, which we know is right (why?)
 Given Qi*, calculate the q-values for all q-states for depth i+1:
21
Q-Learning
 We would like to do Q-value updates to each Q-state:
 But cannot compute this update without knowing T, R.
 Instead, compute average as we go:
 Receive a sample transition (s,a,r,s’).
 This sample suggests:
 But we want to average over results from (s,a) (Why?)
 So keep a running average:
22
Q-Learning Properties
 Will converge to optimal policy:
 If you explore enough (i.e. visit each q-state many times).
 If you make the learning rate small enough.
 Basically does not matter how you select actions (!)
 Off-policy learning: learns optimal q-values, not the values of the
policy you are following.
S
E
S
E
 On-policy vs. off-policy: Chapter 5 on RL textbook.
23
Exploration / Exploitation
 Several schemes for forcing exploration:
 Simplest: random actions ( greedy).
 Every time step, flip a coin.
 With probability , act randomly.
 With probability 1-, act according to current policy.
 Regret: expected gap between rewards during learning
and rewards from optimal action.
 Q-learning with random actions will converge to optimal values,
but possibly very slowly, and will get low rewards on the way.
 Results will be optimal but regret will be large.
 How to make regret small?
24
Exploration Functions
 When to explore:
 Random actions: explore a fixed amount
 Better ideas: explore areas whose badness is not (yet) established, explore less
over time
 One way: exploration function.
 Takes a value estimate and a count, and returns an optimistic utility, (exact form
not important):
25
Q-Learning
 Q-learning produces tables of q-values:
26
Q-Learning
 In realistic situations, we cannot possibly learn about
every single state!
 Too many states to visit them all in training.
 Too many states to hold the q-tables in memory.
 Instead, we want to generalize:
 Learn about some small number of training states from
experience.
 Generalize that experience to new, similar states.
 This is a fundamental idea in machine learning, and we will see it
over and over again!
27
Example: Pacman
 Let’s say we discover
through experience that
this state is bad.
 In naïve Q-learning, we
know nothing about this
state or its q-states.
 Or even this one!
28
Feature-Based Representations
 Solution: describe a state using a
vector of features (properties).
 Features map from states to real
numbers that capture important
properties of the state.
 Example features:





Distance to closest ghost/dot.
Number of ghosts.
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
Is it the exact state on this slide?
 Can also describe a q-state (s, a) with
features (e.g. action moves closer to
food).
29
Linear Feature Functions
 Using a feature representation, can write a q-function
(or value function) for any state using a few weights:
 Advantage: our experience is summed up in a few
powerful numbers 
 Disadvantage: states may share features but actually
be very different in value 
30
Function Approximation
 Q-learning with linear q-functions:
Exact Q’s
Approximate Q’s
 Intuitive interpretation:
 Adjust weights of active features.
 E.g. if something unexpectedly bad happens, do not prefer all states with that
state’s features.
 Formal justification: online least squares.
31
Example: Q-Pacman
32
Linear Regression
40
26
24
22
20
20
30
40
20
0
0
Prediction
30
20
10
20
10
0
0
Prediction
33
Overfitting
30
25
20
Degree 15 polynomial
15
10
5
0
-5
-10
-15
0
2
4
6
8
10
12
14
16
18
20
36
Policy Search
37
Policy Search
 Problem: often the feature-based policies that work well are not the
ones that approximate V or Q best.
 E.g. value functions may provide horrible estimates of future rewards, but they
can still produce good decisions.
 Will see distinction between modeling and prediction again later in the course.
 Solution: learn the policy that maximizes rewards rather than the
value that predicts rewards.
 This is the idea behind policy search, such as what controlled the
upside-down helicopter.
38
Policy Search
 Simplest policy search:
 Start with an initial linear value function or q-function.
 Nudge each feature weight up and down and see if
your policy is better than before.
 Problems:
 How do we tell the policy got better?
 Need to run many sample episodes!
 If there are a lot of features, this can be impractical 
39