Transcript IntroAI_14x
Reinforcement Learning (2) Lirong Xia Tue, March 21, 2014 Reminder • Project 2 due tonight • Project 3 is online (more later) – due in two weeks 1 Recap: MDPs • Markov decision processes: – – – – – States S Start state s0 Actions A Transition p(s’|s,a) (or T(s,a,s’)) Reward R(s,a,s’) (and discount ϒ) • MDP quantities: – Policy = Choice of action for each (MAX) state – Utility (or return) = sum of discounted rewards 2 Optimal Utilities • The value of a state s: – V*(s) = expected utility starting in s and acting optimally • The value of a Q-state (s,a): – Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally • The optimal policy: – π*(s) = optimal action from state s 3 Solving MDPs • Value iteration – Start with V1(s) = 0 – Given Vi, calculate the values for all states for depth i+1: V i 1 s m ax T s , a , s ' R s , a , s ' V i s ' a s' – Repeat until converge – Use Vi as evaluation function when computing Vi+1 • Policy iteration – Step 1: policy evaluation: calculate utilities for some fixed policy – Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values – Repeat until policy converges 4 Reinforcement learning • Don’t know T and/or R, but can observe R – Learn by doing – can have multiple episodes (trials) 5 The Story So Far: MDPs and RL Things we know how to do: • If we know the MDP – Compute V*, Q*, π* exactly – Evaluate a fixed policy π • If we don’t know the MDP Techniques: • Computation • Value and policy iteration • Policy evaluation – If we can estimate the MDP then • Model-based RL solve • sampling – We can estimate V for a fixed policy π – We can estimate Q*(s,a) for the • Model-free RL: • Q-learning optimal policy while executing an exploration policy 6 Model-Free Learning • Model-free (temporal difference) learning – Experience world through episodes (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) – Update estimates each transition (s,a,r,s’) – Over time, updates will mimic Bellman updates Q -V alue Iteration (m odel-based, requires know n M D P) Q i 1 s , a R s , a , s ' m ax Q s ', a ' T s , a , s ' i a' s' Q -Learning (m odel-free, requires only ex perienced transitions) Q ( s , a ) (1 ) Q ( s , a ) r m ax Q s ', a ' a' 7 Detour: Q-Value Iteration • We’d like to do Q-value updates to each Q-state: Q i 1 s , a T s , a , s ' R s , a , s ' m ax Q i s ', a ' a' s' – But can’t compute this update without knowing T,R • Instead, compute average as we go – Receive a sample transition (s,a,r,s’) – This sample suggests Q ( s , a ) r m ax Q s ', a ' a' – But we want to merge the new observation to the old ones – So keep a running average Q ( s , a ) (1 ) Q ( s , a ) r m ax Q s ', a ' a' 8 Q-Learning Properties • Will converge to optimal policy – If you explore enough (i.e. visit each q-state many times) – If you make the learning rate small enough – Basically doesn’t matter how you select actions (!) • Off-policy learning: learns optimal q-values, not the values of the policy you are following 9 Q-Learning • Q-learning produces tables of q-values: 10 Exploration / Exploitation • Random actions (ε greedy) – Every time step, flip a coin – With probability ε, act randomly – With probability 1-ε, act according to current policy 11 Today: Q-Learning with state abstraction • In realistic situations, we cannot possibly learn about every single state! – Too many states to visit them all in training – Too many states to hold the Q-tables in memory • Instead, we want to generalize: – Learn about some small number of training states from experience – Generalize that experience to new, similar states – This is a fundamental idea in machine learning, and we’ll see it over and over again 12 Example: Pacman • Let’s say we discover through experience that this state is bad: • In naive Q-learning, we know nothing about this state or its Q-states: • Or even this one! 13 Feature-Based Representations • Solution: describe a state using a vector of features (properties) – Features are functions from states to real numbers (often 0/1) that capture important properties of the state – Example features: • • • • • • • Distance to closest ghost Distance to closest dot Number of ghosts 1/ (dist to dot)2 Similar to a evaluation function Is Pacman in a tunnel? (0/1) …etc Is it the exact state on this slide? – Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) 14 Linear Feature Functions • Using a feature representation, we can write a Q function (or value function) for any state using a few weights: () () () () Q ( s,a) = w f ( s,a) + w f ( s,a) + + w f ( s,a) V s = w1 f1 s + w2 f 2 s + + wn f n s 1 1 2 2 n n • Advantage: our experience is summed up in a few powerful numbers • Disadvantage: states may share features but actually be very different in value! 15 Function Approximation ( ) ( ) ( ) ( ) Q s,a = w1 f1 s,a + w2 f 2 s,a + + wn f n s,a • Q-learning with linear Q-functions: transition = (s,a,r,s’) difference r m ax Q s ', a ' Q ( s , a ) a' Q(s,a) ¬ Q(s,a) + a éëdifferenceùû wi ¬ wi + a éëdifferenceùû f i s,a • Intuitive interpretation: ( ) Exact Q’s Approximate Q’s – Adjust weights of active features – E.g. if something unexpectedly bad happens, disprefer all states with that state’s features • Formal justification: online least squares 16 Example: Q-Pacman s Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a) fDOT(s,NORTH)=0.5 fGST(s,NORTH)=1.0 Q(s,a)=+1 R(s,a,s’)=-500 difference=-501 α= North r = -500 s’ wDOT←4.0+α[-501]0.5 wGST ←-1.0+α[-501]1.0 Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a) 17 Linear Regression prediction y = w0 + w1 f1 x () prediction y = w0 + w1 f1 x + w2 f 2 x () 18 () Ordinary Least Squares (OLS) ( total error = å yi - y i i ) 2 æ ö = åç yi - å wk f k x ÷ ø i è k () 2 19 Minimizing Error Imagine we had only one point x with features f(x): 2 1 error w y w k f k x 2 k error w y wk f k x f m x wm k wm wm y wk f k x f m x k Approximate q update explained: “target” “prediction” w m w m r m ax Q ( s ', a ') Q ( s , a ) f m x a 20 How many features should we use? • As many as possible? – computational burden – overfitting • Feature selection is important – requires domain expertise 21 Overfitting 22 The elements of statistical learning. Fig 2.11. 23 Overview of Project 3 • MDPs – Q1: value iteration – Q2: find parameters that lead to certain optimal policy – Q3: similar to Q2 • Q-learning – Q4: implement the Q-learning algorithm – Q5: implement ε greedy action selection – Q6: try the algorithm • Approximate Q-learning and state abstraction – Q7: Pacman – Q8 (bonus): design and implement a state-abstraction Qlearning algorithm • Hints – make your implementation general – try to use class methods as much as possible 24