mdp-lecture1

Download Report

Transcript mdp-lecture1

Markov Decision Processes
Basics Concepts
Alan Fern
1
Some AI Planning Problems
Fire & Rescue
Response Planning
Helicopter Control
Solitaire
Real-Time Strategy Games
Legged Robot Control
Network
Security/Control
Some AI Planning Problems
 Health Care
 Personalized treatment planning
 Hospital Logistics/Scheduling
 Transportation
 Autonomous Vehicles
 Supply Chain Logistics
 Air traffic control
 Assistive Technologies
 Dialog Management
 Automated assistants for elderly/disabled
 Household robots
 Sustainability
 Smart grid
 Forest fire management
…..
3
Common Elements
 We have a systems that changes state over time
 Can (partially) control the system state transitions by
taking actions
 Problem gives an objective that specifies which
states (or state sequences) are more/less preferred
 Problem: At each moment must select an action to
optimize the overall (long-term) objective
 Produce most preferred state sequences
4
Observe-Act Loop of AI Planning Agent
Observations
State of
world/system
world/
system
Actions
action
????
Goal
maximize expected
reward over lifetime
5
Stochastic/Probabilistic Planning:
Markov Decision Process (MDP) Model
Markov Decision
Process
World State
????
Action from
finite set
Goal
maximize expected
reward over lifetime
6
Example MDP
State describes
all visible info
about dice
????
Action are the
different choices
of dice to roll
Goal
maximize score or
score more than opponent
7
Markov Decision Processes
 An MDP has four components: S, A, R, T:
 finite state set S (|S| = n)
 finite action set A (|A| = m)
 transition function T(s,a,s’) = Pr(s’ | s,a)

Probability of going to state s’ after taking action a in state s
 bounded, real-valued reward function R(s,a)



Immediate reward we get for being in state s and taking action a
Roughly speaking the objective is to select actions in order to
maximize total reward
For example in a goal-based domain R(s,a) may equal 1 for goal
states and 0 for all others (or -1 reward for non-goal states)
8
State
action = roll the two 5’s
…
1,1
1,2
6,6
Probabilistic
state transition
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: ????
 Should the solution to an MDP be just a sequence of
actions such as (a1,a2,a3, ….) ?
 Consider a single player card game like Blackjack/Solitaire.
 No! In general an action sequence is not sufficient
 Actions have stochastic effects, so the state we end up in is
uncertain
 This means that we might end up in states where the remainder
of the action sequence doesn’t apply or is a bad choice
 A solution should tell us what the best action is for any possible
situation/state that might arise
10
Policies (“plans” for MDPs)
 A solution to an MDP is a policy
 Two types of policies: nonstationary and stationary
 Nonstationary policies are used when we are given a
finite planning horizon H
 I.e. we are told how many actions we will be allowed to take
 Nonstationary policies are functions from states and
times to actions
 π:S x T → A, where T is the non-negative integers
 π(s,t) tells us what action to take at state s when there are t
stages-to-go (note that we are using the convention that t
represents stages/decisions to go, rather than the time step)
11
Policies (“plans” for MDPs)
 What if we want to continue taking actions indefinately?
 Use stationary policies
 A Stationary policy is a mapping from states to actions
 π:S → A
 π(s) is action to do at state s (regardless of time)
 specifies a continuously reactive controller
 Note that both nonstationary and stationary policies
assume or have these properties:
 full observability of the state
 history-independence
 deterministic action choice
12
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy such that ????
 We don’t want to output just any policy
 We want to output a “good” policy
 One that accumulates lots of reward
13
Value of a Policy
 How good is a policy π?
 How do we measure reward “accumulated” by π?
 Value function V: S →ℝ associates value with each
state (or each state and time for non-stationary π)
 Vπ(s) denotes value of policy at state s
 Depends on immediate reward, but also what you achieve
subsequently by following π
 An optimal policy is one that is no worse than any other
policy at any state
 The goal of MDP planning is to compute an optimal
policy
14
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy that achieves an “optimal value”
 This depends on how we define the value of a policy
 There are several choices and the solution algorithms
depend on the choice
 We will consider finite-horizon value
15
Finite-Horizon Value Functions
 We first consider maximizing expected total reward
over a finite horizon
 Assumes the agent has H time steps to live
 To act optimally, should the agent use a stationary
or non-stationary policy?
 I.e. Should the action it takes depend on absolute time?
 Put another way:
 If you had only one week to live would you act the same
way as if you had fifty years to live?
16
Finite Horizon Problems
 Value (utility) depends on stage-to-go
 hence use a nonstationary policy

k
V (s)is k-stage-to-go value function for π
 expected total reward for executing π starting in s for k time steps
k
V ( s)  E [  R |  , s ]
k
t
t 0
k
 E [  R( s t ) | a t   ( s t , k  t ), s 0  s ]
t 0
 Here Rt and st are random variables denoting the reward
received and state at time-step t when starting in s
 These are random variables since the world is stochastic
17
18
Computational Problems
 There are two problems that are of typical interest
 Policy evaluation:
 Given an MDP and a policy π
 Compute finite-horizon value function V k (s) for any k and s

 Policy optimization:
 Given an MDP and a horizon H
 Compute the optimal finite-horizon policy
 How many finite horizon policies are there?
 |A|Hn
 So can’t just enumerate policies for efficient optimization
19
Computational Problems
 Dynamic programming techniques can be used for both
policy evaluation and optimization
 Polynomial time in # of states and actions
 http://web.engr.oregonstate.edu/~afern/classes/cs533/
 Is polytime in # of states and actions good?
 Not when these numbers are enormous!
 As is the case for most realistic applications
 Consider Klondike Solitaire, Computer Network Control,
etc
Enters Monte-Carlo Planning
20
Approaches for Large Worlds:
Monte-Carlo Planning
 Often a simulator of a planning domain is available
or can be learned from data
Fire & Emergency Response
Klondike Solitaire
21
Large Worlds: Monte-Carlo Approach
 Often a simulator of a planning domain is available
or can be learned from data
 Monte-Carlo Planning: compute a good policy for
an MDP by interacting with an MDP simulator
World
Simulator
action
Real
World
State + reward
22
Example Domains with Simulators
 Traffic simulators
 Robotics simulators
 Military campaign simulators
 Computer network simulators
 Emergency planning simulators
 large-scale disaster and municipal
 Sports domains
 Board games / Video games
 Go / RTS
In many cases Monte-Carlo techniques yield state-of-the-art
performance. Even in domains where model-based planners
are applicable.
23
MDP: Simulation-Based Representation
 A simulation-based representation gives: S, A, R, T, I:
 finite state set S (|S|=n and is generally very large)
 finite action set A (|A|=m and will assume is of reasonable size)
 Stochastic, real-valued, bounded reward function R(s,a) = r

Stochastically returns a reward r given input s and a
 Stochastic transition function T(s,a) = s’ (i.e. a simulator)


Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP
 Stochastic initial state function I.

Stochastically returns a state according to an initial state distribution
These stochastic functions can be implemented in any language!
24
Computational Problems
 Policy evaluation:
 Given an MDP simulator, a policy 𝜋, a state s, and horizon h
 Compute finite-horizon value function 𝑉𝜋ℎ (𝑠)
 Policy optimization:
 Given an MDP simulator, a horizon H, and a state s
 Compute the optimal finite-horizon policy at state s
25
Trajectories
 We can use the simulator to observe trajectories of
any policy π from any state s:
 Let Traj(s, π , h) be a stochastic function that
returns a length h trajectory of π starting at s.
 Traj(s, π , h)
 s0 = s
 For i = 1 to h-1
 si = T(si-1, π(si-1))
 Return s0, s1, …, sh-1
 The total reward of a trajectory is given by
h 1
R( s0 ,..., sh 1 )   R( si )
i 0
26
Policy Evaluation
 Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)?
 Simply sample w trajectories starting at s and average the
total reward received
 Select a sampling width w and horizon h.
 The function EstimateV(s, π , h, w) returns estimate of Vπ(s)
EstimateV(s, π , h, w)
V=0
 For i = 1 to w
V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory
 Return V / w

 How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be?
27
Sampling-Error Bound
h 1
Vh ( s )  E [   t R t |  , s ]  E[ R(Traj ( s,  , h))]
t 0
approximation due to sampling
w
Estim ateV(s,  , h, w)  w1  ri , ri  R(Traj( s,  , h))
i 1
• Note that the ri are samples of random variable R(Traj(s, π , h))
• We can apply the additive Chernoff bound which bounds the
difference between an expectation and an emprical average
28
Aside: Additive Chernoff Bound
• Let X be a random variable with maximum absolute value Z.
An let xi i=1,…,w be i.i.d. samples of X
• The Chernoff bound gives a bound on the probability that the
average of the xi are far from E[X]
Let {xi | i=1,…, w} be i.i.d. samples of random variable X,
then with probability at least 1   we have that,
w
E[ X ]  w1  xi  Z
i 1
1
w
ln 1
2







1
equivalently Pr E[ X ]  w  xi     exp    w 
 Z 
i 1




w
29
Aside: Coin Flip Example
• Suppose we have a coin with probability of heads equal to p.
• Let X be a random variable where X=1 if the coin flip
gives heads and zero otherwise. (so Z from bound is 1)
E[X] = 1*p + 0*(1-p) = p
• After flipping a coin w times we can estimate the heads prob.
by average of xi.
• The Chernoff bound tells us that this estimate converges
exponentially fast to the true mean (coin bias) p.
w


1

Pr p  w  xi     exp   2 w
i 1




30
Sampling Error Bound
h 1
V ( s, h)  E [   t R t |  , s ]  E[ R(Traj( s,  , h))]
t 0
approximation due to sampling
w
Estim ateV(s,  , h, w)  w1  ri , ri  R(Traj( s,  , h))
i 1
We get that,
V ( s )  EstimateV ( s,  , h, w)  Vmax
h
with probability at least
1
w
ln 1
1 
31
Two Player MDP (aka Markov Games)
• So far we have only discussed single-player MDPs/games
• Your labs and competition will be 2-player zero-sum games
(zero sum means sum of player rewards is zero)
• We assume players take turns (non-simultaneous moves)
Player 1
Player 2
Markov Game
action
action
State/
reward
State/
reward
Simulators for 2-Player Games
 A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅, 𝑇, 𝐼:
 finite state set S (assume state encodes whose turn it is)
 action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min)
 Stochastic, real-valued, bounded reward function 𝑅 𝑠, 𝑎


Stochastically returns a reward r for max given input s and action a
(it is assumed here that reward for min is -r)
So min maximizes its reward by minimizing the reward of max
 Stochastic transition function T(s,a) = s’ (i.e. a simulator)



Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of game
Generally s’ will be a turn for the other player
These stochastic functions can be implemented in any language!
33
Finite Horizon Value of Game
 Given two policies 𝜋1 and 𝜋2 one for each player we
can define the finite horizon value
 𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to player 1
 expected total reward for player 1 after executing 𝝅𝟏
and 𝝅𝟐 starting in s for k steps
 For zero-sum games the value with respect to player 2 is just
− 𝑉𝜋ℎ1 ,𝜋2 (𝑠)
 Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate
𝑉𝜋ℎ1 ,𝜋2 (𝑠)
 Just as we did for the single-player MDP setting
34