mdp-lecture1
Download
Report
Transcript mdp-lecture1
Markov Decision Processes
Basics Concepts
Alan Fern
1
Some AI Planning Problems
Fire & Rescue
Response Planning
Helicopter Control
Solitaire
Real-Time Strategy Games
Legged Robot Control
Network
Security/Control
Some AI Planning Problems
Health Care
Personalized treatment planning
Hospital Logistics/Scheduling
Transportation
Autonomous Vehicles
Supply Chain Logistics
Air traffic control
Assistive Technologies
Dialog Management
Automated assistants for elderly/disabled
Household robots
Sustainability
Smart grid
Forest fire management
…..
3
Common Elements
We have a systems that changes state over time
Can (partially) control the system state transitions by
taking actions
Problem gives an objective that specifies which
states (or state sequences) are more/less preferred
Problem: At each moment must select an action to
optimize the overall (long-term) objective
Produce most preferred state sequences
4
Observe-Act Loop of AI Planning Agent
Observations
State of
world/system
world/
system
Actions
action
????
Goal
maximize expected
reward over lifetime
5
Stochastic/Probabilistic Planning:
Markov Decision Process (MDP) Model
Markov Decision
Process
World State
????
Action from
finite set
Goal
maximize expected
reward over lifetime
6
Example MDP
State describes
all visible info
about dice
????
Action are the
different choices
of dice to roll
Goal
maximize score or
score more than opponent
7
Markov Decision Processes
An MDP has four components: S, A, R, T:
finite state set S (|S| = n)
finite action set A (|A| = m)
transition function T(s,a,s’) = Pr(s’ | s,a)
Probability of going to state s’ after taking action a in state s
bounded, real-valued reward function R(s,a)
Immediate reward we get for being in state s and taking action a
Roughly speaking the objective is to select actions in order to
maximize total reward
For example in a goal-based domain R(s,a) may equal 1 for goal
states and 0 for all others (or -1 reward for non-goal states)
8
State
action = roll the two 5’s
…
1,1
1,2
6,6
Probabilistic
state transition
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: ????
Should the solution to an MDP be just a sequence of
actions such as (a1,a2,a3, ….) ?
Consider a single player card game like Blackjack/Solitaire.
No! In general an action sequence is not sufficient
Actions have stochastic effects, so the state we end up in is
uncertain
This means that we might end up in states where the remainder
of the action sequence doesn’t apply or is a bad choice
A solution should tell us what the best action is for any possible
situation/state that might arise
10
Policies (“plans” for MDPs)
A solution to an MDP is a policy
Two types of policies: nonstationary and stationary
Nonstationary policies are used when we are given a
finite planning horizon H
I.e. we are told how many actions we will be allowed to take
Nonstationary policies are functions from states and
times to actions
π:S x T → A, where T is the non-negative integers
π(s,t) tells us what action to take at state s when there are t
stages-to-go (note that we are using the convention that t
represents stages/decisions to go, rather than the time step)
11
Policies (“plans” for MDPs)
What if we want to continue taking actions indefinately?
Use stationary policies
A Stationary policy is a mapping from states to actions
π:S → A
π(s) is action to do at state s (regardless of time)
specifies a continuously reactive controller
Note that both nonstationary and stationary policies
assume or have these properties:
full observability of the state
history-independence
deterministic action choice
12
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy such that ????
We don’t want to output just any policy
We want to output a “good” policy
One that accumulates lots of reward
13
Value of a Policy
How good is a policy π?
How do we measure reward “accumulated” by π?
Value function V: S →ℝ associates value with each
state (or each state and time for non-stationary π)
Vπ(s) denotes value of policy at state s
Depends on immediate reward, but also what you achieve
subsequently by following π
An optimal policy is one that is no worse than any other
policy at any state
The goal of MDP planning is to compute an optimal
policy
14
What is a solution to an MDP?
MDP Planning Problem:
Input: an MDP (S,A,R,T)
Output: a policy that achieves an “optimal value”
This depends on how we define the value of a policy
There are several choices and the solution algorithms
depend on the choice
We will consider finite-horizon value
15
Finite-Horizon Value Functions
We first consider maximizing expected total reward
over a finite horizon
Assumes the agent has H time steps to live
To act optimally, should the agent use a stationary
or non-stationary policy?
I.e. Should the action it takes depend on absolute time?
Put another way:
If you had only one week to live would you act the same
way as if you had fifty years to live?
16
Finite Horizon Problems
Value (utility) depends on stage-to-go
hence use a nonstationary policy
k
V (s)is k-stage-to-go value function for π
expected total reward for executing π starting in s for k time steps
k
V ( s) E [ R | , s ]
k
t
t 0
k
E [ R( s t ) | a t ( s t , k t ), s 0 s ]
t 0
Here Rt and st are random variables denoting the reward
received and state at time-step t when starting in s
These are random variables since the world is stochastic
17
18
Computational Problems
There are two problems that are of typical interest
Policy evaluation:
Given an MDP and a policy π
Compute finite-horizon value function V k (s) for any k and s
Policy optimization:
Given an MDP and a horizon H
Compute the optimal finite-horizon policy
How many finite horizon policies are there?
|A|Hn
So can’t just enumerate policies for efficient optimization
19
Computational Problems
Dynamic programming techniques can be used for both
policy evaluation and optimization
Polynomial time in # of states and actions
http://web.engr.oregonstate.edu/~afern/classes/cs533/
Is polytime in # of states and actions good?
Not when these numbers are enormous!
As is the case for most realistic applications
Consider Klondike Solitaire, Computer Network Control,
etc
Enters Monte-Carlo Planning
20
Approaches for Large Worlds:
Monte-Carlo Planning
Often a simulator of a planning domain is available
or can be learned from data
Fire & Emergency Response
Klondike Solitaire
21
Large Worlds: Monte-Carlo Approach
Often a simulator of a planning domain is available
or can be learned from data
Monte-Carlo Planning: compute a good policy for
an MDP by interacting with an MDP simulator
World
Simulator
action
Real
World
State + reward
22
Example Domains with Simulators
Traffic simulators
Robotics simulators
Military campaign simulators
Computer network simulators
Emergency planning simulators
large-scale disaster and municipal
Sports domains
Board games / Video games
Go / RTS
In many cases Monte-Carlo techniques yield state-of-the-art
performance. Even in domains where model-based planners
are applicable.
23
MDP: Simulation-Based Representation
A simulation-based representation gives: S, A, R, T, I:
finite state set S (|S|=n and is generally very large)
finite action set A (|A|=m and will assume is of reasonable size)
Stochastic, real-valued, bounded reward function R(s,a) = r
Stochastically returns a reward r given input s and a
Stochastic transition function T(s,a) = s’ (i.e. a simulator)
Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP
Stochastic initial state function I.
Stochastically returns a state according to an initial state distribution
These stochastic functions can be implemented in any language!
24
Computational Problems
Policy evaluation:
Given an MDP simulator, a policy 𝜋, a state s, and horizon h
Compute finite-horizon value function 𝑉𝜋ℎ (𝑠)
Policy optimization:
Given an MDP simulator, a horizon H, and a state s
Compute the optimal finite-horizon policy at state s
25
Trajectories
We can use the simulator to observe trajectories of
any policy π from any state s:
Let Traj(s, π , h) be a stochastic function that
returns a length h trajectory of π starting at s.
Traj(s, π , h)
s0 = s
For i = 1 to h-1
si = T(si-1, π(si-1))
Return s0, s1, …, sh-1
The total reward of a trajectory is given by
h 1
R( s0 ,..., sh 1 ) R( si )
i 0
26
Policy Evaluation
Given a policy π and a state s, how can we estimate 𝑉𝜋ℎ (𝑠)?
Simply sample w trajectories starting at s and average the
total reward received
Select a sampling width w and horizon h.
The function EstimateV(s, π , h, w) returns estimate of Vπ(s)
EstimateV(s, π , h, w)
V=0
For i = 1 to w
V = V + R( Traj(s, π , h) ) ; add total reward of a trajectory
Return V / w
How close to true value 𝑉𝜋ℎ 𝑠 will this estimate be?
27
Sampling-Error Bound
h 1
Vh ( s ) E [ t R t | , s ] E[ R(Traj ( s, , h))]
t 0
approximation due to sampling
w
Estim ateV(s, , h, w) w1 ri , ri R(Traj( s, , h))
i 1
• Note that the ri are samples of random variable R(Traj(s, π , h))
• We can apply the additive Chernoff bound which bounds the
difference between an expectation and an emprical average
28
Aside: Additive Chernoff Bound
• Let X be a random variable with maximum absolute value Z.
An let xi i=1,…,w be i.i.d. samples of X
• The Chernoff bound gives a bound on the probability that the
average of the xi are far from E[X]
Let {xi | i=1,…, w} be i.i.d. samples of random variable X,
then with probability at least 1 we have that,
w
E[ X ] w1 xi Z
i 1
1
w
ln 1
2
1
equivalently Pr E[ X ] w xi exp w
Z
i 1
w
29
Aside: Coin Flip Example
• Suppose we have a coin with probability of heads equal to p.
• Let X be a random variable where X=1 if the coin flip
gives heads and zero otherwise. (so Z from bound is 1)
E[X] = 1*p + 0*(1-p) = p
• After flipping a coin w times we can estimate the heads prob.
by average of xi.
• The Chernoff bound tells us that this estimate converges
exponentially fast to the true mean (coin bias) p.
w
1
Pr p w xi exp 2 w
i 1
30
Sampling Error Bound
h 1
V ( s, h) E [ t R t | , s ] E[ R(Traj( s, , h))]
t 0
approximation due to sampling
w
Estim ateV(s, , h, w) w1 ri , ri R(Traj( s, , h))
i 1
We get that,
V ( s ) EstimateV ( s, , h, w) Vmax
h
with probability at least
1
w
ln 1
1
31
Two Player MDP (aka Markov Games)
• So far we have only discussed single-player MDPs/games
• Your labs and competition will be 2-player zero-sum games
(zero sum means sum of player rewards is zero)
• We assume players take turns (non-simultaneous moves)
Player 1
Player 2
Markov Game
action
action
State/
reward
State/
reward
Simulators for 2-Player Games
A simulation-based representation gives: 𝑆, 𝐴1 , 𝐴2 , 𝑅, 𝑇, 𝐼:
finite state set S (assume state encodes whose turn it is)
action sets 𝐴1 and 𝐴2 for player 1 (called max) and player 2 (called min)
Stochastic, real-valued, bounded reward function 𝑅 𝑠, 𝑎
Stochastically returns a reward r for max given input s and action a
(it is assumed here that reward for min is -r)
So min maximizes its reward by minimizing the reward of max
Stochastic transition function T(s,a) = s’ (i.e. a simulator)
Stochastically returns a state s’ given input s and a
Probability of returning s’ is dictated by Pr(s’ | s,a) of game
Generally s’ will be a turn for the other player
These stochastic functions can be implemented in any language!
33
Finite Horizon Value of Game
Given two policies 𝜋1 and 𝜋2 one for each player we
can define the finite horizon value
𝑉𝜋ℎ1 ,𝜋2 (𝑠) is h-horizon value function with respect to player 1
expected total reward for player 1 after executing 𝝅𝟏
and 𝝅𝟐 starting in s for k steps
For zero-sum games the value with respect to player 2 is just
− 𝑉𝜋ℎ1 ,𝜋2 (𝑠)
Given 𝜋1 and 𝜋2 we can easily use the simulator to estimate
𝑉𝜋ℎ1 ,𝜋2 (𝑠)
Just as we did for the single-player MDP setting
34