An Introduction to Markov Decision Processes Sarah Hickmott

Download Report

Transcript An Introduction to Markov Decision Processes Sarah Hickmott

An Introduction to Markov Decision Processes

Sarah Hickmott

Decision Theory

• Probability Theory + • Utility Theory Describes what an agent should believe based on evidence.

Describes what an agent wants.

= • Decision Theory Describes what an agent should do.

Markov Assumption

• Andrei Markov (1913) • Markov Assumption: The next state’s conditional probability depends only on a finite history of previous states • Markov Assumption: The next state’s conditional probability depends only on its immediately previous state kth order Markov Process 1 st order Markov Process The definitions are equivalent!!!

Any algorithm that makes the 1 st order Markov Assumption can be applied to any Markov Process

Markov Decision Process

The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive costs.

Markov Decision Process

An MDP has: • A set of states

S = {s 1 , s 2 , … s N }

• A set of actions

A = {a 1 , a 2 , … a M }

• A real valued cost function

g(s, a)

• A transition probability function

p(s’ | s, a)

Note: We will assume the

stationary

Markov transition property. This states that the effect of an action is independent of time

Notation

k N

indexes discrete time

x k

is the state of the system at time

k; μ k (x k ) π

is the control variable to be selected given the system is in state

x k

at time

k ; μ k : S k → A k

is a policy;

π = {μ 0, ,..., μ N-1 } π*

is the optimal policy is the horizon, or number of times the control is applied

x k+1 = f(x k , μ k (x k ) ) k=0…N-1

Policy

A

policy

is a mapping from states to actions Following a policy: 1. Determine current state

x k

2. Execute action

μ k (x k )

3. Repeat 1-2

Solution to an MDP

The expected cost of a policy

π = {μ 0, ,..., μ N-1 }

starting at state state

x 0

is:

N-1 J π (x 0 ) = E{g N (x N ) + ∑ {g k (x k,, μ k (x k ) )} k=0

Goal: Find the policy

π*

which specifies which action to take in each state, so as to minimise the cost function.

This is encapsulated by

Bellman’s Equation

:

J π* (x 0 ) = min J π π (x 0 )

Assigning Costs to Sequences The objective cost function maps infinite sequences of costs to single real numbers • • Options: Set a finite horizon – Discount and simply add the costs If the horizon is infinite, i.e. N → ∞, some possibilities are: to prefer earlier costs

N-1 J π (x 0 ) = lim E{ ∑ α k g(x k K=0 , μ k (x k )) }

– Average the cost per stage

N-1 J π (x 0 ) = lim 1/N E{ ∑ g(x k

N → ∞

K=0 , μ k (x k )) }

MDP Algorithms

Value Iteration

For each state select any initial value

J o (s)

k=1 while k < maximum iterations For each state

s

equation: find the action

a

that minimises the

J

i

(s) = g(s,a) + ∑ p(s’|s, a) J

i-1

(s ’)

Then assign

μ(s) = a

s ’

k = k+1

end

MDP Algorithms

Policy Iteration

Start with a randomly selected initial policy, then refine it repeatedly. Value Determination: solve |S| simultaneous Bellman equations

J i (s) = g(s,a) + ∑ p(s’|s, a) J i (s ’)

Policy Improvement: for any state, if an action exists which reduces the current estimated cost, then change it in the policy.

Each step of Policy Iteration is computationally more expensive than Value Iteration. However Policy Iteration needs fewer steps to converge than Value Iteration.

MDPs and PNs

• MDPs modeled by live Petri nets lead to Average Cost per Stage problems. • A policy is equivalent to a trace through the net • The aim is to use the finite prefix of an unfolding to derive decentralised Bellman’s equations, possibly associated with local configurations, and the communication between interacting parts.

• Initially we will assume actions and their effects are deterministic.

• Some work has been done unfolding Petri nets such that concurrent events are statistically independent.