Multimedia search: From Lab to Web

Download Report

Transcript Multimedia search: From Lab to Web

KI2 - 10
Markov Decision Processes
AIMA, Chapter 17
Kunstmatige Intelligentie / RuG
2
Markov Decision Problem
How to use knowledge about the world to
make decision even when the outcomes of an
action are uncertain and the payoffs will not be
obtained until several (or many) actions have
passed.
3
The Solution
Sequential decision problems in uncertain
environments can be solved by calculating a
policy that associates an optimal decision with
every state that the agent might reach
=> Markov Decision Process (MDP)
4
Example
The world
Actions have uncertain
consequences
3
+1
2
-1
1
start
1
2
3
4
0.8
0.1
0.1
5
6
7
8
9
10
11
Utility of a State Sequence

Additive rewards
U h ([s0 , s1, s2 ...])  R(s0 )  R(s1 )  R(s2 )  ...

Discounted rewards
U h ([s0 , s1, s2 ...])  R(s0 )  R(s1 )   R(s2 )  ...
2
12
13
Utility of a State

The utility of each state is the expected
sum of discounted rewards if the agent
executes the policy 




t
U (s)  E  R(st )  , s0  s
 t 0


The true utility of a state corresponds to
the optimal policy *
14
15
Algorithms for
Calculating the Optimal Policy

Value iteration

Policy iteration
16
Value Iteration


Calculate the utility of each state
Then use the state utilities to select an
optimal action in each state
 (s)  arg max T (s, a, s )U (s )
*
/
a
s/
/
17
Value Iteration Algorithm
function value-iteration(MDP) returns a utility function
local variables: U, U’ initially identical to R
repeat
U U’
for each state s do
U (s)  R(s)   max T (s, a, s / )U (s / )
end
until close-enough(U, U’)
return U
a
s/
Bellman update
18
The Utilities of the States Obtained
After
Iteration
The utilities
of the Value
states by value
iteration algorithm
3
0.812
2
0.762
1
0.705
1
0.868
0.655
2
0.912
+1
0.660
-1
0.611
0.388
3
4
19
Policy Iteration
 Pick a policy, then calculate the utility of each
state given that policy (value determination
step)
 Update the policy at each state using the
utilities of the successor states
 Repeat until the policy stabilizes
20
Policy Iteration Algorithm
function policy-iteration(MDP) returns a policy
local variables: U, a utility function, , a policy
repeat
U value-determination(,U,MDP,R)
unchanged?  true
for each state s do

/
/
/
/ 
if  max  T ( s, a, s )U ( s )   T ( s,  ( s), s )U ( s )  then
s/
s/
 a

 (s)  arg max T (s, a, s / )U (s / )
a
unchanged?  false
end
until unchanged?
return 
s/
21
Value Determination
Simplification of the value iteration
algorithm because the policy is fixed

Linear equations because the max()
operator has been removed

Solve exactly for the utilities using
standard linear algebra

22
Optimal Policy
(policy iteration with 11 linear equations)
3
+1
2
-1
1
1
2
3
4
u(1,1) = 0.8 u(1,2) + 0.1 u(1,2) + 0.1 u(1,1)
u(1,2) = 0.8 u(1,3) + 0.2 u(1,2)
…
23
Partially observable MDP (POMDP)
 In an inaccessible environment, the percept does not provide
enough information to determine the state or the transition
probability
 POMDP
– State transition function: P(st+1 | st, at)
– Observation function: P(ot | st, at)
– Reward function: E(rt | st, at)
 Approach
– To calculate a probability distribution over the possible
states given all previous percepts, and to base decision
on this distribution
 Difficulty
– Actions will cause the agent to obtain new percept, which
will cause the agent’s beliefs to change in complex ways