Optimal Policies for POMDP
Download
Report
Transcript Optimal Policies for POMDP
Optimal Policies for POMDP
Presented by Alp Sardağ
As Much Reward As Possible?
Greedy Agent
How long agent take decision?
Finite Horizon
Infinite Horizon (discount factor)
Values will converge.
Good model if the number of decision step
is not given.
Policy
General plan
Deterministic : one action for each state
Stochastic : pdf over the set of actions
Stationary : can be applied at any time
Non-stationary : dependent on time
Memoryless : no history
Finite Horizon
Agent has to make k decisions, non-stationary
Infinite Horizon
We do not need different policy for each
time step.
0<<1
Infiniteness helps us to find
stationary policy.
={0, 1,..., t}
={i, i,..., i}
MDP
Finite horizon, solved with dynamic
programming.
Infinite horizon S equations S unknowns
LP.
MDP
Actions may be stochastic.
Do you know what state end up?
Dealing with uncertainity in observations.
POMDP Model
Finite set of states
Finite set of actions
Transition probabilities (as in MDP)
Observation model
Reinforcement
POMDP Model
Immediate reward for performing action
a in state i.
POMDP Model
Belief state : probability distribution
over states.
= {0, 1,...., |S|}
Drawback to compute next state world
model needed. From Bayes rule:
POMDP Model
Control dynamics for a POMDP
Policies for POMDP
Belief states infinite, value functions in tables
infeasible.
For horizon length 1.
No control over observations (not found in MDP),
weigh all observations
Value functions for POMDPs
Formula is complex, however if VF is piecewise linear
(a way of rep. Continous space VF), it can be written:
Value functions for POMDPs
Value Functions for POMDPs
Given Vt-1, Vt can be calculated.
Keep the action which gives rise to
specific vector.
To find optimal policy at a belief state,
just perform maximization over all
vectors and take the associated action.
Geometric Interpretation of VF
Belief simplex:
2 dimensional case:
Geometric Interpretation of VF
3 dimensional case :
Alternate VF Interpretation
A decision tree could enumerate each possible policy
for k-horizon, if initial belief state given.
Alternate VF Interpretation
The number of nodes for each action:
The number of possible tree (|A| possible actions for each node)
Somehow only generate useful trees, the complexity will be
greatly reduced.
Previously, to create entire VF generate for all , too many for
the algorithm to work.
POMDP Solutions
For finite horizon:
Iterate over time steps. Given Vt-1 compute Vt.
Retain all intermediate solutions.
For finitely transient, same idea apply to find
infinite horizon.
Iterate until previous optimal value functions
are the same for any two consecutive time
steps.
Once infinite horizon found, discard all
intermediate results.
POMDP Solutions
Given Vt-1 Vt can be calculated for one from
previous formula. No knowledge about which
region this is optimal. (Sondik)
Too many to construct VF, one possible
solution:
Choose random points.
If the number of points is large, one can’t miss
any of true vectors.
How many points to choose? No guarantee.
Find optimal policies by developing a
systematic algorithm to explore the entire
continous space of beliefs.
Tiger Problem
Actions: open left door, open right door, listen.
Listenning not accurate.
s0: tiger on the left, s1: tiger on the right.
Rewards: +10 openning right door, -100 for wrong door,
-1 for listenning.
Initially: = (0.5 0.5)
Tiger Problem
Tiger Problem
First action, intuitively:
-100+102=-55 & -1 for listenning
For horizon length 1:
Tiger Problem
For Horizon length 2:
Tiger Problem
For horizon length 4, nice features:
A belief state for the same action & observation
transformed to a single belief state.
Observations made precisely define the nodes in
the graph that would be traversed.
Infinite Horizon
Finite horizon cumbersome, different policy for the
same belief point for each time step.
Different set of vectors for each time step.
Add discount factor to tiger problem, after 56. Step
the underlying vectors are slightly different:
Infinite Horizon for Tiger Problem
By this way the finite horizon algorithms
can be used for the infinite horizon
problems.
Advantage of infinite horizon, keep the
last policy.
Policy Graphs
A way to encode, without keeping vectors, no dot
products.
Beginning state
Endstate
Finite Transience
All the belief states within a particular
partition element will be transformed to
another element for a particular action
and observation.
For non-finitely transient policies the
policy graphs that are exactly optimal
can not be constructed.
Overview of Algorithms
All performed iteratively.
All try to find the set of vectors that
define both the value function and the
optimal policy at each time step.
Two separate class:
Given Vt-1, generate superset of Vt, reduce
that set until the optimal Vt found
(Monahan and Eagle).
Given Vt-1 construct subset of optimal Vt.
These subsets grow larger until optimal Vt
found.
Monahan Algorithm
Easy to implement
Do not expect to solve anything but
smallest of problems.
Provides background for understanding
of other algorithms.
Monahan Enumeration Phase
Generate all vectors:
Number of gen. Vectors = |A|M||
where M vectors of previous state
Monahan Reduction Phase
All vectors can be kept:
Each time maximize over all vectors.
Lot of excess baggage
The number of vectors in next step will be
even large.
LP used to trim away useless vectors
Monahan Reduction Phase
For a vector to be useful, there must be
at least one belief point it gives larger
value than others:
Monahan Algorithm
Monahan’s LP Complication
Future Work
Eagle’s Variant of Monahan’s Algorithm.
Sondik’s One-Pass Algorithm.
Cheng’s Relaxed Region Algorithm.
Cheng’s Linear Support Algorithm.