Artificial Intelligence Group Exploiting C-TÆMS Models for Policy Search Brad Clement

Download Report

Transcript Artificial Intelligence Group Exploiting C-TÆMS Models for Policy Search Brad Clement

Artificial Intelligence Group
Exploiting C-TÆMS Models for
Policy Search
Brad Clement
Steve Schaffer
Problem
• What is the best the agents could be expected to
perform given a full, centralized view of the problem
and execution?
– Complete information but cannot see into the future.
• Centrally provide optimal choices of action
for all agents at all times.
– offline computation of a policy:
• contingency plan
• function of system states to joint actions
(starting or aborting methods)
– theoretical best computation time grows as
a polynomial function of the size of the policy,
oam in the worst case, for
• a agents
• m methods per agent
• o outcomes per method
0.1
A1 do a
A2 do b
S0
0.4
0.5
0.9
0.1
0.8
A2 do b
A3 do c
0.9
0.8
0.1
0.2
Overview
• C-TAEMS  multiagent MDP
• AO* policy search
• Minimizing state creation time
• Avoiding redundant plan/policy exploration
• Merging equivalent states
• Estimating expected quality
• Handling joint action explosion
TAEMS to C-TAEMS
•
•
•
•
•
•
Task groups represent goals
Tasks represent a sub-goal
Methods are executable primitives – uncertain quality and duration
Resources model resource state
Pre/postconditions used for location/movement
Non-local effects (NLEs) model interactions between activities
– enables, disables, facilitates, hinders (uncertain effects on quality & duration)
• QAFs specify how quality is accrued from sub-tasks
– sum, sum-and, sync-sum, min, max, exactly-one
C-TAEMS as a Multiagent MDP
• MDP for planning
= state  action choices  outcome state & reward distribution
• MMDP
= state  joint action choices  . . .
• A policy is a choice of actions
0.1
A1 do a
• C-TAEMS state representation is
0.4
A2 do b
the state of activity:
S0
0.5
– for each method
• phase: pending, active, complete,
A2 do b
A3 do c
failed, aborted, abandoned,
0.8
maybe-pending, maybe-active
0.2
• outcome: duration, quality, cost
• start time
– time (eliminates state looping, policy space is a DAG)
• Actions are starting and aborting methods
0.9
0.1
0.8
0.9
0.1
Computing policy while expanding MDP state-action space
optimal policy
Compute policy while expanding (AO*)
[1,5]
0.1
[4,4]
[3,5]
– Add outcomes
[2.35,4.95]
[3.2,4.45]
[2.3,5.5]
0.4
ab
S0
0.5
0.9
[3.8,4.1]
[3.8,3.9]
[2,6]
[2.1,4.9]
[3.8,3.9]
0.1
• Calculate quality bounds
[2,5]
[2,3]
• Update policy
[2.2,4.8]0.8
[2.2,3.2]
[2.35,4.95]
[3.2,4.45]
[2.2,5.6] bc
• Prune dominated branches (LB > UB)
0.9
[2.23,4.72]
[3.64,3.92]
[2.2,5.6]
0.8
[2.1,4.9]
[2.1,3.1]
0.1
0.2
• Expand joint start/abort actions
[2,3]
[3,4]
• Choose state in policy with
highest probability
[3,4]
• Want to push expansion deeper
• Want to explore more likely states
• Don’t want to expand bad actions
Minimizing state creation time
Idea
• never create states from scratch
• the next state is a minor change to the current one
Expand combinations of actions and their outcomes like
incrementing a counter.
• 0110
• 0111 lowest order digit changes each iteration;
next higher order changes when lower “rolls over”
• 1000
0001
0000
S0
0010
0011
0100
Higher-order “digits” of are joint actions; lower-order ones are
outcomes.
• agent
– method
• action (start or abort)
– outcome
» duration
» quality
» NLEs
0110
Minimizing state creation time (example)
iteration
• agent B
– method b2
• action start
– outcome duration > 2
– outcome duration=2, quality=0
– outcome duration=1, quality=1
– method rollover from b2 to b1
• action start
– outcome duration=4, quality=2
• agent rollover from B to A
– method a2
• action start
– outcome duration=2, quality=5
• agent B
– method b2
• action start
– outcome b2 duration > 2
– outcome a2 duration=2, quality=5
– outcome b2 duration=2, quality=0
agent
A
B
time \ method a1
a2
b1
b2 expansion
t=1
pend pend pend pend
t=1
pend pend pend actv
action
t=3
t=3
t=3
pend pend pend actv
pend pend pend d2q0
pend pend pend d2q1
state
state
state
t=1
pend pend actv pend
action
t=5
pend pend d4q2 pend
state
t=1
pend actv pend pend
action
t=3
pend d2q5 pend pend
state
t=1
pend actv pend actv
action
t=3
pend d2q5 pend actv
state
t=3
pend d2q5 pend d2q0
state
Avoiding exploration of redundant plans/policies
•
Simple brute force approach is not practical.
– expand all subsets of methods at each clock tick
– 30 methods  230 > 1 billion actions to expand
just at the 1st time step
•
S0
The obvious -- never start a method
–
–
–
–
–
•
1 2
for an agent that is already executing another,
before the method’s release time,
after it can possibly meet its deadline,
when disabled, or
when not enabled.
1,000,000,000
Only consider starting a method
–
–
–
–
at its release time,
when the agent finishes executing another method,
when the method is enabled or facilitated (after the delay), and
one time unit after it would disable or hinder another (hard!).
•
Discrete simulation – skip to earliest time when there is an action choice
or a method completes.
•
Redundant abort times are more difficult to identify.
Start times for sources of disables/hinders NLEs
• NLEs have a delayed effect.
• No problem for enables & facilitates: start the target method delay
after source ends—it is just part of the simulation.
• Need to end a disabler/hinderer at delay-1 from the start of the NLE
target
– can’t simulate potential start times of source unless start of target is known
– can’t repair state action space because actions may have been pruned
• Solution
– generate a temporal network of start times as they depend on other
start/end times
– during state-action space expansion, create start action if start time is
supported by network—search for a support path to a release time
duration
C1
release
C1
A1
follows
A2
A2
A2
duration
duration
B1
C2
hinder delay
C2
enable delay
Merge equivalent states? DAG or tree?
•
•
MDPs are often defined so such that multiple outcomes
point to the same state.
If an outcome is equivalent to one that already exists,
only one outcome is needed, so “merging” them into one
can save memory and time for re-expanding the
outcome.
– each state is followed by an exponentially expanding
number of states
– eliminating a few states early in the plan could
significantly shrink the search space
•
A “looser” equivalence definition allows more
outcomes to merge.
– Ideally equivalence is found whenever the agents
“wouldn’t do anything different from this point on.”
– Defining was fragile for C-TAEMS
0.1
A1 do a
A2 do b
S0
0.5
0.9
0.1
0.8
A2 do b
A3 do c
• computing equivalence became a major slowdown
• produced a lot of subtle bugs
•
0.4
Turns out that merging actually increased memory!
– Large problems few merged outcomes.
– The container for lookup required more memory than merging
could save.
– Better performance resulted from expanding policy space as a
tree without checking for state equivalence.
0.9
0.8
0.1
0.2
Better estimating future quality
•
•
•
AO* is A*
The algorithm uses a heuristic to identify which action choice
leads to the highest overall quality.
The heuristic gives a quick estimate of upper and lower bounds
on expected quality (EQ).
–
–
–
•
upper needs to be an overestimate to be admissible
lower needs to be an underestimate to ensure soundness
the tighter the bounds, the fewer the number of states
required to prove a policy optimal
QAFs can be problematic.
–
EQ of max QAF cannot be computed from
lower and upper bounds of children; for example,
•
•
•
•
•
A1 do a
A2 do b
S0
precompute for methods at different time points near deadline
Result: worth it
–
–
significant but not bad time overhead (~2x?)
reduction in states more significant for most (not all problems)
0.4
0.5
0.9
0.1
0.8
A2 do b
A3 do c
Compute tighter bound distributions based on method quality
and duration distributions  complicated!
–
•
method A quality distribution (50% q=20, 50% q=40), EQ = 30
method B quality distribution (50%, q=0, 50% q=60), EQ = 30
EQ of task with QAF max of methods A and B is not 30!
if executing both, EQ = 20*25% + 40*25% + 60*50% = 45
0.1
0.9
0.8
0.1
0.2
Partially expanding joint actions
• 10 agents each with 9 methods = 1010 joint actions
• How can we preserve optimality without enumerating all joint
actions?
• Choose actions sequentially with intermediate states.
a1b1
S0
a1b2
a2b1
a2b2
S0
a1
a2
b1
b2
b1
b2
• Ended up not being helpful.
• Although it could expand forward, problems were too big to get
useful bounds on the optimal EQ (e.g., [1, 100]).
Summary
•
Many ways to exploit problem structure (model)
–
–
•
Did not help scaling:
1.
2.
3.
•
merging equivalent outcome states to avoid expanding duplicates
(same as #4 above),
using more inclusive equivalence definitions, and
partially expanding actions to avoid the intractability of joint actions.
Helped scaling:
1.
2.
3.
4.
•
some are obvious
for others, it’s hard to know what will help
efficient enumeration/creation of individual actions and states,
selective start and abort times,
more precise expected quality estimates (trading time for space), and
instantiating duplicates of equivalent state to avoid the overhead of a lookup
container.
Seems like other things should help:
–
–
–
use single-agent policies as a heuristic
plan for most likely outcomes as a heuristic
identify independent subproblems
Backup
States and their generation
•
•
•
•
State representation similar to Mausam & Weld, 2005:
– time
– for each method
• phase: pending, active, complete, failed, aborted, abandoned, maybe-pending, maybe-active
• outcome: duration, quality, & cost
• start time
Extended state of frontier nodes
– methods being aborted
– methods intended to never be executed
– for each method
• possible start times
• possible abort times
• NLE quality coefficient distribution & iterator
• outcome distribution (duration, quality) & iterator
• current outcome probability
• remaining outcome probability in unexpanded states
Using extended state, generating new state is simply an iteration of last state on
– agents
– methods
– phase transition
– NLE outcomes
– outcomes
Uses 2GB in 2-3 minutes usually, so another version calculates (instead of storing) the
extended state before generating actions & outcomes
– slower
– many more states fit in memory
Algorithm details
•
•
•
Expand state space for all orderings/concurrency of methods based on temporal
constraints:
– agent cannot execute more than one method at a time
– method must be enabled and not disabled
– facilitates: set of potential time delays A could start after B that could lead to
increasing quality
– hinders: set of potential times A could start before B that could lead to increasing
quality
Time of outcomes is computed as minimum of possible method start times, abort
times, and completion times
Try to avoid expanding state space for suboptimal actions
– every agent must be executing an action unless all remaining activities are NLE
targets
– focus expansion of states following more promising actions (A*) and more likely
outcomes
• more promising actions are determined by computing policy during expansion based on
bounds on expected quality
•
– prove other actions suboptimal and prune!
Optimal policy falls out of state expansion
– accumulated quality is part of state
– state expansion has no cycles (DAG)
– we compute by walking from leaves of expansion back to initial state
Memory
• algorithm
–
–
–
–
freeing memory is slow and not always necessary
wait to prune until memory is completely used
use freed memory to expand further
repeat
• problems
– Not easy to back out in the middle of expansion
• Expanding one state could take up GBs of RAM
• We added an auto-adjustable prune limit (5GB – 7.5GB – 8.75GB – 9.375GB – 10GB)
– Linux doesn’t report all available memory
• adapted spacecraft VxWorks memory manager to keep track
• reclaim memory while executing (not yet)
–
–
–
–
compute policy with memory available
take a step in simulator
prune unused actions and other outcomes
Repeat
Experiments
Experiments
1 GB
Experiments
Merged States
•
storing states in a binary tree (C++ STL set)
•
try to define state equivalence as “wouldn’t do anything different from this point on”
•
actual definition (fragile!)
– are method states ==?
• both quality zero? failed, aborted (, abandoned?)
• otherwise are both pending, active, or complete?
• if active, are start times ==?
• if complete
– quality ==?
– are all NLE targets complete?
» is method the last to be completed by this agent?
» is duration ==?
– if any methods pending?
• if current time is not ordered same wrt release times?
– is time ==?
•
result: ~10x fewer states
•
other potential improvements
– active method that has no effect on decisions (possibly when only one possible
remaining end time eliminating abort decisions)
– method that has no effect (quality is guaranteed or doesn't matter)
New tricks - partially expanding joint actions
• 10 agents each with 10 methods results in 1010 joint actions
• choose actions sequentially with intermediate states
• explore some joint actions without generating others
a1b1
S0
b1
a1b2
a2b1
S0
a1
b2
a2
b1
a2b2
b2
New tricks - subpolicies
• when part of problem can be solved
independently, carve off as a subproblem with
a subpolicy
• exactly-one is only QAF where subtasks can’t
possibly be split
• look for loose coupling and use subpolicy as
a heuristic
Performance summary
• extended state caching
– without merged states – less memory, slightly slower
– with merged states – more memory, slightly faster
• lower bound vs. upper bound heuristic
– lower bound uses more states
– 2x slower when not merging states; ~same whe merging
• merging states
– 10x less states/memory
– slower? (was 5x faster, now ~3x slower)
• partial joint actions
– slightly slower (sometimes ~same, sometimes 2X slower)
– slightly more memory
– range on optimal EQ for large problems not good (e.g. [1,100])
• potentially fixable with better lower bound heuristic
Algorithm Complexity
state space size
m
policy size
(ooq od )
ma
<
ma ((i  1)ooq od ).
a
ma
0.1
...
0.4
ab
where
a=
...
i 1
< a(mooq od )
•
...
# agents
S0
•
m = # methods per agent
•
o=
•
oq = # values in quality distribution per outcome
•
od = # values in duration distribution per outcome
# outcomes per method
0.5
...
bc
0.8
0.2
...
...
Approaches to scaling the solver
...
Explore state space heuristically
– heuristics for estimating lower and upper
bounds of a state
...
• compute information for making estimates
0.1
offline as much as possible
• don’t use relaxed state lookahead:
0.4
ab
heuristic expansion accomplishes same
without throwing away work
S0
0.5
– heuristics to expand actions that maximize
pruning
• now we choose highest quality action
• pick actions with wider gap between upper
and lower bound estimates
• pick action whose bounds will be tightened
the most
– stochastically expand state-action space
...
...
bc
0.8
0.2
...
...
Approaches to scaling the solver
• Try to use memory efficiently
...
– best effort solutions while executing (mostly implemented)
•
•
•
•
compute best effort policy with memory available
take best action
prune space of unused actions and unrealized outcomes
repeat
...
– minimize state-action space expansion
•
•
•
where order of methods doesn’t matter, only explore one ordering
where choice of method doesn’t matter (e.g. qaf_max), only consider one
only order methods that produce highest quality when . . . ???
– compress state-action space
•
•
•
•
•
encode in bits
encode states as differences with prior states
make state representation simpler so that states more
likely match (and merge)
factor state space?
heuristically merge similar states
• Use more memory
– ~16GB computers
– parallelize across network
•
•
load balance states to expand based on memory available
simple protocol of sending/receiving
– state to expand
– states to prune
– updates on quality bounds of states
– memory available
– busy/waiting
0.1
...
0.4
ab
S0
0.5
...
bc
0.8
0.2
...
...
Related work
•
–
–
in this case, policy computation is trivial
because state space is a DAG
policy is computed as we expand the state
•
State representation like Mausam & Weld, ’05
•
We only explore states reachable from initial state.
This is called “reachability analysis” like RTDP (Barto
et al., ‘95) and Looping AO* (LAO*, Hansen &
Zilberstein, ’01)
•
–
Focuses policy computation on more likely states and higher
scoring actions
•
We do this for expansion
•
an opportunity to improve ours
Labeled RTDP focuses computation on what hasn’t converged
in order to include unlikely (but potentially important) states
NMRDP – non-Markovian reward decision process
(Bacchus et al., ’96)
–
–
...
0.1
...
Solved by converting to a regular MDP (Thiébaux et al., ‘06)
For CTAEMS. overall quality is a non-Markovian reward that
we converted to an MDP
0.4
ab
S0
RTDP
–
•
...
Our algorithm is AO*
0.5
...
bc
0.8
0.2
...
...