Transcript slides

Automatic Induction of
MAXQ Hierarchies
Neville Mehta
Michael Wynkoop
Soumya Ray
Prasad Tadepalli
Tom Dietterich
School of EECS
Oregon State University
Funded by DARPA Transfer Learning Program
Hierarchical Reinforcement Learning

Exploits domain structure to facilitate learning
Policy constraints
 State abstraction



Paradigms: Options, HAMs, MaxQ
MaxQ task hierarchy
Directed acyclic graph of subtasks
 Leaves are the primitive MDP actions


Traditionally, task structure is provided as prior
knowledge to the learning agent
Model Representation


Dynamic Bayesian Networks for the transition
and reward models
Symbolic representation of the conditional
probabilities/reward values as decision trees
Goal: Learn Task Hierarchies

Avoid the significant manual engineering of task
decomposition


Requiring deep understanding of the purpose and
function of subroutines in computer science
Frameworks for learning exit-option hierarchies:
HexQ: Determine exit states through random
exploration
 VISA: Determine exit states by analyzing DBN
action models

Focused Creation of Subtasks

HEXQ & VISA: Create a separate subtask for
each possible exit state.

This can generate a large number of subtasks

Claim: Defining good subtasks requires
maximizing state abstraction while identifying
“useful” subgoals.

Our approach: selectively define subtasks with
single abstract exit states
Transfer Learning Scenario

Working hypothesis:



MaxQ value-function learning is much quicker than nonhierarchical (flat) Q-learning
Hierarchical structure is more amenable to transfer from
source tasks to the target than value functions
Transfer scenario:

Solve a “source problem” (no CPU time limit)



Learn DBN models
Learn MAXQ hierarchy
Solve a “target problem” under the assumption that the same
hierarchical structure applies

Will relax this constraint in future work
MaxNode State Abstraction


Rt+1
Xt
Xt+1
Yt
Yt+1
Y is irrelevant within this action


At
It affects the dynamics but not the reward function
In HEXQ, VISA, and our work, we assume there is
only one terminal abstract state, hence no pseudoreward is needed
As a side-effect, this enables “funnel” abstractions
in parent tasks
Our Approach: AI-MAXQ
Learn DBN action models via
random exploration
(Other work)
Apply Q learning to solve the
source problem
Generate a good trajectory from
the learned Q function
Analyze trajectory to produce
CAT
(This Talk)
Analyze CAT to define MAXQ
Hierarchy
(This Talk)
Wargus Resource-Gathering Domain
Causally Annotated Trajectory (CAT)
req.wood
a.r
a.r
a.l
Start reg.* Goto
MG
req.gold
Goto
a.r
Dep
Goto
a.r
CW
Goto
a.*
reg.*
req.wood
Dep
req.gold
A variable v is relevant to an action if the DBN for that action tests or
changes that variable (this includes both the variable nodes and the
reward nodes)
Create an arc from A to B labeled with variable v iff v is relevant to A
and B but not to any intermediate actions.
End
CAT Scan
Start
Goto

MG
Goto
Dep
Goto
CW
Goto
Dep
An action is absorbed regressively as long as


It does not have an effect beyond the trajectory segment,
preventing exogenous effects
It does not increase the state abstraction
End
CAT Scan
Start
Goto
MG
Goto
Dep
Goto
CW
Goto
Dep
End
CAT Scan
Start
Goto
MG
Goto
Dep
Goto
Root
CW
Goto
Dep
End
CAT Scan
Start
Goto
MG
Goto
Dep
Goto
CW
Goto
Dep
Root
Harvest Gold
Harvest Wood
End
Induced Wargus Hierarchy
Root
Harvest Gold
Get Gold
Put Gold
Mine Gold
GGoto(goldmine)
Harvest Wood
Get Wood
GDeposit
Chop Wood
GGoto(townhall)
WGoto(forest)
Goto(loc)
Put Wood
WDeposit
WGoto(townhall)
Induced Abstraction & Termination
Task Name
State Abstraction
Termination Condition
Root
req.gold, req.wood
req.gold = 1 && req.wood = 1
Harvest Gold
req.gold, agent.resource, region.townhall
req.gold = 1
Get Gold
agent.resource, region.goldmine
agent.resource = gold
Put Gold
req.gold, agent.resource, region.townhall
agent.resource = 0
GGoto(goldmine)
agent.x, agent.y
agent.resource = 0 && region.goldmine = 1
GGoto(townhall)
agent.x, agent.y
req.gold = 0 && agent.resource = gold && region.townhall = 1
Harvest Wood
req.wood, agent.resource, region.townhall
req.wood = 1
Get Wood
agent.resource, region.forest
agent.resource = wood
Put Wood
req.wood, agent.resource, region.townhall
agent.resource = 0
WGoto(forest)
agent.x, agent.y
agent.resource = 0 && region.forest = 1
WGoto(townhall)
agent.x, agent.y
req.wood = 0 && agent.resource = wood && region.townhall = 1
Mine Gold
agent.resource, region.goldmine
NA
Chop Wood
agent.resource, region.forest
NA
GDeposit
req.gold, agent.resource, region.townhall
NA
WDeposit
req.wood, agent.resource, region.townhall
NA
Goto(loc)
agent.x, agent.y
NA
Note that because each subtask has a unique terminal state,
Result Distribution Irrelevance applies
Claims

The resulting hierarchy is unique


All state abstractions are safe



Does not depend on the order in which goals and trajectory
sequences are analyzed
There exists a hierarchical policy within the induced hierarchy that will
reproduce the observed trajectory
Extend MaxQ Node Irrelevance to the induced structure
Learned hierarchical structure is “locally optimal”

No local change in the trajectory segmentation can improve
the state abstractions (very weak)
Experimental Setup





Randomly generate pairs of source-target resourcegathering maps in Wargus
Learn the optimal policy in source
Induce task hierarchy from a single (near) optimal
trajectory
Transfer this hierarchical structure to the MaxQ valuefunction learner for target
Compare to direct Q learning, and MaxQ learning on a
manually engineered hierarchy within target
Hand-Built Wargus Hierarchy
Root
Get Gold
Mine Gold
Get Wood
Goto(loc)
Chop Wood
GWDeposit
Deposit
Hand-Built Abstractions & Terminations
Task Name State Abstraction
Termination
Condition
Root
req.gold, req.wood, agent.resource
req.gold = 1 && req.wood = 1
Harvest Gold
agent.resource, region.goldmine
agent.resource ≠ 0
Harvest Wood
agent.resource, region.forest
agent.resource ≠ 0
GWDeposit
req.gold, req.wood, agent.resource, region.townhall
agent.resource = 0
Mine Gold
region.goldmine
NA
Chop Wood
region.forest
NA
Deposit
req.gold, req.wood, agent.resource, region.townhall
NA
Goto(loc)
agent.x, agent.y
NA
Results: Wargus
Wargus domain: 7 reps
8000
Induced (MAXQ)
7000
Hand-engineered (MAXQ)
No transfer (Q)
6000
Total Duration
5000
4000
3000
2000
1000
0
-1000
0
10
20
30
40
50
Episode
60
70
80
90
100
Need For Demonstrations

VISA only uses DBNs for causal information


Problems



Globally applicable across state space without focusing on
the pertinent subspace
Global variable coupling might prevent concise abstraction
Exit states can grow exponentially: one for each path in the
decision tree encoding
Modified bitflip domain exposes these shortcomings
Modified Bitflip Domain


State space: b0,…,bn-1
Action space:

Flip(i), 0 < i < n-1



Flip(n-1)





If b0  …  bi-1 = 1 then bi ← ~bi
Else b0 ← 0, …, bi ← 0
If parity(b0, …,bn-2)  bn-2 = 1, bn-1 ← ~bn-1
Else b0 ← 0, …, bn-1 ← 0
parity(…) = even if n-1 is even, odd otherwise
Reward: -1 for all actions
Terminal/goal state: b0  …  bn-1 = 1
Modified Bitflip Domain
1110000
Flip(3)
1111000
Flip(1)
1011000
Flip(4)
0000000
VISA’s Causal Graph
Flip(n-2)
Flip(2)
Flip(n-2)
Flip(n-2)
b0
Flip(1)
b1
Flip(2)
Flip(2)
b2
bn-2
Flip(n-1)
bn-1
R
Flip(n-1)
Flip(3)
Flip(n-1)
Flip(3)
Flip(n-1)


Variables grouped into two strongly connected components (dashed ellipses)
Both components affect the reward node
VISA task hierarchy
Root
2n-3 exit options
Parity(b0,…,bn-2)  bn-2 = 1 Flip(n-1)
Flip(0)
Flip(1)
Flip(n-1)
Bitflip CAT
bn-1
bn-2
b1
Start
b0
Flip(0)
b0
Flip(1)
Flip(n-2)
b0,…,bn-2
Flip(n-1)
b0,…,bn-1
End
Induced MAXQ task hierarchy
Root
b0…bn-2 = 1
b0…bn-3 = 1
b0 b1 = 1
Flip(0)
Flip(n-2)
Flip(n-3)
Flip(1)
Flip(n-1)
Results: Bitflip
Bitflip domain: 7 bits, 20 reps
3000
Q
MaxQ
VISA
2500
Total Duration
2000
1500
1000
500
0
-500
0
10
20
30
40
50
Episode
60
70
80
90
100
Conclusion


Causality analysis is the key to our approach
Enables us to find concise subtask definitions
from a demonstration


CAT scan is easy to perform
Need to extend to learn from multiple
demonstrations

Disjunctive goals