Transcript slides
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University Funded by DARPA Transfer Learning Program Hierarchical Reinforcement Learning Exploits domain structure to facilitate learning Policy constraints State abstraction Paradigms: Options, HAMs, MaxQ MaxQ task hierarchy Directed acyclic graph of subtasks Leaves are the primitive MDP actions Traditionally, task structure is provided as prior knowledge to the learning agent Model Representation Dynamic Bayesian Networks for the transition and reward models Symbolic representation of the conditional probabilities/reward values as decision trees Goal: Learn Task Hierarchies Avoid the significant manual engineering of task decomposition Requiring deep understanding of the purpose and function of subroutines in computer science Frameworks for learning exit-option hierarchies: HexQ: Determine exit states through random exploration VISA: Determine exit states by analyzing DBN action models Focused Creation of Subtasks HEXQ & VISA: Create a separate subtask for each possible exit state. This can generate a large number of subtasks Claim: Defining good subtasks requires maximizing state abstraction while identifying “useful” subgoals. Our approach: selectively define subtasks with single abstract exit states Transfer Learning Scenario Working hypothesis: MaxQ value-function learning is much quicker than nonhierarchical (flat) Q-learning Hierarchical structure is more amenable to transfer from source tasks to the target than value functions Transfer scenario: Solve a “source problem” (no CPU time limit) Learn DBN models Learn MAXQ hierarchy Solve a “target problem” under the assumption that the same hierarchical structure applies Will relax this constraint in future work MaxNode State Abstraction Rt+1 Xt Xt+1 Yt Yt+1 Y is irrelevant within this action At It affects the dynamics but not the reward function In HEXQ, VISA, and our work, we assume there is only one terminal abstract state, hence no pseudoreward is needed As a side-effect, this enables “funnel” abstractions in parent tasks Our Approach: AI-MAXQ Learn DBN action models via random exploration (Other work) Apply Q learning to solve the source problem Generate a good trajectory from the learned Q function Analyze trajectory to produce CAT (This Talk) Analyze CAT to define MAXQ Hierarchy (This Talk) Wargus Resource-Gathering Domain Causally Annotated Trajectory (CAT) req.wood a.r a.r a.l Start reg.* Goto MG req.gold Goto a.r Dep Goto a.r CW Goto a.* reg.* req.wood Dep req.gold A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes) Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions. End CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep An action is absorbed regressively as long as It does not have an effect beyond the trajectory segment, preventing exogenous effects It does not increase the state abstraction End CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep End CAT Scan Start Goto MG Goto Dep Goto Root CW Goto Dep End CAT Scan Start Goto MG Goto Dep Goto CW Goto Dep Root Harvest Gold Harvest Wood End Induced Wargus Hierarchy Root Harvest Gold Get Gold Put Gold Mine Gold GGoto(goldmine) Harvest Wood Get Wood GDeposit Chop Wood GGoto(townhall) WGoto(forest) Goto(loc) Put Wood WDeposit WGoto(townhall) Induced Abstraction & Termination Task Name State Abstraction Termination Condition Root req.gold, req.wood req.gold = 1 && req.wood = 1 Harvest Gold req.gold, agent.resource, region.townhall req.gold = 1 Get Gold agent.resource, region.goldmine agent.resource = gold Put Gold req.gold, agent.resource, region.townhall agent.resource = 0 GGoto(goldmine) agent.x, agent.y agent.resource = 0 && region.goldmine = 1 GGoto(townhall) agent.x, agent.y req.gold = 0 && agent.resource = gold && region.townhall = 1 Harvest Wood req.wood, agent.resource, region.townhall req.wood = 1 Get Wood agent.resource, region.forest agent.resource = wood Put Wood req.wood, agent.resource, region.townhall agent.resource = 0 WGoto(forest) agent.x, agent.y agent.resource = 0 && region.forest = 1 WGoto(townhall) agent.x, agent.y req.wood = 0 && agent.resource = wood && region.townhall = 1 Mine Gold agent.resource, region.goldmine NA Chop Wood agent.resource, region.forest NA GDeposit req.gold, agent.resource, region.townhall NA WDeposit req.wood, agent.resource, region.townhall NA Goto(loc) agent.x, agent.y NA Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies Claims The resulting hierarchy is unique All state abstractions are safe Does not depend on the order in which goals and trajectory sequences are analyzed There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectory Extend MaxQ Node Irrelevance to the induced structure Learned hierarchical structure is “locally optimal” No local change in the trajectory segmentation can improve the state abstractions (very weak) Experimental Setup Randomly generate pairs of source-target resourcegathering maps in Wargus Learn the optimal policy in source Induce task hierarchy from a single (near) optimal trajectory Transfer this hierarchical structure to the MaxQ valuefunction learner for target Compare to direct Q learning, and MaxQ learning on a manually engineered hierarchy within target Hand-Built Wargus Hierarchy Root Get Gold Mine Gold Get Wood Goto(loc) Chop Wood GWDeposit Deposit Hand-Built Abstractions & Terminations Task Name State Abstraction Termination Condition Root req.gold, req.wood, agent.resource req.gold = 1 && req.wood = 1 Harvest Gold agent.resource, region.goldmine agent.resource ≠ 0 Harvest Wood agent.resource, region.forest agent.resource ≠ 0 GWDeposit req.gold, req.wood, agent.resource, region.townhall agent.resource = 0 Mine Gold region.goldmine NA Chop Wood region.forest NA Deposit req.gold, req.wood, agent.resource, region.townhall NA Goto(loc) agent.x, agent.y NA Results: Wargus Wargus domain: 7 reps 8000 Induced (MAXQ) 7000 Hand-engineered (MAXQ) No transfer (Q) 6000 Total Duration 5000 4000 3000 2000 1000 0 -1000 0 10 20 30 40 50 Episode 60 70 80 90 100 Need For Demonstrations VISA only uses DBNs for causal information Problems Globally applicable across state space without focusing on the pertinent subspace Global variable coupling might prevent concise abstraction Exit states can grow exponentially: one for each path in the decision tree encoding Modified bitflip domain exposes these shortcomings Modified Bitflip Domain State space: b0,…,bn-1 Action space: Flip(i), 0 < i < n-1 Flip(n-1) If b0 … bi-1 = 1 then bi ← ~bi Else b0 ← 0, …, bi ← 0 If parity(b0, …,bn-2) bn-2 = 1, bn-1 ← ~bn-1 Else b0 ← 0, …, bn-1 ← 0 parity(…) = even if n-1 is even, odd otherwise Reward: -1 for all actions Terminal/goal state: b0 … bn-1 = 1 Modified Bitflip Domain 1110000 Flip(3) 1111000 Flip(1) 1011000 Flip(4) 0000000 VISA’s Causal Graph Flip(n-2) Flip(2) Flip(n-2) Flip(n-2) b0 Flip(1) b1 Flip(2) Flip(2) b2 bn-2 Flip(n-1) bn-1 R Flip(n-1) Flip(3) Flip(n-1) Flip(3) Flip(n-1) Variables grouped into two strongly connected components (dashed ellipses) Both components affect the reward node VISA task hierarchy Root 2n-3 exit options Parity(b0,…,bn-2) bn-2 = 1 Flip(n-1) Flip(0) Flip(1) Flip(n-1) Bitflip CAT bn-1 bn-2 b1 Start b0 Flip(0) b0 Flip(1) Flip(n-2) b0,…,bn-2 Flip(n-1) b0,…,bn-1 End Induced MAXQ task hierarchy Root b0…bn-2 = 1 b0…bn-3 = 1 b0 b1 = 1 Flip(0) Flip(n-2) Flip(n-3) Flip(1) Flip(n-1) Results: Bitflip Bitflip domain: 7 bits, 20 reps 3000 Q MaxQ VISA 2500 Total Duration 2000 1500 1000 500 0 -500 0 10 20 30 40 50 Episode 60 70 80 90 100 Conclusion Causality analysis is the key to our approach Enables us to find concise subtask definitions from a demonstration CAT scan is easy to perform Need to extend to learn from multiple demonstrations Disjunctive goals