Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs Carlos Guestrin

Download Report

Transcript Multiagent Coordination, Planning, Learning and Generalization with Factored MDPs Carlos Guestrin

Multiagent Coordination, Planning,
Learning and Generalization
with Factored MDPs
Carlos Guestrin
Daphne Koller
Stanford University
Includes collaborations with:
Geoffrey Gordon1, Michail Lagoudakis2, Ronald Parr2,
Relu Patrascu4, Dale Schuurmans4, Shobha Venkataraman3!1
1Carnegie
Mellon University
3Stanford University
2Duke
University
4University of Waterloo
Multiagent Coordination Examples






Search and rescue
Factory management
Supply chain
Firefighting
Network routing
Air traffic control



Multiple, simultaneous decisions
Limited observability
Limited communication
Network Management Problem
t+1
t
Neighboring
machines:
M1
M4
M2
Status: Si
Load:
M3
Si’
Li
Action: Ai
Li’
Ri
When process
terminates successfully
Administrators must coordinate to
maximize global reward
Joint Decision Space

Represent as MDP:



Action space: joint action a= {a1,…, an} for all agents
State space: joint state x of entire system
Reward function: total reward r

Action space is exponential in # agents

State space is exponential in # variables

Global decision requires complete observation
Multiagents with Factored MDPs

Coordination

Planning

Learning

Generalization
Long-Term Utilities

One step utility:




SysAdmin Ai receives reward ($) if process completes
Total utility: sum of rewards
Optimal action requires long-term planning
Long-term utility Q(x,a):


Expected reward, given current state x and action a
Optimal action at state x is:
max Q ( x, a)
a
Local Q function Approximation
[Guestrin, Koller, Parr ‘01]
M1
Q(A1,…,A4, X1,…,X4) ¼
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +
M4
M2
Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)
Associated with
Agent 3
Observe only
X2 and X3
M3
Limited observability:
agent i only observes variables in Qi
Must choose action to maximize i Qi
Q3
Maximizing
i Qi: Coordination Graph
[Guestrin, Koller, Parr ‘01]




Trees don’t
increase
communication
requirements
Cycles require
graph
triangulation
A1
A4
A7
A8
A6
A9
A2
A3
A5
A10
A11
Limited communication for optimal action choice
Comm. bandwidth = induced width of coord. graph
Variable Coordination Structure
[Guestrin, Venkataraman, Koller ‘02]

With whom should I coordinate?

It depends!

Real-world: coordination structure must be dynamic

Exploit context-specific independence

Obtain coordination structure changing with state
Multiagents with Factored MDPs

Coordination

Planning

Learning

Generalization
Where do the Qi come from?

Use function approximation to find Qi:
Q(X1, …, X4, A1, …, A4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)

Long-term planning requires Markov Decision Process



# states exponential
# actions exponential
Efficient approximation by exploiting structure!
Dynamic Decision Diagram
A1
P(X1’|X1, X4, A1)
X 1’
X1
A2
M4
X2’
X2
X3 ’
X4
X4’
A4
M2
M3
A3
X3
M1




State
Dynamics
Decisions
Rewards
Long-term Utility = Value of MDP

Value computed by linear programming:
[Manne `60]
minimize : V (x )
x
V ( x )  Q (a, x )
subject to : 
x, a



One variable V (x) for each state
One constraint for each state x and action a
Number of states and actions exponential!
Decomposable Value Functions
Linear combination of restricted domain functions
[Bellman et al. `63]
[Tsitsiklis & Van Roy `96]
[Koller & Parr `99,`00]
[Guestrin et al. `01]

Each hi is status of small part(s) of a complex system:



~
V (x)  iwi hi (x)
Status of a machine and neighbors
Load on machine
Must find w giving good approximate value function
Single LP Solution for Factored MDPs
minimize : i wi
i
[Schweitzer and Seidmann ‘85]
[de Farias and Van Roy ‘01]
 wi hi ( x )   Qi (a, x )
 i
i
subject to : 
x, a

One variable wi for each basis function 


One constraint for every state and action 


Polynomially many LP variables
Exponentially many LP constraints
hi , Qi depend on small sets of variables/actions 

Exploit structure as in variable elimination
[Guestrin, Koller Parr `01]
Summary of Algorithm
1. Pick local basis functions hi
2. Single LP to compute local Qi’s in factored MDP
3. Coordination graph computes maximizing action
Multiagent Policy Quality
Estimated value per agent
Comparing to Distributed Reward and Distributed
Value Function algorithms [Schneider et al. ‘99]
4.4
Utopic maximum value
4.2
4
3.8
3.6
3.4
2
4
6
8
10
Number of agents
12
14
16
Multiagent Policy Quality
Estimated value per agent
Comparing to Distributed Reward and Distributed
Value Function algorithms [Schneider et al. ‘99]
4.4
Utopic maximum value
4.2
4
3.8
Distributed
reward
3.6
3.4
2
4
6
8
10
Number of agents
12
14
Distributed
value
16
Multiagent Policy Quality
Estimated value per agent
Comparing to Distributed Reward and Distributed
Value Function algorithms [Schneider et al. ‘99]
4.4
Utopic maximum value
LP pair basis
4.2
LP single basis
4
3.8
Distributed
reward
3.6
3.4
2
4
6
8
10
Number of agents
12
14
Distributed
value
16
Total running time (seconds)
Multiagent Running Time
180
160
Ring of
rings
140
120
Star
pair basis
100
80
60
40
Star
single basis
20
0
2
4
6
8
10
Number of agents
12
14
16
Solve Very Large MDPs
Solved MDPs with :
500 agents;
over 10150 actions and
1322070819480806636890455259752144365965422032752148167664920368
2268285973467048995407783138506080619639097776968725823559509545
8210061891186534272525795367402762022519832080387801477422896484
1274390400117588618041128947815623094438061566173054086674490506
1781254803444055470543970388958174653682549161362208302685637785
8229022846398307887896918556404084898937609373242171846359938695
5167650189405881090604260896714388641028143503856487471658320106
14366132173102768902855220001
states
Multiagents with Factored MDPs

Coordination

Planning

Learning

Generalization
Reinforcement Learning
Do we know the model ?
NO:


Reinforcement learning



Model-free approach:


Training data: < x, a, x’, r >
Data collected while acting in the world
Learn Q-function directly
[Guestrin, Lagoudakis, Parr `02]
Model-based approach:

Learn factored MDP
[Guestrin, Patrascu, Schuurmans `02]
Power Grid – Multiagent LSPI
[Schneider et al ‘99]
Grid
DR
DVF
A
41.00  0.30
17.17  5.87
B
0.65  0.57
0.32  0.07
C
90.00  1.78
44.00  8.75
D
0.32  0.19
0.17  0.02
Lower is better!
Power Grid – Multiagent LSPI
[Schneider et al ‘99]
Grid
DR
DVF
Multiagent LSPI
No comm
A
41.00  0.30
17.17  5.87
28.19  8.30
B
0.65  0.57
0.32  0.07
37.08  21.85
C
90.00  1.78
44.00  8.75
83.32  16.57
D
0.32  0.19
0.17  0.02
28.45  17.83
Lower is better!
Power Grid – Multiagent LSPI
[Schneider et al ‘99]
Grid
DR
DVF
Multiagent LSPI
No comm
Multiagent LSPI
Pairwise Comm
A
41.00  0.30
17.17  5.87
28.19  8.30
0.08  0.01
B
0.65  0.57
0.32  0.07
37.08  21.85
0.13  0.02
C
90.00  1.78
44.00  8.75
83.32  16.57
40.86  1.14
D
0.32  0.19
0.17  0.02
28.45  17.83
0.11  0.02
Lower is better!
Multiagents with Factored MDPs

Coordination

Planning

Learning

Generalization
Hierarchical and Relational Models
[Guestrin, Gordon ‘02]
[Guestrin, Koller ‘02]
Server



Client


Classes of objects
Instances
Relations
Value functions in class level
Factored MDP equivalents of:


OOBNs [Koller, Pfeffer ‘97]
PRMs
[Koller, Pfeffer ‘98]
Generalization



Sample a set of scenarios
Solve a linear program with these
scenarios to obtain class value functions
When faced with new problem:


Use class value function
No re-planning needed
Theorem
Exponentially (infinitely) many worlds !
need exponentially many samples?
NO!
samples
Value function within , with prob. at least 1-.
Proof method related to [de Farias, Van Roy ‘02]
Generalizing to New Problems
Estimated policy value per agent
4.6
4.4
Class-based value function
'Optimal' approximate value function
Utopic maximum value
4.2
4
3.8
3.6
3.4
3.2
3
Ring
Star
Three legs
Generalizing to New Problems
Estimated policy value per agent
4.6
4.4
Class-based value function
'Optimal' approximate value function
Utopic maximum value
4.2
4
3.8
3.6
3.4
3.2
3
Ring
Star
Three legs
Generalizing to New Problems
Estimated policy value per agent
4.6
4.4
Class-based value function
'Optimal' approximate value function
Utopic maximum value
4.2
4
3.8
3.6
3.4
3.2
3
Ring
Star
Three legs
Classes of Objects Discovered
Leaf

Learned 3 classes
Intermediate
Intermediate
Server
Intermediate
Leaf
Leaf
Learning Classes of Objects
Max-norm error of value function
1.4
No class learning
1.2
1
Learnt classes
0.8
0.6
0.4
0.2
0
Ring
Star
Three legs
Learning Classes of Objects
Max-norm error of value function
1.4
No class learning
1.2
1
Learnt classes
0.8
0.6
0.4
0.2
0
Ring
Star
Three legs
Roadmap for Multiagents
Multiagent
Coordination
and Planning
Variable
Coordination
Structure
Coordinated
Reinforcement
Learning
Loopy Approximate
Linear Programming
Hierarchical
Factored
MDPs
Relational
MDPs
Conclusions

Multiagent planning algorithm:


Limited Communication
Limited Observability

Unified view of function approximation and
multiagent communication
Single LP solution is simple and very efficient
Efficient reinforcement learning
Generalization to new domains

Exploit structure to reduce computation costs!




Solve very large MDPs efficiently
Maximizing
i Qi: Coordination Graph
[Guestrin, Koller, Parr ‘01]

Use variable elimination for maximization:
[Bertele & Brioschi ‘72]
max Q1 ( A1 , A2 ) + Q2 ( A1 , A3 ) + Q3 ( A3 , A4 ) + Q4 ( A2 , A4 )
Q1
A1
Q2
A1 ,A2 ,A3 ,A4
A3
 max Q1 ( A1 , A2 ) + Q2 ( A1 , A3 ) + max[Q3 ( A3 , A4 ) + Q4 ( A2 , A4 ) ] A2
A1 ,A2 ,A3
A4
Q3
Q4
 max Q1 ( A1 , A2 ) + Q2 ( A1 , A3 ) + g1 ( A2 , A3 )
A4
A1 ,A2 ,A3
Here we need only 23, instead of 63 sum operations.


If A2 reboots and
A3 optimal
does nothing,
Limited communication for
action choice
then A4 get $10
Comm. bandwidth = induced width of coord. graph