Research Projects

Download Report

Transcript Research Projects

Learning to Improve the Quality
of Plans Produced by Partial-order
Planners
M. Afzal Upal
Intelligent Agents & Multiagent Systems Lab
Outline




Artificial Intelligence Planning: Problems
and Solutions
Why Learn to Improve Plan Quality?
The Performance Improving Partial-order
planner (PIP)
 Intra-solution Learning (ISL) algorithm
 Search-control vs Rewrite rules
 Empirical Evaluation
Conclusion
The Performance Task: The
Classical AI Planning
Given:
Initial State
Goals
1 3 4
8 2
7 6 5
1 2 3
8
4
7 6 5
Actions:{up, down, left, right}
Find:
A sequence of actions that achieves the goals when executed in the
initial state e.g., down(4), right(3), up(2)
Automated Planning Systems

Domain Independent Planning Systems


Modular, Sound, and Complete
Domain-dependent Planning Systems

Practical, Efficient, Produce high quality plans
Domain Independent Systems

State-space Search (each search node is a
valid world state)


Partial-order Plan Space Search (each
search node is a partially-ordered plan)



e.g., PRODIGY, FF
Partial-order planners e.g., SNLP, UCPOP
Graphplan-based Search (a search node is
a union of world states) e.g., STAN
Compilation to General Search


satisfiability engines e.g., SATPLAN
constraint satisfaction engines e.g., CPLAN
State-space vs Plan-space
Planning
1
7
2
8
6
3
4
5
right(8)
1
8 2
7 6
3 d(2) 1 2 3
8
4
4
7 6 5
5 l(4)
1 2
8 4
7 6
3
1
3
4
5
2
8
7 6
5
up(6)
right(8)
down(2)
left(4)
up(6)
END
Partial-order Plan-space Planning

Partial-order planning is the process of
removing flaws (unresolved goals and
unordered actions that cannot take place
at the same time)
Partial-order Plan-space Planning

Decouple the order in which actions are
added during planning from the order in
which they appear in the final plan
4
1
3
2
Learning to Improve Plan Quality
for Partial-order Planners

How to represent plan quality information?


How to identify learning opportunities? (there are
no planning failures or successes to learn from)


Extended STRIPS operators + value function
Assume a better quality model plan for a given problem
is available (from a domain expert or a through a more
extensive automated search of the problem’s search
space)
What search features to base the quality
improving search control knowledge on?
The Logistics Transportation
Domain
at-object(parcel, postoffice)
Initial State: at-truck(truck1, postoffice)
at-plane(plane1, airport)
Goals:
at-object(parcel, airport)
STRIPS encoding of the Logistics
Transportation Domain
Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}
LOAD-TRUCK(Object, Truck, Location)
Effects: {in(Object,Truck), not(at-object(Object,Location))}
Preconditions: {in(Object,Truck), at-truck(Truck,Location)}
UNLOAD-TRUCK(Object, Truck, Location)
Effects: {at-object(Object,Location), not(in(Object,Truck))}
Preconditions: {at-truck(Truck,From)}
DRIVE-TRUCK(Truck, From, To)
Effects: {at-truck(Truck,To), not(at-truck(Truck,From), samecity(From, To)}
PR-STRIPS (similar to PDDL 2.1
level 2)



A state is described using propositional as
well as metric attributes (that specify the
levels of the resources in that state).
An action can have propositional as well as
metric effects (functions which specify the
amount of resources the action consumes).
A value function that specifies the relative
importance of the amount of each resource
consumed and defines plan quality as a
function of the amount of resources
consumed by all actions in the plan.
PR-STRIPS encoding of the
Logistics Transportation Domain
Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}
LOAD-TRUCK(Object, Truck, Location)
Effects: {in(Object,Truck), not(at-object(Object,Location)),
time(-0.5), money(-5)}
Preconditions: {in(Object,Truck), at-truck(Truck,Location)}
UNLOAD-TRUCK(Object, Truck, Location)
Effects: {at-object(Object,Location), not(in(Object,Truck)),
time(-0.5), money(-5) }
Preconditions: {at-truck(Truck,From)}
DRIVE-TRUCK(Truck, From, To)
Effects: {at-truck(Truck,To), not(at-truck(Truck,From),
time(-.02*distance(From, To)), money(-distance(From, To))}
PR-STRIPS encoding of the
Logistics Transportation Domain
Preconditions: {at-object(Object, Location), at-plane(Plane, Location)}
LOAD-PLANE(Object, Plane, Location)
Effects: {in(Object, Plane), not(at-object(Object, Location)),
time(-0.5), money(-5)}
Preconditions: {in(Object, Plane), at-plane(Plane, Location)}
UNLOAD-PLANE(Object, Plane, Location)
Effects: {at-object(Object, Location), not(in(Object, Plane)),
time(-0.5), money(-5) }
Preconditions: {at-plane(Plane, From), airport(To)}
FLY-PLANE(Plane, From, To)
Effects: {at-plane(Plane,To), not(at-plane(Plane, From),
time(-.02*distance(From, To)), money(-distance(From, To))}
PR-STRIPS encoding of the
Logistics Transportation Domain
Quality(Plan) = 1/ (2*time-used(Plan) + 5*money-used(Plan))
The Learning Problem

Given





A planning problem (goals, initial state, and
initial resource level)
Domain knowledge (actions, plan quality
knowledge)
A partial-order planner
A model plan for the given problem
Find

Domain specific rules that can be used by the
given planner to produce better quality plans
(than the plans it would’ve produced had it not
learned those rules).
Solution: The Intra-solution
Learning Algorithm
1.
Find a learning opportunity
2.
Choose the relevant information and
ignore the rest
3.
Generalize the relevant information using
a generalization theory
Phase 1: Find a Learning
Opportunity
1.
2.
3.
4.
Generate a system’s default plan and a default
planning trace using the given partial-order
planner for the given problem
Compare the default plan with the model plan.
If the model plan is not of higher quality then
goto Step 1
Infer the planning decisions that produced the
model plan
Compare the inferred model planning trace with
the default planning trace to identify the decision
points where the two traces differ. These are the
conflicting choice points
Model Trace
System’s Planning Trace
Common Nodes
Phase 2: Choose the relevant
Information
1.
Examine the downstream planning traces
identifying relevant planning decisions using the
heuristics
1.
2.
3.
A planning decision to add an action Q is relevant if Q
supplies a relevant condition to a relevant action
A planning decision to establish an open condition is
relevant if it binds an uninstantiated variable of a
relevant open condition
A planning decision to resolve a threat is relevant if all
three actions involved are relevant
Phase 3: Generalize the Relevant
Information
1.
Generalize the relevant information using
a generalization theory
1.
Replace all constants with variables
An Example Logistics Problem
Initial-state:
{at-object(o1, lax),
at-object(o2, lax),
at-truck(tr1, lax),
at-plane(p1, lax),
airport(sjc),
distance(lax, sjc)=250,
time=0,
money=500}
Goals:
{at-object(o1, sjc),
at-object(o2, sjc)}
Generate System’s Default Plan
and Default Planning Trace

Use the given planner to generate system’s default
planning trace (an ordered constraint set)


Each add-step/establishment decision adds a causal-link and
an ordering constraint
Each threat-resolution decision adds an ordering constraint
1- START ‹ END,
2- unload-truck() ‹ END, unload-truck(o1,Tr,sjc)
at-object(o1,sjc)
3- load-truck() ‹ unload-truck(),load-truck(o1,Tr, sjc)
in-truck(o1,Tr)
4- drive-truck() ‹ unload-truck(),drive-truck(Tr, X, sjc)
at-truck(Tr, sjc)
5- …
END
unload-truck(o1,Tr, sjc)
unload-truck(o1,Tr, sjc)
Compare System’s Default Plan
with the Model Plan
System’s Default Plan
Model Plan
load-truck(o1, tr1, lax),
load-plane(o1, p1, lax),
load-truck(o2, tr1,lax),
load-plane(o2, p1, lax),
drive-truck(tr1, lax, sjc),
fly-plane(p1, lax, sjc),
unload-truck(o1, tr1, sjc),
unload-plane(o1, p1, sjc),
unload-truck(o2, tr1, sjc)
unload-plane(o2, p1, sjc)
Infer the Unordered Model Constraint Set
at-object(o1,sjc)
unload-plane(ol,p1,sjc)
load-plane(ol,p1,lax)
fly-plane(p1,sjc,lax)
START
START
START
at-plane(p1,lax)
at-plane(p1,lax)
at-object(o1,lax)
at-object(o1,sjc)
at-plane(p1,sjc)
fly-plane(p1,sjc,lax)
START
START
START
at-plane(p1,lax)
at-plane(p1,lax)
at-object(o2,lax)
unload-plane(o1,p1,sjc)
unload-plane(o1,p1,sjc)
load-plane(ol,p1,lax)
fly-plane(ol,p1,lax)
load-plane(ol,p1,lax)
unload-plane(o2,p1,sjc)
load-plane(o2,p1,lax)
END
at-object(o2,sjc)
at-object(o2,sjc)
at-plane(p1,sjc)
END
unload-plane(o2,p1,sjc)
unload-plane(o2,p1,sjc)
load-plane(o2,p1,lax)
fly-plane(o2,p1,lax)
load-plane(o2,p1,lax)
Compare the two Planning Traces
to Identify Learning Opportunities
START ‹ END
at-object(o1,sjc)
A learning opportunity
START ‹ END, unload-truck(o1,tr1,sjc) ‹ END
unload-truck(o1,t1,sjc)
at-object(o1,tr1,sjc)
END
START ‹ END, unload-plane(o1,p1,ap) ‹ END
unload-plane(o1,p1,sjc)
at-object(o1,p1,sjc)
END
Choose the Relevant Planning
Decisions
add-actions:START-END
learning opportunity
add-action:unload-plane(o1)
add-action:fly-plane()
add-actions:unload-truck(o1)
relevant
decisions
add-action:drive-truck()
add-action:load-plane(o1)
add-actions:load-truck(o1)
add-action:unload-plane(o2)
add-action:drive-truck()
add-action:load-plane(o2)
irrelevant
decisions
add-actions:load-truck(o2)
Generalize the relevant planning
decisions chains
add-actions:START-END
add-action:unload-plane(O, T)
add-actions:unload-truck(O, P)
add-action:fly-plane(T,X,Y)
add-action:drive-truck(P,X,Y)
add-action:load-plane(O, T)
add-actions:load-truck(O, P)
In What Form Should the Learned
Knowledge be Stored?
Search-Control Rule
Rewrite Rule
Given the goals {at-object(O,Y)} to
resolve and effects {at-truck(T,X), atplane(P, X), airport(Y)}, and distance(X,
Y) > 100
To-be-replaced actions
prefer the planning decisions
unload(O,T, Y)}
{add-step(unload-plane(O,P,Y)), addstep(load-plane(O,P,X)),
addstep(fly-plane(P,X,Y))}
Replacing actions
over the planning decisions
{load-plane(O,P,X),
{add-step(unload-truck(O,T,Y)),
add-step(load-truck(O,T,X)),
step(drive-truck(T,X,Y))}
fly-plane(P,X,Y),
add-
{load-truck(O,T,X),
drive-truck(T,X,Y),
unload-plane(O,P,Y))}
Search Control Knowledge

A heuristic function that provides an estimate of the
quality of the plan a node is expected to lead to
root
n
quality=4
quality=8
quality=2
Rewrite Rules




A Rewrite rule is a 2-tuple
to-be-replaced-subplan, replacing-subplan
Used after search has produced a complete
plan to rewrite it into a higher quality plan.
Only useful in those domains where it is
possible to efficiently produce a low quality
plan but hard to produce a higher quality plan
E.g., To-be-replaced-subplan:
A4, A5
Replacing subplan:
B1
Planning by Rewriting
A2
A4
A1
A6
A3
A5
B1
Empirical Evaluation I: What Form Should
the Learned Knowledge be Stored in?



Perform empirical experiments to compare
the performance of a version of PIP that
learns search-control rules (Sys-searchcontrol) with a version that learns rewrite
rules (Sys-rewrite).
Both Sys-rewrite-first and Sys-rewrite-best
perform up to two rewritings.
At each rewriting


Sys-rewrite-first randomly chooses one of
the applicable rewrite rules
Sys-rewrite-best applies all applicable rewrite
rules to try all ways of rewriting a plan.
Experimental Set-up





Three benchmark planning domains logistics,
softbot, and process planning
Randomly generate 120 unique problem
instances
Train Sys-search-control and Sys-rewrite on
optimal quality solutions for 20, 30, 40, and 60
examples and test them on the remaining
examples (cross-validation)
Plan quality is one minus the average distance of
the plans generated by a system from the optimal
quality plans
Planning efficiency is measured by counting the
average number of new nodes generated by each
system
Results
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
1
Sys-Rewritefirst
Sys-Rewritebest
Sys-Searchcontrol
0.4
0.2
0.2
0
0
20
30
40
60
30
40
20
40
60
Process Planning
160
50
140
45
40
Sys-Searchcontrol
Sys-Rewrite-first
100
80
35
30
25
20
60
Sys-Rewrite-best
40
60
30
60
120
40
0.2
Logistics
50
45
40
35
30
Num new
25
nodes
20
15
10
5
0
30
0.4
0
20
Softbot
20
0.6
0
0
0
0
0.8
15
10
20
5
0
0
0
20
30
40
60
0
20
30
40
60
Conclusion I





Both search control and rewrite rules lead to
improvements in plan quality.
Rewrite-rules have a larger cost in terms of the
loss of planning efficiency than search control
rules
Need a mechanism to distinguish good rules from
bad rules and to forget the bad rules
Comparing planning traces seems to be a better
technique for learning search control rules than
rewrite rules
Need to explore alternate strategies for learning
rewrite rules


By comparing two completed plans of different quality
Through static domain analysis
Empirical Evaluation II: A Study of the
Factors Affecting PIP’s Learning Performance

Generated 25 abstract domains varying
along a number of seemingly relevant
dimensions




Instance Similarity
Quality Branching Factor (average number of
multiple quality solutions per problem)
Association between the default planning bias
and the quality bias
Are there any statistically significant
differences in PIP’s performance as each
factor is varied (student t-test)?
Results

PIP’s learning leads to greater
improvements in domains where



Quality branching factor is large
Planner’s default biases are negatively
correlated with the quality improving heuristic
function
There is no simple relationship between
instance similarity and PIP’s learning
performance
Conclusion II






Need to address scale up issues
Need to keep up with advances in AI planning
technologies
“It is arguably more difficult to accelerate a new
generation planner by outfiting it with learning as
the overhead cost by the learning system can
overwhelm the gains in search efficiency”
(Kambhampati 2001)
Problem is not the lack of a well defined task!
Organize a symposium/special issue on issues of
how to efficiently organize, retrieve, and forget
learned knowledge
An open source planning and learning software?