The International Planning Competition Series and

Download Report

Transcript The International Planning Competition Series and

The International Planning Competition Series
and
Empirical Evaluation of AI Planning Systems
Derek Long and Maria Fox
University of Strathclyde, Glasgow, UK
AI Planning

Given:
a collection of actions
– an initial state and
– a goal condition
–



find a schedule of actions that transform the initial
state into one satisfying the goal
Computationally hard problem (PSPACE complete)
Of longstanding interest in AI
Applications in many areas
– Eg space missions (Deep Space 1, MER, EO-1)
International Planning Competitions



Series started by Drew McDermott in 1998
Biennial event, with 5th IPC just completed
Steady increase in sophistication of planning
problems:
– Simple logical sequencing problems
– Temporal problems including numbers
– Plan quality metrics
– Realistic application domains
– Trajectory constraints and soft goals

Series has inspired dramatic advances in the field
IPC Organisation




A broad objective for the competition is selected
(eg IPC5: explore planning under trajectory
constraints)
Domains are constructed, each with a family of
problems
Competitors run their planners on one machine, on
all problems, over a period of a few weeks
Results are analysed by organisers and winners
announced at the International Conference on AI
Planning and Scheduling (ICAPS)
IPC5 Domains

TPP: traveling and buying goods at selected markets
minimizing costs
Openstacks: combinatorial optimization problem in production
scheduling
Storage: moving and storing crates of goods by hoists from
containers to depots with spatial maps
Pathways: finding a sequence of biochemical (pathways)
reactions in an organism producing certain substances
Trucks: moving packages between locations by trucks under
certain spatial constraints and delivering deadlines
Rovers: controlling multiple rovers to achieve science missions
PipesWorld: oil movement and storage.

In IPC5 over 900 problems were posed, for about 15 planners






Opportunities and Challenges

IPC generates a huge amount of data:
time to generate solutions
– quality of solutions
– problems solved
–


Data is controlled for machine and for representation
Results are used to judge the relative success of the various
competing planners
How to make judgements?
– How to evaluate the relative performances of planners?
–

General performance inferred from competition data samples
IPC3, 2001





Co-organised with Maria Fox
Objective: to introduce temporal planning, metric
constraints and plan quality metrics
8 domains, each with 4 or 5 variants; about 900
problems
13 planners generated ~4000 plans
Two types of planners
– Fully automated (use standard domain encoding)
– Hand-coded (use special encoding that includes control
information constructed by hand)
Analysis of IPC3 Data

Questions
– Which planner performed best?
Fastest to produce plans?
 Best quality plans?
– Scaling performance?
– Coverage (which parts of the domain language can a
planner handle?)


Methodological question:
– How can results be evaluated in a way that establishes a
useful strategy for empirical comparisons of performance?
Journal of AI Research, volume 20:
An overview and analysis of the results of the 3rd IPC
Problems
Hard to control for domain types – performance
often depends heavily on the domain structure
 Competition is not a carefully constructed scientific
experiment: no prior formulation of hypotheses
 Limited resources (not a significant problem in
practice)
 No easy way to construct families of problems that
have specific qualities (eg scaling pattern of
difficulty)
 No control for coding quality or programming
language

Results: a sample
Statistical Analyses I
We compared planners in terms of time taken to
solve problems and (separately) quality of solutions
 Pairwise comparisons
 Means across problems make no sense (difference
sizes of problems) and we have no knowledge of
problem distribution
 Therefore used non-parametric test: Wilcoxon rank
sum matched pairs test
 Using appropriately corrected p value we combined
results to arrive at a partial order on planner
performances

Statistical Analyses II



We did not know whether our efforts at constructing
problems of increasing difficulty were successful
Used a test for rank correlations in multiple
judgements, with planners as judges of relative
hardness (ie the ranking of the problems)
We found that there was broad agreement on the
ranking, although some problem families showed
statistically significant disagreement
Statistical Analyses III



Where planners agreed about the ranking of
difficulty of problems it was possible to consider
their relative scaling behaviour
Spearman rank correlation test was used to compare
pairs of planners, looking for a correlation between
problem difficulty and difference in planner
performance
That is: we wanted to see whether there was a
trend towards growing gap in performance with
growing problem difficulty
Conclusions



Our analyses illustrated the potential role of statistical
techniques in interpreting large bodies of empirical data
gathered in this way
Competitions play many roles and the analysis of data they
generate is somewhat compromised by the need to meet other
demands
AI Planning has benefited enormously from the competition
series and its impact on the pursuit of research across the field
as a whole
Empirical analysis is typically much stronger in this area than it
was 10 years ago
– There is a trend towards improving scientific methods and more
careful analysis of data
–