Transcript Document

1 st

UAI 2006

report from the

Evaluation of Probabilistic Inference July 14 th , 2006 7/14/2006 F,D,G D,F B,C,D,F UAI06-Inference Evaluation Page 1

What is this presentation about?

• Goal: The purpose of this evaluation is to compare the performance of a variety of different software systems on a single set of Bayesian network (BN) problems.

• By creating a friendly evaluation (as is often done in other communities such as SAT, and also in speech recognition and machine translation with their DARPA evaluations), we hope to foster new research in fast inference methods for performing a variety of queries in graphical models.

• Over the past few months, the 1 st such an evaluation took place at UAI.

• This presentation summarizes the outcome of this evaluation.

7/14/2006 UAI06-Inference Evaluation Page 2

Who we are

• Evaluators – Jeff Bilmes – University of Washington, Seattle – Rina Dechter – University of California, Irvine • Graduate Student Assistance – Chris Bartels – University of Washington, Seattle – Radu Marinescu – University of CA, Irvine – Karim Filali – University of Washington, Seattle • Advisory Council – Dan Geiger -- Technion - Israel Institute of Technology – Faheim Bacchus – University of Toronto – Kevin Murphy – University of British Columbia 7/14/2006 UAI06-Inference Evaluation Page 3

Outline

• Background, goals.

• Scope (rational) • Final chosen queries • The UAI 2006 BN benchmark evaluation corpus • Scoring strategies • Participants and team members • Results for PE and MPE • Team presentations – team1 (UCLA), team2 (IET), team3 (UBC), team4 (U. Pitt/DSL), team 5 (UCI) • Conclusion/Open discussion 7/14/2006 UAI06-Inference Evaluation Page 4

Acknowledgements: Graduate Student Help

Chris Bartels, University of Washington Radu Marinescu, University of CA, Irvine Karim Filali, University of Washington Also, thanks to another U. Washington Student, Mukund Narasimhan (now at MSR) 7/14/2006 UAI06-Inference Evaluation Page 5

Background

• Early 2005: Rina Dechter & Dan Geiger decide there should be some form of UAI inference evaluation (like in the SAT community) and discuss the idea (by email) with Adnan Darwiche, Faheim Bacchus, Hector Geffner, Nir Friedman, Thomas Richardson. • I (Jeff Bilmes) take on the task to run it this first time.

– Speech recognition and DARPA evaluations • evaluation of ASR systems using error rate as a metric.

7/14/2006 UAI06-Inference Evaluation Page 6

Scope

• Many “queries” could be evaluated including: – MAP – maximal a posteriori hypothesis – MPE – most probable explanation (also called Viterbi assignment) – PE – probability of evidence – N-best – compute the N-best of the above • Many algorithmic variants – Exact inference – Enforced limited time-bounds and/or space bounds – Approximate inference, and tradeoffs between time/space/accuracy • Classes of models – Static BNs with a generic description (list of CPTs) – More complex description language (e.g., context specific indep.) – Static models vs. Dynamic models (e.g., Dynamic Bayesian Networks, and DGMs) vs. relational models 7/14/2006 UAI06-Inference Evaluation Page 7

Decisions for this first evaluation.

• Emphasis: Keep things simple.

• Focus on exact inference – exact inference can still be useful. – “Exact inference is NP-complete, so we perform approximate inference” is often seen in the literature – With smart algorithms, and for fixed (but real-world) problem sizes, exact is quite doable and can be better for applications.

• Focus on small number of queries: • Original plan: PE, MPE, and MAP for both static and dynamic models • From final participants list, narrowed this down to: PE and MPE on static Bayesian networks 7/14/2006 UAI06-Inference Evaluation Page 8

Query: Probability of Evidence (PE)

7/14/2006 UAI06-Inference Evaluation Page 9

Query: Most Probable Explanation (MPE)

7/14/2006 UAI06-Inference Evaluation Page 10

The UAI06 BN Evaluation Corpus

• • J=78 BNs used for PE, and J=57 BNs used for MPE queries. The BNs were not exactly the same.

• BNs were the following (more details will appear on web page): – random mutations of the burglar alarm graph – diagnosis network (Druzdzel) – DBNs from speech recognition that were unrolled a fixed amount. – Variations on the Water DBN – Various forms of grids – Variations on the ISCAS 85 electrical circuit – Variations on the ISCAS 89 electrical circuit – Various genetic linkage graphs (Geiger) – BNs from computer-based patient care system (cpcs) – Various randomly generated graphs (F. Cozman’s alg).

– Various known-tree-width random k-trees, with determinism (k=24) – Various known-tree-width random positive k-trees, (k=24) – Various linear block coding graphs. • While some of these have been seen before, BNs were “anonymized” before being distributed.

BNs distributed in xbif format (basically XML) 7/14/2006 UAI06-Inference Evaluation Page 11

Timing Platform and Limits

• Timing machines: dual-CPU 3.8GHz Pentium Xeons with 8Gb of RAM each, with hyper-threading turned on.

• Single threaded performance only in this evaluation.

• Each team had 4 days of dedicated machine usage to complete there timings (other than this, there was no upper time bound). • No-one asked for more time than these 4 days -- after timing the BNs, teams could use rest of 4 days as they wish for further tuning. After final numbers were sent to me, no further adjustment of timing numbers have taken place (say based on seeing other’s results).

• Each timing number was the result of running a query 10 times, and then reporting the fastest (lowest) time.

7/14/2006 UAI06-Inference Evaluation Page 12

The Teams

• Thanks to every member of every team: Each member of every team was crucial to making this a successful event!!

7/14/2006 UAI06-Inference Evaluation Page 13

•David Allen (now at HRL Labs, CA) •Mark Chavira (graduate student) •Adnan Darwiche

Team 1: UCLA

• Keith Cascio • Arthur Choi (graduate student) • Jinbo Huang (now at NICTA, Australia) 7/14/2006 UAI06-Inference Evaluation Page 14

From right to left in photo: • Masami Takikawa • Hans Dettmar • Francis Fung • Rick Kissh Other team members: •Stephen Cannon •Chad Bisk •Brandon Goldfedder Other key contributors: • Bruce D'Ambrosio • Kathy Laskey • Ed Wright • Suzanne Mahoney • Charles Twardy • Tod Levitt

Team 2: IET

7/14/2006 UAI06-Inference Evaluation Page 15

Team 3: UBC

Jacek Kisynski, University of British Columbia David Poole, University of British Columbia Michael Chiang , University of British Columbia 7/14/2006 UAI06-Inference Evaluation Page 16

Team 4: U. Pittsburgh, DSL

Tomasz Sowinski, University of Pittsburgh, DSL 7/14/2006 Marek J. Druzdzel, University of Pittsburgh, DSL UAI06-Inference Evaluation Page 17

Team 4: Decision Systems Laboratory

UAI Software Competition

Team 5: UCI

Robert Mateescu , University of CA, Irvine 7/14/2006 Radu Marinescu, University of CA, Irvine Rina Dechter, University of CA, Irvine UAI06-Inference Evaluation Page 19

The Results

7/14/2006 UAI06-Inference Evaluation Page 20

Definition Correct Answer

7/14/2006 UAI06-Inference Evaluation Page 21

Definition of “FAIL”

• Each team had 4 days to complete the evaluation • No time-limit placed on any particular BN.

• A “FAILED” score meant that either the system failed to complete the query, or that the system underflowed their own numeric precision.

– some of the networks were designed not to fit within IEEE 64-bit double precision, so either scaling or log-arithmetic needed to be used (which is a speed hit for PE).

• Teams had the option to submit multiple separate submissions, none did.

• Systems were allowed to “backoff” from no-scaling to, say, a log-arithmetic mode (but that was included in the time charge) 7/14/2006 UAI06-Inference Evaluation Page 22

Definition of Average Speedup

7/14/2006 UAI06-Inference Evaluation Page 23

Results – PE and MPE

78 PE BNs, and 57 MPE BNs 1. Failure rates – number of networks that each team failed to produce a score or a correct score (including underflow) 2. Speedup results on

categorized BNs

– a BN is categorized based on how many teams failed to produce a score (so 0-5 for PE, or 0-3 for MPE) – Average speedup of the best performance over a particular team’s performance on each category.

3. Rank scores – Number of times a particular team was rank n for various n 4. Workload scores – In what workload region (if any) is a particular team optimal.

7/14/2006 UAI06-Inference Evaluation Page 24

PE Results

7/14/2006 UAI06-Inference Evaluation Page 25

PE Failure Rate Results (low is best)

Reminder: 78 BNs total for PE

8 6 4 2 0 20 18 16 14 12 10 Team 1 0 Team 2 10.26

Team 3 19.23

Team 4 14.1

Team 5 19.23

7/14/2006 UAI06-Inference Evaluation Page 26

Avg. Speedups: BNs that ran on all 5 systems 61 out of 78 BNs ran on all systems.

Remember:

lower is better!

7/14/2006

90 80 70 60 50 40 30 20 10 0 Te am 1 85.59

Te am 2 52.44

Te am 3 77.33

Te am 4 1.02

Te am 5 17.15

UAI06-Inference Evaluation Page 27

Avg. Speedups: BNs that ran on all systems

Average Std Min Max 3 2.5

2 1.5

1 0.5

0 -0.5

-1 Te am 1

7/14/2006

Te am 2 Te am 3 Te am

UAI06-Inference Evaluation

4 Te am 5

Page 28

Avg. Speedups: BNs that ran on (3-4)/5 systems 8 out of 78 BNs ran on only 3 or 4 systems.

7/14/2006

20 18 16 14 12 10 4 2 8 6 0 Te am 1 4.93

Te am 2 12.95

Te am 3 1.85

Te am 4 1 Te am 5 18.42

UAI06-Inference Evaluation Page 29

Avg. Speedups: BNs that ran on (3-4)/5 systems

Average Std Min Max 60 50 40 30 20 10 0 Te am 1

7/14/2006

Te am 2 Te am 3 Te am

UAI06-Inference Evaluation

4 Te am 5

Page 30

Avg. Speedups: BNs that ran on (1-2)/5 systems 9 out of 78 BNs ran on only 1 or 2 systems.

Only 2 teams (team 1 and 2) had systems that could run this category BN.

These include the genetic linkage BNs.

12 10 8 6 4 2 0 Te am 1 1 Te am 2 11.28

7/14/2006 UAI06-Inference Evaluation Page 31

Rank Proportions (how often was each team a particular rank, rank 1 is best)

Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Fail 100% 80% 60% 40% 20% 0% Te am 1

7/14/2006

Te am 2 Te am 3 Te am

UAI06-Inference Evaluation

4 Te am 5

Page 32

Another look at PE:

WALL TIME

BN_0 BN_1 BN_2 BN_3 BN_4 BN_5 BN_6 BN_7 BN_8 BN_9 BN_10 BN_11 BN_12 BN_13 BN_14 BN_15 BN_16 BN_18 BN_20 BN_22 BN_24 BN_26 BN_28 BN_30 BN_32 BN_34 BN_36 BN_38 BN_40 Alarm graph Diagnosis Speech

recognition

Water DBN

Grids

7/14/2006

Team 1 Team 2 Team 3 Team 4 Team 5

0.476

1.680

0.400

0.010

0.020

0.548

0.507

0.484

0.564

0.623

0.785

0.781

0.675

0.876

0.669

0.916

0.636

0.628

0.979

0.908

1.581

1.583

66.103

2.130

2.160

3.020

2.220

2.490

2.970

3.000

1.780

3.200

1.610

2.070

2.540

2.240

3.580

2.370

4.430

3.590

28.300

0.520

0.500

0.490

0.690

0.670

0.820

0.810

0.540

0.530

0.540

0.690

0.780

0.790

1.780

0.920

1.110

1.050

56.910

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.010

0.020

0.100

0.060

0.070

0.070

0.040

0.010

0.020

0.020

0.060

0.070

0.140

0.080

0.170

0.060

0.220

0.200

0.120

1.790

1.270

0.170

0.170

FAIL 467.440

8.070

7.030

52.795

3.723

2.744

3.192

3.158

3.124

3.042

3.074

11.110

8.290

16.630

3.620

4.630

6.350

30.190

5.470

6.850

5.810

11.710

9.680

19.850

3.630

FAIL FAIL FAIL FAIL FAIL

1.960

1.420

0.720

0.420

0.520

0.490

0.100

0.590

13.640

7.340

FAIL 123.120

1.620

FAIL FAIL FAIL FAIL FAIL

INFERENCE TIME Team 1

0.17700

0.25500

0.17500

0.19900

0.26700

0.26400

0.36400

0.44300

0.36600

0.56800

0.37400

0.55700

0.30100

0.28800

0.64700

0.55400

0.43600

0.45800

60.07600

2.72100

2.44000

47.56100

0.22400

1.86700

2.25000

2.21000

2.14500

2.06600

Team 2 0.07000

0.07700

0.06800

0.09500

0.10600

0.12500

0.16000

0.11100

0.09000

0.10000

0.09100

0.14600

0.11000

0.11700

0.19800

0.16300

0.32300

0.29600

20.53900

5.40400

2.36000

8.35000

0.09400

2.21700

3.99400

26.72500

3.47500

4.29300

3.36700

Team 3

0.17000

0.29000

0.22000

0.21000

0.36000

0.32000

0.41000

0.53000

0.28000

0.27000

0.29000

0.37000

0.51000

0.47000

1.45000

0.57000

0.16000

0.12000

46.81000

4.07000

3.50000

11.84000

0.41000

FAIL FAIL FAIL FAIL FAIL FAIL

Team 4 0.00065

Team 5 0.02000

0.00133

0.00065

0.00144

0.00097

0.00217

0.00241

0.00392

0.00278

0.00568

0.00200

0.00808

0.00438

0.01415

0.09400

0.05521

0.00225

0.00229

0.03000

0.01000

0.02000

0.02000

0.05000

0.06000

0.13000

0.08000

0.17000

0.06000

0.21000

0.20000

0.11000

1.79000

1.26000

0.02000

0.02000

FAIL 463.24000

0.72258

0.34183

10.44000

4.50000

FAIL 119.23000

0.02208

0.21000

0.39248

0.47857

0.45130

0.39153

0.55135

33

0.28734

FAIL FAIL FAIL FAIL FAIL FAIL

BN_42 BN_43 BN_44 BN_45 BN_46 BN_47 BN_49 BN_51 BN_53 BN_55 BN_57 BN_59 BN_61 BN_63 BN_65 BN_67 BN_69 BN_70 BN_71 BN_72 BN_73 BN_74 BN_75 BN_76 BN_77 iscas85 iscas89 Genetic

linkage

7/14/2006

WALL TIME INFERENCE TIME Team 1 Team 2 Team 3 Team 4 Team 5

1.296

2.960

0.950

0.020

0.130

1.212

1.608

0.981

1.192

1.373

3.040

2.890

3.030

2.880

3.480

0.990

1.060

0.680

1.210

2.170

0.030

0.130

0.020

0.160

0.190

0.100

0.300

0.050

1.370

0.750

1.325

1.056

0.834

1.161

0.817

0.748

1.100

0.974

0.739

0.938

5.103

8.069

8.844

16.581

16.634

10.532

25.656

31.216

59.385

3.410

2.960

2.740

2.180

2.950

2.470

2.700

2.330

2.490

2.630

FAIL

75.300

FAIL FAIL FAIL FAIL FAIL FAIL FAIL 9.120

1.240

0.630

1.450

0.630

0.580

1.390

1.010

0.630

0.910

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL

0.420

0.030

0.010

0.140

0.020

0.010

0.100

0.030

0.010

0.040

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL

0.650

0.200

0.040

0.720

0.020

0.020

0.100

0.260

0.030

0.370

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL UAI06-Inference Evaluation

Team 1

0.59800

0.53700

0.91200

0.34400

0.63800

0.70000

0.64200

0.44100

0.21000

0.55700

0.21000

0.17700

0.41600

0.36700

0.24200

0.47600

3.99500

6.38800

7.27100

14.79000

15.00800

9.58500

24.10700

29.22600

57.87100

Team 2

0.48400

0.48600

0.52700

0.32700

0.43600

0.55200

0.52800

0.50400

0.26000

0.47200

0.25400

0.27100

0.47200

0.42900

0.24500

0.33000

FAIL

72.02900

FAIL FAIL FAIL FAIL FAIL FAIL FAIL

Team 3 0.34000

0.41000

0.46000

0.14000

0.71000

1.52000

8.36000

0.68000

0.16000

0.88000

0.12000

0.07000

0.78000

0.48000

0.19000

0.46000

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL

Team 4 0.00738

0.00707

0.10763

0.00217

0.15363

0.17385

0.40277

0.01871

0.00104

0.12664

0.00097

0.00067

0.08385

0.02013

0.00327

0.02433

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 34

Team 5 0.11000

0.08000

0.27000

0.03000

1.36000

0.74000

0.63000

0.18000

0.02000

0.70000

0.01000

0.01000

0.08000

0.25000

0.02000

0.36000

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL

WALL TIME INFERENCE TIME

BN_78 BN_80 BN_82 BN_84 BN_86 BN_88 BN_90 BN_92 BN_94 BN_96 BN_98 BN_100 BN_102 BN_104

CPCS Random graphs

k-trees

(k=24) determinism

BN_106 BN_108 BN_110 BN_112 BN_114 BN_116 BN_118 BN_120 BN_122 BN_124 k-trees

(k=24) positive

7/14/2006

Team 1 Team 2 Team 3 Team 4 Team 5

0.409

0.903

0.910

1.004

2.397

2.590

2.621

2.691

0.754

0.603

0.771

1.089

1.748

15.708

1.520

1.740

2.080

1.940

4.520

4.470

4.370

4.260

1.550

1.570

1.590

1.630

2.020

21.630

0.350

0.970

1.150

1.070

3.220

3.330

3.210

3.300

0.730

0.880

0.940

1.040

3.190

14.550

0.010

0.040

0.050

0.040

0.780

0.760

0.780

0.800

0.040

0.020

0.050

0.060

0.330

4.020

5.930

0.010

0.080

0.090

0.090

1.790

1.820

1.800

1.870

0.090

0.090

0.140

0.270

1.700

11.097

9.527

7.704

10.310

5.345

8.592

4.928

5.457

6.891

4.297

18.500

17.080

14.930

16.790

7.550

12.350

8.790

8.220

12.020

8.020

20.170

13.390

11.250

19.390

7.620

10.040

6.630

7.380

8.890

6.810

3.650

3.300

2.790

3.430

1.880

3.140

1.920

2.020

2.580

1.700

6.880

5.670

4.120

6.990

6.480

5.790

4.290

5.060

5.440

4.340

Team 1

0.14800

0.19200

0.25200

0.29500

0.47400

0.47000

0.49300

0.52900

0.11500

0.19200

0.19300

0.49200

0.62000

0.34800

0.54900

0.48900

0.40000

0.57100

1.04400

0.63700

0.41500

0.78500

0.68400

0.43400

Team 2 0.04900

0.12800

0.12300

0.15700

0.32800

0.31100

0.30800

0.27700

0.03400

0.08200

0.07800

0.09900

0.15900

0.86300

0.87300

0.87000

0.69100

0.84000

0.49500

0.60200

0.42700

0.44400

0.52500

0.39900

Team 3

0.12000

0.22000

0.29000

0.34000

0.45000

0.46000

0.40000

0.54000

0.05000

0.48000

0.23000

0.32000

1.64000

0.19000

8.53000

3.00000

1.62000

8.69000

1.96000

0.38000

0.76000

1.06000

0.84000

1.44000

UAI06-Inference Evaluation

Team 4 0.00044

0.00112

0.00116

0.00162

0.04229

0.02511

0.04246

0.07392

0.00040

0.00472

0.00435

0.01411

0.10794

0.00699

0.38662

0.42090

0.17202

0.44628

0.14516

0.01697

0.04710

0.06795

0.58000

1.04000

1.67000

0.04319

0.07180

35 1.22000

1.54000

Team 5 0.01000

0.02000

0.03000

0.03000

0.67000

0.74000

0.74000

0.77000

0.01000

0.07000

0.05000

0.18000

1.27000

0.34000

0.19000

0.87000

0.73000

0.40000

3.50000

MPE Results

• Only three teams participated: – Team 1 – Team 2 – Team 5 • 57 BNs, not the same ones, but some are variations of the same original BN.

7/14/2006 UAI06-Inference Evaluation Page 36

MPE Failure Rate Results

40 35 30 25 20 15 10 5 0 Team 1 0 Team 2 38.6

Team 5 7.01

7/14/2006 UAI06-Inference Evaluation Page 37

MPE Avg. Speedups: BNs that ran on all 3 systems 31 out of 57 BNs ran on all systems.

7/14/2006

1 0 3 2 8 5 4 7 6 Te am 1 5.38

2.75

Te am 2

UAI06-Inference Evaluation

Te am 5 7.37

Page 38

MPE Avg. Speedups: BNs that ran on all 3 systems 31 out of 57 BNs ran on all systems.

Average Std Min Max 90 80 70 60 50 40 30 20 10 0 Te am 1

7/14/2006

Te am 2

UAI06-Inference Evaluation

Te am 5

Page 39

MPE Avg. Speedups: BNs that ran on 2/3 systems 26 out of 57 BNs ran on 2 systems.

7/14/2006

50 45 40 35 30 25 20 15 10 5 0 Te am 1 6.6

1.2

Te am 2

UAI06-Inference Evaluation

Te am 5 47.46

Page 40

MPE Avg. Speedups: BNs that ran on 2/3 systems

Average Std Min Max 800 700 600 500 400 300 200 100 0 Te am 1

7/14/2006

Te am 2

UAI06-Inference Evaluation

Te am 5

Page 41

Rank Proportions (how often was each team a particular rank, rank 1 is best)

Rank 1 Rank 2 Rank 3 Fail 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

7/14/2006

Te am 1 Te am 2

UAI06-Inference Evaluation

Te am 5

Page 42

7/14/2006 BN_17 BN_19 BN_21 BN_23 BN_25 BN_27 BN_29 BN_31 BN_33 BN_35 BN_37 BN_39 BN_41 BN_48 BN_50 BN_52 BN_54 BN_56 BN_58 BN_60 BN_62 BN_64 BN_66 BN_68 BN_79 BN_81 BN_83 BN_85 BN_87 BN_89 BN_91 BN_93 BN_95 BN_97 BN_99 BN_101 BN_103 BN_105 BN_107 BN_109 BN_111 BN_113 BN_115 BN_117 BN_119 BN_121 BN_123 BN_125 BN_126 BN_127 BN_128 BN_129 BN_130 BN_131

team 5

0.79

0.72

u/flow u/flow u/flow u/flow 1.54

74.47

38.76

41.81

12.39

137.24

21.7

0.25

0.7

0.85

8.36

97.05

0.78

118.39

5.01

16.7

4.81

0.43

0.01

0.11

0.13

0.19

1.42

1.43

1.39

1.67

0.27

0.27

1.01

0.19

0.94

6.88

6.25

5.64

4.96

5.69

4.5

5.88

4.06

4.2

5.07

3.6

30.12

111.65

2.07

147.32

13.59

16.47

0.693

1.047

0.462

1.481

1.687

2.067

9.019

3.523

8.531

5.016

1.789

1.896

6.213

1.648

3.257

14.65

37.053

9.782

8.319

10.093

6.955

8.57

5.616

6.36

6.958

4.866

30.904

23.6

8.306

42.683

37.685

10.146

team 1

46.193

47.436

WALL TIME team 2

FAIL FAIL 34.689

9.514

8.284

10.963

3.324

2.951

3.253

24.82

12.46

8.47

16.76

4.07

3.31

21.98

3.192

3.263

3.254

3.08

1.216

1.36

1.165

0.818

1.133

0.81

0.758

1.107

0.925

5.1

4.73

7.84

4.58

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 2.15

3.65

2.85

2.91

5.9

4.94

5.61

5.42

2.7

2.56

3.06

2.37

3.47

22.23

19.16

16.33

14.94

17.3

7.69

12.71

8.49

8.05

11.52

8.29

FAIL FAIL FAIL FAIL FAIL FAIL UAI06-Inference Evaluation BN_133 24.212

FAIL 9.86

BN_134 21.243

FAIL 130.48

0.233

0.54

0.173

0.814

1.003

1.403

6.415

1.254

6.28

2.958

0.966

1.433

5.607

1.087

2.226

0.473

0.992

0.774

0.606

0.835

2.438

0.888

1.099

1.168

0.925

1.114

30.276

INFERENCE TIME team 1

44.988

46.259

team 2

FAIL FAIL

team 5

0.75

0.68

28.182

4.389

3.765

5.688

0.4

2.073

2.261

17.66

5.857

2.951

8.843

0.153

0.835

19.551

u/floe u/flow u/flow u/flow 0.17

74.46

38.76

2.203

2.296

2.265

2.097

0.559

0.673

0.488

0.203

0.513

0.205

0.168

0.454

0.375

2.243

2.895

5.158

2.29

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 41.8

12.38

137.23

21.68

0.25

0.69

0.84

8.35

97.04

0.77

118.38

5 16.69

22.979

7.713

42.106

37.071

9.577

78.243

23.6

20.655

FAIL FAIL 0.056

0.513

0.544

0.514

1.21

0.644

1.183

0.902

0.541

0.396

1.021

0.259

1 1.108

1.929

1.206

1.002

1.516

1.005

0.738

0.607

0.681

0.674

0.589

FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 111.64

2.07

147.32

13.58

16.46

164.4

9.86

130.48

4.81

0.42

0.01

0.05

0.07

0.13

0.33

0.32

0.32

0.57

0.19

0.25

0.93

0.1

0.52

0.17

0.82

0.88

0.53

0.96

1.38

0.6

0.88

0.84

0.67

0.81

30.11

Diagnosis graph Speech recognition

Water DBN

Grids ISCAS 85 CPCS Random k-trees w determinism k-trees positive Coding

43

PE and MPE Results

7/14/2006 UAI06-Inference Evaluation Page 44

Workload Scores: PE and MPE

7/14/2006 UAI06-Inference Evaluation Page 45

Workload Scores and Linear Programming

7/14/2006 UAI06-Inference Evaluation Page 46

Workload Scores: PE and MPE

• So each team is a winner, it depends on the workload.

• Could attempt further to rank teams based on volume of workload region where a team wins.

• Which measure, however, should we on the simplex, uniform? Why not something else.

• “A Bayesian approach to performance ranking” – UAI does system performance measures … 7/14/2006 UAI06-Inference Evaluation Page 47

Team technical descriptions

• 5 minute for each team.

• Current plan: more details to ultimately appear on the inference evaluation web site (see main UAI page).

7/14/2006 UAI06-Inference Evaluation Page 48

Team 1: UCLA Technical Description

• presented by Adnan Darwiche 7/14/2006 UAI06-Inference Evaluation Page 49

Team UCLA

  

Performance summary:

 

Solved all 78 P(e) networks in 319s: about 4s per instance Solved all 57 MPE networks in 466s: about 8s per instance MPE approach

    Prune network If network has treewidth 25 or less, run RC Else if network has enough local structure, run Ace Else run BnB/Ace

P(e) approach

    Prune network If network has genetic net characteristics, run RC_Link Else if network has treewidth 25 or less, run RC Else run Ace  Approach is powerful enough to solve every network in every suite. Yet, it incurs a fixed overhead that disadvantages it on easy networks 7/14/2006 UAI06-Inference Evaluation 50

RC and RC Link

Recursive Conditioning

   

Conditioning/Search algorithm Based on decomposing the network Inference exponential in treewidth VE/Jointree could have been used for this!

RC Link

  

RC with local structure exploitation Not necessarily exponential in treewith Download: http://reasoning.cs.ucla.edu/rc_link

7/14/2006 UAI06-Inference Evaluation 51

Ace

     

Compiles BN to arithmetic circuit Reduce to logical inference Strength: Local Structure (determinism & CSI) Strength: online inference Inference not exponential in treewidth http://reasoning.cs.ucla.edu/ace

7/14/2006 UAI06-Inference Evaluation 52

Branch & Bound

  

Approximate network by deleting edges to provides an upper bound on MPE Compile the network using Ace and use to drive search Use belief propagation to construct

seed

 

a static variable order for each variable, an ordering on values

7/14/2006 UAI06-Inference Evaluation 53

7/14/2006 UAI06-Inference Evaluation 54

Team 2: IET Technical Description

• presented by Masami Takikawa 7/14/2006 UAI06-Inference Evaluation Page 55

Basic SPI Algorithm Team 2

BN Was able to solve 59 out of 78 challenges.

Collect factors factors d-separation, barren nodes removal & evidence propagation Order nodes Minimum weight heuristic Multiplication Summation Repeat these steps until all variables are eliminated

#BNs 4 2 20 16 13 4 MaxWeight <=100 <=1,000 <=10,000 <=100,000 <=1,000,000 <=2,000,000

UAI2006 BN Engine Eval. Copyright©, IET, 2006.

Extensions (aka additional overhead)

BN Solved additional 11 challenges.

Intra-node factorization factored graph Collect factors factors Order nodes Time-slice ordering if DBN

BN Without INF BN_30 130M (27) BN_32 17G (34) BN_34 1.1G (30) BN_36 4.3G (32) BN_38 4.3G (32) BN_40 1.1G (30)

Multiplication

BN Min-Weight Time-slice

Summation

BN_70 BN_72 BN_73 8.3E19 (56) 2.9E21 (46) 6.4E22 (51) 9.0E7 (21) 3.5E11 (38) 4.3E9 (26) BN_75 5.5E18 (43) 1.7E10 (34)

Normalization

BN_76 1.9E20 (35) 1.7E11 (24)

Needed to avoid underflow for BN_20-26.

UAI2006 BN Engine Eval. Copyright©, IET, 2006.

With INF 66K (16) 260K (18) 3.1M (21) 130K (17) 520K (19) 130K (17)

) )

Team 3: UBC Technical Description

• presented by David Poole 7/14/2006 UAI06-Inference Evaluation Page 58

Variable Elimination Code

by David Poole and Jacek Kisyński

•This is an implementation of variable elimination in Java 1.5 (without threads).

•We wanted to test how well our base VE system that we were using compared with other systems.

UAI06-Inference Evaluation

orderings and 2GB of memory.

59

The most interesting part of the implementation is in the representation of factors: • A factor is essentially a list of variables, and a one-dimensional array of values.

• There is a total order of all variables and a total ordering of all values, which gives a canonical order of the values in a factor.

• We can multiply factors and sum out variables without doing random access to the values, but rather using the canonical ordering to enumerate the values.

7/14/2006 UAI06-Inference Evaluation 60

• Multiplication is done lazily.

• This code was written for David Poole and Nevin Lianwen Zhang, ``Exploiting contextual independence in probabilistic inference'', Journal of Artificial Intelligence Research,18, 263-313, 2003. http://www.jair.org/papers/paper1122.html

• This is also the code that is used in the CIspace belief and decision network applet. A new version of the applet will be released in July. See: http://www.cs.ubc.ca/labs/lci/CIspace/ • We plan to release the VE code as open source.

UAI06-Inference Evaluation 61

Team 4: U. Pitt/DSL Technical Description

• presented by Jeff Bilmes (Marek Druzdzel was unable to attend).

7/14/2006 UAI06-Inference Evaluation Page 62

Decision Systems Laboratory

University of Pittsburgh [email protected]

http://dsl.sis.pitt.edu/

UAI Software Competition

UAI Competition: Sources of speedup Good theory (in addition to good implementation) 1.

Clustering algorithm at the foundation of the program [Lauritzen Relevance steps: 2.

3.

Relevance reasoning 2.

In p(E), focusing inference on the evidence nodes Removal of barren nodes For very large models: Druzdzel 1997] and [Lin & Druzdzel 1999].

Relevance-based Decomposition Relevance-based Incremental Belief Updating 4.

[Lin & Removal of nuisance nodes 5.

Reuse of valid posteriors Full references are included in GeNIe on-line help, http://genie.sis.pitt.edu/ .

• •

Good engineering (Tomek Sowinski).

Efficient and reliable implementation in C++ ( SMILE

) Tested by over eight years of both academic and industrial use

UAI-06 Software Evaluation

Where our program spent the most time?

bn_22 (speech dbn) bn_82 (cpcs) 38x

Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other

Speedup due to relevance 1x

Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other

bn_94 (random ) 1,046x

Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other

bn_18 (diagnosis) ∞

Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other UAI-06 Software Evaluation

Broader context: GeNIe and SMILE

A developer’s environment for graphical decision models ( http://genie.sis.pitt.edu/ ).

Support for model building: ImaGeNIe Qualitative interface: QGeNIe Learning and discovery module: SMiner Diagnosis: Diagnosis Model developer module: GeNIe .

ImaGeNIe Diagnosis Implemented in Visual C++ in Windows environment.

Wrappers: SMILE.NET

Pocket SMILE

 jSMILE  ,

Allow SMILE

to be accessed from applications other than C++compiler Reasoning engine: SMILE

( S tructural M odeling, I nference, and L earning E ngine).

GeNIe SMILE.NET

SMiner jSMILE

Pocket SMILE

SMILE

A platform independent library of C++ classes for graphical models.

GeNIeRate

UAI-06 Software Evaluation

UAI Competition: Sources of speedup Good theory rather than engineering tricks 1.

2.

3.

Clustering algorithm at the foundation of the program [Lauritzen & Spiegelhalter] (Pr(E) as the normalizing factor).

Relevance reasoning , based on conditional independence [Dawid 1979, Geiger et al, 1990], structured in [Suermondt 1992] and [Druzdzel 1992], summarized in [Druzdzel & Suermondt 1994].

For very large models: Relevance-based Decomposition [Lin & Druzdzel 1997] and Relevance-based Incremental Belief Updating [Lin & Druzdzel 1999].

Full references are included in GeNIe on-line help, http://genie.sis.pitt.edu/ .

Top research programmer (Tomek Sowinski).

Efficient and reliable implementation in C++ ( SMILE

).

UAI Software Competition

Broader Context: GeNIe and SMILE

A developer’s environment for graphical decision models ( http://genie.sis.pitt.edu/ ).

Support for model building: ImaGeNIe Qualitative interface: QGeNIe Learning and discovery module: SMiner Diagnosis: Diagnosis Model developer module: GeNIe .

ImaGeNIe Diagnosis Implemented in Visual C++ in Windows environment.

Wrappers: SMILE.NET

Pocket SMILE

 jSMILE  ,

Allow SMILE

to be accessed from applications other than C++compiler Reasoning engine: SMILE

( S tructural M odeling, I nference, and L earning E ngine).

GeNIe SMILE.NET

SMiner jSMILE

Pocket SMILE

SMILE

A platform independent library of C++ classes for graphical models.

GeNIeRate

UAI Software Competition

Team 5: UCI Technical Description

• presented by Rina Dechter 7/14/2006 UAI06-Inference Evaluation Page 69

PE & MPE – AND/OR Search

B A C E D Bayesian network [AB] E A [ ] B [A] C [AB] D [BC] Pseudo tree 0 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 A 1 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 D D 0 1 0 1 D 0 1 D 0 1 D 0 1 D 0 1 D 0 1 D 0 1

7/14/2006

AND/OR search tree 0 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 A 1 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 D D 0 1 0 1 D D 0 1 0 1 Context minimal graph

70 UAI06-Inference Evaluation

X 1 X K-i X K-i+1 X K X

Adaptive caching

X 1 X K-i X X K X K-i+1

] i-cache for X is purged for every new instantiation of X k-i context(X) = [

X 1 X 2 … X k-i X k-i+1… X k ]

i-bound < k 7/14/2006 i-context(X) = [ UAI06-Inference Evaluation

X k-i+1 …X k

in conditioned subproblem

]

71

PE solver - implementation

• C++ implementation • Caching based on context (table caching) – Adaptive caching when contexts are too large • Switch to variable elimination for small and nondeterministic problems • Constraint propagation • No good learning – just caching no goods • Dynamic range support (for very small probabilities) 7/14/2006 UAI06-Inference Evaluation 72

MPE solver - AOMB(i,j)

Node value v(n)

: most probable explanation of the sub problem rooted by n

Caching

: identical sub-problems rooted at AND nodes (identified by their

contexts

) are solved once and the results cached

j-bound

(context size) controls the memory used for caching

Heuristics

: pruning is based on heuristics estimates which are pre-computed by bounded inference (i.e. mini bucket approximation)

i-bound

(mini-bucket size) controls the accuracy of the heuristic

No constraint propagation

7/14/2006 73 UAI06-Inference Evaluation

AOMB(i,j) – Mini-Bucket Heuristics

• Each node

n

has a static heuristic estimate h(n) of v(n) – h(n) is an upper bound on the value v(n) – h(h) is computed based on the augmented bucket structure generated by the mini-bucket approximation MBE(i) • For every node

n

in the AND/OR search graph: – lb(n) – current best solution cost rooted at n – ub(n) – upper bound on the most probable explanation at n – Prune the search space below current node

t

if ub(

m

) < lb(

m

), where

m

an ancestor of

t

along the current path from the root is • During search, merge nodes based on context (caching); maintain cache tables of size

O(exp(j))

, where

j

is a bound on the size of the 7/14/2006 context.

UAI06-Inference Evaluation 74

AOMB(i,j) – Implementation

• C++ implementation • B&B procedure is recursive – Could be a bit faster if we simulate the stack • Cache tables implemented as hash tables • No ASM code or other optimizations • Static variable ordering determined by the

min-fill

ordering (minimizes the context size) • Choosing (i,j) parameters: – i-bound: choose

i

such that the augmented bucket structure generated by MBE(i) fits 2GB of RAM (i < 22) – j-bound: j = i + 0.75*i (j < 30) • No Constraint propagation 7/14/2006 75 UAI06-Inference Evaluation

References

• 1. Rina Dechter and Robert Mateescu. AND/OR Search Spaces for Grahpical Models. In Artificial Intelligence, 2006. To appear. • 2. Rina Dechter and Robert Mateescu. Mixtures of Deterministic-Probabilistic Networks and their AND/OR Search Space. In proceedings of UAI-04, Banff, Canada. • 3. Robert Mateescu and Rina Dechter. AND/OR Cutset Conditioning. In proceedings of IJCAI-05, Edinburgh, Scotland. • • 4. Radu Marinescu and Rina Dechter. Memory Intensive Branch-and-Bound Search for Graphical Models. In proceedings of AAAI-06, Boston, USA.

5. Radu Marinescu and Rina Dechter. AND/OR Branch-and-Bound for Graphical Models“. In proceedings of IJCAI-05, Edinburgh, Scotland 7/14/2006 76 UAI06-Inference Evaluation

Conclusions

7/14/2006 UAI06-Inference Evaluation Page 77

Conclusions and Discussion

• Most teams said they had fun and it was a learning experience – people also became somewhat competitive  • Teams that used C++ (teams 4-5) arguably had faster times than those who used Java (teams 1-3).

• Use harder BNs and or harder queries next year – hard to find real-world BNs that are easily available but that are hard. If you have a BN that is hard, please make it available for next year.

– Regardless of who runs it next year, please send candidate networks directly to me for now • Provide better resolution and standardized/unified timer (use IPM package).

• Have a dynamic category (needs to be more interest).

• Have an approximate inference category, look at time/space/accuracy tradeoffs 7/14/2006 UAI06-Inference Evaluation Page 78