Transcript Document
1 st
UAI 2006
report from the
Evaluation of Probabilistic Inference July 14 th , 2006 7/14/2006 F,D,G D,F B,C,D,F UAI06-Inference Evaluation Page 1
What is this presentation about?
• Goal: The purpose of this evaluation is to compare the performance of a variety of different software systems on a single set of Bayesian network (BN) problems.
• By creating a friendly evaluation (as is often done in other communities such as SAT, and also in speech recognition and machine translation with their DARPA evaluations), we hope to foster new research in fast inference methods for performing a variety of queries in graphical models.
• Over the past few months, the 1 st such an evaluation took place at UAI.
• This presentation summarizes the outcome of this evaluation.
7/14/2006 UAI06-Inference Evaluation Page 2
Who we are
• Evaluators – Jeff Bilmes – University of Washington, Seattle – Rina Dechter – University of California, Irvine • Graduate Student Assistance – Chris Bartels – University of Washington, Seattle – Radu Marinescu – University of CA, Irvine – Karim Filali – University of Washington, Seattle • Advisory Council – Dan Geiger -- Technion - Israel Institute of Technology – Faheim Bacchus – University of Toronto – Kevin Murphy – University of British Columbia 7/14/2006 UAI06-Inference Evaluation Page 3
Outline
• Background, goals.
• Scope (rational) • Final chosen queries • The UAI 2006 BN benchmark evaluation corpus • Scoring strategies • Participants and team members • Results for PE and MPE • Team presentations – team1 (UCLA), team2 (IET), team3 (UBC), team4 (U. Pitt/DSL), team 5 (UCI) • Conclusion/Open discussion 7/14/2006 UAI06-Inference Evaluation Page 4
Acknowledgements: Graduate Student Help
Chris Bartels, University of Washington Radu Marinescu, University of CA, Irvine Karim Filali, University of Washington Also, thanks to another U. Washington Student, Mukund Narasimhan (now at MSR) 7/14/2006 UAI06-Inference Evaluation Page 5
Background
• Early 2005: Rina Dechter & Dan Geiger decide there should be some form of UAI inference evaluation (like in the SAT community) and discuss the idea (by email) with Adnan Darwiche, Faheim Bacchus, Hector Geffner, Nir Friedman, Thomas Richardson. • I (Jeff Bilmes) take on the task to run it this first time.
– Speech recognition and DARPA evaluations • evaluation of ASR systems using error rate as a metric.
7/14/2006 UAI06-Inference Evaluation Page 6
Scope
• Many “queries” could be evaluated including: – MAP – maximal a posteriori hypothesis – MPE – most probable explanation (also called Viterbi assignment) – PE – probability of evidence – N-best – compute the N-best of the above • Many algorithmic variants – Exact inference – Enforced limited time-bounds and/or space bounds – Approximate inference, and tradeoffs between time/space/accuracy • Classes of models – Static BNs with a generic description (list of CPTs) – More complex description language (e.g., context specific indep.) – Static models vs. Dynamic models (e.g., Dynamic Bayesian Networks, and DGMs) vs. relational models 7/14/2006 UAI06-Inference Evaluation Page 7
Decisions for this first evaluation.
• Emphasis: Keep things simple.
• Focus on exact inference – exact inference can still be useful. – “Exact inference is NP-complete, so we perform approximate inference” is often seen in the literature – With smart algorithms, and for fixed (but real-world) problem sizes, exact is quite doable and can be better for applications.
• Focus on small number of queries: • Original plan: PE, MPE, and MAP for both static and dynamic models • From final participants list, narrowed this down to: PE and MPE on static Bayesian networks 7/14/2006 UAI06-Inference Evaluation Page 8
Query: Probability of Evidence (PE)
7/14/2006 UAI06-Inference Evaluation Page 9
Query: Most Probable Explanation (MPE)
7/14/2006 UAI06-Inference Evaluation Page 10
The UAI06 BN Evaluation Corpus
• • J=78 BNs used for PE, and J=57 BNs used for MPE queries. The BNs were not exactly the same.
• BNs were the following (more details will appear on web page): – random mutations of the burglar alarm graph – diagnosis network (Druzdzel) – DBNs from speech recognition that were unrolled a fixed amount. – Variations on the Water DBN – Various forms of grids – Variations on the ISCAS 85 electrical circuit – Variations on the ISCAS 89 electrical circuit – Various genetic linkage graphs (Geiger) – BNs from computer-based patient care system (cpcs) – Various randomly generated graphs (F. Cozman’s alg).
– Various known-tree-width random k-trees, with determinism (k=24) – Various known-tree-width random positive k-trees, (k=24) – Various linear block coding graphs. • While some of these have been seen before, BNs were “anonymized” before being distributed.
BNs distributed in xbif format (basically XML) 7/14/2006 UAI06-Inference Evaluation Page 11
Timing Platform and Limits
• Timing machines: dual-CPU 3.8GHz Pentium Xeons with 8Gb of RAM each, with hyper-threading turned on.
• Single threaded performance only in this evaluation.
• Each team had 4 days of dedicated machine usage to complete there timings (other than this, there was no upper time bound). • No-one asked for more time than these 4 days -- after timing the BNs, teams could use rest of 4 days as they wish for further tuning. After final numbers were sent to me, no further adjustment of timing numbers have taken place (say based on seeing other’s results).
• Each timing number was the result of running a query 10 times, and then reporting the fastest (lowest) time.
7/14/2006 UAI06-Inference Evaluation Page 12
The Teams
• Thanks to every member of every team: Each member of every team was crucial to making this a successful event!!
7/14/2006 UAI06-Inference Evaluation Page 13
•David Allen (now at HRL Labs, CA) •Mark Chavira (graduate student) •Adnan Darwiche
Team 1: UCLA
• Keith Cascio • Arthur Choi (graduate student) • Jinbo Huang (now at NICTA, Australia) 7/14/2006 UAI06-Inference Evaluation Page 14
From right to left in photo: • Masami Takikawa • Hans Dettmar • Francis Fung • Rick Kissh Other team members: •Stephen Cannon •Chad Bisk •Brandon Goldfedder Other key contributors: • Bruce D'Ambrosio • Kathy Laskey • Ed Wright • Suzanne Mahoney • Charles Twardy • Tod Levitt
Team 2: IET
7/14/2006 UAI06-Inference Evaluation Page 15
Team 3: UBC
Jacek Kisynski, University of British Columbia David Poole, University of British Columbia Michael Chiang , University of British Columbia 7/14/2006 UAI06-Inference Evaluation Page 16
Team 4: U. Pittsburgh, DSL
Tomasz Sowinski, University of Pittsburgh, DSL 7/14/2006 Marek J. Druzdzel, University of Pittsburgh, DSL UAI06-Inference Evaluation Page 17
Team 4: Decision Systems Laboratory
UAI Software Competition
Team 5: UCI
Robert Mateescu , University of CA, Irvine 7/14/2006 Radu Marinescu, University of CA, Irvine Rina Dechter, University of CA, Irvine UAI06-Inference Evaluation Page 19
The Results
7/14/2006 UAI06-Inference Evaluation Page 20
Definition Correct Answer
7/14/2006 UAI06-Inference Evaluation Page 21
Definition of “FAIL”
• Each team had 4 days to complete the evaluation • No time-limit placed on any particular BN.
• A “FAILED” score meant that either the system failed to complete the query, or that the system underflowed their own numeric precision.
– some of the networks were designed not to fit within IEEE 64-bit double precision, so either scaling or log-arithmetic needed to be used (which is a speed hit for PE).
• Teams had the option to submit multiple separate submissions, none did.
• Systems were allowed to “backoff” from no-scaling to, say, a log-arithmetic mode (but that was included in the time charge) 7/14/2006 UAI06-Inference Evaluation Page 22
Definition of Average Speedup
7/14/2006 UAI06-Inference Evaluation Page 23
Results – PE and MPE
78 PE BNs, and 57 MPE BNs 1. Failure rates – number of networks that each team failed to produce a score or a correct score (including underflow) 2. Speedup results on
categorized BNs
– a BN is categorized based on how many teams failed to produce a score (so 0-5 for PE, or 0-3 for MPE) – Average speedup of the best performance over a particular team’s performance on each category.
3. Rank scores – Number of times a particular team was rank n for various n 4. Workload scores – In what workload region (if any) is a particular team optimal.
7/14/2006 UAI06-Inference Evaluation Page 24
PE Results
7/14/2006 UAI06-Inference Evaluation Page 25
PE Failure Rate Results (low is best)
Reminder: 78 BNs total for PE
8 6 4 2 0 20 18 16 14 12 10 Team 1 0 Team 2 10.26
Team 3 19.23
Team 4 14.1
Team 5 19.23
7/14/2006 UAI06-Inference Evaluation Page 26
Avg. Speedups: BNs that ran on all 5 systems 61 out of 78 BNs ran on all systems.
Remember:
lower is better!
7/14/2006
90 80 70 60 50 40 30 20 10 0 Te am 1 85.59
Te am 2 52.44
Te am 3 77.33
Te am 4 1.02
Te am 5 17.15
UAI06-Inference Evaluation Page 27
Avg. Speedups: BNs that ran on all systems
Average Std Min Max 3 2.5
2 1.5
1 0.5
0 -0.5
-1 Te am 1
7/14/2006
Te am 2 Te am 3 Te am
UAI06-Inference Evaluation
4 Te am 5
Page 28
Avg. Speedups: BNs that ran on (3-4)/5 systems 8 out of 78 BNs ran on only 3 or 4 systems.
7/14/2006
20 18 16 14 12 10 4 2 8 6 0 Te am 1 4.93
Te am 2 12.95
Te am 3 1.85
Te am 4 1 Te am 5 18.42
UAI06-Inference Evaluation Page 29
Avg. Speedups: BNs that ran on (3-4)/5 systems
Average Std Min Max 60 50 40 30 20 10 0 Te am 1
7/14/2006
Te am 2 Te am 3 Te am
UAI06-Inference Evaluation
4 Te am 5
Page 30
Avg. Speedups: BNs that ran on (1-2)/5 systems 9 out of 78 BNs ran on only 1 or 2 systems.
Only 2 teams (team 1 and 2) had systems that could run this category BN.
These include the genetic linkage BNs.
12 10 8 6 4 2 0 Te am 1 1 Te am 2 11.28
7/14/2006 UAI06-Inference Evaluation Page 31
Rank Proportions (how often was each team a particular rank, rank 1 is best)
Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Fail 100% 80% 60% 40% 20% 0% Te am 1
7/14/2006
Te am 2 Te am 3 Te am
UAI06-Inference Evaluation
4 Te am 5
Page 32
Another look at PE:
WALL TIME
BN_0 BN_1 BN_2 BN_3 BN_4 BN_5 BN_6 BN_7 BN_8 BN_9 BN_10 BN_11 BN_12 BN_13 BN_14 BN_15 BN_16 BN_18 BN_20 BN_22 BN_24 BN_26 BN_28 BN_30 BN_32 BN_34 BN_36 BN_38 BN_40 Alarm graph Diagnosis Speech
recognition
Water DBN
Grids
7/14/2006
Team 1 Team 2 Team 3 Team 4 Team 5
0.476
1.680
0.400
0.010
0.020
0.548
0.507
0.484
0.564
0.623
0.785
0.781
0.675
0.876
0.669
0.916
0.636
0.628
0.979
0.908
1.581
1.583
66.103
2.130
2.160
3.020
2.220
2.490
2.970
3.000
1.780
3.200
1.610
2.070
2.540
2.240
3.580
2.370
4.430
3.590
28.300
0.520
0.500
0.490
0.690
0.670
0.820
0.810
0.540
0.530
0.540
0.690
0.780
0.790
1.780
0.920
1.110
1.050
56.910
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.010
0.020
0.100
0.060
0.070
0.070
0.040
0.010
0.020
0.020
0.060
0.070
0.140
0.080
0.170
0.060
0.220
0.200
0.120
1.790
1.270
0.170
0.170
FAIL 467.440
8.070
7.030
52.795
3.723
2.744
3.192
3.158
3.124
3.042
3.074
11.110
8.290
16.630
3.620
4.630
6.350
30.190
5.470
6.850
5.810
11.710
9.680
19.850
3.630
FAIL FAIL FAIL FAIL FAIL
1.960
1.420
0.720
0.420
0.520
0.490
0.100
0.590
13.640
7.340
FAIL 123.120
1.620
FAIL FAIL FAIL FAIL FAIL
INFERENCE TIME Team 1
0.17700
0.25500
0.17500
0.19900
0.26700
0.26400
0.36400
0.44300
0.36600
0.56800
0.37400
0.55700
0.30100
0.28800
0.64700
0.55400
0.43600
0.45800
60.07600
2.72100
2.44000
47.56100
0.22400
1.86700
2.25000
2.21000
2.14500
2.06600
Team 2 0.07000
0.07700
0.06800
0.09500
0.10600
0.12500
0.16000
0.11100
0.09000
0.10000
0.09100
0.14600
0.11000
0.11700
0.19800
0.16300
0.32300
0.29600
20.53900
5.40400
2.36000
8.35000
0.09400
2.21700
3.99400
26.72500
3.47500
4.29300
3.36700
Team 3
0.17000
0.29000
0.22000
0.21000
0.36000
0.32000
0.41000
0.53000
0.28000
0.27000
0.29000
0.37000
0.51000
0.47000
1.45000
0.57000
0.16000
0.12000
46.81000
4.07000
3.50000
11.84000
0.41000
FAIL FAIL FAIL FAIL FAIL FAIL
Team 4 0.00065
Team 5 0.02000
0.00133
0.00065
0.00144
0.00097
0.00217
0.00241
0.00392
0.00278
0.00568
0.00200
0.00808
0.00438
0.01415
0.09400
0.05521
0.00225
0.00229
0.03000
0.01000
0.02000
0.02000
0.05000
0.06000
0.13000
0.08000
0.17000
0.06000
0.21000
0.20000
0.11000
1.79000
1.26000
0.02000
0.02000
FAIL 463.24000
0.72258
0.34183
10.44000
4.50000
FAIL 119.23000
0.02208
0.21000
0.39248
0.47857
0.45130
0.39153
0.55135
33
0.28734
FAIL FAIL FAIL FAIL FAIL FAIL
BN_42 BN_43 BN_44 BN_45 BN_46 BN_47 BN_49 BN_51 BN_53 BN_55 BN_57 BN_59 BN_61 BN_63 BN_65 BN_67 BN_69 BN_70 BN_71 BN_72 BN_73 BN_74 BN_75 BN_76 BN_77 iscas85 iscas89 Genetic
linkage
7/14/2006
WALL TIME INFERENCE TIME Team 1 Team 2 Team 3 Team 4 Team 5
1.296
2.960
0.950
0.020
0.130
1.212
1.608
0.981
1.192
1.373
3.040
2.890
3.030
2.880
3.480
0.990
1.060
0.680
1.210
2.170
0.030
0.130
0.020
0.160
0.190
0.100
0.300
0.050
1.370
0.750
1.325
1.056
0.834
1.161
0.817
0.748
1.100
0.974
0.739
0.938
5.103
8.069
8.844
16.581
16.634
10.532
25.656
31.216
59.385
3.410
2.960
2.740
2.180
2.950
2.470
2.700
2.330
2.490
2.630
FAIL
75.300
FAIL FAIL FAIL FAIL FAIL FAIL FAIL 9.120
1.240
0.630
1.450
0.630
0.580
1.390
1.010
0.630
0.910
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL
0.420
0.030
0.010
0.140
0.020
0.010
0.100
0.030
0.010
0.040
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL
0.650
0.200
0.040
0.720
0.020
0.020
0.100
0.260
0.030
0.370
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL UAI06-Inference Evaluation
Team 1
0.59800
0.53700
0.91200
0.34400
0.63800
0.70000
0.64200
0.44100
0.21000
0.55700
0.21000
0.17700
0.41600
0.36700
0.24200
0.47600
3.99500
6.38800
7.27100
14.79000
15.00800
9.58500
24.10700
29.22600
57.87100
Team 2
0.48400
0.48600
0.52700
0.32700
0.43600
0.55200
0.52800
0.50400
0.26000
0.47200
0.25400
0.27100
0.47200
0.42900
0.24500
0.33000
FAIL
72.02900
FAIL FAIL FAIL FAIL FAIL FAIL FAIL
Team 3 0.34000
0.41000
0.46000
0.14000
0.71000
1.52000
8.36000
0.68000
0.16000
0.88000
0.12000
0.07000
0.78000
0.48000
0.19000
0.46000
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL
Team 4 0.00738
0.00707
0.10763
0.00217
0.15363
0.17385
0.40277
0.01871
0.00104
0.12664
0.00097
0.00067
0.08385
0.02013
0.00327
0.02433
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 34
Team 5 0.11000
0.08000
0.27000
0.03000
1.36000
0.74000
0.63000
0.18000
0.02000
0.70000
0.01000
0.01000
0.08000
0.25000
0.02000
0.36000
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL
WALL TIME INFERENCE TIME
BN_78 BN_80 BN_82 BN_84 BN_86 BN_88 BN_90 BN_92 BN_94 BN_96 BN_98 BN_100 BN_102 BN_104
CPCS Random graphs
k-trees
(k=24) determinism
BN_106 BN_108 BN_110 BN_112 BN_114 BN_116 BN_118 BN_120 BN_122 BN_124 k-trees
(k=24) positive
7/14/2006
Team 1 Team 2 Team 3 Team 4 Team 5
0.409
0.903
0.910
1.004
2.397
2.590
2.621
2.691
0.754
0.603
0.771
1.089
1.748
15.708
1.520
1.740
2.080
1.940
4.520
4.470
4.370
4.260
1.550
1.570
1.590
1.630
2.020
21.630
0.350
0.970
1.150
1.070
3.220
3.330
3.210
3.300
0.730
0.880
0.940
1.040
3.190
14.550
0.010
0.040
0.050
0.040
0.780
0.760
0.780
0.800
0.040
0.020
0.050
0.060
0.330
4.020
5.930
0.010
0.080
0.090
0.090
1.790
1.820
1.800
1.870
0.090
0.090
0.140
0.270
1.700
11.097
9.527
7.704
10.310
5.345
8.592
4.928
5.457
6.891
4.297
18.500
17.080
14.930
16.790
7.550
12.350
8.790
8.220
12.020
8.020
20.170
13.390
11.250
19.390
7.620
10.040
6.630
7.380
8.890
6.810
3.650
3.300
2.790
3.430
1.880
3.140
1.920
2.020
2.580
1.700
6.880
5.670
4.120
6.990
6.480
5.790
4.290
5.060
5.440
4.340
Team 1
0.14800
0.19200
0.25200
0.29500
0.47400
0.47000
0.49300
0.52900
0.11500
0.19200
0.19300
0.49200
0.62000
0.34800
0.54900
0.48900
0.40000
0.57100
1.04400
0.63700
0.41500
0.78500
0.68400
0.43400
Team 2 0.04900
0.12800
0.12300
0.15700
0.32800
0.31100
0.30800
0.27700
0.03400
0.08200
0.07800
0.09900
0.15900
0.86300
0.87300
0.87000
0.69100
0.84000
0.49500
0.60200
0.42700
0.44400
0.52500
0.39900
Team 3
0.12000
0.22000
0.29000
0.34000
0.45000
0.46000
0.40000
0.54000
0.05000
0.48000
0.23000
0.32000
1.64000
0.19000
8.53000
3.00000
1.62000
8.69000
1.96000
0.38000
0.76000
1.06000
0.84000
1.44000
UAI06-Inference Evaluation
Team 4 0.00044
0.00112
0.00116
0.00162
0.04229
0.02511
0.04246
0.07392
0.00040
0.00472
0.00435
0.01411
0.10794
0.00699
0.38662
0.42090
0.17202
0.44628
0.14516
0.01697
0.04710
0.06795
0.58000
1.04000
1.67000
0.04319
0.07180
35 1.22000
1.54000
Team 5 0.01000
0.02000
0.03000
0.03000
0.67000
0.74000
0.74000
0.77000
0.01000
0.07000
0.05000
0.18000
1.27000
0.34000
0.19000
0.87000
0.73000
0.40000
3.50000
MPE Results
• Only three teams participated: – Team 1 – Team 2 – Team 5 • 57 BNs, not the same ones, but some are variations of the same original BN.
7/14/2006 UAI06-Inference Evaluation Page 36
MPE Failure Rate Results
40 35 30 25 20 15 10 5 0 Team 1 0 Team 2 38.6
Team 5 7.01
7/14/2006 UAI06-Inference Evaluation Page 37
MPE Avg. Speedups: BNs that ran on all 3 systems 31 out of 57 BNs ran on all systems.
7/14/2006
1 0 3 2 8 5 4 7 6 Te am 1 5.38
2.75
Te am 2
UAI06-Inference Evaluation
Te am 5 7.37
Page 38
MPE Avg. Speedups: BNs that ran on all 3 systems 31 out of 57 BNs ran on all systems.
Average Std Min Max 90 80 70 60 50 40 30 20 10 0 Te am 1
7/14/2006
Te am 2
UAI06-Inference Evaluation
Te am 5
Page 39
MPE Avg. Speedups: BNs that ran on 2/3 systems 26 out of 57 BNs ran on 2 systems.
7/14/2006
50 45 40 35 30 25 20 15 10 5 0 Te am 1 6.6
1.2
Te am 2
UAI06-Inference Evaluation
Te am 5 47.46
Page 40
MPE Avg. Speedups: BNs that ran on 2/3 systems
Average Std Min Max 800 700 600 500 400 300 200 100 0 Te am 1
7/14/2006
Te am 2
UAI06-Inference Evaluation
Te am 5
Page 41
Rank Proportions (how often was each team a particular rank, rank 1 is best)
Rank 1 Rank 2 Rank 3 Fail 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
7/14/2006
Te am 1 Te am 2
UAI06-Inference Evaluation
Te am 5
Page 42
7/14/2006 BN_17 BN_19 BN_21 BN_23 BN_25 BN_27 BN_29 BN_31 BN_33 BN_35 BN_37 BN_39 BN_41 BN_48 BN_50 BN_52 BN_54 BN_56 BN_58 BN_60 BN_62 BN_64 BN_66 BN_68 BN_79 BN_81 BN_83 BN_85 BN_87 BN_89 BN_91 BN_93 BN_95 BN_97 BN_99 BN_101 BN_103 BN_105 BN_107 BN_109 BN_111 BN_113 BN_115 BN_117 BN_119 BN_121 BN_123 BN_125 BN_126 BN_127 BN_128 BN_129 BN_130 BN_131
team 5
0.79
0.72
u/flow u/flow u/flow u/flow 1.54
74.47
38.76
41.81
12.39
137.24
21.7
0.25
0.7
0.85
8.36
97.05
0.78
118.39
5.01
16.7
4.81
0.43
0.01
0.11
0.13
0.19
1.42
1.43
1.39
1.67
0.27
0.27
1.01
0.19
0.94
6.88
6.25
5.64
4.96
5.69
4.5
5.88
4.06
4.2
5.07
3.6
30.12
111.65
2.07
147.32
13.59
16.47
0.693
1.047
0.462
1.481
1.687
2.067
9.019
3.523
8.531
5.016
1.789
1.896
6.213
1.648
3.257
14.65
37.053
9.782
8.319
10.093
6.955
8.57
5.616
6.36
6.958
4.866
30.904
23.6
8.306
42.683
37.685
10.146
team 1
46.193
47.436
WALL TIME team 2
FAIL FAIL 34.689
9.514
8.284
10.963
3.324
2.951
3.253
24.82
12.46
8.47
16.76
4.07
3.31
21.98
3.192
3.263
3.254
3.08
1.216
1.36
1.165
0.818
1.133
0.81
0.758
1.107
0.925
5.1
4.73
7.84
4.58
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 2.15
3.65
2.85
2.91
5.9
4.94
5.61
5.42
2.7
2.56
3.06
2.37
3.47
22.23
19.16
16.33
14.94
17.3
7.69
12.71
8.49
8.05
11.52
8.29
FAIL FAIL FAIL FAIL FAIL FAIL UAI06-Inference Evaluation BN_133 24.212
FAIL 9.86
BN_134 21.243
FAIL 130.48
0.233
0.54
0.173
0.814
1.003
1.403
6.415
1.254
6.28
2.958
0.966
1.433
5.607
1.087
2.226
0.473
0.992
0.774
0.606
0.835
2.438
0.888
1.099
1.168
0.925
1.114
30.276
INFERENCE TIME team 1
44.988
46.259
team 2
FAIL FAIL
team 5
0.75
0.68
28.182
4.389
3.765
5.688
0.4
2.073
2.261
17.66
5.857
2.951
8.843
0.153
0.835
19.551
u/floe u/flow u/flow u/flow 0.17
74.46
38.76
2.203
2.296
2.265
2.097
0.559
0.673
0.488
0.203
0.513
0.205
0.168
0.454
0.375
2.243
2.895
5.158
2.29
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 41.8
12.38
137.23
21.68
0.25
0.69
0.84
8.35
97.04
0.77
118.38
5 16.69
22.979
7.713
42.106
37.071
9.577
78.243
23.6
20.655
FAIL FAIL 0.056
0.513
0.544
0.514
1.21
0.644
1.183
0.902
0.541
0.396
1.021
0.259
1 1.108
1.929
1.206
1.002
1.516
1.005
0.738
0.607
0.681
0.674
0.589
FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL FAIL 111.64
2.07
147.32
13.58
16.46
164.4
9.86
130.48
4.81
0.42
0.01
0.05
0.07
0.13
0.33
0.32
0.32
0.57
0.19
0.25
0.93
0.1
0.52
0.17
0.82
0.88
0.53
0.96
1.38
0.6
0.88
0.84
0.67
0.81
30.11
Diagnosis graph Speech recognition
Water DBN
Grids ISCAS 85 CPCS Random k-trees w determinism k-trees positive Coding
43
PE and MPE Results
7/14/2006 UAI06-Inference Evaluation Page 44
Workload Scores: PE and MPE
7/14/2006 UAI06-Inference Evaluation Page 45
Workload Scores and Linear Programming
7/14/2006 UAI06-Inference Evaluation Page 46
Workload Scores: PE and MPE
• So each team is a winner, it depends on the workload.
• Could attempt further to rank teams based on volume of workload region where a team wins.
• Which measure, however, should we on the simplex, uniform? Why not something else.
• “A Bayesian approach to performance ranking” – UAI does system performance measures … 7/14/2006 UAI06-Inference Evaluation Page 47
Team technical descriptions
• 5 minute for each team.
• Current plan: more details to ultimately appear on the inference evaluation web site (see main UAI page).
7/14/2006 UAI06-Inference Evaluation Page 48
Team 1: UCLA Technical Description
• presented by Adnan Darwiche 7/14/2006 UAI06-Inference Evaluation Page 49
Team UCLA
Performance summary:
Solved all 78 P(e) networks in 319s: about 4s per instance Solved all 57 MPE networks in 466s: about 8s per instance MPE approach
Prune network If network has treewidth 25 or less, run RC Else if network has enough local structure, run Ace Else run BnB/Ace
P(e) approach
Prune network If network has genetic net characteristics, run RC_Link Else if network has treewidth 25 or less, run RC Else run Ace Approach is powerful enough to solve every network in every suite. Yet, it incurs a fixed overhead that disadvantages it on easy networks 7/14/2006 UAI06-Inference Evaluation 50
RC and RC Link
Recursive Conditioning
Conditioning/Search algorithm Based on decomposing the network Inference exponential in treewidth VE/Jointree could have been used for this!
RC Link
RC with local structure exploitation Not necessarily exponential in treewith Download: http://reasoning.cs.ucla.edu/rc_link
7/14/2006 UAI06-Inference Evaluation 51
Ace
Compiles BN to arithmetic circuit Reduce to logical inference Strength: Local Structure (determinism & CSI) Strength: online inference Inference not exponential in treewidth http://reasoning.cs.ucla.edu/ace
7/14/2006 UAI06-Inference Evaluation 52
Branch & Bound
Approximate network by deleting edges to provides an upper bound on MPE Compile the network using Ace and use to drive search Use belief propagation to construct
seed
a static variable order for each variable, an ordering on values
7/14/2006 UAI06-Inference Evaluation 53
7/14/2006 UAI06-Inference Evaluation 54
Team 2: IET Technical Description
• presented by Masami Takikawa 7/14/2006 UAI06-Inference Evaluation Page 55
Basic SPI Algorithm Team 2
BN Was able to solve 59 out of 78 challenges.
Collect factors factors d-separation, barren nodes removal & evidence propagation Order nodes Minimum weight heuristic Multiplication Summation Repeat these steps until all variables are eliminated
#BNs 4 2 20 16 13 4 MaxWeight <=100 <=1,000 <=10,000 <=100,000 <=1,000,000 <=2,000,000
UAI2006 BN Engine Eval. Copyright©, IET, 2006.
Extensions (aka additional overhead)
BN Solved additional 11 challenges.
Intra-node factorization factored graph Collect factors factors Order nodes Time-slice ordering if DBN
BN Without INF BN_30 130M (27) BN_32 17G (34) BN_34 1.1G (30) BN_36 4.3G (32) BN_38 4.3G (32) BN_40 1.1G (30)
Multiplication
BN Min-Weight Time-slice
Summation
BN_70 BN_72 BN_73 8.3E19 (56) 2.9E21 (46) 6.4E22 (51) 9.0E7 (21) 3.5E11 (38) 4.3E9 (26) BN_75 5.5E18 (43) 1.7E10 (34)
Normalization
BN_76 1.9E20 (35) 1.7E11 (24)
Needed to avoid underflow for BN_20-26.
UAI2006 BN Engine Eval. Copyright©, IET, 2006.
With INF 66K (16) 260K (18) 3.1M (21) 130K (17) 520K (19) 130K (17)
) )
Team 3: UBC Technical Description
• presented by David Poole 7/14/2006 UAI06-Inference Evaluation Page 58
Variable Elimination Code
by David Poole and Jacek Kisyński
•This is an implementation of variable elimination in Java 1.5 (without threads).
•We wanted to test how well our base VE system that we were using compared with other systems.
UAI06-Inference Evaluation
orderings and 2GB of memory.
59
The most interesting part of the implementation is in the representation of factors: • A factor is essentially a list of variables, and a one-dimensional array of values.
• There is a total order of all variables and a total ordering of all values, which gives a canonical order of the values in a factor.
• We can multiply factors and sum out variables without doing random access to the values, but rather using the canonical ordering to enumerate the values.
7/14/2006 UAI06-Inference Evaluation 60
• Multiplication is done lazily.
• This code was written for David Poole and Nevin Lianwen Zhang, ``Exploiting contextual independence in probabilistic inference'', Journal of Artificial Intelligence Research,18, 263-313, 2003. http://www.jair.org/papers/paper1122.html
• This is also the code that is used in the CIspace belief and decision network applet. A new version of the applet will be released in July. See: http://www.cs.ubc.ca/labs/lci/CIspace/ • We plan to release the VE code as open source.
UAI06-Inference Evaluation 61
Team 4: U. Pitt/DSL Technical Description
• presented by Jeff Bilmes (Marek Druzdzel was unable to attend).
7/14/2006 UAI06-Inference Evaluation Page 62
Decision Systems Laboratory
University of Pittsburgh [email protected]
http://dsl.sis.pitt.edu/
UAI Software Competition
UAI Competition: Sources of speedup Good theory (in addition to good implementation) 1.
Clustering algorithm at the foundation of the program [Lauritzen Relevance steps: 2.
3.
Relevance reasoning 2.
In p(E), focusing inference on the evidence nodes Removal of barren nodes For very large models: Druzdzel 1997] and [Lin & Druzdzel 1999].
Relevance-based Decomposition Relevance-based Incremental Belief Updating 4.
[Lin & Removal of nuisance nodes 5.
Reuse of valid posteriors Full references are included in GeNIe on-line help, http://genie.sis.pitt.edu/ .
• •
Good engineering (Tomek Sowinski).
Efficient and reliable implementation in C++ ( SMILE
) Tested by over eight years of both academic and industrial use
UAI-06 Software Evaluation
Where our program spent the most time?
bn_22 (speech dbn) bn_82 (cpcs) 38x
Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other
Speedup due to relevance 1x
Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other
bn_94 (random ) 1,046x
Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other
bn_18 (diagnosis) ∞
Relevance Triangulation Find Hosts Init Potentials Collect Distribute Other UAI-06 Software Evaluation
Broader context: GeNIe and SMILE
A developer’s environment for graphical decision models ( http://genie.sis.pitt.edu/ ).
Support for model building: ImaGeNIe Qualitative interface: QGeNIe Learning and discovery module: SMiner Diagnosis: Diagnosis Model developer module: GeNIe .
ImaGeNIe Diagnosis Implemented in Visual C++ in Windows environment.
Wrappers: SMILE.NET
Pocket SMILE
jSMILE ,
Allow SMILE
to be accessed from applications other than C++compiler Reasoning engine: SMILE
( S tructural M odeling, I nference, and L earning E ngine).
GeNIe SMILE.NET
SMiner jSMILE
Pocket SMILE
SMILE
A platform independent library of C++ classes for graphical models.
GeNIeRate
UAI-06 Software Evaluation
UAI Competition: Sources of speedup Good theory rather than engineering tricks 1.
2.
3.
Clustering algorithm at the foundation of the program [Lauritzen & Spiegelhalter] (Pr(E) as the normalizing factor).
Relevance reasoning , based on conditional independence [Dawid 1979, Geiger et al, 1990], structured in [Suermondt 1992] and [Druzdzel 1992], summarized in [Druzdzel & Suermondt 1994].
For very large models: Relevance-based Decomposition [Lin & Druzdzel 1997] and Relevance-based Incremental Belief Updating [Lin & Druzdzel 1999].
Full references are included in GeNIe on-line help, http://genie.sis.pitt.edu/ .
Top research programmer (Tomek Sowinski).
Efficient and reliable implementation in C++ ( SMILE
).
UAI Software Competition
Broader Context: GeNIe and SMILE
A developer’s environment for graphical decision models ( http://genie.sis.pitt.edu/ ).
Support for model building: ImaGeNIe Qualitative interface: QGeNIe Learning and discovery module: SMiner Diagnosis: Diagnosis Model developer module: GeNIe .
ImaGeNIe Diagnosis Implemented in Visual C++ in Windows environment.
Wrappers: SMILE.NET
Pocket SMILE
jSMILE ,
Allow SMILE
to be accessed from applications other than C++compiler Reasoning engine: SMILE
( S tructural M odeling, I nference, and L earning E ngine).
GeNIe SMILE.NET
SMiner jSMILE
Pocket SMILE
SMILE
A platform independent library of C++ classes for graphical models.
GeNIeRate
UAI Software Competition
Team 5: UCI Technical Description
• presented by Rina Dechter 7/14/2006 UAI06-Inference Evaluation Page 69
PE & MPE – AND/OR Search
B A C E D Bayesian network [AB] E A [ ] B [A] C [AB] D [BC] Pseudo tree 0 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 A 1 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 D D 0 1 0 1 D 0 1 D 0 1 D 0 1 D 0 1 D 0 1 D 0 1
7/14/2006
AND/OR search tree 0 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 A 1 B 0 1 E 0 1 0 C E 1 0 1 0 C 1 D D 0 1 0 1 D D 0 1 0 1 Context minimal graph
70 UAI06-Inference Evaluation
X 1 X K-i X K-i+1 X K X
Adaptive caching
X 1 X K-i X X K X K-i+1
] i-cache for X is purged for every new instantiation of X k-i context(X) = [
X 1 X 2 … X k-i X k-i+1… X k ]
i-bound < k 7/14/2006 i-context(X) = [ UAI06-Inference Evaluation
X k-i+1 …X k
in conditioned subproblem
]
71
PE solver - implementation
• C++ implementation • Caching based on context (table caching) – Adaptive caching when contexts are too large • Switch to variable elimination for small and nondeterministic problems • Constraint propagation • No good learning – just caching no goods • Dynamic range support (for very small probabilities) 7/14/2006 UAI06-Inference Evaluation 72
MPE solver - AOMB(i,j)
Node value v(n)
: most probable explanation of the sub problem rooted by n
Caching
: identical sub-problems rooted at AND nodes (identified by their
contexts
) are solved once and the results cached
j-bound
(context size) controls the memory used for caching
Heuristics
: pruning is based on heuristics estimates which are pre-computed by bounded inference (i.e. mini bucket approximation)
i-bound
(mini-bucket size) controls the accuracy of the heuristic
No constraint propagation
7/14/2006 73 UAI06-Inference Evaluation
AOMB(i,j) – Mini-Bucket Heuristics
• Each node
n
has a static heuristic estimate h(n) of v(n) – h(n) is an upper bound on the value v(n) – h(h) is computed based on the augmented bucket structure generated by the mini-bucket approximation MBE(i) • For every node
n
in the AND/OR search graph: – lb(n) – current best solution cost rooted at n – ub(n) – upper bound on the most probable explanation at n – Prune the search space below current node
t
if ub(
m
) < lb(
m
), where
m
an ancestor of
t
along the current path from the root is • During search, merge nodes based on context (caching); maintain cache tables of size
O(exp(j))
, where
j
is a bound on the size of the 7/14/2006 context.
UAI06-Inference Evaluation 74
AOMB(i,j) – Implementation
• C++ implementation • B&B procedure is recursive – Could be a bit faster if we simulate the stack • Cache tables implemented as hash tables • No ASM code or other optimizations • Static variable ordering determined by the
min-fill
ordering (minimizes the context size) • Choosing (i,j) parameters: – i-bound: choose
i
such that the augmented bucket structure generated by MBE(i) fits 2GB of RAM (i < 22) – j-bound: j = i + 0.75*i (j < 30) • No Constraint propagation 7/14/2006 75 UAI06-Inference Evaluation
References
• 1. Rina Dechter and Robert Mateescu. AND/OR Search Spaces for Grahpical Models. In Artificial Intelligence, 2006. To appear. • 2. Rina Dechter and Robert Mateescu. Mixtures of Deterministic-Probabilistic Networks and their AND/OR Search Space. In proceedings of UAI-04, Banff, Canada. • 3. Robert Mateescu and Rina Dechter. AND/OR Cutset Conditioning. In proceedings of IJCAI-05, Edinburgh, Scotland. • • 4. Radu Marinescu and Rina Dechter. Memory Intensive Branch-and-Bound Search for Graphical Models. In proceedings of AAAI-06, Boston, USA.
5. Radu Marinescu and Rina Dechter. AND/OR Branch-and-Bound for Graphical Models“. In proceedings of IJCAI-05, Edinburgh, Scotland 7/14/2006 76 UAI06-Inference Evaluation
Conclusions
7/14/2006 UAI06-Inference Evaluation Page 77
Conclusions and Discussion
• Most teams said they had fun and it was a learning experience – people also became somewhat competitive • Teams that used C++ (teams 4-5) arguably had faster times than those who used Java (teams 1-3).
• Use harder BNs and or harder queries next year – hard to find real-world BNs that are easily available but that are hard. If you have a BN that is hard, please make it available for next year.
– Regardless of who runs it next year, please send candidate networks directly to me for now
• Have a dynamic category (needs to be more interest).
• Have an approximate inference category, look at time/space/accuracy tradeoffs 7/14/2006 UAI06-Inference Evaluation Page 78