Transcript poster

A* Lasso for Learning a Sparse Bayesian
Network Structure for Continuous Variances
Jing Xiang & Seyoung Kim
X3
X4
X3
X4
{X1}
{X2}
{X3}
Sample n
e.g. L1MB, DP + X5
A* for discrete
variables [2,3,4]
X5
X2
X1
X3
X4
{X1,X2}
{X1,X3}
{X2,X3}
{X1,X2,X3}
DP must consider ALL
possible paths in search
space.
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
0
0
• Construct ordering by
decomposing the problem
with DP.
+
h(Sk) is always an
underestimate of the
true cost to the goal.
A* guaranteed to
find the optimal
solution.
Admissible
+
h(Sk) always satisfies
Contributions
=
DP must visit
Efficient + Optimal!
|V|
2 states!
0.5
S1={X1}
5
4
2
0.5
6
S0
S0
5
8
S1
S2
S3
S1
S2
S3
S1
S2
0
0
S1
S2
0
0
0.5
Recall
1
0
0
0.5
Recall
Alarm
1
0.5
Recall
1
0
0
0.5
Recall
0.5
Recall
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
0
0
1
0.5
Recall
Alarm 2
1
1
Hailfinder 2
1
0.5
0.5
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
S3
Barley
1
0.5
Water
S0
S3
0.5
L1MB−5e4
L1MB−1e5
SBN
A*−Qlim=5
A*−Qlim=100
1
1
S0
S3={X3}
1
1
Hailfinder 2
0.5
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
0
0
S2={X2}
0.5
Recall
0.5
Recall
1
L1MB−5e4
L1MB−1e5
SBN
A*−Qlim=5
A*−Qlim=100
1
S0={}
3
0
0
1
1
Factors
Example of A* Search with an Admissible and Consistent Heuristic
1
0.5
Recall
Recovery of V-structures
First path to a state
is guaranteed to be
the shortest path,
thus we can prune
other paths.
Consistent
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
Alarm 2
1
Heuristic
X5
We address the problem of learning a sparse Bayes net
structure for continuous variables in high-D space.
1. Present single stage methods A* lasso and Dynamic
Programming (DP) lasso.
2. A* lasso and DP lasso both guarantee optimality of the
structure for continuous variables.
3. A* lasso has huge speed-up over DP lasso! It improves
on the exponential time required by DP lasso, and
previous optimal methods for discrete variables.
0
0
1
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
Find optimal
score for first
node Xj
Find optimal
score for nodes
excluding Xj
L1MB
SBN
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
Water
0
0
Single stage
combined
Parent Selection
+ DAG Search
e.g. SBN [1]
0.5
Recall
Precision
• Estimate of future cost
• Heuristic estimate of cost to reach
goal from Sk
• Estimate of future LassoScore from Sk
to goal state (ignores DAG constraint).
Cost incurred so far.
g(Sk) only = Greedy
Fast but suboptimal
LassoScore from start
state to Sk.
Precision
…
•
•
•
•
Finding optimal ordering
= finding shortest path from
start state to goal state
Precision
{}
0.5
1
Precision
X2
Precision
X1
0.5
Precision
X2
X1
0.5
Precision
Sample 1
Sample 2
1
1
Precision
X5
Precision
...
• DP is not practical for >20 nodes.
• Need to prune search space, use A* search!
• Learning Bayes net + DAG constraint = learning optimal ordering.
• Given ordering, Pa(Xj) = variables that precede it in ordering.
Barley
Alarm
1
Precision
X1
Factors
Precision
We observe…
Stage 2:
Search for DAG
Recovery of Skeleton
Precision
Stage 1:
Parent Selection
A* Lasso for Pruning the Search Space
Dynamic Programming (DP) with Lasso
Bayesian Network Structure Learning
0.5
L1MB−5e4
L1MB−1e5
SBN
A*−Qlim=5
A*−Qlim=100
0
0
0.5
Recall
1
0.5
L1MB−5e4
L1MB−1e5
SBN
A*−Qlim=5
A*−Qlim=100
0
0
0.5
Recall
1
Prediction Error for Benchmark Networks
9
30
A* [4]
No
Yes
No
≤ Exp.
L1MB [2]
No
No
Yes
Fast
SBN [1]
Yes
No
Yes
Fast
DP Lasso
Yes
Yes
Yes
Exp
A* Lasso
Yes
Yes
Yes
≤ Exp.
A* Lasso + Qlimit Yes
No
Yes
Fast
Bayesian Network Model
• A Bayesian network for continuous variables is defined
over DAG G, which has V nodes, where V = {X1, …, X|V|}.
The probability model factorizes as below.
Linear Regression
Model:
Optimization Problem for Learning
S4={X1,X2}
11
S5={X1,X3}
7
S7={X1,X2,X3}
h(S1) = 4
h(S2) = 5
h(S3) = 10
h(S4) = 9
h(S5) = 5
h(S6) = 6
S6={X2,X3}
S4
S5
S6
S4
S5
S6
S4
S5
S6
S4
S5
L1MB−5e4
L1MB−1e5
L1MB
SBN
A*−Qlim=5
A*−Qlim=100
A*−Qlim=200
A*−Qlim=1000
S6
25
8
S7
S7
Expand S1
Expand S2
Expand S5
Queue
{S0,S2}: f = 2+5= 7
{S0,S1,S5}: f = (1+4)+5= 10
{S0,S3}: f = 3+10= 13
{S0,S1,S4}: f = (1+5)+9= 15
Queue
{S0,S1,S5}: f = (1+4)+5= 10
{S0,S3}: f = 3+10= 13
{S0,S2,S6}: f = (2+5)+6= 13
{S0,S1,S4}: f = (1+5)+9= 15
{S0,S2,S4}: f = (2+6)+9= 17
Queue
{S0,S1,S5,S7}: f = (1+4)+7= 12
{S0,S3}: f = 3+10= 13
{S0,S2,S6}: f = (2+5)+6= 13
{S0,S1,S4}: f = (1+5)+9= 15
S7
S7
Expand S0
Queue
{S0,S1}: f = 1+4= 5
{S0,S2}: f = 2+5= 7
{S0,S3}: f = 3+10= 13
≠
Consistency!
Comparing Computation Time of Different Methods
Goal Reached!
Improving Scalability
• We do NOT naively limit the queue.
This would reduce quality of
solutions dramatically!
• Best intermediate results occupy
shallow part of the search space, so
we distribute results to be
discarded across different depths.
• To discard k results, given depth
|V|, we discard k/|V| intermediate
results at each depth.
Prediction Error
DP [3]
1-Stage Optimal Allows Sparse Computational
Parent Set
Time
No
Yes
No
Exp.
20
15
10
5
1
2
3
4
5
6
7
8
9
Network
Prediction Error for S&P Stock Price Data
• Daily stock price data of 125
S&P companies over 1500
time points (1/3/07-12/17/12).
• Estimated Bayes net using
the first 1000 time points,
then computed prediction
errors on 500 time points.
Prediction Error
5.0 5.2 5.4 5.6 5.8 6.0
Method
e4
e5
5
1
B−
B−
M
M
L1
L1
5
0
0
N
0
0
Q
B
1
2
−
S
A*
−Q
−Q
*
*
A
A
Conclusions
• Proposed A* lasso for Bayes net structure learning with continuous
variables, this guarantees optimality + reduces computational time
compared to the previous optimal algorithm DP.
• Also presented heuristic scheme that further improves speed but
does not significantly sacrifice the quality of solution.
References
1.
2.
3.
4.
Huang et al. A sparse structure learning algorithm for Gaussian Bayesian network identification from
high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6), 2013.
Schmidt et al. Learning graphical model structure using L1-regularization paths. In Proceedings of
AAAI, volume 22, 2007.
Singh and Moore. Finding optimal Bayesian networks by dynamic programming. Technical Report 05106, School of Computer Science, Carnegie Mellon University, 2005.
Yuan et al. Learning optimal Bayesian networks using A* search. In Proceedings of AAAI, 2011.