Transcript Document

A Tutorial on
Inference and Learning
in Bayesian Networks
Irina Rish
Moninder Singh
IBM T.J.Watson Research Center
rish,[email protected]
“Road map”
Introduction: Bayesian networks


What are BNs: representation, types, etc

Why use BNs: Applications (classes) of BNs

Information sources, software, etc
Probabilistic inference


Exact inference

Approximate inference
Learning Bayesian Networks



Learning parameters

Learning graph structure
Summary
Bayesian Networks
P(A)
P(S)
Visit to Asia
P(T|A)
Tuberculosis
BN  (G, Θ)
Smoking
P(L|S)
P(B|S)
Lung Cancer
P(C|T,L)
P(D|T,L,B)
Chest X-ray
P(A, S, T, L, B, C, D)
CPD:
Bronchitis
Dyspnoea
=
T
0
0
0
0
L
0
0
1
1
B
0
1
0
1
D=0
0.1
0.7
0.8
0.9
...
D=1
0.9
0.3
0.2
0.1
P(A) P(S) P(T|A) P(L|S) P(B|S)
P(C|T,L) P(D|T,L,B)
Conditional Independencies
Efficient Representation
[Lauritzen & Spiegelhalter, 95]
Bayesian Networks


Structured, graphical representation of
probabilistic relationships between several
random variables
Explicit representation of conditional
independencies

Missing arcs encode conditional independence

Efficient representation of joint pdf

Allows arbitrary queries to be answered
P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Example: Printer Troubleshooting
(Microsoft Windows 95)
Application
Output OK
Print
Spooling On
Spool
Process OK
Local Disk
Space Adequate
Network
Up
Correct
Printer Path
Net Cable
Connected
Spooled
Data OK
GDI Data
Input OK
Uncorrupted
Driver
Correct
Driver
GDI Data
Output OK
Correct
Printer
Selected
Print
Data OK
Net
Path OK
Printer On
and Online
Correct
Driver
Settings
Net/Local
Printing
PC to Printer
Transport OK
Printer
Data OK
Print Output
OK
Local
Path OK
Correct
Local Port
Local Cable
Connected
Paper
Loaded
Printer Memory
Adequate
[Heckerman, 95]
Example: Microsoft Pregnancy and
Child Care)
[Heckerman, 95]
Example: Microsoft Pregnancy and
Child Care)
[Heckerman, 95]
Independence Assumptions
Smoking
Visit to Asia
Bronchitis
Lung Cancer
tail-to-tail
Tuberculosis
Bronchitis
Lung Cancer
Chest X-ray
Head-to-tail
Head-to-head
Dyspnoea
Independence Assumptions
Nodes X and Y are d-connected by nodes in Z
along a trail from X to Y if





every head-to-head node along the trail is in Z or has
a descendant in Z
every other node along the trail is not in Z
Nodes X and Y are d-separated by nodes in Z if
they are not d-connected by Z along any trail
from X to Y
Nodes X and Y are d-separated by Z implies X
and Y are conditionally independent given Z
Independence Assumptions
A variable (node) is conditionally independent of its
non-descendants given its parents
Smoking
Visit to Asia
Tuberculosis
Bronchitis
Lung Cancer
Chest X-ray
Dyspnoea
Independence Assumptions
Age
Gender
Exposure to
Toxins
Diet
Smoking
Cancer
Serum Calcium
Cancer is independent
of Diet given
Exposure to Toxins
and Smoking
Lung Tumor
[Breese & Koller, 97]
Independence Assumptions
What this means is that joint pdf can be represented as
product of local distributions
P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) .
P(B|A,S,T,L) . P(C|A,S,T,L,B) .
P(D|A,S,T,L,B,C)
= P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) .
P(C|T,L) . P(D|T,L,B)
Smoking
Visit to Asia
Tuberculosis
Bronchitis
Lung Cancer
Chest X-ray
Dyspnoea
Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is
n
P(X1,X2,…,Xn) = P P(Xi | Pa(Xi))
i=1
where Pa(Xi) is the set of parents of Xi
The Knowledge Acquisition Task
Variables:


collectively exhaustive, mutually exclusive values

clarity test: value should be knowable in principle
Structure


if data available, can be learned

constructed by hand (using “expert” knowledge)

variable ordering matters: causal knowledge usually
simplifies
Probabilities


can be learned from data

second decimal usually does not matter; relative probs

sensitivity analysis
The Knowledge Acquisition Task
Fuel
Battery
Start
Gauge
TurnOver
Gauge
Start
Fuel
Battery
TurnOver
Variable Order is Important
Fuel
Battery
Gauge
Causal Knowledge
Simplifies Construction
TurnOver
Start
The Knowledge Acquisition Task
Naive Baysian Classifiers [Duda&Hart; Langley 92]
Selective Naive Bayesian Classifiers [Langley & Sage 94]
Conditional Trees [Geiger 92; Friedman et al 97]
The Knowledge Acquisition Task
Selective Bayesian Networks [Singh & Provan, 95;96]
What are BNs useful for?



Diagnosis: P(cause|symptom)=?
Prediction: P(symptom|cause)=?
Classification: max P(class|data)
class


Decision-making (given a cost function)
Data mining: induce best model from data
Medicine
Bioinformatics
Speech
recognition
Stock market
Text
Classification
Computer
troubleshooting
What are BNs useful for?
Decision Making Max. Expected Utility
Cause
Predictive
Inference
Known
Predisposing
Factors
Effect
Value
Unknown but
important
Cause
Diagnostic
Reasoning
Effect
Decision
Imperfect
Observations
What are BNs useful for?
Value of Information
Salient
Observations
Action 2
Action 1
Do nothing
Fault 1
Fault 2
Fault 3
.
.
.
Assignment
of Belief
Act Now! Yes Halt?
No
Probability of fault “i”
Next Best
Observation
(Value of Information)
New Obs.
Why use BNs?

Explicit management of uncertainty

Modularity implies maintainability


Better, flexible and robust decision making MEU, VOI
Can be used to answer arbitrary queries multiple fault problems

Easy to incorporate prior knowledge

Easy to understand
Application Examples
Intellipath


commercial version of Pathfinder

lymph-node diseases (60), 100 findings
APRI system developed at AT&T Bell Labs


learns & uses Bayesian networks from data to identify
customers liable to default on bill payments
NASA Vista system


predict failures in propulsion systems

considers time criticality & suggests highest utility action

dynamically decide what information to show
Application Examples
Answer Wizard in MS Office 95/ MS Project


Bayesian network based free-text help facility

uses naive Bayesian classifiers
Office Assistant in MS Office 97


Extension of Answer wizard

uses naïve Bayesian networks


help based on past experience (keyboard/mouse use) and
task user is doing currently
This is the “smiley face” you get in your MS Office
applications
Application Examples
Microsoft Pregnancy and Child-Care





Available on MSN in Health section
Frequently occuring children’s symptoms are linked to
expert modules that repeatedly ask parents relevant
questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on
information provided
Application Examples
Printer troubleshooting



HP bought 40% stake in HUGIN. Developing printer
troubleshooters for HP printers
Microsoft has 70+ online troubleshooters on their web site

use Bayesian networks - multiple faults models, incorporate
utilities
Fax machine troubleshooting



Ricoh uses Bayesian network based troubleshooters at call
centers
Enabled Ricoh to answer twice the number of calls in half
the time
Application Examples
Application Examples
Application Examples
Online/print resources on BNs
Conferences & Journals


UAI, ICML, AAAI, AISTAT, KDD

MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers





Bayesian Networks without Tears by Eugene Charniak.
AI Magazine: Winter 1991.
Probabilistic Reasoning in Intelligent Systems by Judea
Pearl. Morgan Kaufmann: 1998.
Probabilistic Reasoning in Expert Systems by Richard
Neapolitan. Wiley: 1990.
CACM special issue on Real-world applications of BNs,
March 1995
Online/Print Resources on BNs
Wealth of online information at www.auai.org
Links to



Electronic proceedings for UAI conferences
Other sites with information on BNs and reasoning
under uncertainty

Several tutorials and important articles

Research groups & companies working in this area

Other societies, mailing lists and conferences
Publicly available s/w for BNs
List of BN software maintained by Russell
Almond at
bayes.stat.washington.edu/almond/belief.html




several free packages: generally research only
commercial packages: most powerful (& expensive)
is HUGIN; others include Netica and Dxpress
we are working on developing a Java based BN
toolkit here at Watson - will also work within ABLE
“Road map”
Introduction: Bayesian networks


What are BNs: representation, types, etc

Why use BNs: Applications (classes) of BNs

Information sources, software, etc
Probabilistic inference


Exact inference

Approximate inference
Learning Bayesian Networks



Learning parameters

Learning graph structure
Summary
Probabilistic Inference Tasks
 Belief updating:
BEL(Xi )  P(Xi  xi | evidence)
 Finding most probable explanation (MPE)
x*  arg max P(x, e)
x
 Finding maximum a-posteriory hypothesis
(a 1* ,..., ak* )  arg max  P(x, e)
a
A X :
hypothesis variables
X/A
 Finding maximum-expected-utility (MEU) decision
(d1* ,..., dk* )  arg max  P(x, e)U(x)
d
X/D
D  X : decision variables
U ( x ) : utility function
Belief Updating
Smoking
Bronchitis
lung Cancer
X-ray
Dyspnoea
P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Belief updating: P(X|evidence)=?
P(a|e=0)  P(a,e=0)=
A
BB
C
C
 P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=
e  0 , d , c ,b
D
D
EE
“Moral” graph
P(a) P(c|a)  P(b|a)P(d|b,a)P(e|b,c)
e 0
d
c
b
Variable Elimination
h B (a, d , c, e)
Bucket elimination
Algorithm elim-bel (Dechter 1996)

b
bucket B:
bucket C:
P(b|a) P(d|b,a) P(e|b,c)
P(c|a) h B (a, d, c, e)
hC (a, d, e)
bucket D:
bucket E:
bucket A:
Elimination operator
e=0
P(a)
B
C
D
h D (a, e)
E
h (a)
P(a|e=0)
W*=4
”induced width”
(max clique size)
E
A
Finding
MPE  max P(x)
Algorithm elim-mpe (Dechter x1996)

is replacedby max :
MPE  max P(a ) P(c | a ) P(b | a ) P(d | a, b) P(e | b, c)
a ,e ,d ,c ,b
max 
b
Elimination operator
bucket B:
P(b|a) P(d|b,a) P(e|b,c)
B
bucket C:
P(c|a) h B (a, d, c, e)
C
hC (a, d, e)
bucket D:
bucket E:
e=0
bucket A:
P(a)
D
h D (a, e)
h E (a)
MPE
W*=4
”induced width”
(max clique size)
E
A
Generating the MPE-tuple
5. b'  arg max P(b | a' ) 
b
 P(d' | b, a' )  P(e' | b, c' )
4. c'  arg max P(c | a' ) 
c
 h (a' , d' , c, e' )
B
B: P(b|a) P(d|b,a) P(e|b,c)
C:
P(c|a)
h B (a, d, c, e)
D:
hC (a, d, e)
2. e'  0
E:
h D (a, e)
1. a'  arg max P(a)  h E (a)
A: P(a)
3. d'  arg max h (a' , d, e' )
C
d
e=0
a
Return (a' , b' , c' , d' , e' )
h E (a)
Complexity of inference
O(n exp ( w* (d ))
w* (d )  the induced width of moral graph along ordering d
The effect of the ordering:
A
B
D
C
E
“Moral” graph
B
E
C
D
D
C
E
B
A
A
w* (d1 )  4
w* (d 2 )  2
Other tasks and algorithms

MAP and MEU tasks:




Similar bucket-elimination algorithms - elim-map, elim-meu
(Dechter 1996)
Elimination operation: either summation or maximization
Restriction on variable ordering: summation must precede
maximization (i.e. hypothesis or decision variables are eliminated
last)
Other inference algorithms:
 Join-tree clustering
 Pearl’s poly-tree propagation
 Conditioning, etc.
Relationship with join-tree
clustering
O rde ri n g: A, B, C , D, E
bucket(E) P(e|b,c)
bucket(D) P(d|a,b)
bucket(C) P(c|a) || h D(a,b)
bucket(B) P(b|a) ||h C (a,b)
bucket( A)  P ( a ) || hB(a)
BCE
ADB
A cluster is a set of buckets
(a “super-bucket”)
ABC
Relationship with Pearl’s
belief propagation in poly-trees
Z3
Z1
Z (u1 ) 
Z2
Z3
Z (u3 )
Z (u2 )
1
P( z1 | u1 )
2
U1
U2
Z2
3
Z1
U3
Y1
“Causal
support”
 ( x1 )
U1
X1
Y ( x1 )
1
Y1
U2
“Diagnostic
support”
U3
X1
Pearl’s belief propagation
for single-root query
elim-bel using topological ordering
and super-buckets for families
Elim-bel, elim-mpe, and elim-map are linear for poly-trees.
“Road map”

Introduction: Bayesian networks

Probabilistic inference

Exact inference

Approximate inference
Learning Bayesian Networks



Learning parameters

Learning graph structure
Summary
Inference is NP-hard
=> approximations
O(n exp(w*))
S
CC
XX
BB
D
D
Approximations:




Local inference
Stochastic simulations
Variational approximations
etc.
Local Inference Idea
Bucket-elimination approximation:
“mini-buckets”



Local inference idea:
bound the size of recorded dependencies
Computation in a bucket is time and space
exponential in the number of variables involved
Therefore, partition functions in a bucket
into “mini-buckets” on smaller number of variables
Mini-bucket approximation:
MPE task
Split a bucket into mini-buckets =>bound complexity
h g
X
X
Exponentia l complexity decrease : O(e n )  O(e r )  O(e nr )
Approx-mpe(i)


Input: i – max number of variables allowed in a mini-bucket
Output: [lower bound (P of a sub-optimal solution), upper bound]
Example: approx-mpe(3) versus elim-mpe
w*  2
w*  4
Properties of approx-mpe(i)

Complexity: O(exp(2i)) time and O(exp(i)) time.

Accuracy: determined by upper/lower (U/L) bound.

As i increases, both accuracy and complexity increase.


Possible use of mini-bucket approximations:

As anytime algorithms (Dechter and Rish, 1997)

As heuristics in best-first search (Kask and Dechter, 1999)
Other tasks: similar mini-bucket approximations for: belief
updating, MAP and MEU (Dechter and Rish, 1997)
Anytime Approximation
an yti m e- m pe ( )
In i ti al i z:e i  i0
W h i l et imeand space resourcesare available
i  i  istep
U  upper bound comput edby approx- m pe(i)
L  lower bound comput edby approx- m pe(i)
keep t hebest solut ionfoundso far
U
if 1 
 1   , ret urnsolut ion
L
end
re tu rn t helargest L and t hesmallestU
Empirical Evaluation
(Dechter and Rish, 1997; Rish, 1999)

Randomly generated networks




Uniform random probabilities
Random noisy-OR
CPCS networks
Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe
Random networks


Uniform random: 60 nodes, 90 edges (200 instances)

In 80% of cases, 10-100 times speed-up while U/L<2
Noisy-OR – even better results
P( x  0 | y1 ,..., yn )   qi  randomnoise parameterqi  q

y 1
i
Exact elim-mpe was
infeasible; appprox-mpe took 0.1 to 80 sec.
CPCS networks – medical diagnosis
(noisy-OR model)
Test case: no evidence
Anytime-mpe(0.0001)
U/L error vs time
3.8
cpcs422b
cpcs360b
Upper/Lower
3.4
3.0
2.6
2.2
1.8
1.4
1.0
0.6
i=1
1
10
100
i=21
Time and parameter i
Algorithm
elim-mpe
anytime-mpe( ),   10 4
1
anytime-mpe( ),   10
1000
Time (sec)
cpcs360
cpcs422
115.8
1697.6
70.3
505.2
70.3
110.5
Effect of evidence
More likely evidence=>higher MPE => higher accuracy (why?)
log(U/L) histogram for i=10 on
1000 instances of random evidence
1000
1000
900
900
800
800
700
700
Frequency
Frequency
log(U/L) histogram for i=10 on
1000 instances of likely evidence
600
500
400
300
200
600
500
400
300
200
100
100
0
0
1
2
3
4
5
6
7
log(U/L)
8
9
10 11 12
0
0
2
4
6
8
log(U/L)
Likely evidence versus random (unlikely) evidence
10
12
Probabilistic decoding
Error-correcting linear block code
State-of-the-art:
approximate algorithm – iterative belief propagation (IBP)
(Pearl’s poly-tree algorithm applied to loopy networks)
approx-mpe vs. IBP
approx - mpe is better on low - w * codes
IBP is better on randomly generated (high - w*) codes
Bit error rate (BER) as a function of noise (sigma):
Mini-buckets: summary

Mini-buckets – local inference approximation

Idea: bound size of recorded functions

Approx-mpe(i) - mini-bucket algorithm for MPE





Better results for noisy-OR than for random problems
Accuracy increases with decreasing noise in
Accuracy increases for likely evidence
Sparser graphs -> higher accuracy
Coding networks: approx-mpe outperfroms IBP on lowinduced width codes
“Road map”

Introduction: Bayesian networks

Probabilistic inference

Exact inference

Approximate inference

Local inference

Stochastic simulations

Variational approximations

Learning Bayesian Networks

Summary
Approximation via Sampling
1. Generate N samples from P( X) :
S  (s1 ,...,s N ), where s i  (x1i , x i2 ,...,x in )
2. Estimateprobabilities by frequencies :
# sam ples with Y  y
P(Y  y ) 
,
N
3. How to handle evidence E ?
- acceptance- rejection (e.g., forward sampling)
- " clamping"evidence nodes to their values :
* likelihood weighing
* Gibbs sampling(MCMC)
Forward Sampling
(logic sampling (Henrion, 1988))
,sel p mas f o # - N,ec nedi ve - E :t up nI
) n X , . .. , 1 X (  og ni red r ol artsec n a na
E h tiwt ne tsis n ocsel p mas N :t up t u O
N o t 1  #el p m as r oF . 1
n o t 1  i r o f .2
.3
) ia p | ix (P m o r f ix el p mas  i X
.4
:e lp m a stc e j er , ix  i X d na E  i X f i
.5
2 pe tso t o g d na 1  i
Forward sampling (example)
X1
X2
P( x2 | x1 )
P( x1 )
X3
X4
P( x3 | x1 )
P( x4 | x2 , x3 )
Evidence: X 3  0
// generatesample k
1. Sample x1 from P ( x1 )
2. Sample x 2 from P ( x 2 | x1 )
3. Sample x3 from P ( x3 | x1 )
4. If x3  0, reject sample
and start from1, ot herwise
5. sample x 4 from P ( x 4 | x 2, x3 )
Drawback: high rejection rate!
Likelihood Weighing
(Fung and Chang, 1990; Shachter and Peot, 1990)
“Clamping” evidence+forward sampling+
weighing samples by evidence likelihood
. ie  ix ngissa , E  i X hc a e r oF . 1
: sed o ne h t f o g nire dr ol artsec n a na d niF . 2
. ) n X ,..., 1 X (  o
N o t 1  #el p m as r oF . 3
E  i X r o f .4
) ia p | ix (P m o r f ix el p mas  i X
.5
) ia p | ie ( P   )ep
l mas (e r ocs
.6
E i X
se r ocsezila m r o n
.7
)ep
l mas (e r ocs    ) E | y  Y ( P ne h T
s e lpmas
y Y er ehw
Works well for likely evidence!
Gibbs Sampling
(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC):
create a Markov chain of samples
. ie  ix , E  i X hc a e r oF . 1
e ul avm o d n ar  ix , E  i X hc a e r oF . 2
N o t 1  #el p m as r oF . 3
E  i X r o f .4
)} i X { \ X | ix (P m o r f ix el p mas  i X
.5
Advantage: guaranteed to converge to P(X)
Disadvantage: convergence may be slow
Gibbs Sampling (cont’d)
(Pearl, 1988)
:yllacolde tupmoc si )} i X{ \ X | ix (P t:na tropmI
) j ap | j x ( P  ) i ap | ix ( P  )} i X { \ X | ix ( P
ih c j
X
Markov blanket:
Xi
) j ap  (  i hc  i ap  ) i X ( M
j h c j
X
tek n al b v okr aM ne viG
, )s t ne ra prie h t d na, ne rdli hc,s t ne ra p (
sed o nre h t o lla f o tned ne ped ni si i X
“Road map”

Introduction: Bayesian networks

Probabilistic inference

Exact inference

Approximate inference

Local inference

Stochastic simulations

Variational approximations

Learning Bayesian Networks

Summary
Variational Approximations
Idea:
variational transformation of CPDs simplifies inference
Advantages:
 Compute upper and lower bounds on P(Y)
 Usually faster than sampling techniques
Disadvantages:
 More complex and less general: must be derived for
each particular form of CPD functions
Variational bounds: example
log(x ) 
 min{x  log   1} 

log(x)
 x  log   1
 - variational parameter
This approach can be generalized for any concave (convex)
function in order to compute its upper (lower) bounds:
convex duality (Jaakkola and Jordan, 1997)
Convex duality
(Jaakkola and Jordan, 1997)
1. If f ( x ) is concave, it has a dual function f * (  ) s.t. :
f ( x )  min{T x  f * (  )}

*
f (  )  min{T x  f ( x )}
x
and we get upper bounds
f ( x )  T x  f * (  )
f * (  )  T x  f ( x )
2. For convex f ( x ), we get
lower bounds.
Example: QMR-DT network
(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))
d1
600 diseases
4000 findings
f1
d2
f2
dk
f3
fn
Noisy-OR model:
P( f i  0 | d )  (1  qi 0 )  (1  qij )

P( f i  0 | d )  e
dj
d j  pai
i 0 
d j pai ij d j
where ij  - log( 1-qij )
,
Inference in QMR-DT
 P(d , f )
Inference: P(d1 | f ) 
d2 ,...,dk
P( d , f )  P( f | d ) P( d ) 
  P( f i | d )  P( f i | d )
f i 1
 (1  e
i 0 
d j pai ij d j
f i 0
)
f i 1
Positive evidence “couples”
the disease nodes
 P( d
j
)
dj
e
i 0 
f i 0
 e
f i 0
i 0
d j  pa i ij d j
 [e

factorized

 fi 0ij
]
dj
d j  pai
Inference complexity: O(exp(min{p,k}))
p = # of positive findings, k = max family size
(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)
factorized
Variational approach to QMR-DT
(Jaakkola and Jordan, 1997)
f ( x )  ln(1  e  x ) is concaveand has a dual
f * (  )   ln   (   1) ln(  1)
T hen P( f i  1 | d )  1  e
P( f i  1 | d )  e
i (i 0 
i 0 
d j  pai ij d j
d j  pai ij d j ) f * ( i )
e
can be bounded by :
ii 0  f * ( i )

ij
 [e  ]
d j  pa i
The effect of positive evidence is now factorized
(diseases are “decoupled”)
dj
Variational approximations




Bounds on local CPDs yield a bound on
posterior
Two approaches: sequential and block
Sequential: applies variational transformation
to (a subset of) nodes sequentially during
inference using a heuristic node ordering;
then optimizes across variational parameters
Block: selects in advance nodes to be
transformed, then selects variational
parameters minimizing the KL-distance
between true and approximate posteriors
Block approach
P(Y | E )  exact posteriorof Y given evidence E
Q (Y | E ,  )  approximation after replacingsome
CP Ds with their variationalbounds
Find *  arg min D(Q || P )
where D(Q || P ) is the Kullback - Leibler (KL) distance
Q( S )
D(Q || P )   Q ( S ) log
P( S )
S
Inference in BN: summary



Exact inference is often intractable => need approximations
Approximation principles:

Approximating elimination – local inference, bounding size of
dependencies among variables (cliques in a problem’s graph).
Mini-buckets, IBP

Other approximations: stochastic simulations, variational
techniques, etc.
Further research:

Combining “orthogonal” approximation approaches

Better understanding of “what works well where”: which
approximation suits which problem structure

Other approximation paradigms (e.g., other ways of
approximating probabilities, constraints, cost functions)
“Road map”

Introduction: Bayesian networks

Probabilistic inference

Exact inference

Approximate inference
Learning Bayesian Networks



Learning parameters

Learning graph structure
Summary
Why learn Bayesian networks?


Combining domain expert
knowledge with data
<9.7 0.6 8 14 18>
<0.2 1.3 5 ?? ??>
<1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….
Efficient representation and
inference

Incremental learning: P(H) or

Handling missing data:

Learning causal relationships: S
<1.3 2.8 ?? 0 1 >
C
Learning Bayesian Networks

Known graph – learn parameters
S
Complete data:
P(C|S)
parameter estimation (ML, MAP)
B
C
Incomplete data:
non-linear parametric
optimization (gradient descent, EM)

P(S)
P(X|C,S)
P(B|S)
P(D|C,B)
D
X
Unknown graph – learn graph and parameters
S
Complete data:
optimization (search
in space of graphs)
Incomplete data:
EM plus Multiple
Imputation,
structural EM,
mixture models
S
X
B
C
B
C
D
X
D
ˆ  arg max Score(G)
G
G
Learning Parameters:
complete data

C
B
X

max log P( D | ) - decomposable!
ML-estimate:
Pa X

Multinomial
 x ,pa X 
P( x | pa X )
MAP-estimate
(Bayesian
statistics)
counts
ML( x ,pa X ) 
x
x ,pa X
x ,pa X
max log P( D | ) P()

N x ,pa X   x ,pa X
N
N
x
Conjugate priors - Dirichlet
MAP( x ,pa X ) 
N x ,pa X
  x ,pa X
x
Dir( pa X | 1,pa X ,...,m,pa X )
Equivalent sample size
(prior knowledge)
Learning graph structure
ˆ  arg max Score(G)
Find G
G

Heuristic search:
Incomplete data (score
non-decomposable):stochastic
methods
Add S->B
S
Complete data – local computations
NP-hard
optimization
S
B
C
Delete
S->B
B
C
Reverse
S->B
S
C

Constrained-based methods

Data impose independence
relations (constraints)
B
S
C
B
Learning BNs: incomplete data

Learning parameters
 EM algorithm [Lauritzen, 95]
 Gibbs Sampling [Heckerman, 96]
 Gradient Descent [Russell et al., 96]

Learning both structure and parameters
 Sum over missing values [Cooper & Herskovits, 92;
Cooper, 95]
 Monte-Carlo approaches [Heckerman, 96]
 Gaussian approximation [Heckerman, 96]
 Structural EM [Friedman, 98]
 EM and Multiple Imputation [Singh 97,98,00]
Learning Parameters:
incomplete data
Non-decomposable marginal likelihood (hidden nodes)
Data
Initial parameters
Current model
Expectation
Inference:
P(S|X=0,D=1,C=0,B=1)
(G,Θ)
S X D C
<? 0 1 0
<1 1 ? 0
<0 0 0 ?
<? ? 0 ?
………
B
1>
1>
?>
1>
Expected
counts
Maximization
Update parameters
(ML, MAP)
EM-algorithm:
iterate until convergence
S
1
1
0
1
X D C
0 1 0
1 1 0
0 0 0
0 0 0
………..
B
1
1
0
1
Learning Parameters:
incomplete data
(Lauritzen, 95)

Complete-data
log-likelihood
is
n qi ri
   Nijk log ijk
i 1 j 1k 1

E step


Compute E( Nijk | Yobs, 
M step

Compute
 E( Nijk | Yobs, E( Nij | Yobs, 
Learning structure: incomplete data




Depends on the type of missing data - missing
independent of anything else (MCAR) OR missing
based on values of other variables (MAR)
While MCAR can be resolved by decomposable
scores, MAR cannot
For likelihood-based methods, no need to explicitly
model missing data mechanism
Very few attempts at MAR: stochastic methods
Learning structure: incomplete data

Approximate EM by using Multiple Imputation to
yield efficient Monte-Carlo method
[Singh 97, 98, 00]
 trade-off between performance & quality
 learned network almost optimal
 approximate complete-data log-likelihood
function using Multiple Imputation
 yields decomposable score, dependent only
on each node & its parents
 converges to local maxima of observed-data
likelihood
Learning structure: incomplete data
Q Bs | B
(t )
s
   l B | D
S
obs
,D
mis
P D
mis
obs
(t )
S
|D ,B
 dD
mis
1 M
obs
mis
  l BS | D , Ds 
M s 1
1 M
  log P Dobs , Dmis
s  | BS 
M s 1
P D mis | Dobs , BS( t )    P D mis | D obs , BS( t ) ,  P | Dobs , BS( t )  dD mis
1 T
  P D mis | Dobs , BS( t ) ,  r  
T r 1
Scoring functions:
Minimum Description Length (MDL)

Learning  data compression
<9.7 0.6 8 14 18>
<0.2 1.3 5 ?? ??>
<1.3 2.8 ?? 0 1 >
<?? 5.6 0 10 ??>
……………….
log N
MDL ( BN | D )   log P( D | , G ) 
||
2
DL(Data|model)


DL(Model)
Other: MDL = -BIC (Bayesian Information Criterion)
Bayesian score (BDe) - asymptotically equivalent to MDL
Learning Structure plus Parameters
p(Y | D)   p(Y | M , D) p( M | D)
M
No. of models is super exponential
Alternatives: Model Selection or Model Averaging
Model Selection
Generally, choose a single model M*.
Equivalent to saying P(M*|D) = 1
p(Y | D)  p(Y | M * , D)
Task is now to: 1) define a metric to decide which
model is best
2) search for that model through the
space of all models
One Reasonable Score:
Posterior Probability of a Structure
p( S | D)  p( S ) p( D| S )
h
h
h
 p( S )  p( D|s , S ) p(s | S ) ds
h
structure
prior
h
likelihood
h
parameter
prior
Global and Local Predictive Scores
[Spiegelhalter et al 93]
p( D| S h )| p( D| S0h )
Bayes’ factor
log p ( D | S ) 
h
m
h
...
 log p ( x l | x 1 , , x l 1 , S )
l 1
 log p ( x 1 | S h )  log p ( x 2 | x 1 , S h )  log p ( x 3 | x 1 , x 2 , S h ) L
Local is useful for diagnostic problems
Local Predictive Score
Spiegelhalter et al. (1993)
disease
Y
...
X1
Xn
symptoms
X2
m
pred(S h )   log p( yl | x l , d1 ,..., d l 1 , S h )
l 1
Exact computation of p(D|Sh)

No missing data

Cases are independent, given the model.

Uniform priors on parameters

discrete variables
n
p( D| S h )   g (i , i )
i 1
[Cooper & Herskovits, 92]
Bayesian Dirichlet Score
Cooper and Herskovits (1991)
n
qi
i 1
j 1
p( D| S )   
h
 (ij )
 (ij  N ij
ri

)
k 1
 (ijk  N ijk )
 (ijk )
N ijk : # cases where X i = xik and Pa i = pa ij
ri : number of states of X i
qi : number of instances of parents of X i
ri
 ij    ijk
k 1
ri
N ij   N ijk
k 1
Learning BNs without specifying an ordering
n! ordering; ordering greatly affects the quality of
network learned.
use conditional independence tests, and dseparation to get an ordering
[Singh & Valtorta’ 95]
Learning BNs via the MDL principle
Idea: best model is that which gives the most
compact representation of the data
So, encode the data using the model plus encode
the model. Minimize this.
[Lam & Bacchus, 93]
Learning BNs: summary






Bayesian Networks – graphical probabilistic models
Efficient representation and inference
Expert knowledge + learning from data
Learning:
 parameters (parameter estimation, EM)
 structure (optimization w/ score functions – e.g., MDL)
Applications/systems: collaborative filtering (MSBN), fraud
detection (AT&T), classification (AutoClass (NASA), TANBLT(SRI))
Future directions: causality, time, model evaluation criteria,
approximate inference/learning, on-line learning, etc.