CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor.

Download Report

Transcript CMSC 671 Fall 2010 Class #18/19 – Wednesday, November 3 / Monday, November 8 Some material borrowed with permission from Lise Getoor.

CMSC 671
Fall 2010
Class #18/19 – Wednesday, November 3 /
Monday, November 8
Some material borrowed with
permission from Lise Getoor
1
Next two classes
• Probability theory (quick review!)
• Bayesian networks
– Network structure
– Conditional probability tables
– Conditional independence
• Bayesian inference
– From the joint distribution
– Using independence/factoring
– From sources of evidence
2
Bayesian Reasoning
Chapter 13
3
Sources of uncertainty
• Uncertain inputs
– Missing data
– Noisy data
• Uncertain knowledge
– Multiple causes lead to multiple effects
– Incomplete enumeration of conditions or effects
– Incomplete knowledge of causality in the domain
– Probabilistic/stochastic effects
• Uncertain outputs
– Abduction and induction are inherently uncertain
– Default reasoning, even in deductive fashion, is uncertain
– Incomplete deductive inference may be uncertain
Probabilistic reasoning only gives probabilistic
results (summarizes uncertainty from various sources)
4
Decision making with uncertainty
• Rational behavior:
– For each possible action, identify the possible outcomes
– Compute the probability of each outcome
– Compute the utility of each outcome
– Compute the probability-weighted (expected) utility
over possible outcomes for each action
– Select the action with the highest expected utility
(principle of Maximum Expected Utility)
5
Why probabilities anyway?
•
Kolmogorov showed that three simple axioms lead to the
rules of probability theory
– De Finetti, Cox, and Carnap have also provided compelling
arguments for these axioms
1. All probabilities are between 0 and 1:
•
0 ≤ P(a) ≤ 1
2. Valid propositions (tautologies) have probability 1, and
unsatisfiable propositions have probability 0:
•
P(true) = 1 ; P(false) = 0
3. The probability of a disjunction is given by:
•
P(a  b) = P(a) + P(b) – P(a  b)
a
ab
b
6
Probability theory
• Random variables
– Domain
• Alarm, Burglary, Earthquake
– Boolean (like these), discrete,
continuous
• Atomic event: complete
specification of state
• Alarm=True  Burglary=True 
Earthquake=False
alarm  burglary  ¬earthquake
• Prior probability: degree
of belief without any other
evidence
• Joint probability: matrix
of combined probabilities
of a set of variables
• P(Burglary) = .1
• P(Alarm, Burglary) =
alarm
¬alarm
burglary
.09
.01
¬burglary
.1
.8
7
Probability theory (cont.)
• Conditional probability:
probability of effect given causes
• Computing conditional probs:
– P(a | b) = P(a  b) / P(b)
– P(b): normalizing constant
• Product rule:
– P(a  b) = P(a | b) P(b)
• Marginalizing:
– P(B) = ΣaP(B, a)
– P(B) = ΣaP(B | a) P(a)
(conditioning)
• P(burglary | alarm) = .47
P(alarm | burglary) = .9
• P(burglary | alarm) =
P(burglary  alarm) / P(alarm)
= .09 / .19 = .47
• P(burglary  alarm) =
P(burglary | alarm) P(alarm) =
.47 * .19 = .09
• P(alarm) =
P(alarm  burglary) +
P(alarm  ¬burglary) =
.09+.1 = .19
8
Example: Inference from the joint
alarm
¬alarm
earthquake
¬earthquake
earthquake
¬earthquake
burglary
.01
.08
.001
.009
¬burglary
.01
.09
.01
.79
P(Burglary | alarm) = α P(Burglary, alarm)
= α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake)
= α [ (.01, .01) + (.08, .09) ]
= α [ (.09, .1) ]
Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.09+.1) = 5.26
(i.e., P(alarm) = 1/α = .19 – quizlet: how can you verify this?)
P(burglary | alarm) = .09 * 5.26 = .474
P(¬burglary | alarm) = .1 * 5.26 = .526
9
Exercise: Inference from the joint
smart
smart
p(smart 
study  prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– What is the prior probability of smart?
– What is the prior probability of study?
– What is the conditional probability of prepared, given
study and smart?
• Save these answers for next time! 
10
Independence
• When two sets of propositions do not affect each others’
probabilities, we call them independent, and can easily
compute their joint and conditional probability:
– Independent (A, B) → P(A  B) = P(A) P(B), P(A | B) = P(A)
• For example, {moon-phase, light-level} might be
independent of {burglary, alarm, earthquake}
– Then again, it might not: Burglars might be more likely to
burglarize houses when there’s a new moon (and hence little light)
– But if we know the light level, the moon phase doesn’t affect
whether we are burglarized
– Once we’re burglarized, light level doesn’t affect whether the alarm
goes off
• We need a more complex notion of independence, and
methods for reasoning about these kinds of relationships
11
Exercise: Independence
smart
smart
p(smart 
study  prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– Is smart independent of study?
– Is prepared independent of study?
12
Conditional independence
• Absolute independence:
– A and B are independent if P(A  B) = P(A) P(B); equivalently,
P(A) = P(A | B) and P(B) = P(B | A)
• A and B are conditionally independent given C if
– P(A  B | C) = P(A | C) P(B | C)
• This lets us decompose the joint distribution:
– P(A  B  C) = P(A | C) P(B | C) P(C)
• Moon-Phase and Burglary are conditionally independent
given Light-Level
• Conditional independence is weaker than absolute
independence, but still useful in decomposing the full joint
probability distribution
13
Exercise: Conditional independence
smart
smart
p(smart 
study  prep) study study
study
study
prepared
.432
.16
.084
.008
prepared
.048
.16
.036
.072
• Queries:
– Is smart conditionally independent of prepared, given
study?
– Is study conditionally independent of prepared, given
smart?
14
Bayes’s rule
• Bayes’s rule is derived from the product rule:
– P(Y | X) = P(X | Y) P(Y) / P(X)
• Often useful for diagnosis:
– If X are (observed) effects and Y are (hidden) causes,
– We may have a model for how causes lead to effects (P(X | Y))
– We may also have prior beliefs (based on experience) about the
frequency of occurrence of effects (P(Y))
– Which allows us to reason abductively from effects to causes (P(Y |
X)).
15
Bayesian inference
• In the setting of diagnostic/evidential reasoning
H i P(Hi )
hypotheses
P( E j | Hi )
E1
Ej
Em
evidence/manifestations
– Know prior probability of hypothesis
conditional probability
– Want to compute the posterior probability
P(Hi )
P( E j | Hi )
P( Hi | E j )
• Bayes’ theorem (formula 1):
P( Hi | E j )  P( Hi ) P( E j | Hi ) / P( E j )
16
Simple Bayesian diagnostic reasoning
• Knowledge base:
– Evidence / manifestations:
– Hypotheses / disorders:
E1, … Em
H1, … H n
• Ej and Hi are binary; hypotheses are mutually exclusive (nonoverlapping) and exhaustive (cover all possible cases)
– Conditional probabilities:
P(Ej | Hi), i = 1, … n; j = 1, … m
• Cases (evidence for a particular instance): E1, …, El
• Goal: Find the hypothesis Hi with the highest posterior
– Maxi P(Hi | E1, …, El)
17
Bayesian diagnostic reasoning II
• Bayes’ rule says that
– P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El)
• Assume each piece of evidence Ei is conditionally
independent of the others, given a hypothesis Hi, then:
– P(E1, …, El | Hi) = lj=1 P(Ej | Hi)
• If we only care about relative probabilities for the Hi, then
we have:
– P(Hi | E1, …, El) = α P(Hi) lj=1 P(Ej | Hi)
18
Limitations of simple
Bayesian inference
• Cannot easily handle multi-fault situation, nor cases where
intermediate (hidden) causes exist:
– Disease D causes syndrome S, which causes correlated
manifestations M1 and M2
• Consider a composite hypothesis H1  H2, where H1 and H2
are independent. What is the relative posterior?
– P(H1  H2 | E1, …, El) = α P(E1, …, El | H1  H2) P(H1  H2)
= α P(E1, …, El | H1  H2) P(H1) P(H2)
= α lj=1 P(Ej | H1  H2) P(H1) P(H2)
• How do we compute P(Ej | H1  H2) ??
19
Limitations of simple Bayesian
inference II
• Assume H1 and H2 are independent, given E1, …, El?
– P(H1  H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El)
• This is a very unreasonable assumption
– Earthquake and Burglar are independent, but not given Alarm:
• P(burglar | alarm, earthquake) << P(burglar | alarm)
• Another limitation is that simple application of Bayes’s rule doesn’t
allow us to handle causal chaining:
– A: this year’s weather; B: cotton production; C: next year’s cotton price
– A influences C indirectly: A→ B → C
– P(C | B, A) = P(C | B)
• Need a richer representation to model interacting hypotheses,
conditional independence, and causal chaining
• Next time: conditional independence and Bayesian networks!
20
Bayesian Networks
Chapter 14.1-14.3
Some material borrowed
from Lise Getoor
21
Bayesian Belief Networks (BNs)
• Definition: BN = (DAG, CPD)
– DAG: directed acyclic graph (BN’s structure)
• Nodes: random variables (typically binary or discrete, but
methods also exist to handle continuous variables)
• Arcs: indicate probabilistic dependencies between nodes
(lack of link signifies conditional independence)
– CPD: conditional probability distribution (BN’s parameters)
• Conditional probabilities at each node, usually stored as a table
(conditional probability table, or CPT)
P ( xi |  i ) where i is theset of all parentnodesof xi
– Root nodes are a special case – no parents, so just use priors
in CPD:
 i  , so P( xi |  i )  P( xi )
22
Example BN
P(A) = 0.001
a
P(B|A) = 0.3
P(B|A) = 0.001
b
P(C|A) = 0.2
P(C|A) = 0.005
c
d
P(D|B,C) = 0.1
P(D|B,C) = 0.01
P(D|B,C) = 0.01
P(D|B,C) = 0.00001
e
P(E|C) = 0.4
P(E|C) = 0.002
Note that we only specify P(A) etc., not P(¬A), since they have to add to one
23
Conditional independence and
chaining
• Conditional independence assumption
– P ( xi |  i , q)  P ( xi |  i )
i
where q is any set of variables
q
(nodes) other than x i and its successors
xi
–  i blocks influence of other nodes on x i
and its successors (q influences x i only
through variables in  i )
– With this assumption, the complete joint probability distribution of all
variables in the network can be represented by (recovered from) local
CPDs by chaining these CPDs:
P ( x1 ,..., xn )  ni1 P ( xi |  i )
24
Chaining: Example
a
b
c
d
e
Computing the joint probability for all variables is easy:
P(a, b, c, d, e)
= P(e | a, b, c, d) P(a, b, c, d)
by the product rule
= P(e | c) P(a, b, c, d)
by cond. indep. assumption
= P(e | c) P(d | a, b, c) P(a, b, c)
= P(e | c) P(d | b, c) P(c | a, b) P(a, b)
= P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)
25
Topological semantics
• A node is conditionally independent of its nondescendants given its parents
• A node is conditionally independent of all other nodes in
the network given its parents, children, and children’s
parents (also known as its Markov blanket)
• The method called d-separation can be applied to decide
whether a set of nodes X is independent of another set Y,
given a third set Z
26
Inference in Bayesian
Networks
Chapter 14.4-14.5
Some material borrowed
from Lise Getoor
28
Inference tasks
• Simple queries: Computer posterior marginal P(Xi | E=e)
– E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)
• Conjunctive queries:
– P(Xi, Xj | E=e) = P(Xi | e=e) P(Xj | Xi, E=e)
• Optimal decisions: Decision networks include utility
information; probabilistic inference is required to find
P(outcome | action, evidence)
• Value of information: Which evidence should we seek next?
• Sensitivity analysis: Which probability values are most
critical?
• Explanation: Why do I need a new starter motor?
29
Approaches to inference
• Exact inference
–
–
–
–
Enumeration
Belief propagation in polytrees
Variable elimination
Clustering / join tree algorithms
• Approximate inference
–
–
–
–
–
–
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Genetic algorithms
Neural networks
Simulated annealing
Mean field theory
30
Direct inference with BNs
• Instead of computing the joint, suppose we just want the
probability for one variable
• Exact methods of computation:
– Enumeration
– Variable elimination
– Join trees: get the probabilities associated with every query variable
31
Inference by enumeration
• Add all of the terms (atomic event probabilities) from the
full joint distribution
• If E are the evidence (observed) variables and Y are the
other (unobserved) variables, then:
P(X|e) = α P(X, E) = α ∑ P(X, E, Y)
• Each P(X, E, Y) term can be computed using the chain rule
• Computationally expensive!
32
Example: Enumeration
a
b
•
•
•
•
c
d
e
P(xi) = Σ πi P(xi | πi) P(πi)
Suppose we want P(D=true), and only the value of E is
given as true
P (d|e) =  ΣABCP(a, b, c, d, e)
=  ΣABCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)
With simple iteration to compute this expression, there’s
going to be a lot of repetition (e.g., P(e|c) has to be
recomputed every time we iterate over C=true)
33
Exercise: Enumeration
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
study
.9
.7
study
.5
.1
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
Query: What is the
probability that a student
studied, given that they pass
the exam?
34
Variable elimination
• Basically just enumeration, but with caching of local
calculations
• Linear for polytrees (singly connected BNs)
• Potentially exponential for multiply connected BNs
Exact inference in Bayesian networks is NP-hard!
• Join tree algorithms are an extension of variable elimination
methods that compute posterior probabilities for all nodes
in a BN simultaneously
35
Variable elimination
General idea:
• Write query in the form
P( X n , e )   P( xi | pai )
• Iteratively
xk
x3
x2
i
– Move all irrelevant terms outside of innermost sum
– Perform innermost sum, getting a new term
– Insert the new term into the product
36
Variable elimination: Example
Cloudy
Rain
Sprinkler
WetGrass
P(w)   P(w | r, s)P(r | c)P(s | c)P(c)
r ,s ,c
  P(w | r, s) P(r | c)P(s | c)P(c)
r ,s
c
  P(w | r, s)f1 (r, s)
r ,s
f1 (r, s)
37
A more complex example
• “Asia” network:
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
39
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
40
S
V
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
L
T
B
A
Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
Eliminate: v
Compute:
fv (t )  P (v )P (t |v )
v
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Note: fv(t) = P(t)
In general, result of elimination is not necessarily a probability
term
41
S
V
• We want to compute P(d)
• Need to eliminate: s,x,t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: s
Compute:
fs (b,l )  P (s )P (b | s )P (l | s )
s
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Summing on s results in a factor with two arguments fs(b,l)
In general, result of elimination may be a function of several
variables
42
S
V
• We want to compute P(d)
• Need to eliminate: x,t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: x
Compute:
fx (a )  P (x | a )
x
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Note: fx(a) = 1 for all values of a !!
43
S
V
• We want to compute P(d)
• Need to eliminate: t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Eliminate: t
Compute:
ft (a ,l )  fv (t )P (a |t ,l )
t
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
44
S
V
• We want to compute P(d)
• Need to eliminate: l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
Eliminate: l
Compute:
fl (a , b )  fs (b,l )ft (a ,l )
 fl (a, b)fx (a )P (d | a, b)
l
45
• We want to compute P(d)
• Need to eliminate: b
S
V
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
 fl (a, b)fx (a )P (d | a, b)  fa (b,d )  fb (d )
Eliminate: a,b
Compute:
fa (b,d )  fl (a , b )fx (a ) p (d | a , b ) fb (d )  fa (b,d )
a
b
46
S
V
L
T
Dealing with evidence
• How do we deal with evidence?
B
A
X
D
• Suppose we are give evidence V = t, S = f, D = t
• We want to compute P(L, V = t, S = f, D = t)
47
S
V
L
T
Dealing with evidence
• We start by writing the factors:
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
•
•
•
•
Since we know that V = t, we don’t need to eliminate V
Instead, we can replace the factors P(V) and P(T|V) with
fP (V )  P (V  t ) fp (T |V ) (T )  P (T |V  t )
These “select” the appropriate parts of the original factors given the evidence
Note that fp(V) is a constant, and thus does not appear in elimination of other variables
48
Dealing with evidence
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
L
T
B
A
X
D
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
49
Dealing with evidence
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
L
T
B
A
X
D
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
50
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b)ft (a,l )fx (a )fP (d |a ,b ) (a, b)
51
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b)ft (a,l )fx (a )fP (d |a ,b ) (a, b)
•
Eliminating a, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b)fa (b,l )
52
Dealing with evidence
L
T
•
•
•
Given evidence V = t, S = f, D = t
Compute P(L, V = t, S = f, D = t )
Initial factors, after setting evidence:
S
V
B
A
X
D
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )P (x | a )fP (d |a ,b ) (a , b )
•
Eliminating x, we get
fP (v )fP (s )fP (t|v ) (t )fP (l |s ) (l )fP (b|s ) (b )P (a |t ,l )fx (a )fP (d |a ,b ) (a , b )
•
Eliminating t, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b)ft (a,l )fx (a )fP (d |a ,b ) (a, b)
•
•
Eliminating a, we get
fP (v )fP (s )fP (l |s ) (l )fP (b|s ) (b)fa (b,l )
Eliminating b, we get
fP (v )fP (s )fP (l |s ) (l )fb (l )
53
Variable elimination algorithm
•
Let X1,…, Xm be an ordering on the non-query variables
•
For i = m, …, 1
 ...  P(X
X1
X2
Xm
j
| Parent s
( X j ))
j
– Leave in the summation for Xi only factors mentioning Xi
– Multiply the factors, getting a factor that contains a number for each value of the
variables mentioned, including Xi
– Sum out Xi, getting a factor f that contains a number for each value of the variables
mentioned, not including Xi
– Replace the multiplied factor in the summation
54
Complexity of variable elimination
Suppose in one elimination step we compute
fx (y1,, yk )  f 'x (x , y1,, yk )
x
m
f 'x (x , y1 ,, yk )  fi (x , y1,1, , y1,li )
This requires
m  Val(X )   Val(Yi )
i 1
i
multiplications (for each value for x, y1, …, yk, we do m multiplications) and
Val(X )   Val(Yi )
i
additions (for each value of y1, …, yk , we do |Val(X)| additions)
►Complexity is exponential in the number of variables in the intermediate factors
►Finding an optimal ordering is NP-hard
55
Exercise: Variable elimination
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
study
.9
.7
study
.5
.1
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
Query: What is the
probability that a student is
smart, given that they pass
the exam?
56
Conditioning
a
b
c
d
e
• Conditioning: Find the network’s smallest cutset S (a set of nodes
whose removal renders the network singly connected)
– In this network, S = {A} or {B} or {C} or {D}
• For each instantiation of S, compute the belief update with your favorite
inference algorithm
• Combine the results from all instantiations of S
• Computationally expensive (finding the smallest cutset is in general NPhard, and the total number of possible instantiations of S is O(2|S|))
57
Approximate inference:
Direct sampling
• Suppose you are given values for some subset of the
variables, E, and want to infer values for unknown
variables, Z
• Randomly generate a very large number of instantiations
from the BN
– Generate instantiations for all variables – start at root variables and
work your way “forward” in topological order
• Rejection sampling: Only keep those instantiations that are
consistent with the values for E
• Use the frequency of values for Z to get estimated
probabilities
• Accuracy of the results depends on the size of the sample
(asymptotically approaches exact results)
58
Exercise: Direct sampling
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
study
.9
.7
study
.5
.1
Topological order = …?
Random number
generator: .35, .76, .51, .44,
.08, .28, .03, .92, .02, .42
59
Likelihood weighting
• Idea: Don’t generate samples that need to be rejected in the
first place!
• Sample only from the unknown variables Z
• Weight each sample according to the likelihood that it
would occur, given the evidence E
60
Markov chain Monte Carlo algorithm
• So called because
– Markov chain – each instance generated in the sample is dependent
on the previous instance
– Monte Carlo – statistical sampling method
• Perform a random walk through variable assignment space,
collecting statistics as you go
– Start with a random instantiation, consistent with evidence variables
– At each step, for some nonevidence variable, randomly sample its
value, consistent with the other current assignments
• Given enough samples, MCMC gives an accurate estimate
of the true distribution of values
61
Exercise: MCMC sampling
p(smart)=.8
p(study)=.6
smart
study
p(fair)=.9
prepared
fair
p(prep|…) smart smart
pass
p(pass|…)
smart
smart
prep
prep
prep
prep
fair
.9
.7
.7
.2
fair
.1
.1
.1
.1
study
.9
.7
study
.5
.1
Topological order = …?
Random number
generator: .35, .76, .51, .44,
.08, .28, .03, .92, .02, .42
62
Summary
• Bayes nets
– Structure
– Parameters
– Conditional independence
– Chaining
• BN inference
– Enumeration
– Variable elimination
– Sampling methods
63