CS 497: Section EA

Download Report

Transcript CS 497: Section EA

UIUC CS 497: Section EA
Lecture #6
Reasoning in Artificial Intelligence
Professor: Eyal Amir
Spring Semester 2004
(Based on slides by Lise Getoor and Alvaro Cardenas (UMD)
(in turn based on slides by Nir Friedman (Hebrew U)))
Last Time
• Tree decomposition in first-order logic
– Provably better computational bounds when
low treewidth in propositional logic
– Eliminate possible interactions between
clauses
• Applications:
– Planning, spatial reasoning
Today
1. Probabilistic graphical models
2. Treewidth methods:
1. Variable elimination
2. Clique tree algorithm
3. Applications du jour: Sensor Networks
Independent Random Variables
• Two variables X and Y are independent if
– P(X = x|Y = y) = P(X = x) for all values x,y
– That is, learning the values of Y does not
change prediction of X
• If X and Y are independent then
– P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)
• In general, if X1,…,Xp are independent,
then P(X1,…,Xp)= P(X1)...P(Xp)
– Requires O(n) parameters
Conditional Independence
• Unfortunately, most of random variables of
interest are not independent of each other
• A more suitable notion is that of
conditional independence
• Two variables X and Y are conditionally
independent given Z if
– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z
– That is, learning the values of Y does not change
prediction of X once we know the value of Z
– notation: I ( X , Y | Z )
Example: Family trees
Noisy stochastic process:
Homer
Example: Pedigree
• A node represents
an individual’s Bart
genotype
Marge
Lisa
Maggie
Modeling assumptions:
Ancestors can effect descendants' genotype only by passing
genetic materials through intermediate generations
Markov Assumption
• We now make this
independence assumption
more precise for directed
acyclic graphs (DAGs)
• Each random variable X, is
independent of its nondescendents, given its
parents Pa(X)
• Formally,
I (X, NonDesc(X) | Pa(X))
Ancestor
Parent
Y1
Y2
X
Non-descendent
Descendent
Markov Assumption Example
Earthquake
Radio
Burglary
Alarm
• In this example:
– I ( E, B )
– I ( B, {E, R} )
– I ( R, {A, B, C} | E )
– I ( A, R | B,E )
– I ( C, {B, E, R} | A)
Call
I-Maps
• A DAG G is an I-Map of a distribution P if all
Markov assumptions implied by G are satisfied
by P
(Assuming G and P both use the same set of random variables)
Examples:
X
Y
X
Y
Factorization
• Given that G is an I-Map of P, can we
simplify the representation of P?
X
Y
• Example:
• Since I(X,Y), we have that P(X|Y) = P(X)
• Applying the chain rule
P(X,Y) = P(X|Y) P(Y) = P(X) P(Y)
• Thus, we have a simpler representation of
P(X,Y)
Factorization Theorem
Thm: if G is an I-Map of P, then
P(X1 ,...,Xp )   P(Xi | Pa(Xi ))
i
Proof:
P(Xi | X1 ,...,Xi1)
• By chain rule: P(X1 ,...,Xp ) 
i
• wlog. X1,…,Xp is an ordering consistent with G

From assumption: Pa(X i )  {X1,  , X i1 }
{X1,  , X i1 }  Pa(X i )  NonDesc(X i )
• Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi))
• Hence,
I(Xi , {X1, , Xi1 }  Pa(Xi ) | Pa(Xi ))
• We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )
Factorization Example
Earthquake
Radio
Burglary
Alarm
Call
P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
Consequences
• We can write P in terms of “local” conditional probabilities
If G is sparse,
– that is, |Pa(Xi)| < k ,
 each conditional probability can be specified compactly
– e.g. for binary variables, these require O(2k) params.
 representation of P is compact
– linear in number of variables
Summary
We defined the following concepts
• The Markov Independences of a DAG G
– I (Xi , NonDesc(Xi) | Pai )
• G is an I-Map of a distribution P
– If P satisfies the Markov independencies implied by G
We proved the factorization theorem
• if G is an I-Map of P, then
P(X1 ,...,Xn )   P(Xi | Pai )
i
Conditional Independencies
• Let Markov(G) be the set of Markov Independencies
implied by G
• The factorization theorem shows
G is an I-Map of P 
P (X1,...,Xn )  P (Xi | Pai )
i
• We can also show the opposite:
Thm:
P (X1,...,Xn )  P (Xi | Pai )
i
 G is an I-Map of P
Proof (Outline)
X
Z
Example:
Y
P (X ,Y , Z ) P (X )P (Y | X )P (Z | X )
P (Z | X ,Y ) 

P (X ,Y )
P (X )P (Y | X )
 P (Z | X )
Implied Independencies
• Does a graph G imply additional independencies
as a consequence of Markov(G)?
• We can define a logic of independence
statements
• Some axioms:
– I( X ; Y | Z )  I( Y; X | Z )
– I( X ; Y1, Y2 | Z )  I( X; Y1 | Z )
d-seperation
• A procedure d-sep(X; Y | Z, G) that given
a DAG G, and sets X, Y, and Z returns
either yes or no
• Goal:
d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows
from Markov(G)
Paths
• Intuition: dependency must “flow” along paths
in the graph
• A path is a sequence of neighboring variables
Earthquake
Examples:
• REAB
• CAER
Radio
Burglary
Alarm
Call
Paths
• We want to know when a path is
– active -- creates dependency between end
nodes
– blocked -- cannot create dependency end
nodes
• We want to classify situations in which
paths are active.
Path Blockage
Three cases:
Blocked
– Common cause
E
E
–
–
Unblocked
Active
R
A
R
A
Path Blockage
Three cases:
– Common cause
Blocked
E
Unblocked
Active
E
– Intermediate cause
–
A
A
C
C
Path Blockage
Three cases:
– Common cause
Blocked
E
– Intermediate cause
– Common Effect
Unblocked
Active
E
B
A
B
C
A
E
C
B
A
C
Path Blockage -- General Case
A path is active, given evidence Z, if
• Whenever we have the configuration
A
C
B
B or one of its descendents are in Z
• No other nodes in the path are in Z
A path is blocked, given evidence Z, if it is not active.
Example
– d-sep(R,B)?
E
R
B
A
C
Example
– d-sep(R,B) = yes
– d-sep(R,B|A)?
E
R
B
A
C
Example
– d-sep(R,B) = yes
– d-sep(R,B|A) = no
– d-sep(R,B|E,A)?
E
R
B
A
C
d-Separation
• X is d-separated from Y, given Z, if all paths from a
node in X to a node in Y are blocked, given Z.
• Checking d-separation can be done efficiently
(linear time in number of edges)
– Bottom-up phase:
Mark all nodes whose descendents are in Z
– X to Y phase:
Traverse (BFS) all edges on paths from X to Y and check if they
are blocked
Soundness
Thm: If
– G is an I-Map of P
– d-sep( X; Y | Z, G ) = yes
• then
– P satisfies I( X; Y | Z )
Informally: Any independence reported by dseparation is satisfied by underlying
distribution
Completeness
Thm: If d-sep( X; Y | Z, G ) = no
• then there is a distribution P such that
– G is an I-Map of P
– P does not satisfy I( X; Y | Z )
Informally: Any independence not reported by dseparation might be violated by the underlying
distribution
• We cannot determine this by examining the
graph structure alone
Summary: Structure
• We explored DAGs as a representation of
conditional independencies:
– Markov independencies of a DAG
– Tight correspondence between Markov(G) and the
factorization defined by G
– d-separation, a sound & complete procedure for
computing the consequences of the independencies
– Notion of minimal I-Map
– P-Maps
• This theory is the basis for defining Bayesian
networks
Inference
• We now have compact representations of
probability distributions:
– Bayesian Networks
– Markov Networks
• Network describes a unique probability
distribution P
• How do we answer queries about P?
• We use inference as a name for the process
of computing answers to such queries
Today
1. Probabilistic graphical models
2. Treewidth methods:
1. Variable elimination
2. Clique tree algorithm
3. Applications du jour: Sensor Networks
Queries: Likelihood
• There are many types of queries we might ask.
• Most of these involve evidence
– An evidence e is an assignment of values to a set E
variables in the domain
– Without loss of generality E = { Xk+1, …, Xn }
• Simplest query: compute probability of evidence
P (e )  P (x1,, xk , e )
x1
xk
• This is often referred to as computing the
likelihood of the evidence
Queries: A posteriori belief
• Often we are interested in the conditional
probability of a variable given the evidence
P (X , e )
P (X | e ) 
P (e )
• This is the a posteriori belief in X, given
evidence e P (X  x | e )  P (X  x , e )
P (X

x
 x , e )

• A related task is computing the term P(X, e)
– i.e., the likelihood of e and X = x for values of X
A posteriori belief
This query is useful in many cases:
• Prediction: what is the probability of an
outcome given the starting condition
– Target is a descendent of the evidence
• Diagnosis: what is the probability of
disease/fault given symptoms
– Target is an ancestor of the evidence
• the direction between variables does not
restrict the directions of the queries
Queries: MAP
• In this query we want to find the maximum a
posteriori assignment for some variable of
interest (say X1,…,Xl )
• That is, x1,…,xl maximize the probability
P(x1,…,xl | e)
• Note that this is equivalent to maximizing
P(x1,…,xl, e)
Queries: MAP
We can use MAP for:
• Classification
– find most likely label, given the evidence
• Explanation
– What is the most likely scenario, given the
evidence
Complexity of Inference
Thm:
Computing P(X = x) in a Bayesian network
is NP-hard
Not surprising, since we can simulate
Boolean gates.
Approaches to inference
• Exact inference
–Inference in Simple Chains
–Variable elimination
–Clustering / join tree algorithms
• Approximate inference – next time
–Stochastic simulation / sampling methods
–Markov chain Monte Carlo methods
–Mean field theory – your presentation
Variable Elimination
General idea:
• Write query in the form
P (Xn , e )  P (xi | pai )
xk
x3 x2
i
• Iteratively
– Move all irrelevant terms outside of innermost sum
– Perform innermost sum, getting a new term
– Insert the new term into the product
Example
• “Asia” network:
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
B
A
Initial factors
X
P (v, s, t , l , a, b, x, d ) 
P (v) P ( s ) P (t | v) P (l | s ) P (b | s ) P (a | t , l ) P ( x | a ) P (d | a, b)
“Brute force approach”
P(d)   P(v,s,t,l,a,b, x,d)
x

b
a
l
t
s
v
Complexity is exponential in the size of the graph
(number of variables) = T. N=number of states for each
variable O(N T )
D
S
V
L
T
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
Eliminate: v
Compute:
fv (t )  P (v )P (t |v )
v
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Note: fv(t) = P(t)
In general, result of elimination is not necessarily a probability
term
• We want to compute P(d)
• Need to eliminate: s,x,t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: s
Compute:
fs (b,l )  P (s )P (b | s )P (l | s )
s
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Summing on s results in a factor with two arguments fs(b,l)
In general, result of elimination may be a function of several
variables
• We want to compute P(d)
• Need to eliminate: x,t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
Eliminate: x
Compute:
fx (a )  P (x | a )
x
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Note: fx(a) = 1 for all values of a !!
• We want to compute P(d)
• Need to eliminate: t,l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
Eliminate: t
Compute:
ft (a ,l )  fv (t )P (a |t ,l )
t
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
• We want to compute P(d)
• Need to eliminate: l,a,b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
Eliminate: l
Compute:
fl (a , b )  fs (b,l )ft (a ,l )
 fl (a, b)fx (a )P (d | a, b)
l
• We want to compute P(d)
• Need to eliminate: b
S
V
L
T
B
A
• Initial factors
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )P (a |t ,l )P (x | a )P (d | a, b)
 fv (t )fs (b,l )fx (a )P (a |t ,l )P (d | a, b)
 fs (b,l )fx (a )ft (a,l )P (d | a, b)
 fl (a, b)fx (a )P (d | a, b)  fa (b,d )  fb (d )
Eliminate: a,b
Compute:
fa (b,d )  fl (a , b )fx (a ) p (d | a , b ) fb (d )  fa (b,d )
a
b
• Different elimination ordering:
• Need to eliminate: a,b,x,t,v,s,l

L
T
B
A
• Initial factors
P(v)P(s)P(t | v)P(l | s)P(b | s)P(a | t,l)P(x | a)P(d | a,b)
S
V
X
Intermediate factors:
ga (l,t,d,b, x)
gb (l,t,d, x,s)
gx (l,t,d,s)
gt (l,t,s,v)
gv (l,d,s)
gs (l,d)
gl (d)
Complexity is exponential in the size of the factors!
D
Variable Elimination
• We now understand variable elimination as a
sequence of rewriting operations
• Actual computation is done in elimination step
• Exactly the same computation procedure applies
to Markov networks
• Computation depends on order of elimination
Markov Network
(Undirected Graphical Models)
• A graph with hyper-edges (multi-vertex
edges)
• Every hyper-edge e=(x1…xk) has a
potential function fe(x1…xk)
• The probability distribution is
P( X 1 ,..., X n )  Z  f e ( xe1 ,..., xek )
eE
Z  1 /  ... f e ( xe1 ,..., xek )
x1
xn eE
Complexity of variable
elimination
• Suppose in one elimination step we compute
fx (y1,, yk )  f 'x (x , y1,, yk )
x
m
f 'x (x , y1 ,, yk )  fi (x , y1,1, , y1,li )
This requires
i 1
• m  Val(X )   Val(Yi )
multiplications
i
– For each value for x, y1, …, yk, we do m multiplications
•
Val(X )   Val(Yi )
i
additions
– For each value of y1, …, yk , we do |Val(X)| additions
Complexity is exponential in number of variables in the
intermediate factor
Undirected graph representation
• At each stage of the procedure, we have an
algebraic term that we need to evaluate
• In general this term is of the form:
P (x1,, xk )  fi (Zi )
y1
yn
i
where Zi are sets of variables
• We now plot a graph where there is
undirected edge X--Y if X,Y are arguments of
some factor
– that is, if X,Y are in some Zi
• Note: this is the Markov network that
describes the probability on the variables we
did not eliminate yet
Chordal Graphs
• elimination ordering  undirected chordal graph
S
V
L
T
X
D
L
T
B
A
S
V
B
A
X
D
Graph:
• Maximal cliques are factors in elimination
• Factors in elimination are cliques in the graph
• Complexity is exponential in size of the largest
clique in graph
Induced Width
• The size of the largest clique in the induced
graph is thus an indicator for the complexity of
variable elimination
• This quantity is called the induced width of a
graph according to the specified ordering
• Finding a good ordering for a graph is equivalent
to finding the minimal induced width of the graph
PolyTrees
• A polytree is a network where there is at most
one path from one variable to another A
C
B
E
D
F
Thm:
• Inference in a polytree is linear in the
representation size of the network
– This assumes tabular CPT representation
H
G
Today
1. Probabilistic graphical models
2. Treewidth methods:
1. Variable elimination
2. Clique tree algorithm
3. Applications du jour: Sensor Networks
Junction Tree
• Why junction tree?
– More efficient for some tasks than variable
elimination
– We can avoid cycles if we turn highlyinterconnected subsets of the nodes into
“supernodes”  cluster
• Objective
– Compute P(V  v | E  e)
• v is a value of a variable V and e is evidence
for a set of variable E
Properties of Junction Tree
• An undirected tree
• Each node is a cluster (nonempty set)
of variables
• Running intersection property:
– Given two clusters X and Y , all clusters
on the path between Y and X contain X  Y
• Separator sets (sepsets):
– Intersection of the adjacent cluster
ABD
Cluster ABD
AD
ADE
Sepset DE
DE
DEF
Potentials
• Potentials:  :X R {0}
– Denoted by X
• Marginalization
X  Y , the marginalization of Y into X
– 
X  Y
• Multiplication
Y\ X
– Z  X  Y , the multiplication of X and Y
Z  XY
Properties of Junction Tree
• Belief potentials:
– Map each instantiation of clusters or sepsets into a
real number
• Constraints:
– Consistency: for each cluster X and neighboring
sepset S

X\S
– The joint distribution
X
 S


P(U) 

i
Xi
j
Sj
Properties of Junction Tree
• If a junction tree satisfies the properties, it
follows that:
– For each cluster (or sepset) X ,
X  P(X)
– The probability distribution of any variable V ,
using any cluster (or sepset) X that contains V
P(V ) 

X
X \{V }
Building Junction Trees
DAG
Moral Graph
Triangulated Graph
Identifying Cliques
Junction Tree
Constructing the Moral Graph
A
B
C
G
D
E
H
F
Constructing The Moral Graph
• Add undirected
edges to all coparents which are
not currently joined
–Marrying parents
A
B
C
G
D
E
H
F
Constructing The Moral Graph
• Add undirected
edges to all coparents which are
not currently joined
–Marrying parents
• Drop the directions
of the arcs
A
B
C
G
D
E
H
F
Triangulating
• An undirected graph is triangulated iff
every cycle of length >3 contains an edge
to connects two nonadjacent nodes
A
B
C
G
D
E
H
F
Identifying Cliques
• A clique is a subgraph of an undirected
graph that is complete and maximal
A
B
C
D
E
F
EGH
CEG
DEF
ACE
ABD
ADE
G
H
Junction Tree
• A junction tree is a subgraph of the
clique graph that
– is a tree
– contains all the cliques
– satisfies the running intersection property
ABD
AD
ADE
AE
ACE
EGH
CEG
DEF
ACE
ABD
ADE
CE
CEG
DE
EG
DEF
EGH
Principle of Inference
DAG
Junction Tree
Initialization
Inconsistent Junction Tree
Propagation
Consistent Junction Tree
Marginalization
P(V  v | E  e)
Example: Create Join Tree
HMM with 2 time steps:
X1
X2
Y1
Y2
Junction Tree:
X1,Y1
X1
X1,X2
X2
X2,Y2
Example: Initialization
X1,Y1
X1
X1,X2
X2
X2,Y2
Variable
Associated
Cluster
Potential
function
X1
X1,Y1
X1,Y1  P(X1)
Y1
X2
X1,Y1

X1,X2

Y2
X2,Y2

 X1,Y1 
P(X1)P(Y1 | X1)
X1,X 2  P(X2 | X1)
X 2,Y 2  P(Y2 | X2)
Example: Collect Evidence
• Choose arbitrary clique, e.g. X1,X2, where
all potential functions will be collected.
• Call recursively neighboring cliques for
messages:
• 1. Call X1,Y1.
– 1. Projection:      P(X1,Y1)  P(X1)
X1
X1,Y1
{X1,Y1}X1
Y1
– 2. Absorption:
 X1
 X1,X 2  X1,X 2 old
 P(X2 | X1)P(X1)  P(X1, X2)
 X1
Example: Collect Evidence
(cont.)
• 2. Call X2,Y2:

– 1. Projection: X 2 
X 2,Y 2  P(Y2 | X2) 1
{X 2,Y 2}X 2
Y2
– 2. Absorption:

 X1,X 2   X1,X 2

X1,Y1
X1
X 2
 P(X1, X2)
old
X 2
X1,X2
X2
X2,Y2
Example: Distribute Evidence
• Pass messages recursively to neighboring
nodes
• Pass message from X1,X2 to X1,Y1:
– 1. Projection:
X1 

{X1,X 2}X1
– 2. Absorption:


 X1,Y1   X1,Y1
X1,X 2  P(X1, X2)  P(X1)
X2
 X1
P(X1)

P(X1,Y1)
old
 X1
P(X1)
Example: Distribute Evidence
(cont.)
• Pass message from X1,X2 to X2,Y2:
– 1. Projection:
X 2 

X1,X 2  P(X1, X2)  P(X2)
{X1,X 2}X 2
X1
– 2. Absorption:
X 2
P(X2)
 X 2,Y 2   X 2,Y 2 old  P(Y 2 | X2)
 P(Y2, X2)
X 2
1


X1,Y1
X1
X1,X2
X2
X2,Y2
Example: Inference with
evidence
• Assume we want to compute:
P(X2|Y1=0,Y2=1) (state estimation)
• Assign likelihoods to the potential
functions during initialization:

0
if Y1  1
 X1,Y1  
P(X1,Y1  0) if Y1  0

0
if Y2  0
 X 2,Y 2  
P(Y2  1 | X2) if Y2  1
Example: Inference with
evidence (cont.)
• Repeating the same steps as in the
previous case, we obtain:

0
if
 X1,Y1  
P(X1,Y1  0,Y 2  1) if
Y1  1
Y1  0
 X1  P(X1,Y1  0,Y 2  1)
 X1,X 2  P(X1,Y1  0, X2,Y 2  1)
 X 2  P(Y1  0, X2,Y 2  1)

0
if
 X 2,Y 2  
P(Y1  0, X2,Y 2  1) if
Y2  0
Y2 1
Next Time
• Approximate Probabilistic Inference via
sampling
– Gibbs
– Priority
– MCMC
THE END
Example: Naïve Bayesian
Model
• A common model in early diagnosis:
– Symptoms are conditionally independent
given the disease (or fault)
• Thus, if
– X1,…,Xp denote whether the symptoms
exhibited by the patient (headache, highfever, etc.) and
– H denotes the hypothesis about the patients
health
• then, P(X1,…,Xp,H) =
Elimination on Trees
• Formally, for any tree, there is an
elimination ordering with induced width = 1
Thm
• Inference on trees is linear in number of
variables