Transcript Bayesian Statistics and Belief Networks
Bayesian Statistics and Belief Networks
Overview
• Book: Ch 8.3
• Refresher on Bayesian statistics • Bayesian classifiers • Belief Networks / Bayesian Networks
Why Should We Care?
• Theoretical framework for machine learning, classification, knowledge representation, analysis • Bayesian methods are capable of handling noisy, incomplete data sets • Bayesian methods are commonly in use today
Bayesian Approach To Probability and Statistics
• Classical Probability : Physical property of the world (e.g., 50% flip of a fair coin). True probability.
• Bayesian Probability : A person’s
degree of belief
in event X. Personal probability.
• Unlike classical probability, Bayesian probabilities benefit from but do not require repeated trials only focus on next event; e.g. probability Seawolves win next game?
Product Rule: Equating Sides: i.e.
Bayes Rule
B
B
P evidence Class P Class
)
P evidence
) All classification methods can be seen as estimates of Bayes’ Rule, with different techniques to estimate P(evidence|Class).
Simple Bayes Rule Example
Probability your computer has a virus, V, = 1/1000.
If virused, the probability of a crash that day, C, = 4/5.
Probability your computer crashes in one day, C, = 1/10.
P(C|V)=0.8
P(V)=1/1000 P(C)=1/10 ) Even though a crash is a strong indicator of a virus, we expect only 8/1000 crashes to be caused by viruses. Why not compute P(V|C) from direct evidence? Causal vs. Diagnostic knowledge; (consider if P(C) suddenly drops).
Bayesian Classifiers
P evidence Class P Class
)
P evidence
) If we’re selecting the single most likely class, we only need to find the class that maximizes P(e|Class)P(Class).
Hard part is estimating P(e|Class).
Evidence e typically consists of a set of observations:
E
( , 1 2 ,...,
e n
) Usual simplifying assumption is conditional independence:
i n
1
P e C
i
i n
1 ( | )
i
Bayesian Classifier Example
Probability P(C) P(crashes|C) P(diskfull|C) C=Virus 0.4
0.1
0.6
C=Bad Disk 0.6
0.2
0.1
Given a case where the disk is full and computer crashes, the classifier chooses Virus as most likely since (0.4)(0.1)(0.6) > (0.6)(0.2)(0.1).
Beyond Conditional Independence
Linear Classifier: C1 C2 • Include second-order dependencies; i.e. pairwise combination of variables via joint probabilities: 2
( | )
P e c
1
( | )[ 1
1
( | )]
Correction factor Difficult to compute
n
2 joint probabilities to consider
Belief Networks
• DAG that represents the dependencies between variables and specifies the joint probability distribution • Random variables make up the nodes • Directed links represent causal direct influences • Each node has a conditional probability table quantifying the effects from the parents • No directed cycles
Burglary Alarm Example
Burglary John Calls P(B) 0.001
Earthquake Alarm A T F P(J) 0.90
0.05
B E P(A) T T 0.95
T F 0.94
F T F F 0.29
0.001
Mary Calls A T F P(E) 0.002
P(M) 0.70
0.01
Sample Bayesian Network
Using The Belief Network
P(B) P(E) Burglary Earthquake 0.001
0.002
Alarm B E T T T F F T F F P(A) 0.95
0.94
0.29
0.001
P(J) 0.90
0.05
A P(M) John Calls A T F Mary Calls T F 0.70
0.01
( , 1 2 ,...
x n
)
i n
1 ( |
i
(
i
)) Probability of alarm, no burglary or earthquake, both John and Mary call:
B P
E
) )( .
)( .
)
Belief Computations
• Two types; both are NP-Hard • Belief Revision – Model explanatory/diagnostic tasks – Given evidence, what is the most likely hypothesis to explain the evidence?
– Also called abductive reasoning • Belief Updating – Queries – Given evidence, what is the probability of some other random variable occurring?
Belief Revision
• Given some evidence variables, find the state of all other variables that maximize the probability.
• E.g.: We know John Calls, but not Mary. What is the most likely state? Only consider assignments where J=T and M=F, and maximize. Best:
P
(
B
)
P
(
E
)
P
(
A
|
B
E
)
P
(
J
|
A
)
P
(
M
( 0 .
999 )( 0 .
998 )( 0 .
999 )( 0 .
05 )( 0 .
99 ) 0 .
049 |
A
)
Belief Updating
• Causal Inferences E Q • Diagnostic Inferences Q E • Intercausal Inferences • Mixed Inferences Q E E Q E
Causal Inferences
Inference from cause to effect. E.g. Given a Burglary P(B) 0.001
Alarm burglary, what is John Calls P(J|B)?
A T F P(J) 0.90
0.05
P
(
J
|
B
) ?
P
(
A
|
B
)
P
(
B
)
P
(
E
)( 0 .
94 )
P
(
B
)
P
(
E
)( 0 .
95 )
P
(
A
|
B
) 1 ( 0 .
998 )( 0 .
94 ) 1 ( 0 .
002 )( 0 .
95 ) B E T T T F F T F F Earthquake P(A) 0.95
0.94
0.29
0.001
Mary Calls
P
(
A
|
B
) 0 .
94
P
(
J
|
B
)
P
(
A
)( 0 .
9 )
P
(
A
)( 0 .
05 )
P
(
J
| 0 .
85
B
) ( 0 .
94 )( 0 .
9 ) ( 0 .
06 )( 0 .
05 ) A T F P(E) 0.002
P(M) 0.70
0.01
P(M|B)=0.67 via similar calculations
Diagnostic Inferences
From effect to cause. E.g. Given that John calls, what is the P(burglary)?
P
(
B
|
J
)
P
(
J
|
B
)
P
(
B
)
P
(
J
) What is P(J)? Need P(A) first:
P
(
A
)
P
(
B
)
P
(
E
)( 0 .
95 )
P
(
B
)
P
(
E
)( 0 .
29 )
P
(
B
)
P
(
E
)( 0 .
94 )
P
(
B
)
P
(
E
)( 0 .
001 )
P
(
A
) ( 0 .
001 )( 0 .
002 )( 0 .
95 ) ( 0 .
999 )( 0 .
002 )( 0 .
29 ) ( 0 .
001 )( 0 .
998 )( 0 .
94 ) ( 0 .
998 )( 0 .
999 )( 0 .
001 )
P
(
A
) 0 .
002517
P
(
J
)
P
(
A
)( 0 .
9 )
P
(
A
)( 0 .
05 )
P
(
J
) ( 0 .
002517 )( 0 .
9 ) ( 0 .
9975 )( 0 .
05 )
P
(
J
) 0 .
052
P
(
B
|
J
) ( 0 .
85 )( 0 .
001 ) 0 .
016 ( 0 .
052 ) Many false positives.
Intercausal Inferences
Explaining Away Inferences.
Given an alarm, P(B|A)=0.37. But if we add the evidence that earthquake is true, then P(B|A^E)=0.003.
Even though B and E are independent, the presence of one may make the other more/less likely.
Mixed Inferences
Simultaneous intercausal and diagnostic inference.
E.g., if John calls and Earthquake is false:
P
(
A
|
J
^
E
) 0 .
03
P
(
B
|
J
^
E
) 0 .
017 Computing these values exactly is somewhat complicated.
Exact Computation Polytree Algorithm
• Judea Pearl, 1982 • Only works on singly-connected networks - at most one undirected path between any two nodes. • Backward-chaining Message-passing algorithm for computing posterior probabilities for query node X – Compute causal support for X, evidence variables “above” X – Compute evidential support for X, evidence variables “below” X
U(1 )
Polytree Computation
...
U(m)
E x
X Z(1,j) Z(n,j)
E x
...
Y(1 ) Y(n )
P
(
X P
(
X P
(
E x
| | |
E
)
E x
X
) )
P
(
u
X P
| (
E X
i yi
|
x
)
u
P
(
E
)
i x P
|
X
(
U P
(
E
y
|
y i
)
i
)
zj
|
E ui
\
x
)
P
(
y i
| Algorithm recursive, message
X
,
z j
)
j P
(
z ij
| passing chain
E zij
\
y i
)
Other Query Methods
• Exact Algorithms – Clustering • Cluster nodes to make single cluster, message-pass along that cluster – Symbolic Probabilistic Inference • Uses d-separation to find expressions to combine • Approximate Algorithms – Select sampling distribution, conduct trials sampling from root to evidence nodes, accumulating weight for each node. Still tractable for dense networks.
• Forward Simulation • Stochastic Simulation
Summary
• Bayesian methods provide sound theory and framework for implementation of classifiers • Bayesian networks a natural way to represent conditional independence information. Qualitative info in links, quantitative in tables.
• NP-complete or NP-hard to compute exact values; typical to make simplifying assumptions or approximate methods.
• Many Bayesian tools and systems exist
References
• Russel, S. and Norvig, P. (1995). Artificial Intelligence, A Modern Approach. Prentice Hall.
• Weiss, S. and Kulikowski, C. (1991). Computer Systems That Learn. Morgan Kaufman.
• Heckerman, D. (1996).
A Tutorial on Learning with Bayesian Networks
. Microsoft Technical Report MSR-TR 95-06. • Internet Resources on Bayesian Networks and Machine Learning: http://www.cs.orst.edu/~wangxi/resource.html