Bayesian Statistics and Belief Networks

Download Report

Transcript Bayesian Statistics and Belief Networks

Bayesian Statistics and Belief Networks

Overview

• Book: Ch 8.3

• Refresher on Bayesian statistics • Bayesian classifiers • Belief Networks / Bayesian Networks

Why Should We Care?

• Theoretical framework for machine learning, classification, knowledge representation, analysis • Bayesian methods are capable of handling noisy, incomplete data sets • Bayesian methods are commonly in use today

Bayesian Approach To Probability and Statistics

• Classical Probability : Physical property of the world (e.g., 50% flip of a fair coin). True probability.

• Bayesian Probability : A person’s

degree of belief

in event X. Personal probability.

• Unlike classical probability, Bayesian probabilities benefit from but do not require repeated trials only focus on next event; e.g. probability Seawolves win next game?

Product Rule: Equating Sides: i.e.

Bayes Rule

B

  

B

               

P evidence Class P Class

)

P evidence

) All classification methods can be seen as estimates of Bayes’ Rule, with different techniques to estimate P(evidence|Class).

Simple Bayes Rule Example

Probability your computer has a virus, V, = 1/1000.

If virused, the probability of a crash that day, C, = 4/5.

Probability your computer crashes in one day, C, = 1/10.

P(C|V)=0.8

P(V)=1/1000 P(C)=1/10   )  Even though a crash is a strong indicator of a virus, we expect only 8/1000 crashes to be caused by viruses. Why not compute P(V|C) from direct evidence? Causal vs. Diagnostic knowledge; (consider if P(C) suddenly drops).

Bayesian Classifiers

  

P evidence Class P Class

)

P evidence

) If we’re selecting the single most likely class, we only need to find the class that maximizes P(e|Class)P(Class).

Hard part is estimating P(e|Class).

Evidence e typically consists of a set of observations:

E

 ( , 1 2 ,...,

e n

) Usual simplifying assumption is conditional independence: 

i n

  1

P e C

i

 

i n

  1 ( | )

i

Bayesian Classifier Example

Probability P(C) P(crashes|C) P(diskfull|C) C=Virus 0.4

0.1

0.6

C=Bad Disk 0.6

0.2

0.1

Given a case where the disk is full and computer crashes, the classifier chooses Virus as most likely since (0.4)(0.1)(0.6) > (0.6)(0.2)(0.1).

Beyond Conditional Independence

Linear Classifier: C1 C2 • Include second-order dependencies; i.e. pairwise combination of variables via joint probabilities: 2

( | )

P e c

1

( | )[ 1

  1

( | )]

Correction factor  Difficult to compute 

n

2  joint probabilities to consider

Belief Networks

• DAG that represents the dependencies between variables and specifies the joint probability distribution • Random variables make up the nodes • Directed links represent causal direct influences • Each node has a conditional probability table quantifying the effects from the parents • No directed cycles

Burglary Alarm Example

Burglary John Calls P(B) 0.001

Earthquake Alarm A T F P(J) 0.90

0.05

B E P(A) T T 0.95

T F 0.94

F T F F 0.29

0.001

Mary Calls A T F P(E) 0.002

P(M) 0.70

0.01

Sample Bayesian Network

Using The Belief Network

P(B) P(E) Burglary Earthquake 0.001

0.002

Alarm B E T T T F F T F F P(A) 0.95

0.94

0.29

0.001

P(J) 0.90

0.05

A P(M) John Calls A T F Mary Calls T F 0.70

0.01

( , 1 2 ,...

x n

) 

i n

  1 ( |

i

(

i

)) Probability of alarm, no burglary or earthquake, both John and Mary call: 

B P

E

) )( .

)( .

) 

Belief Computations

• Two types; both are NP-Hard • Belief Revision – Model explanatory/diagnostic tasks – Given evidence, what is the most likely hypothesis to explain the evidence?

– Also called abductive reasoning • Belief Updating – Queries – Given evidence, what is the probability of some other random variable occurring?

Belief Revision

• Given some evidence variables, find the state of all other variables that maximize the probability.

• E.g.: We know John Calls, but not Mary. What is the most likely state? Only consider assignments where J=T and M=F, and maximize. Best:

P

( 

B

)

P

( 

E

)

P

( 

A

| 

B

 

E

)

P

(

J

| 

A

)

P

( 

M

( 0 .

999 )( 0 .

998 )( 0 .

999 )( 0 .

05 )( 0 .

99 )  0 .

049 | 

A

)

Belief Updating

• Causal Inferences E Q • Diagnostic Inferences Q E • Intercausal Inferences • Mixed Inferences Q E E Q E

Causal Inferences

Inference from cause to effect. E.g. Given a Burglary P(B) 0.001

Alarm burglary, what is John Calls P(J|B)?

A T F P(J) 0.90

0.05

P

(

J

|

B

)  ?

P

(

A

|

B

) 

P

(

B

)

P

( 

E

)( 0 .

94 ) 

P

(

B

)

P

(

E

)( 0 .

95 )

P

(

A

|

B

)  1 ( 0 .

998 )( 0 .

94 )  1 ( 0 .

002 )( 0 .

95 ) B E T T T F F T F F Earthquake P(A) 0.95

0.94

0.29

0.001

Mary Calls

P

(

A

|

B

)  0 .

94

P

(

J

|

B

) 

P

(

A

)( 0 .

9 ) 

P

( 

A

)( 0 .

05 )

P

(

J

|  0 .

85

B

)  ( 0 .

94 )( 0 .

9 )  ( 0 .

06 )( 0 .

05 ) A T F P(E) 0.002

P(M) 0.70

0.01

P(M|B)=0.67 via similar calculations

Diagnostic Inferences

From effect to cause. E.g. Given that John calls, what is the P(burglary)?

P

(

B

|

J

) 

P

(

J

|

B

)

P

(

B

)

P

(

J

) What is P(J)? Need P(A) first:

P

(

A

) 

P

(

B

)

P

(

E

)( 0 .

95 ) 

P

( 

B

)

P

(

E

)( 0 .

29 ) 

P

(

B

)

P

( 

E

)( 0 .

94 ) 

P

( 

B

)

P

( 

E

)( 0 .

001 )

P

(

A

)  ( 0 .

001 )( 0 .

002 )( 0 .

95 )  ( 0 .

999 )( 0 .

002 )( 0 .

29 )  ( 0 .

001 )( 0 .

998 )( 0 .

94 )  ( 0 .

998 )( 0 .

999 )( 0 .

001 )

P

(

A

)  0 .

002517

P

(

J

) 

P

(

A

)( 0 .

9 ) 

P

( 

A

)( 0 .

05 )

P

(

J

)  ( 0 .

002517 )( 0 .

9 )  ( 0 .

9975 )( 0 .

05 )

P

(

J

)  0 .

052

P

(

B

|

J

)  ( 0 .

85 )( 0 .

001 )  0 .

016 ( 0 .

052 ) Many false positives.

Intercausal Inferences

Explaining Away Inferences.

Given an alarm, P(B|A)=0.37. But if we add the evidence that earthquake is true, then P(B|A^E)=0.003.

Even though B and E are independent, the presence of one may make the other more/less likely.

Mixed Inferences

Simultaneous intercausal and diagnostic inference.

E.g., if John calls and Earthquake is false:

P

(

A

|

J

^ 

E

)  0 .

03

P

(

B

|

J

^ 

E

)  0 .

017 Computing these values exactly is somewhat complicated.

Exact Computation Polytree Algorithm

• Judea Pearl, 1982 • Only works on singly-connected networks - at most one undirected path between any two nodes. • Backward-chaining Message-passing algorithm for computing posterior probabilities for query node X – Compute causal support for X, evidence variables “above” X – Compute evidential support for X, evidence variables “below” X

U(1 )

Polytree Computation

...

U(m)

E x

 X Z(1,j) Z(n,j)

E x

 ...

Y(1 ) Y(n )

P

(

X P

(

X P

(

E x

 | | |

E

)

E x

X

) )    

P

( 

u

X P

| (

E X



i yi

|

x

 )

u

P

(

E

) 

i x P

|

X

(

U P

(

E

y

|

y i

)

i

) 

zj

|

E ui

\

x

)

P

(

y i

| Algorithm recursive, message

X

,

z j

) 

j P

(

z ij

| passing chain

E zij

\

y i

)

Other Query Methods

• Exact Algorithms – Clustering • Cluster nodes to make single cluster, message-pass along that cluster – Symbolic Probabilistic Inference • Uses d-separation to find expressions to combine • Approximate Algorithms – Select sampling distribution, conduct trials sampling from root to evidence nodes, accumulating weight for each node. Still tractable for dense networks.

• Forward Simulation • Stochastic Simulation

Summary

• Bayesian methods provide sound theory and framework for implementation of classifiers • Bayesian networks a natural way to represent conditional independence information. Qualitative info in links, quantitative in tables.

• NP-complete or NP-hard to compute exact values; typical to make simplifying assumptions or approximate methods.

• Many Bayesian tools and systems exist

References

• Russel, S. and Norvig, P. (1995). Artificial Intelligence, A Modern Approach. Prentice Hall.

• Weiss, S. and Kulikowski, C. (1991). Computer Systems That Learn. Morgan Kaufman.

• Heckerman, D. (1996).

A Tutorial on Learning with Bayesian Networks

. Microsoft Technical Report MSR-TR 95-06. • Internet Resources on Bayesian Networks and Machine Learning: http://www.cs.orst.edu/~wangxi/resource.html