Signal Averaging

Download Report

Transcript Signal Averaging

Probabilistic Reasoning Systems - Introduction to Bayesian Network -

CS570 AI Team #7: T. M. Kim, J. B. Hur, H. Y. Park Speaker: Kim, Tae Min

Outline

 Introduction to graphical model  Review: Uncertainty and Probability  Representing Knowledge in an Uncertain Domain  Semantics of bayesian Networks  Inference in bayesian Networks  Summary  Practice Questions  Useful links on the WWW

1

What is a graphical model?

 A graphical model is a way of representing probabilistic relationships between random variables .

 Variables are represented by nodes :  Conditional (in) dependencies are represented by (missing) edges :  Undirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model):  Directed edges give causality relationships Catch ( Bayesian Network or Directed Graphical Model): Cavity Weather Toothache

2

Significance of graphical model

 “Graphical models: marriage between probability and graph theory .  Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data.  The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.  They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering – uncertainty and complexity.

 In particular, they are playing an increasingly important role in the design and analysis of machine learning algorithms .

3

Significance of graphical model

 Fundamental to the idea of a graphical model is the notion of modularity – a complex system is built by combining simpler parts.

 Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism -- examples include mixture models, factor analysis, hidden Markov models, and Kalman filters.  The graphical model framework provides a way to view systems as instances of a common underlying formalism .

all of these  This view has many advantages -- in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely .  Moreover, the graphical model formalism provides a natural framework for the design of new systems .“ --- Michael Jordan, 1998.

4

We already know many graphical models: (Picture by Zoubin Ghahramani and Sam Roweis)

5

Taxonomy of graphical models 6

Review: Probabilistic Independence

  Joint Probability:

P

(

X

Y

)  Probability of the Joint Event

X

Y

Independence  ( ) ( ) 

X Y

  Conditional Independence  | ) ( | ) 

X Y Z

| , )  | )  

7

Review: Basic Formulas for Probabilities

   Bayes ’ s Rule Product Rule | )   | ) Chain Rule: Repeatedly using Product Rule

P x

1

x n

)   

i n

  1

n n

| |

x n

-1

x n

-1 ,..., ) ( 1

n

-1 ,..., ) ( 1

n

-1 |

x

1

x n

-2

i

|

i

-1

x

1

x

1

P x

2 | 1 ) ( ) 1  Theorem of Total Probability  Suppose events

A 1

,

A 2

, … ,

A n

are mutually exclusive and exhaustive( 

P

(

A i

) = 1 )  

n

i

n

1

i

  1

i

) (

i

) : Conditioning

i

, ) : Marginalization

8

Uncertain Knowledge Representation

 By using a data structure called a Bayesian network (also known as a Belief network, probabilistic network, causal network, or knowledge map), we can represent the dependence between variables and to give a concise specification of the joint probability distribution.

 Definition of Bayesian Network: Topology of the network + CPT  A set of random variables makes up the nodes of the network  A set of directed links or arrows connects pairs of nodes. Example: X->Y means X has a direct influence on Y.  Each node has a conditional probability table (CPT) that quantifies the effects that the parents have on the node. The parents of a node are all those nodes that have arrows pointing to it.  The graph has or DAG). no directed cycles (hence is a directed, acyclic graph,

9

Representing Knowledge Example

C P(S=F) P(S=T) F T 0.5 0.5

0.9 0.1

Spinkler P(C=F) P(C=T) 0.5 0.5

Cloudy S R P(W=F) P(W=T) F F T F F T T T 1.0 0.0

0.1 0.9

0.1 0.9

0.01 0.99

WetGrass Rain C P(R=F) P(R=T) F T 0.8 0.2

0.2 0.8

10

Conditional Probability Table (CPT)

 Once we get the topology of the network , conditional probability table (CPT) must be specified for each node.  Example of CPT for the variable WetGrass: C S R P(W=F) P(W=T) F F 1.0 0.0

S R T F F T T T 0.1 0.9

0.1 0.9

0.01 0.99

W   Each row in the table contains the conditional probability of each node value for a conditioning case. Each row must sum to 1 , because the entries represent an exhaustive set of cases for the variable.

A conditioning case is a possible combination of values for the parent nodes .

11

An Example

 Pr(

S

 1|

W

Pr(

R

 1|

W

Pr(

S

Pr( 

W

1,

W

 1)  1)  Pr(

R

Pr( 

W

1,

W

 1)  1)    Pr(

C

  1,

R

 Pr(

C

 Pr(

W

 1)  1,

R

 Pr(

W

 1)  1)  0.2781

 0.647

1  1)  0.4581

0.6471

 0.7079

C 0.6471

Pr(

W

 Pr(

C

   Explaining away  In the above example, notice that the S two causes "compete" to "explain" the observed data . Hence

S

and

R

dependent given that their common child,

W

, is observed, even though they are marginally independent .  For example, suppose the grass is wet, but that we also know that it is raining. Then the posterior probability that the sprinkler is on goes down: • Pr(S=1|W=1,R=1) = 0.1945

R

12

Typical example of BN

 Compactly represent due to the fact that each row must sum to 1

13

Conditional independence in BN

 Chain rule

P x

1

x n

) 

i n

  1

P x

1

i

1:

n

) |

X

1 ,...,

x n

) 

i

X n

)

i n

  1  (

i

| 

i

)

i

|

i

|  Example

i

|

i

-1

x

( (

i

))

i

)) if (

i

)  {

X

1 ,...,

X n

} X4 X2 X1 X6 ( 1:6 )  ( ) ( 1 ( 5 | 2 | ,

x p x

1 ) ( 3 2 , 3 , 4 |

x x p x

1 , 2 ) ( 4 ) ( 6 | , | 2 ,

x x x

1 , 2 , 3 3 , 4 , ) 5 ) X3 ( 1:6 )  ( ) ( 1 2 |

x p x x p x

1 ) ( 3 | 1 ) ( 4 |

x p x x p x

2 ) ( 5 | 3 ) ( 6 |

x x

2 , 5 ) X5

14

Conditional independence in BN

X

Z

|

Y

?

X A B C D Y Z E

)

15

Conditional independence in BN

 Def. A topological (total) ordering

I

of the graph

G

is such that all parents of node

i

occur earlier in

I

than

i

.

 Ex.

I

= { 1, 2, 3, 4, 5, 6 } is a total ordering .

I

= { 6, 2, 5, 3, 1, 4 } is not a total ordering.

  Def. Non-descendant: the set of indices

v i v i

 {

j

before

i

other than parents of

i

, 

i

:

j

I

in total ordering

I

,

j

  

i i

.

}

X1 X2 X4 X6

Ex.

I v

order comp   {1, 2, 3, 4, 5, 6},   Markov property:

X i

X v i

|

X

i X i

i

) |

i

)

X3 X5

X

1

X

4

X

6     

X

{ {

X X

1 , 3 }| 2

X X X

1 , 3 , 

X

2  |

X

,

X

1 5 , 

X

{ 4 }|{

X X

2 , 5 } 3 

X

2 |

X X X

1 , 2 ,

X

1 4 }|

X

3

16

Conditional independence in BN

 To verify

x

4 

x x

1 3

x

2 ( 1:4 )    

x

5:6 ( 1:6 ) ( ) ( 1 2  |

p x p x

2

x p x

1 ) ( 3 | | ) (

x p x

1 ) ( 4 | ( ) ( 1 2 |

x p x

1 ) ( 3 |

x p x

1 ) ( 4 | 3 |

x

2 ) 

x

5

x

2 ) ) ( 4 ( 5 | |

x

2

x

3 ) ) 

x

5:6 

x

6 ( 5 ( 6 | | ) (

x x

2 , 5 ) 6 | , 5 ) ( 1:3 )  

x

4 ( 1:4 ) 

p x p x

2 | ) ( 3 |

x

1 ) ( 4 | , 2 , 3 )  ( 1:4 ( 1:3 ) ) 

x x

3

x

2 4 |

x

2 )

X1 X2 X3 X4 X5 X6

17

Contructing a BN

  General procedure for incremental network construction:  choose the set of relevant variables

Xi

that describe the domain  choose the ordering for the variables  while there are variables left: • pick a variable

Xi

and add a node to the network for it • set Parent(

Xi

) by testing its conditional independence • define the conditional probability table for

Xi

☞ in the net learning BN Example: Suppose we choose the ordering B,E,A,J,M  P(B|E)= P(E)? Yes  P(A|B)= P(A)? P(A|E)= P(A)? No Burglary  P(J|A,B,E)= P(J|A)? Yes  P(J |A)= P(J)? No Earthquake Alarm   P(M|A)= P(M)? No P(M|A,J)= P(M|A)? Yes JohnCalls MaryCalls

18

Contructing a Bad BN Example

 Suppose we choose the ordering M,J,A,B,E  P(J|M)= P(J)? No  P(A|J,M)= P(A|J)? P(A|J,M)= P(A)? No  P(B|A,J,M)= P(B|A)? Yes  P(B|A) = P(B)? No

MaryCalls

 P(E|A,J,M)= P(E|A)? Yes  P(E|A,B)= P(E|A)? No  P(E|A) = P(E)? No

Alarm JohnCalls Burglary Earthquake 19

Bayes ball algorithm

  If every undirected path from a node in

X

to a node in y is d-separated by E , then

X

and

Y

are

conditionally independent

given E. A set of nodes E d-separates two sets of nodes

X

and

Y

if every undirected path from a node in X to a node in Y blocked given E .

20

Three canonical GM’s

 Case I. Serial Connection ( Markov Chain ) X Y Z

I v X

    

p

(

x

,

y

,

z

) 

p

(

x

)

p

(

y

|

x

)

p

(

z

|

y

)

p

(

z

|

x

,

y

) 

p

(

x

,

y

,

z

) 

p

(

x

,

y

)

p

(

x

)

p

(

y

|

x

)

p

(

z

|

y

) 

p

(

x

)

p

(

y

|

x

)

p

(

z

|

y

)

21

Three canonical GM’s

 Case II.

Diverging Connection

Y

I V X

     , }

X Z

X Y Z

Shoe Size Age Gray Hair

22

Three canonical GM’s

 Case III.

Converging Connection X Y  Explaining away Z

X

I X

  { , , },

V Z

  

y

y

    Rain Lawn wet Sprinkler Burglar Alarm Earthquake

23

Bayes ball algorithm

 Bayes Ball bouncing rules  Serial Connection X Y Z X Y Z  Diverging Connection Y Y X Z X Z

24

Bayes ball algorithm

 Converging Connection Y Y X Z  Boundary condition X Y X X Z Y

25

Bayes ball algorithm: Summary 26

Examples of Bayes Ball Algorithm

 Markov Chain

X1 X2 X3 X4 X5

{

X

2 ,

X

1 }  {

X

4 ,

X

5 }|

X

3

X

1 

X

5 |

X

3 X4 X1 X2 X3 X6 X5

X

4  { , 3 }

X

4  { , 3 } |

X

2

X

4  {

X X X

1 , 3 , 5 ,

X

6 } |

X

2

27

Examples of Bayes Ball Algorithm

X4 X1 X2 X3 X5 X6

X

1 

X

6 | {

X

2 ,

X

3 }

X

4 

X

5 | {

X

2 ,

X

3 }

28

Examples of Bayes Ball Algorithm

X4 X1 X2 X3 X5 X6

X

2 

X

3 | {

X

4 

X

5 | { , , 6 } 6 }

29

Examples of Bayes Ball Algorithm

A B Z X

X

 D C E Y

30

Non-descendants

X

31

Markov blanket

 Markov blanket: Parents + children + children’s parents

32

The four types of inferences

 Note that these are just terminology used to describe the type of inference in various systems. Using bayesian network, we don’t need to distinguish the type of reasoning it performs. i.e., it treats everything as mixed .

33

Other usage of the 4 patterns

    Making decisions based on probabilities in the network and on the agent's utilities. Deciding which additional evidence variables should be observed in order to gain useful information. Performing sensitivity analysis to understand which aspects of the model have the greatest impact on the probabilities of the query variables (and therefore must be accurate). Explaining the results of probabilistic inference to the user.

34

Exact inference

 Inference by enumeration   

e a

(with alarm example) B J 

e a

 

e a P

 Variable elimination by distributive law 

e a

 

e

a P a B e P j a P m

A

a

 In general Q H E | )  )   

h

 

h

| , )

P h e

 | , ) ( | )  |

h

) | ) E M 

h

| , ) ( | )

35

Exact inference

C S R  Another example of variable elimination Pr(

W

w

) = = 

c

 

c s

Pr(

s



C r r

Pr(

C

Pr(

C

  Pr(

W

w

) 

c

c c

) Pr( Pr(

C

 

c

) 

s c

) 

s

Pr( Pr(

S S

 

S

    

c

)  Pr(

W

w

)  

c

Pr(

C T T

 Complexity of exact inference 

s

 

r

w

)

c

) Pr(

R

c

)

r

Pr(

R

Pr(

R

Pr(

S

     

c

) Pr(

W

c

) Pr(

W

) Pr(

W

   W    

r

) 

r

) 

r

)

O(n)

for polytree(singly connected network) there exist at most one undirectd path between any two nodes in the networks (e.g. alarm example)  Multiply connected network : exponential time (e.g. wet grass ex.)

36

Exact inference

 Clustering Algorithm  To calculate posterior probabilities for all the variables:

O(n 2 )

even for polytree  Cluster the network into polytree and apply Constraint propagation algorithm (refer to Chap 5 Constraint Satisfaction Prob in the text)  Widely used because of

O(n)

C P(S=F) P(S=T) F T 0.5 0.5

0.9 0.1

S C P(R=F) P(R=T) F T 0.8 0.2

0.2 0.8

C W R Cloudy Spr+Rain WetGrass C F T P(S,R) FF FT TF TT .40 .10 .40 .10

.18 .72 .02 .08

37

Hybrid (discrete + continuous) BNs

  Option1: Discretization – possibly large error, Large CPTs Option2: Finitely parameterized canonical form  Continuous variable, discrete + continuous parents (e.g. Cost) Subsidy Harvest Cost Buy P(

C

  

c N

|

H

(

a t h

 

h

,

b t S

,  

t true

) 

t

1 2  exp     1 2  

c

 (

a h t

t

b t

)   2    Discrete variable, continuous parents (e.g. Buys) • Probit , Cummualtive Normal pdf : More Realistic but difficult to manage . • Logit , Sigmoid function : Practical ’cause simple derivative .

38

Simple BNs

 Notations  Circles denote continuous rv's  squares denote discrete rv's,  clear means hidden , and  shaded means observed X Y Q Y X  Examples  FA/PCA Mixture of FAs Principal component analysis Factor analysis X1 X2

...

Xn X1

...

Xn Y1

...

Ym Y1 Y2

...

Ym

39

Temporal(Dynamic) models

 Hidden Markov Models Autoregressive HMM ...

Q1 Q2 Q3 Q4 Q1 Q2 Y1 Y2 Y3 Y4 Y1  Linear Dynamic Systems(LDSs) and Kalman filter  x(t+1) = A* x(t) + w(t), w ~ N(0, Q), x(0) ~ N(x 0 ,V 0 )  y(t) = C* x(t) + v(t), v ~ N(0, R) Kalman filter X1 X2 Autoregressive model AR(1) X1 X2 Y1 Y2 Y2

40

Approximate inference

 To avoid tremendous computation  Monte Carlo Methods  Variational methods  Loopy junction graphs and loopy Bayesian propagation

41

Monte Carlo Methods

 Direct sampling  Generating Random Samples according to CPTs  Counting # of samples matching the query  Likelihood weighting  To avoid rejection sample, generate events consistent with the evidence and calculate the likelihood weight of the event  Summing the weights w.r.t variables conditioned by the evidence  Example of P(R=T|S=T,W=T) with initial weight w=1 • P(C)=(0.5,0.5) return true • Since S is evidence, w=w* P(S=T|C=T)=0.1

• P(R|C=T)=(0.8,0.2) return true • Since W is evidence, w=w* P(W=T|S=T,R=T)=0.1*0.99=0.099

• [t,t,t,t] with weight 0.099

42

Monte Carlo Methods Examples

C P(S=F) P(S=T) F T 0.5 0.5

0.9 0.1

Spinkler S R F F T F F T T T P(W=F) P(W=T) 1.0 0.0

0.1 0.9

0.1 0.9

0.01 0.99

P(C=F) P(C=T) 0.5 0.5

Cloudy C P(R=F) P(R=T) F T 0.8 0.2

0.2 0.8

WetGrass Rain P(R=T) C S R W T T F T T F T F F F F T T T F F P(R=T|S=T,W=T) Wait 0.099

0.001

0.10

S=T,W=T C R T T T T F F

43

Markov Chain Monte Carlo methods

 MCMC algorithm

X1 X2 X3 X4

 Randomly sampling w.r.t one of the nonevidence variable

Xi

, conditioned on the current value of the variables in the Markov blanket of

Xi

 Example; to calculate P(R|S=T,W=T)  Repeat following steps with initial point [C S R W]=[T T F T]  Sample from P( C |S=T,R=F)=P(C,S=T,R=F)/P(S=T,R=F) return F  Sample from P(R| C=F ,S=T,R=F) return T C S R

X5

W

44

Learning from data

 To estimate parameters.

 Data observability.

 Model structure if it is unknown Structure Known Structure Data Complete Data Statistical parametric estimation: MLE, MAP Unknown Structure Model selection; Discrete optimization Incomplete Data Parametric optimization; EM algorithm Combined: EM + model selection

45

Summary

       Bayesian networks are a natural way to represent conditional independence information. A bayesian network is complete and compact representation for the joint probability distribution for the domain. Inference in bayesian networks means computing the probability distribution of a set of query variables, given a set of evidence variables . Bayesian networks can reason causally, diagnostically, in mixed mode, or intercausally . No other uncertain reasoning mechanism can handle all these modes . The complexity of bayesian network inference depends on the network structure. In polytrees the computation time is linear in the size of the network.

With large and highly connected graphical models, exponential blowup number of computations for exact inference occurs in the Given the intractability of exact inference in large multiply connected networks, it is essential to consider approximate inference methods such as Monte Carlo methods and MCMC

46

References

  http://www.cs.sunysb.edu/~liu/cse352/#handouts http://sern.ucalgary.ca/courses/CPSC/533/W99/presentations/L2_15B_ Griscowsky_Kainth /    Jeff Bilmes, Introduction to graphical model, lecture note http://www.ai.mit.edu/~murphyk/Bayes/bayes_tutorial.pdf

Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall

47