Document 7657478

Download Report

Transcript Document 7657478

Exploiting Parameter Domain Knowledge
for
Learning in Bayesian Networks
~ Thesis Defense ~
Stefan Niculescu
Carnegie Mellon University, July 2005
Thesis Committee:
Tom Mitchell (Chair)
John Lafferty
Andrew Moore
Bharat Rao (Siemens Medical Solutions)
1
Domain Knowledge
• In real world, often data is too sparse to allow
building of an accurate model
• Domain knowledge can help alleviate this problem
• Several types of domain knowledge:
– Relevance of variables (feature selection)
– Conditional Independences among variables
– Parameter Domain Knowledge
2
Parameter Domain Knowledge
• In a Bayesian Network for a real world domain:
– can have huge number of parameters
– not enough data to estimate them accurately
• Parameter Domain Knowledge constraints:
– reduce the space of feasible parameters
– reduce the variance of parameter estimates
3
Parameter Domain Knowledge
Examples:
• DK: “If a person has a Family history of Heart Attack, Race and
Pollution are not significant factors for the probability of
getting a Heart Attack.”
• DK: “Two voxels in the brain may exhibit the same activation
patterns during a cognitive task, but with different amplitudes.”
• DK: “Two countries may have different Heart Disease rates, but
the relative proportion of Heart Attack to CHF is the same.”
• DK: “The aggregate probability of Adverbs in English is less
than the aggregate probability of Verbs”.
4
Thesis
Standard methods for performing parameter estimation
in Bayesian Networks can be naturally extended to
take advantage of parameter domain knowledge that
can be provided by a domain expert. These new
learning algorithms perform better (in terms of
probability density estimation) than existing ones.
5
Outline
• Motivation
 Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
6
Parameter Domain Knowledge Framework
~ Domain Knowledge Constraints ~
7
Parameter Domain Knowledge Framework
~ Frequentist Approach, Complete Data ~
8
Parameter Domain Knowledge Framework
~ Frequentist Approach, Complete Data ~
9
Parameter Domain Knowledge Framework
~ Frequentist Approach, Incomplete Data ~
EM Algorithm. Repeat until convergence:
10
Parameter Domain Knowledge Framework
~ Frequentist Approach, Incomplete Data ~
~ Discrete Variables ~
EM Algorithm. Repeat until convergence:
11
Parameter Domain Knowledge Framework
~ Bayesian Approach ~
12
Parameter Domain Knowledge Framework
~ Bayesian Approach ~
13
Parameter Domain Knowledge Framework
~ Computing the Normalization Constant ~
14
Parameter Domain Knowledge Framework
~ Computing the Normalization Constant ~
H(2)
In H7:
ε = 0.5
15
Outline
• Motivation
• Parameter Domain Knowledge Framework
 Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
16
Simple Parameter Sharing
~ Maximum Likelihood Estimators ~
i
Cubical Die – cut symmetrically at each corner
i ( Ni )
i
k1=6 k2=8
ki places
Theorem. The Maximum Likelihood
parameters are given by:
i
Total:
1( N )
N Nj
j
17
Simple Parameter Sharing
~ Dependent Dirichlet Priors ~
18
Simple Parameter Sharing
~ Variance Reduction in Parameter Estimates ~
19
Simple Parameter Sharing
~ Experiments – Learning a Probability Distribution ~
• Synthetic Dataset:
– Probability distribution over 50 values
– 50 randomly generated parameters:
• 6 shared between 2 and 5 times to count as half
• The rest “not shared” (shared exactly once)
– 1000 examples sampled from this distribution
– Purpose:
• Domain Knowledge readily available
• To be able to study the effect of training set size (up to 1000)
• To be able to compare our estimated distribution to the true
distribution
• Models:
– STBN ( Standard Bayesian Network )
– PDKBN ( Bayesian Network with PDK )
20
Experimental Results
• PDKBN performs better than STBN
– Largest difference: 0.05 (30 ex)
• On average, STBN needs 1.86 times
more examples to catch up in KL !!!
• 40 (PDKBN) ~ 103 (STBN)
• 200 (PDKBN) ~ 516 (STBN)
• 650 (PDKBN) ~ >1000 (STBN)
• The difference between PDKBN and STBN shrinks when the size of training
set increases, but PDKBN is much better when training data is scarce.
21
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
 Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
22
Hidden Process Models
One observation (trial):
N different trials:
All trials and all Processes
have equal length T
23
Parameter Sharing in HPMs
(c1v )
• similar shape activity
(c2v )
• different amplitudes
X tv' ~ N (c1v P1(t 't1 )  c2v P2(t 't2 ) , 2 )
Xv
24
Parameter Sharing in HPMs
~ Maximum Likelihood Estimation ~
l’(P,C) quadratic in (P,C), but
• linear in P !
• linear in C !
25
Parameter Sharing in HPMs
~ Maximum Likelihood Estimation ~
26
Starplus Dataset
•
Trial:
– read sentence
– view picture
– answer whether sentence describes picture
•
40 trials – 32 time slices (2/sec)
– picture presented first in half of trials
– sentence first in the other half
•
Three possible objects: star, dollar, plus
•
Collected by Just et al.
•
IDEA: model using HPMs with two processes:
– “Sentence” and “Picture”
– We assume a process starts when stimulus is presented
– Will use Shared HPMs where possible
27
It is true that the star is above the plus?
28
29
+
--*
30
31
Parameter Sharing in HPMs
~ Hierarchical Partitioning Algorithm ~
32
Parameter Sharing in HPMs
~ Experiments ~
• We compare three models:
–
–
–
–
Based on Average (per trial) Likelihood
StHPM – Standard, per voxel HPM
ShHPM – One HPM for all voxels in an ROI (24 total)
HieHPM – Hierarchical HPM
• Effect of training set size (6 to 40) in CALC:
– ShHPM biased here
• Better than StHPM at small sample size
• Worse at 40 examples
– HieHPM – the best
• It can represent both models
• e106 times better data likelihood than StHPM at 40 examples
• StHPM needs 2.9 times more examples to catch up
33
Parameter Sharing in HPMs
~ Experiments ~
Performance over whole brain (40 examples):
– HieHPM – the best
• e1792 times better data likelihood than StHPM
• Better than StHPM in 23/24 ROIs
• Better than ShHPM in 12/24 ROIs, equal in 11/24
– ShHPM – second best
• e464 times better data likelihood than StHPM
• Better than StHPM in 18/24 ROIs
• It has bias, but makes sense to share whole ROIs not involved in the
cognitive task
34
Learned Voxel Clusters
• In the whole brain:
 ~ 300 clusters
 ~ 15 voxels / cluster
• In CALC:
 ~ 60 clusters
 ~ 5 voxels / cluster
35
Sentence Process in CALC
36
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
 Types of Parameter Domain Knowledge
• Related Work
• Summary / Future Work
37
Parameter Domain Knowledge Types
• DISCRETE:
• Known Parameter Values
• Parameter Sharing and Proportionality Constants – One Distribution
• Sum Sharing and Ratio Sharing – One Distribution
• Parameter Sharing and Hierarchical Sharing – Multiple Distributions
• Sum Sharing and Ratio Sharing – Multiple Distributions
• CONTINUOUS (Gaussian Distributions):
• Parameter Sharing and Proportionality Constants – One Distribution
• Parameter Sharing in Hidden Process Models
• INEQUALITY CONSTRAINTS:
• Between Sums of Parameters – One Distribution
• Upper Bounds on Sums of Parameters – One Distribution
38
Probability Ratio Sharing
•Want to model P(Word|Language)
T1 Computer
Words
•Two languages: English, Spanish
•Different sets of words
•Domain Knowledge:
T2 Business Words
•Word groups:
•About computers: computer,
keyboard, monitor, etc
•Relative frequency of “computer” to
“keyboard” same in both languages
•Aggregate mass can be different
11
12
 21
 22
 41
 42
 51
 52
1
1
c1  P(Word | English )
c2  P(Word | Spanish )
39
Probability Ratio Sharing
p11
p12
p1k
p21
p22
p2 k
p41
p42
p4 k
p51
p52
p5k
...
1
1
1
DK: Parameters of a given color
preserve their relative ratios across all
distributions!
p1k
p11 p12

 ... 
p21 p22
p2 k
40
Proportionality Constants for Gaussians
41
Inequalities between Sums of Parameters
 In spoken language:
• Each Adverb comes along with a Verb
• Each Adjective comes with a Noun or Pronoun
 Therefore it is reasonable to expect that:
• The frequency of Adverbs is less than that of Verbs
• The frequency of Adjectives is less than that of Nouns and Pronouns
Equivalently:
 In general, within the same distribution:
42
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
 Related Work
• Summary / Future Work
43
Dirichlet Priors in a Bayes Net
Prior Belief
Spread
• The Domain Expert specifies an assignment of parameters.
– leaves room for some error (Variance).
• Several types:
– Standard
– Dirichlet Tree Priors
– Dependent Dirichlet
44
Markov Models
...
...
Wt 1
Wt
Wt 1
...
...
45
Module Networks
In a Module:
• Same parents
• Same CPTs
Image from “Learning Module Networks” by Eran Segal and Daphne Koller
46
Context Specific Independence
Burglary
Set
Alarm
47
Limitations of Current Models
• Dirichlet priors
– When the number of parameters is huge, specifying a
useful prior is difficult
– Unable to enforce even simple constraints:
 Need additional hyperparameters to enforce basic parameter
sharing, but no closed form MAP estimates can be computed !
– Dependent Dirichlet Priors are not conjugate priors
• Our priors are dependent and also conjugate !!!
• Markov Models, Module Networks and CSI
– Particular cases of our Parameter Sharing DK
– Do not allow sharing at parameter level of granularity
48
Outline
• Motivation
• Parameter Domain Knowledge Framework
• Simple Parameter Sharing
• Parameter Sharing in Hidden Process Models
• Types of Parameter Domain Knowledge
• Related Work
 Summary / Future Work
49
Summary
•
•
•
Parameter Related Domain Knowledge is needed when data is scarce
–
Reduces the number of free parameters
–
Reduces the variance in parameter estimates (illustrated on Simple Parameter Sharing)
Developed unified Parameter Domain Knowledge Framework
–
From both a frequentist and Bayesian point of view
–
From both complete and incomplete data
Developed efficient learning algorithms for several types of PDK:
–
Closed form solutions for most of these types
–
For both discrete and continuous variables
–
For both equality and inequality constraints
–
Particular cases of our parameter sharing framework:
•
Markov Models, Module Nets, Context Specific Independence
•
Developed method of automatically learning the domain knowledge (illustrated on HPMs)
•
Experiments show the superiority of models using PDK
50
Future Work
• Interactions among different types of Parameter
Domain Knowledge
• Incorporate Parameter Domain Knowledge in
Structure Learning
• Hard vs. Soft constraints
• Parameter Domain Knowledge for learning
Undirected Graphical Models
51
Questions ?
52
THE END
53