Transcript Document

Some Comments on
Sebastiani et al Nature
Genetics 37(4)2005
1
Bayesian Classifiers & Structure Learners
They come in several varieties designed to balance the
following properties to different degrees:
1. Expressiveness: can they learn/represent arbitrary or
constrained functions?
2. Computational tractability: can perform learning and
inference fast?
3. Sample efficiency: how much sample is needed?
4. Structure discovery: can they be used to infer
structural relationships, even causal ones?
2
Variants:








Exhaustive Bayes
Simple (aka Naïve) Bayes
Bayesian Networks (BNs)
TANs (Tree-Augmented Simple Bayes)
BAN (Bayes Net Augmented Simple Bayes)
Bayesian Multinets (TAN or BAN based)
FAN
Several others exist but not examined here (e.g.,
MB classifiers, model averaging)
3
Exhaustive Bayes

Bayes’ Theorem (or formula) says that:
P (D) * P(F| D)
P (D | F) =
P(F)
4
Exhaustive Bayes
1. Expressiveness: can learn any function
2. Computational tractability: exponential
3. Sample efficiency: exponential
4. Structure discovery: does not reveal structure
5
Simple Bayes

Requires that findings are independent conditioned on
the disease states (note: this does not mean that the
findings are independent in general, but rather, that they
are conditionally independent).
6
Simple Bayes: less sample and less
computation but more restricted in what it
can learn than Exhaustive Bayes

Simple Bayes can be implemented by plugging in the main formula:
P(F | D) = P P(Fi | Dj)
i,j
where Fi is the ith (singular) finding and Dj the jth (singular) disease.
7
Naive Bayes
1. Expressiveness: can learn a small fraction of
functions (that shrinks exponentially fast as # of
dimensions grows)
2. Computational tractability: linear
3. Sample efficiency: needs linear number of
parameters to # of variables; each parameter can be
estimated fairly efficiently since it involves conditioning
on one variable (the class node). E.g., in the diagnosis
context one needs only prevalence of disease and
sensitivity of each finding for the disease.
4. Structure discovery: does not reveal structure
8
Bayesian Networks:
Achieve trade-off between flexibility of
Exhaustive Bayes and tractability of
Simple Bayes
Also allow discovery of structural
relationships
9
Bayesian Networks
1. Expressiveness: can represent any function
2. Computational tractability: Depends on the dependency structure
of the underlying distribution. It is worst-case intractable but for
sparse or tree-like networks it can be very fast.
Representational tractability is excellent in sparse networks
3. Sample efficiency: There is no formal characterization because
(a) highly depends on the underlying structure of the distribution
and (b) in most practical learners local errors propagate to remote
areas in the network. Large-scale empirical studies show that very
complicated structures (i.e., with hundreds or even thousands of
variables and medium to small densities) can be learned accurately
with relatively small samples (i.e., a few hundred samples).
4. Structure discovery: under well-defined and reasonable
conditions is capable of revealing causal structure.
10
Bayesian Networks: The Bayesian Network
Model and Its Uses


BN=Graph (Variables (nodes), dependencies (arcs)) + Joint
Probability Distribution + Markov Property
Graph has to be DAG (directed acyclic) in the standard BN model
A
B
C
P(A+,
P(A+,
P(A+,
P(A+,
P(A-,
P(A-,
P(A-,
P(A-,
JPD
B+, C+)=0.006
B+, C-)=0.014
B-, C+)=0.054
B-, C-)=0.126
B+, C+)=0.240
B+, C-)=0.160
B-, C+)=0.240
B-, C-)=0.160
Theorem: any JPD can be represented in BN form
11
Bayesian Networks: The Bayesian Network
Model and Its Uses

Markov Property: the probability distribution of any node N given its
parents P is independent of any subset of the non-descendent nodes
W of N
A
e.g., :
B
C
D ^ {B,C,E,F,G | A}
D
F ^ {A,D,E,F,G,H,I,J |
B, C }
E
F
G
I
H
J
12
Bayesian Networks: The Bayesian Network
Model and Its Uses

Theorem: the Markov property enables us to decompose (factor) the
joint probability distribution into a product of prior and conditional
probability distributions
The original JPD:
P(A+, B+, C+)=0.006
P(A+, B+, C-)=0.014
A
P(A+, B-, C+)=0.054
P(A+, B-, C-)=0.126
P(A-, B+, C+)=0.240
Up to
P(A-, B+, C-)=0.160
P(A-, B-, C+)=0.240
Exponential
B
C
P(A-, B-, C-)=0.160
Saving in
P(V) =
P p(V |Pa(V ))
i
i
i
Becomes:
P(A+)=0.8
P(B+ | A+)=0.1
P(B+ | A-)=0.5
P(C+ | A+)=0.3
P(C+ | A-)=0.6
Number of
Parameters!
13
Bayesian Networks: The Bayesian Network
Model and Its Uses

Once we have a BN model of some domain we can ask
questions:
A
• Forward: P(D+,I-| A+)=?
• Backward: P(A+| C+, D+)=?
B
C
D
• Forward & Backward:
P(D+,C-| I+, E+)=?
E
F
G
I
H
• Arbitrary abstraction/Arbitrary
predictors/predicted variables
J
14
Other Restricted Bayesian
Classifiers:TANs, BANs, FANs,
Multinets
15
Other Restricted Bayesian
Classifiers:TANs, BANs, FANs, Multinets
1. Expressiveness: can represent limited classes of functions (more
expressive than SB, less so than BNs)
2. Computational tractability: Worse than Simple Bayes, often
faster than BNs.
3. Sample efficiency: There is no formal characterization. Limited
empirical studies so far, however results promising.
4. Structure discovery: not designed to reveal causal structure.
16
TANs

The TAN classifier extends Naïve Bayes with
“augmenting” edges among findings such that
the resulting network among the findings is a
tree
D
F1
F2
F3
F4
17
TAN multinet

The TAN multinet classifier uses a different TAN for
each value of D and then chooses the predicted class
to be the value of D that has the highest posterior
given the findings (over all TANs)
D=2
D=1
F1
F2
F3
F4
F1
F2
F3
F4
D=3
F1
F2
F3
F4
18
BANs

The BAN classifier extends Naïve Bayes with
“augmenting” edges among findings such that
the resulting network among the findings is a
graph
D
F1
F2
F3
F4
19
FANs (Finite Mixture Augmented Naïve
Bayes)

The FAN classifier extends Naïve Bayes by
modeling extra dependencies among findings
via an unmeasured hidden confounder (Finite
Mixture model) parameterized via EM
D
F1
F3
F2
H
F4
20
How feasible is to learn structure accurately
with Bayesian Network Learners and
realistic samples?
Abundance of empirical evidence shows that it is very feasible. A few
examples:

C.F. Aliferis, G.F. Cooper. “An Evaluation of an Algorithm for Inductive Learning of
Bayesian Belief Networks Using Simulated Data Sets”. In Proceedings of Uncertainty
in Artificial Intelligence 1994.


I. Tsamardinos, L.E. Brown, C.F. Aliferis. "The Max-Min Hill-Climbing Bayesian
Network Structure Learning Algorithm" Machine Learning Journal, 2006


 67 random BNs with samples from <200 to 1500 and up to 50 variables obtained mean
sensitivity of 92% and mean superfluous arcs ratio of 5%
 22 networks from 20 variables to 5000, and samples from 500 to 5000 yielding excellent
Structural Hamming Distances (for details please see paper).
I. Tsamardinos, C.F. Aliferis, A. Statnikov. "Time and Sample Efficient Discovery of
Markov Blankets and Direct Causal Relations" In Proceedings of the 9th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, August
24-27, 2003, Washington, DC, USA, ACM Press, pages 673-678

 8 networks with 27 to 5000 variabes and 500 to 5000 samples yield average
sensitivity/specificity of 90%
See www.dsl-lab.org for details
21
Other comments on the paper
1.
2.
3.
4.
5.
6.
The paper aspires to build powerful classifiers and to reveal structure in one
modeling step. Several important predictive modeling approaches and
structure learners that a priori seem more suitable are ignored.
Analysis is conducted exclusively with a commercial product owned by one
of the authors. Conflict is disclosed in the paper.
Using (approximately) a BAN may facilitate parameterization however it
does not facilitate structure discovery.
Ordering SNPs is a good idea.
No more than 3 parents per node means that approx. 20 samples are used
for each independent cell in the conditional probability tables. Experience
shows that this number is more than enough for sufficient parameterization
IF this density is correct.
The proposed classifier achieves accuracy close to what one gets by
classifying everything to the class with higher prevalence (since the
distribution is very unbalanced). However close inspection shows that the
classification is much more discriminatory. Accuracy is a very poor metric to
show this.
22
Other comments on the paper
7.
8.
9.
10.
11.
12.
13.
A very appropriate analysis not pursued here is to convert the graph to its
equivalence class and examine structural dependencies there.
No examination of structure stability in the 5-folds of cross validation, or via
bootstrapping.
Table 1 confuses explanatory with predictive modeling. SNP contributions
are estimated in the very small sample while they should be estimated in
the larger sample (table 1 offers an explanatory analysis).
It is not clear what set each SNP/gene SNP set is removed from to compute
Table 1.
Mixing source populations in the evaluation set may have biased the
evaluation.
Discretization has a huge effect on structure discovery algorithms. The
applied discretization procedure of continuous variables is suboptimal.
When using selected cases and controls artifactual dependencies are
introduced among some of the variables. This is well known and corrections
to the Bayesian metric have been devised to deal with this. The paper
ignores this despite that its purpose is precisely to infer such dependencies.
23
Other comments on the paper
14.
15.
The paper makes the argument that by enforcing that arcs go from the
phenotype to SNPs the resulting model needs less sample to parameterize.
While this may be true for the parameterization of the phenotype node, it is
not true in general for the other nodes. In fact by doing so genotype nodes
have, in general, to be more densely connected and thus their
parameterization becomes more sample-intensive. At the same time the
validity of the inferred structure may be compromised.
There has been quite a bit of “simulations to evaluate heuristic choices” and
parameter values chosen by “sensitivity analysis” and other such premodeling that open up the possibility for some manual over-fitting.
24