Transcript sews4 7240

Consistent probabilistic outputs
for protein function prediction
William Stafford Noble
Department of Genome Sciences
Department of Computer Science and Engineering
University of Washington
Outline
• Motivation and background
• Methods
– Shared base method
– Reconciliation methods
• Results
The problem
Given:
• protein sequence,
• knockout phenotype,
• gene expression
profile,
• protein-protein
interactions, and
• phylogenetic profile
Predict
• a probability for every
term in the Gene
Ontology
Heterogeneous data
Missing data
Multiple labels per gene
Structured output
Consistent predictions
Cytoplasmic
membrane-bound
vesicle
(GO:0016023)
is a
Cytoplasmic
vesicle
(GO:0031410)
The probability that
protein X is a
cytoplasmic
membrane-bound
vesicle must be less
than or equal to the
probability that
protein X is a
cytoplasmic vesicle.
Data sets
Kernels
SVM → Naïve Bayes
Data 1
SVM/AL 1
Probability 1
Data 2
SVM/AL 2
Probability 2
Data 3
SVM/AL 3
Probability 3
Data 4
SVM/AL 4
Probability 4
Data 5
SVM/AL 5
Data 6
SVM/AL 6
Data 7
SVM/AL 7
Data 8
SVM/AL 8
Probability 8
Data 33
SVM/AL 33
Probability 33
Gaussian
Product, plus
Bayes’ rule
Probability
Probability 6
Asymmetric Laplace
SVM → logistic regression
Data 1
SVM 1
Predict 1
Data 2
SVM 2
Predict 2
Data 3
SVM 3
Predict 3
Data 4
SVM 4
Predict 4
Data 5
SVM 5
Data 6
SVM 6
Data 7
SVM 7
Data 8
SVM 8
Predict 8
Data 33
SVM 33
Predict 33
Logistic
regressor 1
Logistic
regressor 2
Logistic
regressor 3
Predict 6
Logistic
regressor 11
Probability
Reconciliation Methods
•
•
•
•
3 heuristic methods
3 Bayesian networks
1 cascaded logistic regression
3 projection methods
Heuristic methods
• Max: Report the maximum probability
of self and all descendants.
pi  max pˆ j
• And: Report the product of
probabilities of all ancestors and self.
pi   pˆ j
• Or: Compute the probability that at
least one descendant of the GO term
is “on,” assuming independence.
jDi
jAi
pi  1   1  pˆ j 
jDi
• All three methods use probabilities estimated by logistic
regression.
Bayesian network
• Belief propagation on a graphical model with the topology of the GO.
• Given Yi, the distribution of each SVM output Xi is modeled as an
independent asymmetric Laplace distribution.
• Solved using a variational inference algorithm.
• “Flipped” variant: reverse the directionality of edges in the graph.
Cascaded logistic regression
• Fit a logistic regression to the SVM output
only for those proteins that belong to all
parent terms.
• Models the conditional distribution of the
term, given all parents.
• The final probability is the product of these

conditionals:
pi   p j
jAi
Isotonic regression
• Consider the squared Euclidean distance
between two sets of probabilities.
• Find the closest set of probabilities to the
logistic regression values that satisfy all
the inequality constraints.
min   p  pˆ 
2
pi , iI
s.t.
iI
i
i
p j  pi , i, j   E
Isotonic regression
• Consider the squared Euclidean distance
between two sets of probabilities.
• Find the closest set of probabilities to the
logistic regression values that satisfy all
the inequality constraints.
2
ˆ


p

p
min  i i
pi , iI
s.t.
iI
p j  pi , i, j   E
min  D pˆ p 
pi , iI
s.t.
iI
i
i
p j  pi , i, j   E
Küllback-Leibler projection
• Küllback-Leibler projection on the set of
distributions which factorize according to the
ontology graph.
• Two variants, depending on the directions of the
edges.
Hybrid method
KLP
BPAL
BPLR
Likelihood ratios
obtained from
logistic regression
• Replace the Bayesian log posterior for Yi by the marginal
log posterior obtained from the logistic regression.
• Uses discriminative posteriors from logistic regression,
but still uses a structural prior.
Axes of evaluation
• Ontology
– biological process
– cellular compartment
– molecular function
• Term size
–
–
–
–
3-10 proteins
11-30 proteins
31-100 proteins
100-200 proteins
• Evaluation mode
– Joint evaluation
– Per protein
– Per term
• Recall
–
–
–
–
1%
10%
50%
80%
Legend
Belief propagation, asymmetric Laplace
Belief propagation, asymmetric Laplace, flipped
Belief propagation, logistic regression
Cascaded logistic regression
Isotonic regression
Logistic regression
Küllback-Leibler projection
Küllback-Leibler projection, flipped
Naïve Bayes, asymmetric Laplace
Joint evaluation
Precision TP/(TP+FP)
Biological
process
ontology
Large terms
(101-200)
Recall TP / (TP+FN)
Biological
process
ontology
Molecular
function
ontology
Cellular
compartment
ontology
Conclusions: Joint evaluation
• Reconciliation does not always help.
• Isotonic regression performs well overall,
especially for recall > 20%.
• For lower recall values, both KüllbackLeibler projection methods work well.
Average precision per protein
Biological
process
All term sizes
Biological
process
Statistical significance
Biological
process
Large terms
Biological
process
Large terms
3-10
953 proteins
11-30
435 proteins
31-100
239 proteins
101-200
100 proteins
Biological
process
3-10
476 proteins
11-30
142 proteins
31-100
111 proteins
101-200
35 proteins
Molecular
function
3-10
196 proteins
11-30
135 proteins
31-100
171 proteins
101-200
278 proteins
Cellular
component
Conclusions: per protein
• Several methods perform well
–
–
–
–
Unreconciled logistic regression
Unreconciled naïve Bayes
Isotonic regression
Belief propagation with asymmetric Laplace
• For small terms
– For molecular function and biological process, we do
not observe many significant differences.
– For cellular components, belief propagation with
logistic regression works well.
Average precision per term
Biological
process
All term sizes
3-10
953 terms
11-30
435 terms
31-100
239 terms
101-200
100 terms
Biological
process
3-10
476 terms
11-30
142 terms
31-100
111 terms
101-200
35 terms
Molecular
function
3-10
152 terms
11-30
97 terms
31-100
48 terms
101-200
30 terms
Cellular
component
Conclusions
• Reconciliation does not always help.
• Isotonic regression (IR) performs well
overall.
• For small biological process and molecular
function terms, it is less clear that IR is
one of the best methods.
Acknowledgments
Guillaume Obozinski
Charles Grant
Michael Jordan
Gert Lanckriet
The mousefunc organizers
• Tim Hughes
• Lourdes Pena-Castillo
• Fritz Roth
• Gabriel Berriz
• Frank Gibbons
Per term for small terms
Biological
process
Molecular
function
Cellular
component