Remote homology detection Genome 541 Prof. William Stafford Noble Outline • • • • • • • Pairwise alignment algorithms Hidden Markov models Iterative profile methods Profile-profile alignment Support vector machines Network diffusion Metric space embedding.

Download Report

Transcript Remote homology detection Genome 541 Prof. William Stafford Noble Outline • • • • • • • Pairwise alignment algorithms Hidden Markov models Iterative profile methods Profile-profile alignment Support vector machines Network diffusion Metric space embedding.

Remote homology detection
Genome 541
Prof. William Stafford Noble
Outline
•
•
•
•
•
•
•
Pairwise alignment algorithms
Hidden Markov models
Iterative profile methods
Profile-profile alignment
Support vector machines
Network diffusion
Metric space embedding
Pairwise alignment
algorithms
Smith-Waterman (Smith J. Mol Biol 1981)
BLAST (Altschul Nucleic Acids Research 1990)
A pairwise alignment algorithm
ranks homologs and produces
an alignment
Query
Sequence
database
BLAST
Ideally, each target
sequence also has an
associated statistical
confidence estimate.
Homologs
Profile hidden Markov
models
SAM (Krogh J. Mol. Biol. 1994)
HMMER (Eddy ISMB 1995)
Some positions are more likely
to have a gap than others
EEFG----SVDGLVNNA
QKYG----RLDVMINNA
RRLG----TLNVLVNNA
GGIG----PVD-LVNNA
KALG----GFNVIVNNA
ARFG----KID-LIPNA
FEPEGPEKGMWGLVNNA
AQLK----TVDVLINGA
EEFGSVDGLVNNA
Charge more for a gap here than here.
A profile HMM has positionspecific substitutions, insertions
and deletions
I0
B
I1
I2
I3
I4
I5
I6
I7
I8
M1
M2
M3
M4
M5
M6
M7
M8
D1
D2
D3
D4
D5
D6
D7
D8
E
HMM dynamic programming
E M R K L
I1
I2
B
B
M1
E
I1
D1
I2
M1
E M R K L
D1
E
Iterative profile
methods
PSI-BLAST (Altschul Nucleic Acids Research 1997)
SAM-T98 (Karplus Bioinformatics 1998)
Position-specific iterated BLAST
Position-specific
scoring matrix
(PSSM)
Query
Statistical model
of protein family
Homologs
Sequence
database
BLAST
Profile-profile
alignment
HHPred / HHSearch (Söding Bioinformatics 2005)
Path through two HMMs
corresponds to a sequence coemitted by both
Support vector machine methods
Fisher SVM (Jaakkola ISMB 1999)
Spectrum kernel (Leslie NIPS 2002)
Pairwise kernel (Liao RECOMB 2002)
SVM-fold (Melvin BMC Bioinformatics 2007)
We can frame homology
detection as a classification task
• Many statistical and machine learning
algorithms exist for performing binary
classification.
–
–
–
–
–
–
Naïve Bayes classifier
Nearest neighbor classifier
Linear discriminant analysis
Decision trees
Neural networks
Support vector machines
• We can use such methods to answer
questions of the form, “Is this protein
a penicillin amidase?”
Is protein X a
penicillin amidase?
Classifier
Yes
Important caveat: This approach only
produces a ranking, not an alignment.
Problem: Protein lengths vary
• Most classifiers
require fixed-length
input.
• We need a method
to convert a protein
into a vector.
“Featurizer”
Fisher kernel uses sufficient
statistics from HMM training
• For each training example x, Fisher score
is gradient of log-likelihood for x given M+
• For example,
Emission probability of
amino acid i in state j.
Times that amino acid i is
observed in state j.
Spectrum kernel uses counts of
k-mers
• The k-spectrum is the set of all length-k
subsequences.
• Features are counts of subsequences.
• Dimension of feature space = 20k
AKQDYYYYEI
( 0 , 0 ,…, 1 ,…,1,…,2)
AAA AAC
AKQ
KQD YYY
Pairwise kernel uses pairwise
sequence comparison scores
Smith-Waterman score
from comparison of
protein X and protein
Y
SW score w.r.t. protein Y
Support vector machine
classification
SW score w.r.t. protein X
SVM classification
High confidence
“No”
SW score w.r.t. protein Y
High confidence
“Yes”
Penicillin amidase
Other
Unknown
Low-confidence
“Yes”
SW score w.r.t. protein X
SVMs outperform HMMs
Liao & Noble, Journal of Computational Biology, 2003.
Network diffusion
Rankprop (Weston PNAS 2004)
Adai et al. JMB 340:179-190 (2004).
Adai et al. JMB 340:179-190 (2004).
Protein similarity network
• Compute all-vs-all PSI-BLAST similarity
network.
• Store all E-values (no threshold).
• Convert E-values to weights via transfer
function (weight = e-E/).
• Normalize edges leading into a node to
sum to 1.
The RankProp algorithm
• The query node
always has
activation of 1.
• Propagate
activation to
neighbors of the
query, weighted
by edge weight.
• Propagate
activation
among more
distant nodes
with an
additional weight
factor α.
Kij = similarity of proteins i and j
yi(t) = activation of protein i at time t
Toy example
Round 0
0.8
0
1
0.9
0.5
0.2
0
0
0
0.7
α = 0.95
Toy example
Round 0
0.8
Round 1
0
1
0.8
1
0.9
0.5
0
0
0.7
α = 0.95
0.9
0.5
0.2
0
0.8
0.2
0.5
0.2
0
0.7
Toy example
Round 1
0.8
Round 2
0.8
0.8
1
1
0.9
0.5
0.5
0.2
0.2
0
0.7
α = 0.95
0.9
0.5
0.2
?
?
0.7
Toy example
Round 1
0.8
Round 2
0.8
0.8
1
1
0.9
0.5
0.2
0.2
0
0.7
α = 0.95
0.9
0.5
0.2
0.5
0.8
0.5
?
0.7
Toy example
Round 1
0.8
Round 2
0.8
0.8
1
1
0.9
0.5
0.2
0.2
0
0.7
α = 0.95
0.9
0.5
0.2
0.5
0.8
0.5
0.884
0.2 + (0.95 * 0.8 * 0.9)
?
0.7
Toy example
Round 1
0.8
Round 2
0.8
0.8
1
1
0.9
0.5
0.2
0.2
0
0.7
α = 0.95
0.9
0.5
0.2
0.5
0.8
0.5
0.884
0.133
0.7
0 + (0.95 * 0.2 * 0.7)
Activations
Round 0 Round 1 Round 2 Round 3
Node 1
1
Node 2
0
Node 3
0
Node 4
0
Node 5
0
Toy example #2
Propagation corrects errors
made by PSI-BLAST
Homologs
Query
PSI-BLAST
false negatives
Many connections
Before
Member of PYP-like sensor domain superfamily.
Protein domain with known structure, not in superfamily.
Protein with no known structure.
Few connections
After
Member of PYP-like sensor domain superfamily.
Protein domain with known structure, not in superfamily.
Protein with no known structure.
Comparison with PSI-BLAST
Axes are ROC50 scores.
Metric space
embedding
Protembed (Melvin PLOS Comp Biol 2011)
Comparison of methods
Method
Task
Alignment
Confidence
Performance
Pairwise alignment
Single
X
X
1
Hidden Markov models
Family
X
(X)
Iterative profile methods
Single
X
X
2
Profile-profile alignment
Either
X
X
3
Support vector machines
Family
Network diffusion
Single
Metric space embedding
Single
4
X
5
Please read the SVM article before the next class.
Please write a one-minute response before you leave.