Remote homology detection Genome 541 Prof. William Stafford Noble Outline • • • • • • • Pairwise alignment algorithms Hidden Markov models Iterative profile methods Profile-profile alignment Support vector machines Network diffusion Metric space embedding.
Download ReportTranscript Remote homology detection Genome 541 Prof. William Stafford Noble Outline • • • • • • • Pairwise alignment algorithms Hidden Markov models Iterative profile methods Profile-profile alignment Support vector machines Network diffusion Metric space embedding.
Remote homology detection Genome 541 Prof. William Stafford Noble Outline • • • • • • • Pairwise alignment algorithms Hidden Markov models Iterative profile methods Profile-profile alignment Support vector machines Network diffusion Metric space embedding Pairwise alignment algorithms Smith-Waterman (Smith J. Mol Biol 1981) BLAST (Altschul Nucleic Acids Research 1990) A pairwise alignment algorithm ranks homologs and produces an alignment Query Sequence database BLAST Ideally, each target sequence also has an associated statistical confidence estimate. Homologs Profile hidden Markov models SAM (Krogh J. Mol. Biol. 1994) HMMER (Eddy ISMB 1995) Some positions are more likely to have a gap than others EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA Charge more for a gap here than here. A profile HMM has positionspecific substitutions, insertions and deletions I0 B I1 I2 I3 I4 I5 I6 I7 I8 M1 M2 M3 M4 M5 M6 M7 M8 D1 D2 D3 D4 D5 D6 D7 D8 E HMM dynamic programming E M R K L I1 I2 B B M1 E I1 D1 I2 M1 E M R K L D1 E Iterative profile methods PSI-BLAST (Altschul Nucleic Acids Research 1997) SAM-T98 (Karplus Bioinformatics 1998) Position-specific iterated BLAST Position-specific scoring matrix (PSSM) Query Statistical model of protein family Homologs Sequence database BLAST Profile-profile alignment HHPred / HHSearch (Söding Bioinformatics 2005) Path through two HMMs corresponds to a sequence coemitted by both Support vector machine methods Fisher SVM (Jaakkola ISMB 1999) Spectrum kernel (Leslie NIPS 2002) Pairwise kernel (Liao RECOMB 2002) SVM-fold (Melvin BMC Bioinformatics 2007) We can frame homology detection as a classification task • Many statistical and machine learning algorithms exist for performing binary classification. – – – – – – Naïve Bayes classifier Nearest neighbor classifier Linear discriminant analysis Decision trees Neural networks Support vector machines • We can use such methods to answer questions of the form, “Is this protein a penicillin amidase?” Is protein X a penicillin amidase? Classifier Yes Important caveat: This approach only produces a ranking, not an alignment. Problem: Protein lengths vary • Most classifiers require fixed-length input. • We need a method to convert a protein into a vector. “Featurizer” Fisher kernel uses sufficient statistics from HMM training • For each training example x, Fisher score is gradient of log-likelihood for x given M+ • For example, Emission probability of amino acid i in state j. Times that amino acid i is observed in state j. Spectrum kernel uses counts of k-mers • The k-spectrum is the set of all length-k subsequences. • Features are counts of subsequences. • Dimension of feature space = 20k AKQDYYYYEI ( 0 , 0 ,…, 1 ,…,1,…,2) AAA AAC AKQ KQD YYY Pairwise kernel uses pairwise sequence comparison scores Smith-Waterman score from comparison of protein X and protein Y SW score w.r.t. protein Y Support vector machine classification SW score w.r.t. protein X SVM classification High confidence “No” SW score w.r.t. protein Y High confidence “Yes” Penicillin amidase Other Unknown Low-confidence “Yes” SW score w.r.t. protein X SVMs outperform HMMs Liao & Noble, Journal of Computational Biology, 2003. Network diffusion Rankprop (Weston PNAS 2004) Adai et al. JMB 340:179-190 (2004). Adai et al. JMB 340:179-190 (2004). Protein similarity network • Compute all-vs-all PSI-BLAST similarity network. • Store all E-values (no threshold). • Convert E-values to weights via transfer function (weight = e-E/). • Normalize edges leading into a node to sum to 1. The RankProp algorithm • The query node always has activation of 1. • Propagate activation to neighbors of the query, weighted by edge weight. • Propagate activation among more distant nodes with an additional weight factor α. Kij = similarity of proteins i and j yi(t) = activation of protein i at time t Toy example Round 0 0.8 0 1 0.9 0.5 0.2 0 0 0 0.7 α = 0.95 Toy example Round 0 0.8 Round 1 0 1 0.8 1 0.9 0.5 0 0 0.7 α = 0.95 0.9 0.5 0.2 0 0.8 0.2 0.5 0.2 0 0.7 Toy example Round 1 0.8 Round 2 0.8 0.8 1 1 0.9 0.5 0.5 0.2 0.2 0 0.7 α = 0.95 0.9 0.5 0.2 ? ? 0.7 Toy example Round 1 0.8 Round 2 0.8 0.8 1 1 0.9 0.5 0.2 0.2 0 0.7 α = 0.95 0.9 0.5 0.2 0.5 0.8 0.5 ? 0.7 Toy example Round 1 0.8 Round 2 0.8 0.8 1 1 0.9 0.5 0.2 0.2 0 0.7 α = 0.95 0.9 0.5 0.2 0.5 0.8 0.5 0.884 0.2 + (0.95 * 0.8 * 0.9) ? 0.7 Toy example Round 1 0.8 Round 2 0.8 0.8 1 1 0.9 0.5 0.2 0.2 0 0.7 α = 0.95 0.9 0.5 0.2 0.5 0.8 0.5 0.884 0.133 0.7 0 + (0.95 * 0.2 * 0.7) Activations Round 0 Round 1 Round 2 Round 3 Node 1 1 Node 2 0 Node 3 0 Node 4 0 Node 5 0 Toy example #2 Propagation corrects errors made by PSI-BLAST Homologs Query PSI-BLAST false negatives Many connections Before Member of PYP-like sensor domain superfamily. Protein domain with known structure, not in superfamily. Protein with no known structure. Few connections After Member of PYP-like sensor domain superfamily. Protein domain with known structure, not in superfamily. Protein with no known structure. Comparison with PSI-BLAST Axes are ROC50 scores. Metric space embedding Protembed (Melvin PLOS Comp Biol 2011) Comparison of methods Method Task Alignment Confidence Performance Pairwise alignment Single X X 1 Hidden Markov models Family X (X) Iterative profile methods Single X X 2 Profile-profile alignment Either X X 3 Support vector machines Family Network diffusion Single Metric space embedding Single 4 X 5 Please read the SVM article before the next class. Please write a one-minute response before you leave.