Transcript CS5263 Bioinformatics - Department of Computer Science
CS5263 Bioinformatics
Probabilistic modeling approaches for motif finding
Motif representation
• Collection of exact words – {ACGTTAC, ACGCTAC, AGGTGAC, …} • Consensus sequence (with wild cards) – {AcGTgTtAC} – {ASGTKTKAC} S=C/G, K=G/T (IUPAC code) • Position specific weight matrices (PWMs)
Sequence Logo
A C G T I 1 .97
2 .10
3 .02
4 .03
5 .10
6 .01
7 .05
8 .85
9 .03
.01
.01
.40
.40
.01
.95
.04
.03
.05
.40
.01
.01
.05
.3
.05
.05
.03
.03
.01
.10
.02
.90
.45
.97
.6
.05
.91
1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Finding Motifs
Classification of approaches
• Combinatorial search – Based on enumeration of words and computing word similarities – Analogy to DP for sequence alignment • Probabilistic modeling – Construct models to distinguish motifs vs non motifs – Analogy to HMM for sequence alignment
Combinatorial motif finding
Given a set of sequences S = {x 1 , …, x n } • A motif W is a consensus string w 1 …w K • Find motif W * with “best” match to x 1 , …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and a word in x i d(W, S) = i d(W, x i ) W* = argmin( d(W, S) )
Exhaustive searches
1. Pattern-driven algorithm: For W = AA…A to TT…T Find d( W, S ) Report W* = argmin( d(W, S) ) (4 K possibilities) Running time: O( K N 4 K (where N = i |x i |) ) Guaranteed to find the optimal solution.
Exhaustive searches
2. Sample-driven algorithm: For W = a K-long word in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W * Running time: O( K N 2 )
WEEDER: algorithm sketch
# mismatches (e, B) Seq occ Current pattern P, |P| < K A C G T T • A list containing all eligible nodes: with at most α mismatches to P • For each node, remember #mismatches accumulated ( e ), and bit vector ( B ) for seq occ, e.g. [011100010] • Bit OR all B’s to get seq occurrence for P • Suppose #occ >= m – Pattern still valid • Now add a letter
WEEDER: algorithm sketch
(e, B) Current pattern P A C G T T A • Simple extension: no branches. – No change to B – e may increase by 1 or no change – Drop node if e > α • Branches: replace a node with its child nodes – Drop if e > α – B may change • Re-do Bit OR using all B’s • Try a different char if #occ < m • Report P when |P| = K
Probabilistic modeling approaches for motif finding
Probabilistic modeling approaches
• A motif model – Usually a PWM – M = (P ij ) , i = 1..4, j = 1..k, k: motif length • A background model – Usually the distribution of base frequencies in the genome (or other selected subsets of sequences) – B = (b i ) , i = 1..4
• A word can be generated by M or B
Expectation-Maximization
• For any word W, P(W | M) = P W[1] 1 P W[2] 2 …P W[K] K P(W | B) = b W[1] b W[2] …b W[K] • Let = P(M), i.e., the probability for any word to be generated by M.
• Then P(B) = 1 • Can compute the posterior probability P(M|W) and P(B|W) P(M|W) ~ P(W|M) * P(B|W) ~ P(W|B) * (1 )
Expectation-Maximization
Initialize: Randomly assign each word to M or B • Let Z xy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, , B Iterate until converge: • E-step: Z xy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M, given Z (B usually fixed)
Expectation-Maximization
position 1 5 9 Initialize E-step M-step • E-step: Z xy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M, given Z 1 5 9
MEME
• Multiple EM for Motif Elicitation • Bailey and Elkan, UCSD • http://meme.sdsc.edu/ • Multiple starting points • Multiple modes: ZOOPS, OOPS, TCM
Gibbs Sampling
• Another very useful technique for estimating missing parameters • EM is deterministic – Often trapped by local optima • Gibbs sampling: stochastic behavior to avoid local optima
Gibbs sampling
Initialize: Randomly assign each word to M or B • Let Z xy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, B, Iterate: • Randomly remove a sequence X* from S • Recalculate model parameters using S \ X* • Compute Z x*y for X* • Sample a y* from Z x*y . • Let Z x*y = 1 for y = y* and 0 otherwise
Gibbs Sampling
position 0.2
0.15
0.1
0.05
0 0 2 4 6 8 10 position 12 14 16 18 20 Sampling • Gibbs sampling: sample one position according to probability – Update prediction of one training sequence at a time • Viterbi: always take the highest • EM: take weighted average Simultaneously update predictions of all sequences
Better background model
• Repeat DNA can be confused as motif – Especially low-complexity CACACA… AAAAA, etc.
• Solution: more elaborate background model – Higher-order Markov model 0 th 1 st … K th order: B = { p A , p C , p G , p T } order: B = { P(A|A), P(A|C), …, P(T|T) } order: B = { P(X | b 1 …b K ); X, b i {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)
Gibbs sampling motif finders
• Gibbs Sampler – First appeared as: Larence et.al.
Science
– Allow two-block motifs – Consider higher-order markov models 262(5131):208-214. – Continually developed and updated. webpage – The newest version: Thompson et. al.
Nucleic Acids Res
. 35 (s2):W232 W237 • AlignACE – Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14.
– Allow don’t care positions – Additional tools to scan motifs on new seqs, and to compare and group – motifs • BioProspector, X. Liu et. al. PSB 2001 , an improvement of AlignACE Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38 .
Limits of Motif Finders
0 ???
gene • Given upstream regions of coregulated genes: – Increasing length makes motif finding harder random motifs clutter the true ones – – Decreasing length makes motif finding motif missing in some sequences harder – true
Challenging problem
k d mutations n = 20 L = 600 • (k, d)-motif challenge problem • Many algorithms fail at (15, 4)-motif for n = 20 and L = 600 • Combinatorial algorithms usually work better on challenge problem – However, they are usually designed to find (k, d)-motifs – Performance in real data varies
(15, 4)-motif
• Information content: 11.7 bits • ~ 6mers. Expected occurrence 1 per 3k bp
Actual Results by MEME
llr = 163 E-value = 3.2e+005 llr = 177 E-value = 1.5e+006 llr = 88 E-value = 2.5e+005
Motif finding in practice
• Now we’ve found some good looking motifs – This is probably the easiest step • What to do next?
– Are they real?
– How do we find more instances in the rest of the genome?
– What are their functional meaning?
• Motifs => regulatory networks
How to make sense of the motifs?
• Each program usually reports a number of motifs (tens to hundreds) – Many motifs are variations of each other – Each program also report some different ones • Each program has its own way of scoring motifs – Best scored motifs often not interesting – AAAAAAAA – ACACACAC – TATATATAT
How to make sense of the motifs?
• Combine results from different algorithms usually helpful – Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA – Cluster motifs into groups. • Compare with known motifs in database – TRANSFAC – JASPAR – YPD (yeast promoter database)
Strategies to improve results
• How to tell real motifs (functional) from noises? Statistical test of significance.
– Enrichment in target sequences vs background sequences Target set T Background set B Assumed to contain a common motif, P Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
P Target set T
N
P appeared in
n
sequences P appeared in
m
sequences Background set + target set B + T
M
• If
n / N >> m / M
– P is enriched (over-represented) in T – Statistical significance?
• If we randomly draw
N
sequences likely we will see at least
n
from (B+T) sequences , how having P ?
Hypergeometric distribution
hypegeom
(
n
;
M
,
N
,
m
)
m n
M N
m n
M N
• A box with
M
balls (seqs), of which
m
are red (with motifs), and the rest are blue (without motifs).
– Red ball: sequences with motifs – Blue ball: sequences without motifs • We randomly draw
N
the box balls (seqs) from • What’s the probability we’ll see
n
balls?
red # of choices to have
n
red balls Total # of choices to draw
N
balls
Cumulative hypergeometric test for motif significance
• We are interested in: if we randomly pick
m
balls, how likely that we’ll see at least
n
red balls?
cHypegeom
(
n
;
M
,
N
,
m
) ) min(
i m
,
n N hypogeom
(
i
;
M
,
N
,
m
) 1
i n
1 0
hypogeom
(
i
;
M
, Null hypothesis: our selection is random.
Alternative hypothesis: our selection favored red balls.
When prob is small, we reject the null hypothesis.
1
i n
1 0 Equivalent: we accept the alternative hypothesis
m i
M N M N
m i
(The number of red balls is larger than expected).
N
,
m
)
Example
• Yeast genome has 6000 genes • Select 50 genes believed to be co-regulated by a common TF • Found a motif from the promoter seqs of these 50 genes • The motif appears in 20 of these 50 genes • In the rest of the genome, 100 genes have this motif • M = 6000, N = 50, m = 100+20 = 120, n = 20 • Intuitively: – m/M = 120/6000=1/50. (1 out 50 genes has the motif) – N = 50, would expect only 1 gene in the target set to have the motif – 20-fold enrichment • P-value = cHyperGeom(20; 6000, 50, 120) = 6 x 10 -22 • This motif is significantly enriched in the set of genes
ROC curve for motif significance
• Motif is usually a PWM • Any word will have a score – Typical scoring function: Log (P(W | M) / P(W | B)) – W: a word. – M: a PWM. – B: background model • To determine whether motif M occurred in a sequence, a cutoff has to be decided – Different cutoffs give different # of occurrences – Stringent cutoff: low occurrence in both + and - sequences – Loose cutoff: high occurrence in both + and - sequences – It may be better to look at a range of cutoffs
ROC curve for motif significance
P Target set T N Background set + target set B + T Given a score cutoff Appeared in n sequences Appeared in m sequences M • With different score cutoff, will have different m and n • Assume you want to use P to classify T and B • Sensitivity: n / N • Specificity: (M-N-m+n) / (M-N) • False Positive Rate = 1 – specificity: (m – n) / (M-N) • With decreasing cutoff, sensitivity , FPR
ROC curve for motif significance
A good cutoff 1 Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0.
ROC-AUC: area under curve. 1: the best. 0.5: random.
0 0 1-specificity Motif 1 Motif 2 Random 1 Motif 1 is more enriched in motif 2.
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Other strategies
• Cross-validation – Randomly divide sequences into 10 sets, hold 1 set for test. – Do motif finding on 9 sets. Does the motif also appear in the testing set?
• Phylogenetic conservation information – Does a motif also appears in the homologous genes of another species?
– Strongest evidence – However, will not be able to find species-specific ones
Other strategies
• Finding motif modules – Will two motifs always appear in the same gene?
• Location preference – Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start – If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources – Do the genes having the motif always have similar activities (gene expression levels) across different conditions?
– Interact with the same set of proteins?
– Similar functions?
– etc.
To search for new instances
• Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function Log (P(W | M) / P(W | B)) A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
• Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities – Risk of circular reasoning: most likely this is how you get the initial sequences to do motif finding • Phylogenetic conservation is the key
References
• D’haeseleer P (2006) What are DNA sequence motifs?
NATURE BIOTECHNOLOGY
, 24 (4):423-425 • D’haeseleer P (2006) How does DNA sequence motif discovery work?
NATURE BIOTECHNOLOGY
, 24 (8):959-961 • MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs.
PLoS Comput Biol
2(4): e36 • Lawrence CE et. al. (1993)
Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple
Alignment, Science , 262(5131):208-214