CS5263 Bioinformatics - Department of Computer Science

Download Report

Transcript CS5263 Bioinformatics - Department of Computer Science

CS5263 Bioinformatics

Probabilistic modeling approaches for motif finding

Motif representation

• Collection of exact words – {ACGTTAC, ACGCTAC, AGGTGAC, …} • Consensus sequence (with wild cards) – {AcGTgTtAC} – {ASGTKTKAC} S=C/G, K=G/T (IUPAC code) • Position specific weight matrices (PWMs)

Sequence Logo

A C G T I 1 .97

2 .10

3 .02

4 .03

5 .10

6 .01

7 .05

8 .85

9 .03

.01

.01

.40

.40

.01

.95

.04

.03

.05

.40

.01

.01

.05

.3

.05

.05

.03

.03

.01

.10

.02

.90

.45

.97

.6

.05

.91

1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42

Finding Motifs

Classification of approaches

• Combinatorial search – Based on enumeration of words and computing word similarities – Analogy to DP for sequence alignment • Probabilistic modeling – Construct models to distinguish motifs vs non motifs – Analogy to HMM for sequence alignment

Combinatorial motif finding

Given a set of sequences S = {x 1 , …, x n } • A motif W is a consensus string w 1 …w K • Find motif W * with “best” match to x 1 , …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and a word in x i d(W, S) =  i d(W, x i ) W* = argmin( d(W, S) )

Exhaustive searches

1. Pattern-driven algorithm: For W = AA…A to TT…T Find d( W, S ) Report W* = argmin( d(W, S) ) (4 K possibilities) Running time: O( K N 4 K (where N =  i |x i |) ) Guaranteed to find the optimal solution.

Exhaustive searches

2. Sample-driven algorithm: For W = a K-long word in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W * Running time: O( K N 2 )

WEEDER: algorithm sketch

# mismatches (e, B) Seq occ Current pattern P, |P| < K A C G T T • A list containing all eligible nodes: with at most α mismatches to P • For each node, remember #mismatches accumulated ( e ), and bit vector ( B ) for seq occ, e.g. [011100010] • Bit OR all B’s to get seq occurrence for P • Suppose #occ >= m – Pattern still valid • Now add a letter

WEEDER: algorithm sketch

(e, B) Current pattern P A C G T T A • Simple extension: no branches. – No change to B – e may increase by 1 or no change – Drop node if e > α • Branches: replace a node with its child nodes – Drop if e > α – B may change • Re-do Bit OR using all B’s • Try a different char if #occ < m • Report P when |P| = K

Probabilistic modeling approaches for motif finding

Probabilistic modeling approaches

• A motif model – Usually a PWM – M = (P ij ) , i = 1..4, j = 1..k, k: motif length • A background model – Usually the distribution of base frequencies in the genome (or other selected subsets of sequences) – B = (b i ) , i = 1..4

• A word can be generated by M or B

Expectation-Maximization

• For any word W,  P(W | M) = P W[1] 1 P W[2] 2 …P W[K] K  P(W | B) = b W[1] b W[2] …b W[K] • Let  = P(M), i.e., the probability for any word to be generated by M.

• Then P(B) = 1  • Can compute the posterior probability P(M|W) and P(B|W)   P(M|W) ~ P(W|M) *  P(B|W) ~ P(W|B) * (1  )

Expectation-Maximization

Initialize: Randomly assign each word to M or B • Let Z xy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M,  , B Iterate until converge: • E-step: Z xy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M,  given Z (B usually fixed)

Expectation-Maximization

position 1 5 9 Initialize E-step M-step • E-step: Z xy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M,  given Z 1 5 9

MEME

• Multiple EM for Motif Elicitation • Bailey and Elkan, UCSD • http://meme.sdsc.edu/ • Multiple starting points • Multiple modes: ZOOPS, OOPS, TCM

Gibbs Sampling

• Another very useful technique for estimating missing parameters • EM is deterministic – Often trapped by local optima • Gibbs sampling: stochastic behavior to avoid local optima

Gibbs sampling

Initialize: Randomly assign each word to M or B • Let Z xy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, B,  Iterate: • Randomly remove a sequence X* from S • Recalculate model parameters using S \ X* • Compute Z x*y for X* • Sample a y* from Z x*y . • Let Z x*y = 1 for y = y* and 0 otherwise

Gibbs Sampling

position 0.2

0.15

0.1

0.05

0 0 2 4 6 8 10 position 12 14 16 18 20 Sampling • Gibbs sampling: sample one position according to probability – Update prediction of one training sequence at a time • Viterbi: always take the highest • EM: take weighted average Simultaneously update predictions of all sequences

Better background model

• Repeat DNA can be confused as motif – Especially low-complexity CACACA… AAAAA, etc.

• Solution: more elaborate background model – Higher-order Markov model 0 th 1 st … K th order: B = { p A , p C , p G , p T } order: B = { P(A|A), P(A|C), …, P(T|T) } order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

Gibbs sampling motif finders

• Gibbs Sampler – First appeared as: Larence et.al.

Science

– Allow two-block motifs – Consider higher-order markov models 262(5131):208-214. – Continually developed and updated. webpage – The newest version: Thompson et. al.

Nucleic Acids Res

. 35 (s2):W232 W237 • AlignACE – Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14.

– Allow don’t care positions – Additional tools to scan motifs on new seqs, and to compare and group – motifs • BioProspector, X. Liu et. al. PSB 2001 , an improvement of AlignACE Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38 .

Limits of Motif Finders

0 ???

gene • Given upstream regions of coregulated genes: – Increasing length makes motif finding harder random motifs clutter the true ones – – Decreasing length makes motif finding motif missing in some sequences harder – true

Challenging problem

k d mutations n = 20 L = 600 • (k, d)-motif challenge problem • Many algorithms fail at (15, 4)-motif for n = 20 and L = 600 • Combinatorial algorithms usually work better on challenge problem – However, they are usually designed to find (k, d)-motifs – Performance in real data varies

(15, 4)-motif

• Information content: 11.7 bits • ~ 6mers. Expected occurrence 1 per 3k bp

Actual Results by MEME

llr = 163 E-value = 3.2e+005 llr = 177 E-value = 1.5e+006 llr = 88 E-value = 2.5e+005

Motif finding in practice

• Now we’ve found some good looking motifs – This is probably the easiest step • What to do next?

– Are they real?

– How do we find more instances in the rest of the genome?

– What are their functional meaning?

• Motifs => regulatory networks

How to make sense of the motifs?

• Each program usually reports a number of motifs (tens to hundreds) – Many motifs are variations of each other – Each program also report some different ones • Each program has its own way of scoring motifs – Best scored motifs often not interesting – AAAAAAAA – ACACACAC – TATATATAT

How to make sense of the motifs?

• Combine results from different algorithms usually helpful – Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA – Cluster motifs into groups. • Compare with known motifs in database – TRANSFAC – JASPAR – YPD (yeast promoter database)

Strategies to improve results

• How to tell real motifs (functional) from noises? Statistical test of significance.

– Enrichment in target sequences vs background sequences Target set T Background set B Assumed to contain a common motif, P Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance

P Target set T

N

P appeared in

n

sequences P appeared in

m

sequences Background set + target set B + T

M

• If

n / N >> m / M

– P is enriched (over-represented) in T – Statistical significance?

• If we randomly draw

N

sequences likely we will see at least

n

from (B+T) sequences , how having P ?

Hypergeometric distribution

hypegeom

(

n

;

M

,

N

,

m

)   

m n

   

M N

 

m n

   

M N

  • A box with

M

balls (seqs), of which

m

are red (with motifs), and the rest are blue (without motifs).

– Red ball: sequences with motifs – Blue ball: sequences without motifs • We randomly draw

N

the box balls (seqs) from • What’s the probability we’ll see

n

balls?

red # of choices to have

n

red balls Total # of choices to draw

N

balls

Cumulative hypergeometric test for motif significance

• We are interested in: if we randomly pick

m

balls, how likely that we’ll see at least

n

red balls?

cHypegeom

(

n

;

M

,

N

,

m

)  ) min(

i m

,  

n N hypogeom

(

i

;

M

,

N

,

m

)  1 

i n

 1   0

hypogeom

(

i

;

M

, Null hypothesis: our selection is random.

Alternative hypothesis: our selection favored red balls.

When prob is small, we reject the null hypothesis.

 1 

i n

 1   0 Equivalent: we accept the alternative hypothesis   

m i

        

M N M N

    

m i

   (The number of red balls is larger than expected).

N

,

m

)

Example

• Yeast genome has 6000 genes • Select 50 genes believed to be co-regulated by a common TF • Found a motif from the promoter seqs of these 50 genes • The motif appears in 20 of these 50 genes • In the rest of the genome, 100 genes have this motif • M = 6000, N = 50, m = 100+20 = 120, n = 20 • Intuitively: – m/M = 120/6000=1/50. (1 out 50 genes has the motif) – N = 50, would expect only 1 gene in the target set to have the motif – 20-fold enrichment • P-value = cHyperGeom(20; 6000, 50, 120) = 6 x 10 -22 • This motif is significantly enriched in the set of genes

ROC curve for motif significance

• Motif is usually a PWM • Any word will have a score – Typical scoring function: Log (P(W | M) / P(W | B)) – W: a word. – M: a PWM. – B: background model • To determine whether motif M occurred in a sequence, a cutoff has to be decided – Different cutoffs give different # of occurrences – Stringent cutoff: low occurrence in both + and - sequences – Loose cutoff: high occurrence in both + and - sequences – It may be better to look at a range of cutoffs

ROC curve for motif significance

P Target set T N Background set + target set B + T Given a score cutoff Appeared in n sequences Appeared in m sequences M • With different score cutoff, will have different m and n • Assume you want to use P to classify T and B • Sensitivity: n / N • Specificity: (M-N-m+n) / (M-N) • False Positive Rate = 1 – specificity: (m – n) / (M-N) • With decreasing cutoff, sensitivity  , FPR 

ROC curve for motif significance

A good cutoff 1 Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0.

ROC-AUC: area under curve. 1: the best. 0.5: random.

0 0 1-specificity Motif 1 Motif 2 Random 1 Motif 1 is more enriched in motif 2.

Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.

Other strategies

• Cross-validation – Randomly divide sequences into 10 sets, hold 1 set for test. – Do motif finding on 9 sets. Does the motif also appear in the testing set?

• Phylogenetic conservation information – Does a motif also appears in the homologous genes of another species?

– Strongest evidence – However, will not be able to find species-specific ones

Other strategies

• Finding motif modules – Will two motifs always appear in the same gene?

• Location preference – Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start – If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources – Do the genes having the motif always have similar activities (gene expression levels) across different conditions?

– Interact with the same set of proteins?

– Similar functions?

– etc.

To search for new instances

• Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function Log (P(W | M) / P(W | B)) A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances

• Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities – Risk of circular reasoning: most likely this is how you get the initial sequences to do motif finding • Phylogenetic conservation is the key

References

• D’haeseleer P (2006) What are DNA sequence motifs?

NATURE BIOTECHNOLOGY

, 24 (4):423-425 • D’haeseleer P (2006) How does DNA sequence motif discovery work?

NATURE BIOTECHNOLOGY

, 24 (8):959-961 • MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs.

PLoS Comput Biol

2(4): e36 • Lawrence CE et. al. (1993)

Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple

Alignment, Science , 262(5131):208-214