Document 7492760

Download Report

Transcript Document 7492760

CS5263 Bioinformatics
Lecture 19
Motif finding
(Sequence) motif finding
• Given a set of sequences
• Goal: find sequence motifs that appear in
all or the majority of the sequences, and
are likely associated with some functions
– In DNA: regulatory sequences
– In protein: functional/structural domains
Biological background for motif finding
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
DNA
Regulatory Element
Gene
The Cell as a Regulatory Network
If C then D
gene D
A
B
C
Make D
If B then NOT D
If A and B then D
D
gene B
D
C
Make B
If D then B
Characteristics of Regulatory Motifs
• Tiny (6-12bp)
• Intergenic regions are
very long
• Highly Variable
• ~Constant Size
– Because a constant-size
transcription factor binds
• Often repeated
• Often conserved
Motif Representation
Motif representation
• Collection of exact words
– {ACGTTAC, ACGCTAC, AGGTGAC, …}
• Consensus sequence (with wild cards)
– {AcGTgTtAC}
– {ASGTKTKAC} S=C/G, K=G/T (IUPAC code)
• Position specific weight matrices
Position Specific Weight Matrix
1
2
3
4
5
6
7
8
9
A
.97
.10
.02
.03
.10
.01
.05
.85
.03
C
.01
.40
.01
.04
.05
.01
.05
.05
.03
G
.01
.40
.95
.03
.40
.01
.3
.05
.03
T
.01
.10
.02
.90
.45
.97
.6
.05
.91
A
S
G
T
K
T
K
A
C
frequency
Sequence Logo
A
C
1
2
3
4
5
6
7
8
9
.97 .10 .02 .03 .10 .01 .05 .85 .03
.01 .40 .01 .04 .05 .01 .05 .05 .03
G
T
.01 .40 .95 .03 .40 .01
.01 .10 .02 .90 .45 .97
.3
.6
.05 .03
.05 .91
Sequence Logo
1
2
3
4
5
6
7
8
9
A
C
.97
.10
.02
.03
.10
.01
.05
.85
.03
.01
.40
.01
.04
.05
.01
.05
.05
.03
G
T
I
.01
.40
.95
.03
.40
.01
.3
.05
.03
.01
.10
.02
.90
.45
.97
.6
.05
.91
1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Background-normalized Seq Logo
1
2
3
4
5
6
7
8
9
A
C
G
.97
.10
.02
.03
.10
.01
.05
.85
.03
.01
.40
.01
.04
.05
.01
.05
.05
.03
.01
.40
.95
.03
.40
.01
.3
.05
.03
T
I
I’
.01
.10
.02
.90
.45
.97
.6
.05
.91
1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
2
.13
1.35
1.6
0.45
2
.70
1.37 1.65
Finding Motifs
Motif finding schemes
Conservation
Yes
No
Whole Yes Genome 1 & 2 & 3
Genome
1
Dictionary
building
genome
Phylogenetic footprinting
No Gene 1A & 1B & 1C or
“Motif finding”
Gene
Set 1
Gene Set 1 & 2 & 3
1A
1B
1C
Gene set 1
Gene set 2
Gene set 3
Genome 2
Genome 3
Genome 1
Ideally, all information should be used, at some stage.
i.e., inside algorithm vs pre- or post-processing.
Classification of approaches
• Combinatorial search
– Based on enumeration of words and
computing word similarities
– Analogy to DP for sequence alignment
• Probabilistic modeling
– Construct models to distinguish motifs vs nonmotifs
– Analogy to HMM for sequence alignment
Combinatorial motif finding
Given a set of sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and a word in xi
d(W, S) = i d(W, xi)
W* = argmin( d(W, S) )
Exhaustive searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T
(4K possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Guaranteed to find the optimal solution.
Exhaustive searches
2. Sample-driven algorithm:
For W = a K-long word in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )
OR Report a local improvement of W*
Running time: O( K N2 )
Exhaustive searches
• Problem with sample-driven approach:
• If:
– True motif does not occur in data, and
– True motif is “weak”
• Then,
– random strings may score better than any
instance of true motif
Example
• E. coli. Promoter
• “TATA-Box” ~ 10bp upstream of transcription
start
• TACGAT
• TAAAAT
• TATACT
Consensus: TATAAT
• GATAAT
Each instance differs at most 2
• TATGAT
bases from the consensus
• TATGTT
None of the instances matches the
consensus perfectly
Consensus
Algorithm:
Cycle 1:
For each word W in S
For each word W’ in S
Create alignment (gap free) of W, W’
Keep the C1 best alignments, A1, …, AC1
ACGGTTG ,
ACGCCTG ,
CGAACTT ,
AGAACTA ,
GGGCTCT …
GGGGTGT …
Algorithm (cont’d):
Cycle i:
For each word W in S
For each alignment Aj from cycle i-1
Create alignment (gap free) of W, Aj
Keep the Ci best alignments A1, …, ACi
Extended sample-driven (ESD)
approaches
• Hybrid between pattern-driven and sample-driven
• Assume each instance does not differ by more than α
bases to the motif ( usually depends on k)
motif
instance

α-neighborhood
The real motif will reside in the neighborhood of some words in S.
Instead of searching all 4K patterns,
we can search the -neighborhood
of every word in S.
WEEDER
• Naïve: N Kα 3α NK
# of patterns to test
# of words in sequences
Better idea
• Using a joint suffix tree, find all patterns
that:
– Have length K
– Appeared in at least m sequences with at
most α mismatches
• Post-processing
WEEDER: algorithm sketch
Current pattern P, |P| < K
# mismatches
(e, B)
Seq occ
A
C
G
T
T
• A list containing all eligible
nodes: with at most α
mismatches to P
• For each node, remember
#mismatches accumulated (e),
and bit vector (B) for seq occ,
e.g. [011100010]
• Bit OR all B’s to get seq
occurrence for P
• Suppose #occ >= m
– Pattern still valid
• Now add a letter
WEEDER: algorithm sketch
Current pattern P
(e, B)
A
C
G
T
T
A
• Simple extension: no branches.
– No change to B
– e may increase by 1 or no
change
– Drop node if e > α
• Branches: replace a node with
its child nodes
– Drop if e > α
– B may change
• Re-do Bit OR using all B’s
• Try a different char if #occ < m
• Report P when |P| = K
WEEDER: complexity
• Can get all D(P, S) in time
O(nN (K choose α) 3α) ~ O(nN Kα 3α).
n: # sequences. Needed for Bit OR.
• Better than O(KN 4K) since usually α << K
• Kα 3α may still be expensive for large K
– E.g. K = 20, α = 6
WEEDER: More tricks
Current pattern P
A
C
G
T
T
A
• Eligible nodes: with at most α
mismatches to P
• Eligible nodes: with at most
min(L, α) mismatches to P
– L: current pattern length
– : error ratio
– Require that mismatches to be
somewhat evenly distributed
among positions
• Prune tree at length K
MULTIPROFILER
W*
W*: ACGTACG
W:
ATGTAAG
W
W differs from W* at 
positions.
The consensus sequence
for the words in the
-neighborhood of W is
similar to W.
If we ignore all the chars
that are similar to W,
the rest may suggest
the difference between
W and W*
MULTIPROFILER: alg sketch
• For each word P in S
– Find its α-neighborhood in S
– List of words: C
W*
W
• For each position j from
1..K of the words in C
– Find the most popular char
that differ from P[j]
• Replace α positions in P
with the chars found above
W*: ACGTACG
W:
ATGTAAG
– Call the new word P’
• W* = argmin D(P’, S)
MULTIPROFILER
W*
W*: ACGTACG
W:
ATGTAAG
W
• Complexity not discussed
in the paper
• More efficient than
WEEDER for longer
patterns: N < Kα 3α
• How to choose α is an
issue:
– Large α: too many noises
in neighborhood
– Small α: few true instances
in neighborhood
Probabilistic modeling approaches
for motif finding
Probabilistic modeling approaches
• A motif model
– Usually a PWM
– M = (Pij), i = 1..4, j = 1..k, k: motif length
• A background model
– Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)
– B = (bi), i = 1..4
• A word can be generated by M or B
Expectation-Maximization
• For any word W,
 P(W | M) = PW[1] 1 PW[2] 2…PW[K] K
 P(W | B) = bW[1] bW[2] …bW[K]
• Let  = P(M), i.e., the probability for any word to
be generated by M.
• Then P(B) = 1 - 
• Can compute the posterior probability P(M|W)
and P(B|W)
 P(M|W) ~ P(W|M) * 
 P(B|W) ~ P(W|B) * (1-)
Expectation-Maximization
Initialize:
Randomly assign each word to M or B
• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise
• Estimate parameters M, , B
Iterate until converge:
• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y
• M-step: re-estimate M,  given Z (B usually fixed)
Expectation-Maximization
position
5
1
Initialize
E-step
probability
1
5
9
9
M-step
• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y
• M-step: re-estimate M,  given Z
MEME
•
•
•
•
•
Multiple EM for Motif Elicitation
Bailey and Elkan, UCSD
http://meme.sdsc.edu/
Multiple starting points
Multiple modes: ZOOPS, OOPS, TCM
Gibbs Sampling
• Another very useful technique for
estimating missing parameters
• EM is deterministic
– Often trapped by local optima
• Gibbs sampling: stochastic behavior to
avoid local optima
Gibbs sampling
Initialize:
Randomly assign each word to M or B
• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise
• Estimate parameters M, B, 
Iterate:
•
•
•
•
•
Randomly remove a sequence X* from S
Recalculate model parameters using S \ X*
Compute Zx*y for X*
Sample a y* from Zx*y.
Let Zx*y = 1 for y = y*, and 0 elsewhere
Gibbs Sampling
probability
position
0.2
probability
0.15
0.1
0.05
0
0
2
4
6
8
10
position
12
14
16
18
Sampling
• Gibbs sampling: sample one position according to probability
•
•
– Update prediction of one training sequence at a time
Viterbi: always take the highest
Simultaneously update
EM: take weighted average
predictions of all sequences
20
Gibbs sampling motif finders
• Gibbs Sampler, based on C. Larence et.al.
Science, 1993
• AlignACE, Nat Biotech 1998, developed in
Church lab, Harvard Univ
• BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE
Better background model
• Repeat DNA can be confused as motif
– Especially low-complexity CACACA… AAAAA, etc.
• Solution: more elaborate background model
– Higher-order Markov model
0th order: B = { pA, pC, pG, pT }
1st order: B = { P(A|A), P(A|C), …, P(T|T) }
…
Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
Limits of Motif Finders
0
???
gene
• Given upstream regions of coregulated genes:
– Increasing length makes motif finding harder –
random motifs clutter the true ones
– Decreasing length makes motif finding harder – true
motif missing in some sequences
Challenging problem
d mutations
n = 20
k
L = 1000
• (k, d)-motif challenge problem
• Many algorithms fail at (15, 4)-motif for n = 20 and L = 1000
• Combinatorial algorithms usually work better on challenge problems
– However, they are usually designed to find (k, d)-motifs
– Performance in real data varies
• Information content: 13.8 bits
• ~ 7mers. Expected occurrence 1 per 16k bp
Motif finding in practice
• Now we’ve found some good looking
motifs
– Easiest step?
• What to do next?
– Are they real?
– How do we find more instances in the rest of
the genome?
– What are their functional meaning?
• Motifs => regulatory networks
To make sense about the motifs
• Each program usually reports a number of motifs
(tens to hundreds)
– Many motifs are variations of each other
– Each program also report some different ones
• Each program has its own way of scoring motifs
–
–
–
–
Best scored motifs often not interesting
AAAAAAAA
ACACACAC
TATATATAT
Strategies to improve results
• Combine results from different algorithms
usually helpful
– Ones that appeared multiple times are
probably more interesting
• Except simple repeats like AAAAA or ATATATATA
– Cluster motifs into groups. Issues:
• Measure similarities between two motifs (PWMs)
• # of clusters
Strategies to improve results
• Compare with known motifs in database
– TRANSFAC
– JASPAR
• Issues:
– Compute similarities among motifs
– How similar is similar?
Strategies to improve results
• Statistical test of significance
– Enrichment in target sequences vs
background sequences
Target set
T
Assumed to contain a
common motif, P
Background set
B
Assumed to not contain P,
or with very low frequency
Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
P
Target set
T
Background set + target set
B+T
N
Appeared in
n sequences
Appeared in m
sequences
• If n / N >> m / M
– P is enriched (over-represented) in T
– Statistical significance?
M
Hypergeometric distribution
• A box with M balls, of which N are
red, and the rest are blue.
• We randomly draw m balls from
the box
• What’s the probability we’ll see n
red balls?
• Red ball: target sequences
• Blue ball: background sequences
 N  M  N 
 

n mn 
Hypegeom ( M , N , m, n)   
M 
 
m
• Total # of choices: (M choose m)
• # of choices to have n red balls:
(N choose n) x (M-N choose m-n)
Cumulative hypergeometric test for
motif significance
• We are interested in: if we
randomly pick m balls, how likely
that we’ll see at least n red balls?
 N  M  N 
 

min( m , N ) 
i  m  i 

cHypegeom ( M , N , m, n)  
M 
i n
 
m
This can be interpreted as the p-value for the null
hypothesis that we are randomly picking.
Alternative hypothesis: our selection favors red balls.
Equivalent: the target set T is enriched with motif P.
Or: P is over-represented in T.
 N  M  N 
 

n 1 
i  m  i 

 1 
M 
i 0
 
m
Examples
•
•
•
•
•
•
•
Yeast genome has 6000 genes
Select 50 genes believed to be co-regulated by a common TF
Found a motif for these 50 genes
It appeared in 20 out of these 50 genes
In the whole genome, 100 genes have this motif
M = 6000, N = 50, m = 100+20 = 120, n = 20
Intuition:
– m/M = 120/6000. In Genome, 1 out 50 genes have the motif
– N = 50, would expect only 1 gene in the target set to have the motif
– 20-fold enrichment
• P-value = 6 x 10-22
• n = 5. 5-fold enrichment. P-value = 0.003
• Normally a very low p-value is needed, e.g. 10-10
ROC curve for motif significance
• Motif is usually a PWM
• Any word will have a score
–
–
–
–
Typical scoring function: Log P(W | M) / P(W | B)
W: a word.
M: a PWM.
B: background model
• To determine whether a sequence contains a
motif, a cutoff has to be decided
– With different cutoffs, you get different number of
genes with the motif
– Hyper-geometric test first assumes a cutoff
– It may be better to look at a range of cutoffs
ROC curve for motif significance
P
Target set
T
N
Given a score cutoff
Appeared in
n sequences
•
•
•
•
•
•
Background set + target set
B+T
Appeared in m
sequences
With different score cutoff, will have different m and n
Assume you want to use P to classify T and B
Sensitivity: n / N
Specificity: (M-N-m+n) / (M-N)
False Positive Rate = 1 – specificity: (m – n) / (M-N)
With decreasing cutoff, sensitivity , FPR 
M
ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.
sensitivity
1
ROC-AUC: area under curve.
1: the best. 0.5: random.
Motif 1
Motif 2
Random
0
0
1-specificity
Motif 1 is more enriched in motif 2.
1
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Other strategies
• Cross-validation
– Randomly divide sequences into 10 sets, hold 1 set
for test.
– Do motif finding on 9 sets. Does the motif also appear
in the testing set?
• Phylogenetic conservation information
– Does a motif also appears in the homologous genes
of another species?
– Strongest evidence
– However, will not be able to find species-specific ones
Other strategies
• Finding motif modules
– Will two motifs always appear in the same gene?
• Location preference
– Some motifs appear to be in certain location
• E.g., within 50-150bp upstream to transcription start
– If a detect motif has strong positional bias, may be a sign of its
function
• Evidence from other types of data sources
– Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?
– Interact with the same set of proteins?
– Similar functions?
– etc.
To search for new instances
• Usually many false positives
• Score cutoff is critical
• Usually estimate a score cutoff from the “true”
binding sites
Motif finding
Scoring function
A set of scores for the “true” sites. Take mean - std as a cutoff.
(or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
• Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear
• Use gene expression data to help: the
genes having the true motif should have
similar activities
• Phylogenetic conservation is the key