Gene Regulation and Microarrays CS262 Lecture 9, Win07, Batzoglou Finding Regulatory Motifs CS262 Lecture 9, Win07, Batzoglou.

Download Report

Transcript Gene Regulation and Microarrays CS262 Lecture 9, Win07, Batzoglou Finding Regulatory Motifs CS262 Lecture 9, Win07, Batzoglou.

Gene Regulation and
Microarrays
CS262 Lecture 9, Win07, Batzoglou
Finding Regulatory Motifs
CS262 Lecture 9, Win07, Batzoglou
Regulatory Motif Discovery
DNA
Find motifs within groups of corregulated genes
Common subsequence
Group of co-regulated genes
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Characteristics of Regulatory Motifs
• Tiny
• Highly Variable
• ~Constant Size
 Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity-ish
CS262 Lecture 9, Win07, Batzoglou
Sequence Logos
Height of each letter
proportional to its frequency
Height of all letters
proportional to information content
at that position
CS262 Lecture 9, Win07, Batzoglou
Problem Definition
Given a collection of promoter sequences s1,…, sN of
genes with common expression
Probabilistic
Combinatorial
1iW
1j4
Mij = Prob[ letter j, pos i ]
Motif M: m1…mW
Find best M, and positions
p1,…, pN in sequences
Find M that occurs in all si
with  k differences
Motif: Mij;
CS262 Lecture 9, Win07, Batzoglou
Some of the mi’s blank
Discrete Approaches to Motif Finding
CS262 Lecture 9, Win07, Batzoglou
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
CS262 Lecture 9, Win07, Batzoglou
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T
Find d( W, S )
Report W* = argmin( d(W, S) )
(4K possibilities)
Running time: O( K N 4K )
(where N = i |xi|)
Advantage:
Finds provably “best” motif W
Disadvantage: Time
CS262 Lecture 9, Win07, Batzoglou
Exhaustive Searches
2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )
or, Report a local improvement of W*
Running time: O( K N2 )
Advantage:
Time
Disadvantage: If the true motif is weak and does not occur in data
then a random motif may score better than any
instance of true motif
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’)  α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
Assume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W
 L is smaller than the word length K
Example:
K = 7; L = 3
W
G
=
=
ACGTTGA
--A--CG
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
Algorithm:
For each W in S:
For L = 1 to Lmax
1. Find the α-neighbors of W in S
2. Find all “strong” L-long wordlets G in Na(W)
3. For each wordlet G,
1. Modify W by the wordlet G
2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 2 above: Smaller motif-finding problem;
Use exhaustive search
CS262 Lecture 9, Win07, Batzoglou
 Nα(W)
 W’
Expectation Maximization in Motif
Finding
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
motif
All K-long background
words
Algorithm (sketch):
1.
2.
3.
Given genomic sequences find all k-long words
Assume each word is motif or background
Find likeliest
Motif Model
Background Model
classification of words into either Motif or Background
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
Given sequences x1, …, xN,
•
•
Find all k-long words X1,…, Xn
motif
Define motif model:
M = (M1,…, MK)
Mi = (Mi1,…, Mi4)
(assume {A, C, G, T})
where Mij = Prob[ letter j occurs in
motif position i ]
•
background
Define background model:
B = B1, …, B4
Bi = Prob[ letter j in background
sequence ]
CS262 Lecture 9, Win07, Batzoglou
M1
A
C
G
T
MK
B
Expectation Maximization
•
•
Define
Zi1 = { 1, if Xi is motif;
0, otherwise }
motif
Zi2 = { 0, if Xi is motif;
1, otherwise }

Given a word Xi = x[s]…x[s+k],
P[ Xi, Zi1=1 ] =  M1x[s]…Mkx[s+k]
P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]
Let 1 = ; 2 = (1 – )
CS262 Lecture 9, Win07, Batzoglou
M1
A
C
G
T
background
1–
MK
B
Expectation Maximization
Define:
Parameter space  = (M, B)
1: Motif;
2: Background


M1
1–
MK
A
C
G
T
Objective:
Maximize log likelihood of model:
n
2
log P( X 1...X n , Z |  ,  )   Z ij log( j P( X i |  j ))
i 1 j 1
n

i 1
CS262 Lecture 9, Win07, Batzoglou
2
Z
j 1
n
ij
log P( X i |  j )  
i 1
2
Z
j 1
ij
log  j
B
Expectation Maximization
• Maximize expected likelihood, in iteration of two steps:
Expectation:
Find expected value of log likelihood:
E[logP( X1...X n , Z |  ,  )]
Maximization:
Maximize expected value over , 
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization: E-step
Expectation:
Find expected value of log likelihood:
E[log P( X 1...X n , Z |  ,  )] 
n
2
  E[Z
i 1
j 1
n
ij
] log P( X i |  j )  
i 1
2
 E[Z
j 1
ij
] log  j
where expected values of Z can be computed as follows:
 j P( X i |  j )
E[Z ij]  Pr ob[Zij  1] 
 Z *ij
P( X i | 1 )  (1   ) P( X i |  2 )
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization: M-step
Maximization:
Maximize expected value over  and 
independently
For , this has the following solution:
(we won’t prove it)

n
NEW
n
Z *i1
 arg m a x ( Z *i1 log   Z *i 2 log(1   ))  

n
i 1
i 1
Effectively, NEW is the expected # of motifs per
position, given our current parameters
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization: M-step
•
For  = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]
• cij values are calculated easily from Z* values
It then follows:
M
NEW
jk

c jk

4
k 1
NEW
k
B
c jk
to not allow any 0’s, add pseudocounts
CS262 Lecture 9, Win07, Batzoglou

c0 k

4
c
k 1 0 k
Initial Parameters Matter!
Consider the following artificial example:
6-mers X1, …, Xn:
(n = 2000)
 990 words “AAAAAA”
 990 words “CCCCCC”
 20 words “ACACAC”
Some local maxima:
 = 49.5%; B = 100/101 C, 1/101 A
M = 100% AAAAAA
 = 1%;
B = 50% C, 50% A
M = 100% ACACAC
CS262 Lecture 9, Win07, Batzoglou
Overview of EM Algorithm
1.
Initialize parameters  = (M, B), :
 Try different values of  from N-1/2 up to 1/(2K)
2.
Repeat:
a. Expectation
b. Maximization
3.
Until change in  = (M, B),  falls below 
4.
Report results for several “good” 
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling in Motif Finding
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
• Given:
 x1, …, xN,
 motif length K,
 background B,
• Find:
 Model M
 Locations a1,…, aN in x1, …, xN
Maximizing log-odds likelihood ratio:
N
K
 log
i 1 k 1
CS262 Lecture 9, Win07, Batzoglou
M (k , xai i  k )
B ( xai i  k )
Gibbs Sampling
•
•
AlignACE: first statistical motif finder
BioProspector: improved version of AlignACE
Algorithm (sketch):
1. Initialization:
a.
b.
2.
Select random locations in sequences x1, …, xN
Compute an initial model M from these locations
Sampling Iterations:
a.
b.
c.
Remove one sequence xi
Recalculate model
Pick a new location of motif in xi according to probability the location is
a motif occurrence
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Initialization:
•
Select random locations 1,…, N in x1, …, xN
•
For these locations, compute M:
N
1
M kj 
( j  ( xai  k  j ))
NB
i 1
where j are pseudocounts to avoid 0s,
and B = j j
•
That is, Mkj is the number of occurrences of letter j in motif
position k, over the total
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Predictive Update:
• Select a sequence x = xi
• Remove xi, recompute model:
N
1
M kj 
(  j   ( xas  k  j ))
( N  1)  B
s 1, s i
where j are pseudocounts to avoid 0s,
and B = j j
CS262 Lecture 9, Win07, Batzoglou
M
Gibbs Sampling
Sampling:
For every K-long word xj,…,xj+k-1 in x:
Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)
Pi = Prob[ word | background ] B(xj)…B(xj+k-1)
Let
Q j / Pj
A j  | x| k 1
 Q j / Pj
Prob
j 1
Sample a random new position ai
according to the probabilities A1,…, A|x|-k+1.
CS262 Lecture 9, Win07, Batzoglou
0
|x|
Gibbs Sampling
Running Gibbs Sampling:
1.
Initialize
2.
Run until convergence
3.
Repeat 1,2 several times, report common motifs
CS262 Lecture 9, Win07, Batzoglou
Advantages / Disadvantages
• Very similar to EM
Advantages:
• Easier to implement
• Less dependent on initial parameters
• More versatile, easier to enhance with heuristics
Disadvantages:
• More dependent on all sequences to exhibit the motif
• Less systematic search of initial parameter space
CS262 Lecture 9, Win07, Batzoglou
Repeats, and a Better Background
Model
• Repeat DNA can be confused as motif
 Especially low-complexity CACACA… AAAAA, etc.
Solution:
more elaborate background model
0th order: B = { pA, pC, pG, pT }
1st order: B = { P(A|A), P(A|C), …, P(T|T) }
…
Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
CS262 Lecture 9, Win07, Batzoglou
Limits of Motif Finders
0
???
gene
• Given upstream regions of coregulated genes:
 Increasing length makes motif finding harder – random motifs clutter the
true ones
 Decreasing length makes motif finding harder – true motif missing in
some sequences
Motif Challenge problem:
Find a (15,4) motif in N sequences of length
CS262 Lecture 9, Win07, Batzoglou
Example Application: Motifs in Yeast
Group:
Tavazoie et al. 1999, G. Church’s lab, Harvard
Data:
•
•
Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)
15 time points across two cell cycles
1. Clustering genes according to common expression
•
•
K-means clustering -> 30 clusters, 50-190 genes/cluster
Clusters correlate well with known function
2. AlignACE motif finding
•
600-long upstream regions
CS262 Lecture 9, Win07, Batzoglou
Motifs in Periodic Clusters
CS262 Lecture 9, Win07, Batzoglou
Motifs in Non-periodic Clusters
CS262 Lecture 9, Win07, Batzoglou
Motifs are preferentially conserved across evolution
Gal4
Gal10
GAL10
Scer
Spar
Smik
Sbay
Gal1
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA
CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA
GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA
TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA
* * **** * * *
** ** * *
**
** ** * *
*
**
**
* * * ** * * *
TBP
Scer
Spar
Smik
Sbay
TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG
TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG
TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG
TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG
**
** *** **** ******* **
* *
*
* *
* *
** **
* *** *
***
* * *
Scer
Spar
Smik
Sbay
CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT
CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT
TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT
TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT
** **
** ***** ******* ****** ***** *** ****
* *** ***** * * ****** ***
* ***
GAL4
GAL4
GAL4
GAL4
Scer
Spar
Smik
Sbay
TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA
TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC
ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC
GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT
** *
** *** *
*
***** ** * *
****** **
*
* **
* *
** ***
MIG1
Scer
Spar
Smik
Sbay
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G
CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G
GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG
****
*
*
*****
***
* * *
* * *
*
*
**
Scer
Spar
Smik
Sbay
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT
TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT
TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA
* *
*
***
* **
* *
*** ***
* * ** ** * ********
****
*
Scer
Spar
Smik
Sbay
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT
TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA
TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT
*
*
*
*
* * ** ***
* *
*
* ** ** ** * * * *
* ***
*
Scer
Spar
Smik
Sbay
TTAA-CGTCAAGGA---GAAAAAACTATA
TTAT-CGTCAAGGAAA-GAACAAACTATA
TCGTTCATCAAGAA----AAAAAACTA..
TTATCCCAAAAAAACAACAACAACATATA
*
*
** *
** ** **
MIG1
slide
credits:
M.
Kellis
CS262 Lecture
9, Win07,
Batzoglou
TBP
Is this enough
to discover motifs?
GAL1
Factor footprint
No.
Conservation island
Comparison-based Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Known motifs are frequently conserved
Err
Err
Err
Human
Dog
Mouse
Rat
• Across the human promoter regions, the Err motif:
 appears 434 times
 is conserved 162 times
Conservation rate: 37%
• Compare to random control motifs
– Conservation rate of control motifs: 6.8%
– Err enrichment: 5.4-fold
– Err p-value < 10-50 (25 standard deviations under binomial)
Motif Conservation Score (MCS)
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Finding conserved motifs in whole genomes
M. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals
1.
Define seed “mini-motifs”
T C A
N
A C G
2.
Filter and isolate mini-motifs that are more conserved than average
3.
Extend mini-motifs to full motifs
4.
Validate against known databases of motifs & annotations
5.
Report novel motifs
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Test 1: Intergenic conservation
Conserved count
CGG-11-CCG
Total count
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Test 2: Intergenic vs. Coding
Intergenic Conservation
CGG-11-CCG
Higher Conservation in Genes
Coding Conservation
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Test 3: Upstream vs. Downstream
Upstream Conservation
CGG-11-CCG
Downstream motifs?
Most
Patterns
Downstream Conservation
CS262 Lecture 9, Win07, Batzoglou
slide credits: M. Kellis
Constructing full motifs
Test 1
Test 2
Test 3
2,000
Mini-motifs
Extend
R T C A Y
R
T C G
Collapse
G T C A
A T C R
R T C G
C
C
Y
C
Extend
65
Extend
Collapse
Collapse
Merge
Full Motifs
CS262 Lecture 9, Win07, Batzoglou
Extend
A C G R
ACollapse
C G A
A C G A
A C G A
A C G A
72
Full motifs
slide credits: M. Kellis
Summary for promoter motifs
Rank
Discovered Motif
Known
TF motif
Tissue
Enrichment
Distance
bias
NRF-1
Yes
Yes
1
RCGCAnGCGY
2
CACGTG
MYC
Yes
Yes
3
SCGGAAGY
ELK-1
Yes
Yes
4
ACTAYRnnnCCCR
New
Yes
Yes
5
GATTGGY
NF-Y
Yes
Yes
6
GGGCGGR
SP1
Yes
Yes
7
TGAnTCA
AP-1
Yes
8
TMTCGCGAnR
New
Yes
Yes
9
TGAYRTCA
ATF3
Yes
Yes
10
GCCATnTTG
YY1
Yes
11
MGGAAGTG
GABP
Yes
12
CAGGTG
E12
Yes
13
CTTTGT
LEF1
Yes
14
TGACGTCA
ATF3
Yes
15
CAGCTG
AP-4
Yes
16
RYTTCCTG
C-ETS-2
Yes
17
AACTTT
IRF1(*)
Yes
18
TCAnnTGAY
SREBP-1
Yes
Yes
19
GKCGCn(7)TGAYG
New
Yes
Yes
20
GTGACGY
E4F1
Yes
Yes
21
GGAAnCGGAAnY
Yes
Yes
22
TGCGCAnK
New
New
Yes
Yes
23
TAATTA
CHX10
Yes
24
GGGAGGRR
MAZ
Yes
25
TGACCTY
ERRA
Yes
CS262 Lecture 9, Win07, Batzoglou
Yes
Yes
Yes
• 174 promoter motifs
 70 match known TF motifs
 115 expression enrichment
 60 show positional bias
 75% have evidence
• Control sequences
< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
 < 7% false positives
Most discovered motifs
are likely to be functional
slide credits: M. Kellis