Motif Finding

Download Report

Transcript Motif Finding

Motif Finding
Yueyi Irene Liu
CS374 Lecture
Oct. 17, 2002
Outline
• Background biology
• Motif-finding methods
–
–
–
–
–
Word enumeration
Gibbs sampling
Random projection
Phylogenetic footprinting
Reducer
Regulation of Gene Expression
•
•
•
•
•
•
•
•
•
Chromatin structure
Transcription initiation
Transcript processing and modification
RNA transport
Transcript stability
Translation initiation
Post-Translational Modification
Protein Transport
Control of Protein Stability
Typical Structure of an Eukaryotic
mRNA Gene
Control of Transcription Initiation
Motif
• A conserved pattern that is found in two or
more sequences
• Can be found in
– DNA (e.g., transcription factor binding sites)
– Protein
– RNA
Models for Representing
Motifs
• Regular expression
– Consensus
TGACGCA
TGACGCA
AGACGCA
TGACACA
AGACGCA
• TGACGCA
– Degenerate
• WGACRCA
• Position Specific Matrix
A
1 2
0.4 0
3
1
4
0
5 6
0.2 0
7
1
T
G
0.6 0
0 1
0
0
0
0
0 0
0.8 0
0
0
0
1
0
0
C 0
0
1
Where to look for motifs?
• Gene families: a set of genes controlled by a
common transcription factor or common
environmental stimulus
• How do you construct gene families?
– Microarray experiments
10
Microarrays
Cells of Interest
Known DNA sequences
Isolate mRNA
Glass slide
Resulting data
genes
experiments
3.25 3.01 1.30 0.70
6.73 2.89 0.92 0.67
1.14 1.15 0.60 0.23
2.12 6.12 0.07 0.02
Reference sample
Motif-finding Methods
• Goal: Look for motifs (5-15bp) in the data set
• Methods:
–
–
–
–
–
Word enumeration method
Gibbs sampling
Random projection
Phylogenetic footprinting
Reducer
Word Enumeration
• For every word w, calculate:
– Expected frequency based on entire upstream region of the
yeast genome
• E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4,
P(G)=P(C) = 0.1
• Expected number of occurrences of ATTGA: n*P(ATTGA)
– Observed frequency in the data set
– Statistical significance of enrichment
Z = (O - E) / sqrt[np  (1 - p)] ~ N(0, 1)
– Disadvantage: only consider exact word
• E.g, YCTGCA: TCTGCA and CCTGCA
Gibbs Sampling
• Matrix to capture a motif
• Goal: find the best ak to maximize the
difference between motif and background
base distribution.
a1
a2
a3
a4
ak
Liu, X
Gibbs Sampling
(Lawrence, et al, 1993)
• Step 1: Pick random start position, compute current
motif matrix
• Step 2: Iterative update
–
–
–
–
Take one sequence out, update motif matrix
Calcuate fitness score of each position of out sequence
Pick start position in out sequence based on weight Ax
Take out another sequence, …, until converge
• Step 3: Reset starting position
Liu, X
Gibbs Sampling Initialization
Pick random start position, compute motif matrix
a1
a2
a3
a1'
a2'
a3'
a4 a4'
ak
ak'
Liu, X
Gibbs Sampling Iteration Steps
1) Take out one sequence, calculate the fitness score of
every subsequence relative to the current motif
a1' ?????????????????
a2'
a3'
a4'
ak'
Liu, X
Fitness Score
Current Motif
• Ax = Qx / Px
– Qx: probability of
generating subsequence
x from current motif
– Px: probability of
generating subsequence
x from background
1
2
3
A
0.1
0.3
0.7
T
0.1
0.2
0.1
G
0.7
0.4
0.1
C
0.1
0.1
0.1
Background:
X = GGA:
P(A) = P(T) = 0.4
Q? P?
P(G) = P(C) = 0.1
Gibbs Sampling Iteration Steps
2) Pick new start position sampling from fitness score
Sample from Fitne ss Score
5
Fitness
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
…
Sta rting position of motif in se que nce
a1''
a2 '
a3 '
a4 '
ak'
Liu, X
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Random Projection
(Buhler, 2002)
• (l, d)-motif problem:
– M is an (unknown) motif of length l
– Each occurrence of M is corrupted by exactly d
point substitutions in random positions
• No known biological motifs are
CCcaAG
of (l, d)-motif
CCcgAG
CtATgG
CCgcAG
tCtTAG
CCtaAG
CaAcAG
CCtgAG
CCAgAa
CCctAc
Random Projection Algorithm
• Guiding principle: Some instances of a motif
agree on a subset of positions.
• Use information from multiple motif instances
to construct model.
x(1)
x(2)
x(5)
x(8)
...ccATCCGACca...
...ttATGAGGCtc...
...ctATAAGTCgc...
...tcATGTGACac...
ATGCGTC
=M
(7,2) motif
Buhler, J
k-Projections
• Choose k positions in string of length l.
• Concatenate nucleotides at chosen k
positions to form k-tuple.
• In l-dimensional Hamming space,
projection onto k dimensional subspace.
l = 15
k=7
P
ATGGCATTCAGATTC
TGCTGAT
P = (2, 4, 5, 7, 11, 12, 13)
Buhler, J
Random Projection Algorithm
• Choose a projection by
selecting k positions
uniformly at random.
• For each l-tuple in input
sequences, hash into
bucket based on letters
at k selected positions.
• Recover motif from
bucket containing
multiple l-tuples.
Input sequence x(i):
…TCAATGCACCTAT...
TGCACCT
Bucket TGCT
Buhler, J
Example
• l = 7 (motif size) , k = 4 (projection size)
• Choose projection (1,2,5,7)
Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
ATCCGAC
GCCTTAC
ATGC
GCTC
Buhler, J
Hashing and Buckets
• Hash function h(x) obtained from k
positions of projection.
• Buckets are labeled by values of h(x).
• Enriched buckets: contain more than s ltuples, for some parameter s.
ATGC
GCTC
CATC
ATTC
Buhler, J
Motif Refinement
• How do we recover the motif from the
sequences in the enriched buckets?
• k nucleotides are known from hash value of
bucket.
• Use information in other l-k positions as
starting point for local refinement scheme,
e.g. EM or Gibbs sampler
ATCCGAC
ATGAGGC
ATAAGTC
ATGTGAC
Local refinement algorithm
ATGCGTC
Candidate motif
Buhler, J
ATGC
Parameter Selection
• Projection size k
• Choose k small so several motif instances hash
to same bucket. (k < l - d)
• Choose k large to avoid contamination by
spurious l-mers. ( 4k > t (n - l + 1)
• Bucket threshold s: (s = 3, s = 4)
Buhler, J
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Conservation of Regulatory
Elements in Upstream of
ApoAI Gene
Hepatic site C
Mouse
Rabbit
Human
Chicken
Mouse
Rabbit
Human
Chicken
TATA
TATA
boxbox
Mouse
Rabbit
Human
Chicken
TATA box
CCAAT box
AAGCA
AAGCA
AAGCA
ACGCA
AAGCA
Substring Parsimony Problem
Given:
•
•
•
orthologous upstream sequences S1,…Sn
phylogenetic tree T of the n species
size k of the motif, threshold d
Problem:
Find all sets of substrings s1,…sn of S1,…Sn , each of size
k, such that the parsimony score of s1,…sn on T is at
most d
Blanchette, M
Parsimony Score
s1
Tree T:
s2
s`34
s6
s5
s4
s3
Minimum (all possible labelings of internal nodes)
•l(v) – label of node v
•d(l1, l2) – Hamming distance
 d (l(u),l (v))
(u ,v )ET
Blanchette, M
String Parsimony Problem
S1: AAAGCATTC
S2: TACGCACCC
S3: GAAGCAGGG
AAGCA
AAGCA
k=5
d=1
S1
AAGCA
S2
ACGCA
S3
AAGCA
Algorithm: version I
• Root the tree at arbitrary internal node r
• Compute table Wu of size 4k for each node u, where
Wu[s] – best parsimony score for subtree rooted at u
when u is labeled with s


,
if
u
is
leaf
and
s
is
not
a
substring
of
S
u

Wu [s]  0, if u is leaf and s is a substring of Su
  min(W [t ]  d (s, t ) if u is not a leaf
vChild (u ) tk v
• Direct implementation of this recursion gives
O(n∙k∙(42k + l), where l – average sequence length
Blanchette, M
Algorithm: version II
• Define X(u, v)[s] – best parsimony score for subtree
consisting of edge (u,v) and the subtree rooted at v
u labeled s
w
v
X (u ,v) [s]  mintk (Wv [t ]  d (s, t ))
Wu [s] 
X
vChild (u )
( u ,v )
[ s]
Blanchette, M
Algorithm: version II (continued)
• Update X(u, v) in phases: in phase p maintain set Bp of
sequences t, such that X(u, v)[t] = p
• Define:
• Ra = {s: Wv[s] = a}
• N(s) = {t in ∑k: d(s, t) = 1}
• Start in phase m and let Bm = Rm
• Update
B p 1  R p 1  N ( s )   B j
sB p
• Computation of X(u, v) takes
j p
O(k∙4k)
Blanchette, M
Improvements
• Reduce the size of Bp when sequences contribute to X(u, v)
greater than threshold d
In phase p, only care for sequence X(u, v) [s] if
 X (u ,v ) [ s] if X (u ,v ) [ s] has been computed
d  p  max 
wChild ( u ) p  1 otherwise

w v
Leads to significant reductions in stages d/2 … d
• Reduce the number of substrings inserted in W at the
leaves
For substring s of Si, if its best match against any Sj, has
Hamming distance at least d, s can be discarded Blanchette, M
Results
• Practical limit on k = 10
• There appeared to be a threshold d0 with very
few solutions below and many above
• Algorithm found ~80% known binding sites
• Performed better than ClustalW, MEME,
Consensus
Blanchette, M
Recent Development
• Random Projection
• Phylogenetic Footprinting
• Reducer
Reducer
(Bussemaker, et al 2001)
• Links motif finding to expression level
• Ag = C + Σ Fu Nug
– Ag: gene expression level (logarithm of expression
ratio)
– M: number of significant motifs
– Ng: number of occurrences of motif u in gene g
– C: baseline expression level (same for all genes)
– F: increase/decrease of expression level caused by
presence of motif
Reducer (Cont’d)
Expressio Log ratio of expression levels
n vector
Gene1 Gene2 Gene3 Gene4 … GeneN
1.3
-3.7
10.3
4.5
-2.3
Motif
Number of times that motif occurs in the
vector
upstream region of the gene
Gene1 Gene2 Gene3 Gene4 … GeneN
AAAAA 2
0
5
3
0
AAAAT 5
3
2
1
5
…
Liu, X
Reducer (Cont’d)
• Normalize expression (A) and motif (n)
vectors
• Linear regression between A vector and every
n vector to find the best fit n to A
• Step-wise regression to combine effects of
motifs
– Subtract the effect of one motif
– Find the next best motif
Liu, X
Acknowlegement
• People from whom I borrowed slides:
–
–
–
–
–
Xiaole Liu (Reducer)
Olga Troyanskaya (Microarray)
Jeremy Buhler (Random projections)
Mathieu Blanchette (Phylogenetic footprinting)
Various web sources
excitation
cDNA clones
(probes)
laser 2
PCR product amplification
purification
printing
scanning
laser 1
emission
mRNA target)
overlay images and normalise
0.1nl/spot
microarray
Hybridise target
to microarray
analysis
Information Content of Motifs
• Uncertainty
• Information = Hbefore - Hafter
Improvement on Original Gibbs
sampler
• 0 ~ n copies of sites in each sequence
• Iterative masking to find multiple motifs
• Use higher order Markov models to improve
motif specificity
Clinical Importance of Defects in
Regulatory Elements
Burkitt’s Lymphoma
Statistical Methods
• Expectation Maximization (EM)
– MEME
• Gibbs sampling
– BioProspector
– AlignACE
Motifs are not limited to DNAs
• RNA motifs
– RNA – RNA interaction motifs, e.g., intron-exon
splice sites
– RNA – protein interaction motifs, e.g., binding of
proteins to RNA polyA tail
• Protein motifs
– E.g., Helix-turn-helix motif
Sequence Logo
Why is this Problem Hard?
• Motif information content low
• Hamming distance between each motif
instance high