Counting position weight matrices in a sequence & an

Download Report

Transcript Counting position weight matrices in a sequence & an

Counting position weight matrices in a
sequence & an application to discriminative
motif finding
Saurabh Sinha
Computer Science
University of Illinois, Urbana-Champaign
Transcriptional Regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Transcriptional Regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Binding sites and motifs
Transcription factor binding sites in a
gene’s neighborhood are the fundamental
units of the regulatory network
 Transcription factor binding is specific,
hence binding sites are similar to each
other, but variability is often seen
 A motif is the common sequence pattern
among binding sites of transcription factor

Motif models
Consensus string, e.g., ACGWGT
 Position Weight Matrix (PWM)

Position Weight Matrix
Binding
sites
ACCCGTT
ACCGGTT
ACAGGAT
ACCGGTT
ACATGAT
5 0 2 0 0 2 0 A
0 5 3 1 0 0 0 C
0 0 0 3 5 0 0 G
0 0 0 1 0 3 5 T
PWM
Databases of PWMs
Transfac has ~100s of PWMs for human
 Jaspar: a smaller, perhaps better curated
database of PWMs
 Organism specific databases coming up
frequenctly
 PWMs in databases often derived from
experimentally validated binding sites

Bioinformatics of PWMs





Popular motif model
i.e., several motif finding algorithms that attempt
to find PWMs from sequences
Gibbs sampling: one of the earliest; tries to
sample PWMs with high “relative entropy”
MEME: another early algorithm; uses
expectation maximization to find PWMs that
best “model the sequences”
Many more algorithms to find PWMs from a set
of sequences
Problem: counting motifs





Given DNA sequence, and a consensus motif
(say “ACGWGT”), count the motif in the
sequence
Trivial solution
What if the motif is a Position Weight Matrix
(PWM) ?
Why hasn’t this problem been looked at?
Because previous algorithms used different
scores of PWMs: how “sharp” they are, how
well they explain data, etc.
Counting matches to a PWM: A possibility

For each site s in sequence, compute
If Pr(s | W) > some threshold, call s a site
 Count number of sites in sequence
 No distinction between strong and weak
sites, as long as they are above threshold


binary scheme, not realistic
A wish-list (for the score)




Score should consider both strong and weak
occurrences of motif
Score should assign appropriate weights to
strong and weak occurrences
Score should be aware that there may also be
sites of other known motifs in the sequence
The list goes on: score should be efficiently
computable, score should be differentiable,
score should …
The “w-score”
Defined by a probabilistic model of
sequence generation
 Given one or more motifs, and a
background distribution, defines a
probability space on sequences
 A simple (zeroth order) Hidden Markov
model (HMM)

Probabilistic Model: toy example

W1

Wb
Given two motifs W1,W2, a
“background” motif Wb, and a
sequence length L
Pr(Wi  Wj) = pj

W2

When in state Wi, emit a
substring s chosen with
probability Pr(s | Wi)


transition probability
emission probability
Stop when length of emitted
sequence is L
A stochastic process generating sequences of length L
A “path” through the HMM

Wb
Wb
W1
One possible path T1
Wb
W1
W2
Another
Wb
Wb
W2
W2
possible path T2
Likelihood of sequence & paths

Wb
Wb
W1
Wb
W1

W2

Wb
A path of the HMM defines the
locations of motif matches
For a sequence S & a path T,
the joint probability Pr(S,T) is
easy to compute
Conditional probability of a
path T, given the data S, is:
Wb
W2
W2


Strong matches make the
probability higher
Paths with weak matches have
lower conditional probabilities
The “w-score”
Let the number of occurrences of a motif
(say W1) in path T be
 Compute:


In words: An average of the motif count
, with weights equal to the
probability of T given S
The “w-score” (Cont’d)
Score depends both on number and
quality of matches to motif.
 Every substring is a potential binding site,
and paths placing the motif there will
contribute to the count
 Pr(T | S) depends on the match strength
of all motifs, not just the one being
counted

The wish-list (again)
  Score should give consider both strong
and weak occurrences of motif
  Score should assign appropriate weights
to strong and weak occurrences
  Score should be aware that there may
also be sites of other known motifs in the
sequence
An exciting new feature of this motif score
Computational pros and cons
The w-score computation takes
time, where L is sequence length, and lm
is the motif length. This is relatively
expensive
 The w-score can be differentiated with
respect to all of the PWM parameters in
time


Important feature for search algorithms
Using the “w-score” in
discriminative motif finding
Discriminative motif finding




Suppose we have a set of co-regulated genes,
i.e., we believe they have binding sites of the
same transcription factor (in their regulatory
control regions)
Traditionally, motif finding tries to find these
binding sites, based on over-representation,
conservation etc.
Often we also know a set of genes that should
NOT have binding sites of that transcription
factor
Examples: ChIP-on-chip, In situ hybridization
pictures of Drosophila embryo, etc.
Problem formulation
Given two sets of sequences S+ and S Find a motif that has many occurrences in
S+ and few occurrences in S Maximize the difference in the average
counts of the motif in the two sets
 Let W(S) = count of a motif W in
sequence S
 Maximize:

Optimization problem

Find motif W that maximizes
Derivatives of objective function
Let Wk be the PWM entry for base  in
column k
 We can efficiently compute


We can efficiently differentiate our
objective function
Algorithm
Search space: Set of n = 20 substrings of
sequences in S+ (called “site set”)
 Objective function: Construct PWM W
from site-set, compute score
 Length of sites is user-defined

Algorithm
Current site-set C
S+
Algorithm
Replace one site with any site from sequence
S+
Pick a replacement that improves objective function
Algorithm
Current solution (site-set): C
 Candidate new solution: C
 Many possibilities for C (every substring
of every sequence in S+ is a possible
replacement)
 Evaluate objective function on each
candidate C



Too slow !
Use derivative information !
Algorithm
Estimate the objective function value for
each candidate C using partial
derivatives and first order approximation
 Examine each candidate in decreasing
order of estimated score
 If a candidate C found with greater score
than C, choose it.

Algorithm illustration
Current score = 12
Estimated scores
10
Accurate score
Accurate score
13
Accurate score
11
Algorithm Properties
Objective function has many desirable
properties, but is an expensive operation
 Derivative computation has the same time
complexity, and is used to guide search
 Avoids local optima by searching in a
discretized PWM space
 Performs significantly better and/or faster
than Gibbs sampling and Conjugate
Gradients, for this particular score

Discriminative PWM Search (DIPS)
Software available
 Can easily handle data sets of ~100
sequences
 Can find multiple motifs iteratively, but
without masking:


Find a PWM, then include it in the model as a
known PWM, find another PWM, and so on
Performance tests
Tested on synthetic data
 Compared to traditional motif finder as
well as two discriminative motif finders
 Superior performance in the presence of
“distractor” motifs


it really helps to be able to count a motif in
the presence of other known motifs
Tests on Drosophila Enhancers
200
180
BICOID (ACTIVATOR)
160
140
120
100
80
60
40
20
0
100
HEAD
80
60
40
20
TAIL
Tests on Drosophila Enhancers
100
CAUDAL (ACTIVATOR)
50
0
100
HEAD
80
60
40
20
TAIL
DIPS runs
S+ = promoters of genes expressed in
anterior half of embryo
 S- = promoters of genes expressed in
posterior half of embryo
 Top motif: Bicoid !

200
180
160
BICOID (ACTIVATOR)
140
120
100
80
60
40
20
0
100
HEAD
80
60
40
20
TAIL
DIPS runs
S+ = promoters of genes expressed in
posterior half of embryo
 S- = promoters of genes expressed in
anterior half of embryo
 Top motif: Caudal !

100
CAUDAL (ACTIVATOR)
50
0
100
HEAD
80
60
40
20
TAIL
Summary of results
Phase
1 (activ ator)
2 (repressor)
3 (activ ator)
4 (repressor)
S+
anterior 50%
posterior 50%
terminal 40%
middle 40%
80-100% EL
60-80% EL
40-60% EL
20-40% EL
0-20% EL
anterior 50%
posterior 50%
terminal 40%
middle 40%
80-100% EL
60-80% EL
40-60% EL
20-40% EL
0-20% EL
SFound motifBest match Pvalue of match
posterior 50%
anterior.1
bicoid
0.00
anterior 50%
posterior.1 caudal
0.03
middle 80%
terminal.1
torRE
0.00
0-30%, 70-100% centrall.1
hunchback
0.00
60-80% EL
5.1.1
torRE
0.10
80-100%,40-60%5.2.1
caudal
0.06
60-80%, 20-40% 5.3.1
kruppel
0.25
0-20%, 40-60% 5.4.1
knirps
0.03
20-40% EL
5.5.1
Dichaete
0.07
posterior 50%
anterior.2
huckebein
0.02
anterior 50%
posterior.2 pdm1_2
0.00
middle 80%
terminal.2
caudal
0.06
0-30%, 70-100% centrall.2
huckebein
0.12
60-80% EL
5.1.2
knirps
0.01
80-100%,40-60%5.2.2
torRE
0.01
60-80%, 20-40% 5.3.2
giant
0.09
0-20%, 40-60% 5.4.2
giant
0.05
20-40% EL
5.5.2
bicoid
0.18
Social regulation in honey bee
Transition from nursing in the hive to
foraging for food is age related, but also
regulated by the needs of the colony
 32 genes demonstrated to be significantly
differentially expressed in brains of nurses
and foragers (21 active in foragers only,
11 active in nurses only)
 DIPS run on 2Kbp promoters of these
social behavior-related genes

Results on honey bee genes
Conclusion
Discriminative motif finding increasingly
becoming a necessary analysis
 Motif finding in the presence of other
known motifs also becoming relevant
 A search algorithm that maximizes any
objective function of the motif counts in
the sequences

(as long as its differentiable)
 Several extensions and variations possible

Acknowledgements
Eric Siggia, Eran Segal
 Yoseph Barash (“LearnPSSM”)
 Andrew Smith (“DME”)

Reference

ISMB 2006 (Brazil); Bioinformatics
journal.