Information Encoding in Biological Molecules: DNA and

Transcript Information Encoding in Biological Molecules: DNA and

5.1: Gene Regulation
and Promoter Analysis
Wyeth Wasserman
Centre for Molecular Medicine and Therapeutics
Children’s and Women’s Hospital
Department of Medical Genetics
University of British Columbia
www.cisreg.ca
Lecture 5.1
Overview
• 5.1.0 Bioinformatics for detection of transcription
factor binding sites
• The Specificity Problem
• 5.1.1 Discrimination of regulatory control sequences
• Based on knowledge of established TFBS
• 5.1.2 Discovery of regulatory mechanisms
• Based on de novo pattern discovery
• 5.1.3 Impending advances
Lecture 5.1
Layers of Complexity in Metazoan Transcription
Lecture 5.1
Transcription Simplified
Lecture 5.1
URF
Pol-II
URE
TATA
5.1.0 Profile Models for
Prediction of TF Binding Sites
Lecture 5.1
Representing Binding Sites for a TF
• A single site
• AAGTTAATGA
• A set of sites represented as a consensus
• VDRTWRWWSHD (IUPAC degenerate DNA)
• A matrix describing a a set of sites
A
C
G
T
Lecture 5.1
14 16 4 0 1 19 20 1
3 0 0 0 0 0 0 0
4 3 17 0 0 2 0 0
0 2 0 21 20 0 1 20
4 13 4 4 13 12 3
7 3 1 0 3 1 12
9 1 3 0 5 2 2
1 4 13 17 0 6 4
Set of
binding
sites
AAGTTAATGA
CAGTTAATAA
GAGTTAAACA
CAGTTAATTA
GAGTTAATAA
CAGTTATTCA
GAGTTAATAA
CAGTTAATCA
AGATTAAAGA
AAGTTAACGA
AGGTTAACGA
ATGTTGATGA
AAGTTAATGA
AAGTTAACGA
AAATTAATGA
GAGTTAATGA
AAGTTAATCA
AAGTTGATGA
AAATTAATGA
ATGTTAATGA
AAGTAAATGA
AAGTTAATGA
AAGTTAATGA
AAATTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
PFMs to PWMs
One would like to add the following features to the model:
1. Correcting for the base frequencies in DNA
2. Weighting for the confidence (depth) in the pattern
3. Convert to log-scale probability for easy arithmetic
w matrix
f matrix
A
C
G
T
5
0
0
0
0
2
3
0
1
2
1
1
0
4
0
1
f(b,i) + s(N)
0
Log
p(b)
0
4
1
(
) AC
1.6
-1.7
G -1.7
T -1.7
-1.7
0.5
1.0
-1.7
-0.2
0.5
-0.2
-0.2
-1.7
1.3
-1.7
-0.2
TGCTG = 0.9
Lecture 5.1
-1.7
-1.7
1.3
-0.2
JASPAR
OPEN-ACCESS DATABASE
OF TF BINDING PROFILES
(Some other databases with TF profiles include Transfac, TRRD,
mPromDB, SCPD (yeast), dbTBS and EcoTFS (bacteria))
Lecture 5.1
Performance of Profiles
• 95% of predicted sites bound in vitro (Tronche
1997)
• MyoD binding sites predicted about once
every 600 bp (Fickett 1995)
• Futility Theorem
• Nearly 100% of predicted TFBS have no function in vivo
• Brazma claims it should be called the futility conjunction
Lecture 5.1
1000bp promoter screened with
collection of TF profiles (beta-globin)
Lecture 5.1
5.1.1 Pattern Discrimination
Overcoming the specificity problem by
incorporating biological knowledge into
computational algorithms
Lecture 5.1
Phylogenetic Footprinting
70,000,000 years of evolution reveals
most regulatory regions.
Lecture 5.1
SIDENOTE:
Global Progressive Alignments
(e.g. ORCA, AVID, LAGAN)
ORCA
•
•
•
Lecture 5.1
Global alignments memory = product of sequence lengths
Progressive alignment by banding with local alignments (e.g.
BLAST) and running global method on banded sub-segments
Recursion with decreasingly stringent parameters
Phylogenetic Footprinting to Identify
Functional Segments
% Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse.
Lecture 5.1
Phylogenetic Footprinting (cont)
FoxC2
% Identity
100%1
0.8
80%
0.6
60%
40%
0.4
20%
0.2
0%0
-0.2
0
1000
2000
3000
4000
Start Position of 200bp Window
Lecture 5.1
5000
6000
7000
Recall...
1000bp beta-globin promoter screened
with phylogenetic footprinting
Lecture 5.1
Choosing the ”right” species...
Genes evolve at different rates – make gene-specific choice
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
Lecture 5.1
Performance: Human vs. Mouse Pairwise
SELECTIVITY
SENSITIVITY
• Testing set: 40 experimentally defined sites in 15 well studied
genes (Replicated with 100+ site set)
• 85-95% of defined sites detected with conservation filter, while
only 11-16% of total predictions retained
Lecture 5.1
ConSite
Now driven by the ORCA Aligner
Selected Emerging Issues
• Multiple sequence comparisons
– Incorporate phylogenetic distances into a scoring metric
– Visualization (see dcode service and Sockeye)
– Analysis of many closely related species
• “Phylogenetic shadowing”
• Genome rearrangements
– Inversion compatible alignment algorithm
• LAGAN
• Higher order models of TFBS
Lecture 5.1
Regulatory Modules
for better specificity
TFs do NOT act in isolation
Lecture 5.1
Layers of Complexity in Metazoan Transcription
Lecture 5.1
Liver regulatory modules
Lecture 5.1
PSSMs for Liver TFs…
HNF1
C/EBP
Lecture 5.1
HNF3
HNF4
Detection of Clusters of TFBS
• In the best cases, we have enough data to train a
discriminant function
• Rare to have sufficient data
• Alternatively, identify dense clusters of sites that are
statistically significant
• Diverse methods have been introduced over the past few
years…Berman; Markstein; Frith; Noble; Wagner;…
• Non-trivial to correct for non-random properties of DNA
– Most difficulty comes from local direct repeats
• A primary challenge from the biological side is the
selection of a meaningful grouping of TFs
• Multiple testing problems severe
Lecture 5.1
TFBS Clusters
(MSCAN, MCAST, COMET, etc)
• MSCAN allows users to submit any set of TF profiles
• Calculates significance for each site based on local
sequence characteristics
• G-rich PSSM gets less weight on G-rich region of gene
• Calculates cluster significance using a dynamic
programming approach
• Approximately 1 significant liver cluster / 18 000 bp in human
genome sequence
• Filters to remove “significant” clusters of sites that
contain local repeats
• Identification of non-random characteristics in DNA
http://mscan.cgb.ki.se
Lecture 5.1
Training predictive models for
modules
• Not every combination of sites is meaningful
• Reality: Some factors critical, others secondary
• An alternative is to teach the computer which
combinations are better
• Limited by small size of positive training set
• Explore an older method based on Logistic Regression
Analysis
Lecture 5.1
Recall: Liver regulatory modules
Lecture 5.1
Logistic Regression Analysis
* a1
* a2
* a3
* a4
Optimize a vector to maximize
the distance between output values
for positive and negative training
data.
S
“logit”
Output value is:
elogit
p(x)=
1 + elogit
PERFORMANCE
• Liver (Genome Research, 2001)
– At 1 hit per 35 kbp, identifies 60% of
modules
– Limited to genes expressed late in liver
development
LRA Models do not account for
multiple sites for the same TF*
*Frith et al’s CISTER algorithm circumvent this
problem
Lecture 5.1
UDPGT1 (Gilbert’s Syndrome)
0.8
0.6
Series1
Wildtype
Mutant
Series2
0.4
0.2
0
5840
5430
5020
4610
4200
3790
3380
2970
2560
1330
920
510
100
Lecture 5.1
2150
“Window” Position in Sequence
-0.2
1740
Liver Module Model Score
1
Making better predictions
• Profiles make far too many false predictions to have
predictive value in isolation
• Phylogenetic footprinting eliminates about 90% of
false predictions while retaining ~70-70% of real sites
(human vs mouse)
• Detection of clusters of binding sites offers better
predictive performance, especially through trained
discrimination functions
Lecture 5.1
Active Issues
• Significance of clusters of sites
• Segmentation of DNA into regions of different composition
• Methods using training to find clusters
• Where to place weights?
• Interaction weighting in the absence of large data collections
• Resources
• Limited number of solid PSSMs
• Need a reference database for functional regulatory regions
• Validation of predictions for tissues/cells not well represented in
cell culture
Lecture 5.1
EMERGING APPLICATION
Regulatory Analysis of
Variation in ENhancers
Genetic variation in TFBS can result in
biomedically important phenotypes
Lecture 5.1
Sequence Variation in TFBS
URF
AaGT
TSS
GENE
DISEASE/CONDITION (associated)
REFERENCE
UGT1A1
Gilbert’s Syndrome –jaundice
PJ Bosma, et al., 1995
UCP3
Elevated Body Mass
S Otabe et al., 2000
TNFalpha
Malaria Susceptibility
JC Knight et al., 1999
Resistin
Elevated Body Mass
JC Engert et al., 2002
IL4Ralpha
Reduced soluble IL4R
H Hackstein et al., 2001
ABCA1
Coronary artery disease
KY Zwarts et al., 2002
Ob
Leptin levels
J Hager et al., 1998
PEPCK
Obesity
Y. Olswang et al., 2002
PR
Endometrial cancer
I DeVivo et al., 2002
LDLR
Familial hypercholesterolemia
Koivisto et al., 1994
Lecture 5.1
Stage 1:
Prediction of Regulatory Regions
Lecture 5.1
Stage 1: Predict Regulatory Regions
• Retrieve orthologous human and mouse gene
sequences
• Align sequences with a global aligner (ORCA)
• Identify regions of conservation
• Designs primers for SNP discovery
FoxC2
100%1
0.8
80%
0.6
60%
40%
0.4
20%
0.2
0%0
Lecture
-0.2 5.1
0
1000
2000
3000
4000
5000
6000
7000
Data/Orthology obtained
from GeneLynx (www.genelynx.org)
SIDENOTE:
Lecture 5.1
Stage 2:
Analysis of Polymorphisms
ACGCATAAGTTAATGAATAACAGAT
ACGCATAAGTTAATGAATAACAGAT
ACGCATAAGTTAATGAATAACAGAT
ACGCATAAGTTAATGAATAACAGAT
ACGCATAAGTTAATGAATAACAGAT
ACGCATAAGTTAACGAATAACAGAT
ACGCATAAGTTAACGAATAACAGAT
ACGCATAAGTTAACGAATAACAGAT
ACGCATAAGTTAACGAATAACAGAT
Lecture 5.1
Identify variations that generate allelespecific binding sites (predicted)
Differences
in scores
4
2
0
-2
-4
1
2
3
4
5
6
7
8
9 10 11
Pseudo-data for instructional purposes
1234567890123456789012345
ACGCATAAGTTAAtGAATAACAGAT
.............c...........
Lecture 5.1
RAVEN screenshots
Lecture 5.1
5.1.2 Discovery of Mediating TFBS
for Sets of Co-Regulated Genes
Finding characteristics over-represented in a
set of co-regulated genes
Lecture 5.1
Pattern Discovery
Linking co-expressed genes from
microarrays to candidate TFs
Lecture 5.1
oPOSSUM Project
• A significant subset of TFs are represented by
existing binding profiles
• Within same structural class, often binding specificity
retained (more on this later)
• Can we link known TFs to a putative regulon
by over-representation of predicted binding
sites in promoters?
• Identical concept to the detection of over-represented
GO terms from previous session
Lecture 5.1
oPOSSUM Procedure
Set of coexpressed
genes
Automated
sequence retrieval
from EnsEMBL
Phylogenetic
Footprinting
ORCA
Putative
mediating
transcription
factors
Statistical
significance of
binding sites
Detection of
transcription factor
binding sites
Reference Gene Sets
A. Muscle-specific (15)
z-score Fisher
p-value p-value
Mef2 *
++++ ++
SRF *
++++ +
Myf *
++++ +
FREAC-7
++++ +
TEF-1 *
++
c-MYB-1
++
Pax-2
++
Tal1beta-E47S +
Gklf
+
Thing1-E47
+
MZF_5-13
+
Yin-Yang
+
SPI-B
+
GATA-2
+
Ahr-ARNT
+
B. Liver-specific (15)
z-score Fisher
p-value P-value
+
HNF-1 *
++++
+
COUP-TF ++++
+
Gfi
++++
FREAC-7
+++
GATA-2
+++
FREAC-3
+++
E4BP4
++
FREAC-4
++
RORalpha-1 +
S8
HNF-3beta +
*
+
Sox-5
+
C. Known NF-κB targets
(61)
z-score Fisher
p-value
++++
NF-κB *
++++
c-REL *
++++
p65 *
++++
p50 *
++++
Irf-2
++++
SPI-B
++++
c-FOS
+++
SPI-1
+++
Irf-1
Brachyury ++
+
Elk-1
MZF_1-4 +
+
Sox-5
GATA-2
Fisher p-values ++ p<1e-05, + p<1e-02
Lecture 5.1
p-value
++
+
++
+
+
+
+
+
+
+
MICROARRAY APPLICATION:
NF-kB Inhibitor-sensitive genes
(326)
Genes Significantly Down-regulated After Treatment
with Inhibitor
z-score pFisher pClass
value
value
++++
++
NF-kappaB
Rel/NFkB
++++
++
p65
Rel/NFkB
++++
+
c-Rel
Rel/NFkB
++++
+
p50
Rel/NFkB
++++
+
Pbx
Homeo
+++
+
Sox-5
HMG
+++
+
SPI-B
ETS
+++
+
HFH-2
Forkhead
FREAC-4
Forkhead +++
+
Max
bHLH-ZIP
++
+
SRY
HMG
+
+
Lecture 5.1
++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02
oPOSSUM Server
Over-represented Site Combinations
(Kreiman 2004)
• Based on our understanding of CRMs, likely
that combinations of sites would be more
distinguished than individual sites (better
signal-to-noise?)
• Kreiman has introduced a system to assess
clusters of neighbouring conserved sites
based on counting
– Hypergeometric distribution, simply compare the
frequency of the cluster occurrence vs.
expectation
Lecture 5.1
What if the TFBS is novel?
Lecture 5.1
de novo Pattern Discovery Methods
• String-based
– e.g. “Moby Dick” (Bussemaker, Li & Siggia)
– Identify over-represented oligomers in comparison of “+” and
“-” (or complete) promoter collections
• Profile-based
– Monte Carlo/Gibbs Sampling
– e.g. AnnSpec (Workman & Stormo)
– Identify strong patterns in “+” promoter collection vs.
background model of expected sequence characteristics
Lecture 5.1
String-base Exhaustive Methods
Word-based methods:
How likely are X words in a set of
sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC
TTCAAATTTTAACGCCGGAATAATCTCCTATT
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG
TATCGTCATTCTCCGCCTCTTTTCTT
GCTTATCAATGCGCCCGGAATAAAACGCTATA
CATTGACTTTATCGAATAAATCTGTT
ATCTATTTACAATGATAAAACTTCAA
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT
TTTCAAATCCGGAATTTCCACCCGGAATTACT
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC
ACTGAACTTGTCTTCAAATTTCAACACCGGAA
TCAATGCCGGAATTCTGAATGTGAGTCGCCCT
Lecture 5.1
>EP71002
>EP63009
>EP63010
>EP11013
>EP11014
>EP11015
>EP11016
>EP11017
>EP63007
>EP63008
>EP17012
>EP55011
(+)
(+)
(+)
(+)
(+)
(-)
(+)
(+)
(-)
(+)
(+)
(-)
Ce[IV] msp-56 B; range -100 to -75
Ce Cuticle Col-12; range -100 to -75
Ce Cuticle Col-13; range -100 to -75
Ce vitellogenin 2; range -100 to -75
Ce vitellogenin 5; range -100 to -75
Ce vitellogenin 4; range -100 to -75
Ce vitellogenin 6; range -100 to -75
Ce calmodulin cal-2; range -100 to -75
Ce cAMP-dep. PKR P1+; range -100 to -75
Ce cAMP-dep. PKR P2; range -100 to -75
Ce hsp 16K-1 A; range -100 to -75
Ce hsp 16K-1 B; range
Exhaustive methods(2)
Find all words of length 7 in the yeast genome
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG
ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC
GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC
CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT
CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA
ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG
GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA
GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT
TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA
AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA
AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC
ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT
CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA
GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG
ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT
CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA
ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG
AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG
TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC
TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT
ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT
AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT
CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT
GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG
CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
Lecture 5.1
Make a lookup table:
TTTTTTTT/aaaaaaa
GATAGGCA/tgcctatc
AAACCTTT/aaaggttt
Etc...
57788
589
456
Exhaustive methods(3)
k
Pw begins in i    p(a j )
j 1
k
EX w   (n  k  1)  p(a j )
j 1
X w  EX w 
Zw 
VarX w 
Lecture 5.1
Over-representation
How many words of
type ’AGGAGTGA’ are
found in our
sequences?
How likely is
this result?
Exhaustive methods(4)
Modeling Properties of DNA
Simple:
How likely are single nucleotides?
(extended Bernoulli)
Complex:
Neglect certain words
Locations of TFBS
Higher-order descriptions of DNA
Lecture 5.1
Exhaustive methods: Key items

Algorithms with high complexity - Large
sequences and/or many possible word
lengths not possible

Often string-based

TFBS are not words (’fuzzy’ binding)

Sensitivity susceptible to noisy indata
(e.g. microarrays)
Lecture 5.1
Profile-based Methods
(usually probablistic)
Find a local alignment of width x of sites that
maximizes information content in reasonable time
Usually by Gibbs sampling or EM methods
Motivations:
TFBS are not words
Efficiency
Can be intentionally influenced by biological data
Lecture 5.1
Profile Methods (2)
The Gibbs Sampling algorithm
Two data structures used:
1) Current pattern nucleotide frequencies
qi,1,..., qi,4 and corresponding background
frequencies pi,1,..., pi,4
2) Current positions of site startpoints in
the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j.
One starting point in each sequence is
chosen randomly initially.
tgacttcc
tgatctct
agacctca
tgacctct
Profile Methods (3)
Iteration step
A
Remove one sequence z from the
set. Update the current pattern
according to
qi , j 
B
ci , j  b j
N 1  B
Pseudocount for symbol j
Sum of all
pseudocounts in column
’Score’ the current pattern
against each possible occurence
ak in z. Draw a new ak with
probabilities based on respective
score divided by the background
model
z
tgacttcc
tgatctct
agacctca
tgacctct
Pattern Discovery Across Orthologous
Promoters from Gram-Positive Bacteria
0.25
Real sets
random
Frequency
0.20
0.15
0.10
0.05
0.00
0
1
2
3
MAP value
Lecture 5.1
4
5
Yeast Regulatory Sequence
Analysis (YRSA) system
EXAMPLE:
Tests of YRSA System
DNA-damage
Classic
PDR3-regulated
cell-cycleresponse
array
genes
data
partially
mediating
by et
MCB
re-clustered
from
array
by Getz
study
al
Lecture 5.1
Comparison of profiles requires
alignment and a scoring function
SIDENOTE:
Frequency
• Scoring function based on sum of
squared differences
• Align frequency matrices with modified
Needleman-Wunsch algorithm
• Calculate empirical p-values based on
simulated set of matrices
Score
How is the Performance: Hit and Miss
Lecture 5.1
Applied Pattern Discovery is Acutely
Sensitive to Noise
PATTERN SIMILARITY
vs. TRUE MEF2 PROFILE
18
16
14
12
10
0
100
200
300
400
500
SEQUENCE LENGTH
Lecture 5.1
True
Mef2 Binding Sites
600
Over-coming the
sensitivity challenge
Metazoan genomes are far from ideal
Lecture 5.1
Biochemical complexity enables
greater complexity in regulation
Yeast
GO GO GO
500 bp
Humans
GO GO GO
Lecture 5.1
GO GO GO
20 000 bp
ORF A
EXON 1
GO GO GO
2 EXON 3
Four Approaches to Improve
Sensitivity
• Better background models
-Higher-order properties of DNA
• Phylogenetic Footprinting
– Human:Mouse comparison eliminates ~75% of sequence
• Regulatory Modules
– Architectural rules
• Limit the types of binding profiles allowed
– TFBS patterns are NOT random
Lecture 5.1
Phylogenetic Footprinting to
Identify Conserved Regions
Bayes Block Aligner
(Lawrence Group)
ORCA
Lecture 5.1
Skeletal Muscle Genes
• One of the most extensively studied tissues for
transcriptional regulation
– 45 genes partially analyzed
– 26 genes with orthologous genomic sequence from human
and rodent
• Five primary classes of transcription factors
– Principal: Myf (myoD), Mef2, SRF
– Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal
muscle types)
Lecture 5.1
de novo Discovery of Skeletal Muscle
Transcription Factor Binding Sites
Mef2-Like
Lecture 5.1
SRF-Like
Myf-Like
Pattern discovery methods
using biochemical constraints
Lecture 5.1
RECALL:
tgacttcc
tgatctct
agacctca
tgacctct
Lecture 5.1
Gibbs Algorithm
z
tgacttcc
tgatctct
agacctca
tgacctct
Some profile constraints have
been explored…
• Segmentation of informative
columns
• Palindromic patterns
Lecture 5.1
Intra-family PSSM similarity
Match to bHLH
COMPARE
TF Database
(JASPAR)
Jackknife Test
87% correct
Independent Test Set 93% correct
Lecture 5.1
Lecture 5.1
FBPs enhance sensitivity of
pattern detection
Lecture 5.1
Lecture 5.1
APPLICATION:
Cancer Protection Response
• Detoxification-related enzymes are induced by
compounds present in Broccoli
• Arrays, SSH and hard work have defined a set of
responsive genes
• A known element mediates the response (Antioxidant
Responsive Element)
• Controversy over the type of mediating leucine zipper
TF
• NF-E2/Maf or Jun/Fos
Lecture 5.1
Application (2)
Gibbs Sampling
Problem: Given a set of co-regulated genes, determine the
common TFBS. Classify the mediating TF. We expect a
leucine zipper-type TF.
Application (3)
Gibbs with FBP Prior
Problem: Given a set of co-regulated genes, determine the
common TFBS. Classify the mediating TF. We expect a
leucine zipper-type TF.
Application (4)
Classify New TF Motif
Maf (p<0.02)
Jun (p<0.98)
Problem: Given a set of co-regulated genes, determine the
common TFBS. Classify the mediating TF. We expect a
leucine zipper-type TF.
EMERGING METHOD
de novo Analysis
of Regulatory Modules
Lecture 5.1
Focus on regulatory modules
for pattern detection
Cluster Genes
by Expression
Predictive Models
6
2
0
0
0
8
0
0
0
4
4
0
0
7
0
1
7
1
0
0
0
0
8
0
0
2
0
6
Identify and Model
Contributing TFs
Analyze co-regulated genes to define circuit characteristics
Specific
Gene Features
General Circuit Properties
Binding
Profiles
mi =
k
Number
of Sites
Distribution
Neighbor
Interactions
Separation
Distribution
(Default = Uniform)
Width
Distributions
(Sum of Separations)
3
aij
m
m
i
j
0
g
100bp
b
250
m
m
i
j
Discovery performance
• Approximately 50% of annotated TFBS are
detected in the training set sequences of 25
genes
• Only 40% of predicted TFBS are annotated
• We suspect that most of the un-annotated sites will turn
out to be functional. This needs to be determined.
Lecture 5.1
Review of Primary Points
Second Chance
Lecture 5.1
Regulatory regions problem space
Sets of
binding
sites
Specificity profiles for binding sites
A
C
G
T
AATCACCA
AATCACCA
AATCACCA
AATCACCA
AATCTCCC
AATCTCCG
AATCACAC
AATCATCA
AATCTCAC
AATCTCTG
AGTCCCCA
AATCCCGG
AATCTGAG
AATCCATA
ATTCAGCC
AATAACTT
GATAACCT
AATTAGAC
GATTACAG
GATTAGCG
ATTCTTCC
TATGAACA
Lecture 5.1
GATTAAAA
AGACCCCA
[
-2
0
-2 -0.415
[
1 0.585
0
0
[0.585 0.322 0.807 1.585
[0.319 0.322
1
-2
0.585
-2
-1
-2
1
-2
0 2.088
-2 2.088
-2
-2
-1 0.585
-2
-2 2.088
-2 0.585 0.807
2
-2
-2 2.088
-2
0
-1
-2
-2
-2 1.459 -0.415
]
]
]
]
Clusters of binding sites
Transcription factors
URF
Pol-II
URE
TATA
Transcription factor binding sites
Regulatory nucleotide sequences
Detecting binding sites in a single
sequence
Scanning a sequence against a PWM
Sp1
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Abs_score = 13.4 (sum of column scores)
Calculating the relative score
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
Is 93% better than 82%?
0.4368
-1.5
-1.5
1.7457
]
]
]
]
Scanning 1300 bp of human
insulin receptor gene with Sp1
at rel_score threshold of 75%
Max_score = 15.2 (sum of highest column scores)
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Min_score = -10.3 (sum of lowest column scores)
Abs_score - Min_score
100 %
Max_score - Min_score
13.4- (-10.3)

100%  93%
Lecture 5.1
15.2  (10.3)
]
]
]
]
Rel_score 
Ouch.
]
]
]
]
Phylogenetic Footprints
Scanning a single sequence
Scanning a pair orf orthologous
sequences for conserved patterns in
conserved sequence regions
A dramatic improvement in the
percentage of biologically significant
detections
Low specificity of profiles:
•too many hits
•great majority are not
biologically significant
Lecture 5.1
Applied Pattern Discovery is Acutely
Sensitive to Noise
PATTERN SIMILARITY
vs. TRUE MEF2 PROFILE
18
16
14
12
10
0
100
200
300
400
500
SEQUENCE LENGTH
True Mef2 Binding Sites
600
Acknowledgements
Wasserman Group
Wynand Alkema
Dave Arenillas
Jochen Brumm
Alice Chou
Shannan Ho Sui
Danielle Kemmer
Jonathan Lim
Raf Podowski
Dora Pak
Albin Sandelin
Chris Walsh
Collaborating Trainees
Malin Andersson (KTH)
Öjvind Johansson (UCSD)
Stuart Lithwick (U.Toronto)
Collaborators
Boris Lenhard (K.I.)
Chip Lawrence (Wadsworth)
William Thompson (Wadsworth)
Jens Lagergren (KTH)
Christer Höög (K.I.)
Brenda Gallie (OCI)
Jacob Odeberg (KTH)
Niclas Jareborg (AZ)
William Hayes (AZ)
James Mortimer (MF)
Group Alumni
Elena Herzog
Annette Höglund
William Krivan
Luis Mendoza
Support: CIHR, CGDN, CFI, Merck-Frosst, BC Children’s
Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder
Lecture 5.1
EXTRA SLIDES
What will a computational biologist
do with a scoring function?
Build a similarity tree!
Lecture 5.1
The matrix tree:
Lecture 5.1
bHLH-Zip domain
bHLH domain
Compare with consensus for both classes - CANNTG
Lecture 5.1

Information Encoding in Biological Molecules: DNA and

Transcript Information Encoding in Biological Molecules: DNA and

Directory