Transcript Document

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module 3
1
Module 3
2
Canadian Bioinformatics Workshops 2008
Inferring Regulatory Mechanisms
Governing Sets of Genes
Wyeth W. Wasserman
University of British Columbia
www.cisreg.ca
Module 3
3
Deciphering Regulation of CoExpressed Genes
Co-Expressed
Negative Controls
Module 3
4
Module 3: Overview
Part 1: Overview of transcription
Lab 3.1: Promoters in Genome Browser (UCSC)
Part 2: Prediction of transcription factor binding sites
using binding profiles (“Discrimination”)
Lab 3.2: TFBS scan (Footer)
Part 3: Interrogation of sets of co-expressed genes to
identify mediating transcription factors
Lab 3.3: TFBS Over-Representation (oPOSSUM)
Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed
genes (“Discovery”)
Lab 3.4: Motif Discovery (MEME/STAMP)
Module 3
5
Restrictions in Coverage
• Focus on Eukaryotic cells and PolII
Promoters
• Most principles apply to prokaryotes
• Pol-II ~ protein coding genes
• All references are made to activating
sequences
• Information about repression is sparse
Module 3
6
Part 1
Introduction to transcription
in eukaryotic cells
Module 3
7
Transcription Over-Simplified
Three-step Process:
1. TF binds to TFBS (DNA)
2. TF catalyzes recruitment of
polymerase II complex
3. Production of RNA from
transcription start site (TSS)
TF
Pol-II
TFBS
TATA
TSS
Module 3
8
Anatomy of Transcriptional Regulation
WARNING: Terms vary widely in meaning between scientists
Core Promoter/Initiation Region (Inr)
Distal Regulatory Region
TFBS
TFBS
TFBS
Proximal Regulatory Region
TFBS
TFBS
TATA
TSR
EXON
Distal R.R.
TFBS
TFBS
EXON
• Core Promoter – Sufficient for initiation of
transcription; orientation dependent
• TSR – transcription start region
– Refers to a region rather than specific start site (TSS)
• TFBS – single transcription factor binding site
• Regulatory Regions
•
•
•
•
Proximal/Distal – vague reference to distance from TSR
May be positive (enhancing) or negative (repressing)
Orientation independent (generally)
Modules – Sets of TFBS within a region that function together
• Transcriptional Unit
Module 3
• DNA sequence transcribed as a single polycistronic mRNA
9
Complexity in Transcription
Chromatin
Distal enhancer
Proximal enhancer
Module 3
Core Promoter
Distal enhancer
10
Lab Discovery of TF Binding Sites
0%
Reporter Gene Activity
100%
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
mutation
Identify functional regulatory region within a sequence and delineate
specific TFBS through mutagenesis (and in vitro binding studies)
Module 3
11
EMSA/Gel Shift Assays
to Identify Binding Proteins
TF + DNA
DNA
http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg
Module 3
12
High-throughput Methods
• SELEX
– mix random ds DNA oligonucleotides with TF protein,
recover TF-DNA complexes and sequence DNA
• Protein Binding Arrays
– prepare arrays with ds DNA attached, label protein with a
fluorescent mark and observe DNA bound by protein
• ChIP
– covalently link proteins to DNA in cell, shear DNA, recover
protein-DNA complexes and identify DNA (PCR, array or
sequencing)
Module 3
13
Promoters
• In most vertebrates the delineation of the
transcription start position is not easy
• cDNA often incomplete at 5’ end
• Multiple promoters for many genes
• Referencing position relative to the initiation “site” is therefore
not a good idea
– But done almost uniformly in biological papers
• (Translation start equally problematic)
– Can be in internal exon
– Multiple start positions common
Module 3
14
mRNA Caps for Mapping
Initiation Sites
• 5’ end of mRNA have a “cap”
structure that can be
precipitated with an antibody
– Allows for large-scale
sequencing of “full-length”
cDNAs and “tags” derived from
the 5’ end of mRNAs
– RIKEN the leading generators
of such sequences
http://departments.oxy.edu/biology/Stillman/bi221/111300/26_18a.GIF
Module 3
15
Classes of Initiation Regions
Bias:
TATA Box
(“Selective”)
Bias:
CpG Island
(“Broad”)
Position
This is over-simplified - see paper for greater detail.
Take home message is that promoters are not drawn
from a single continuous distribution of properties,
rather drawn from at least two classes.
Image from Carninci P, et al (2006). Genome-wide analysis of mammalian promoter
architecture and evolution. Nat Genet. Apr 28 PMID: 16645617
Module 3
16
CpG Islands
• DNA methylation occurs in competition with histone
acetylation
• Acetylation promotes open chromatin structure that is permissive
for TF binding to DNA
• Methylation of DNA inhibits histone acetylation
• Certain TFs promote histone acetylation by recruiting acetylases
• Methylation occurs on cytosines
• Preferentially on cytosine adjacent to guanines (CG dinucleotides,
generally referred to as CpG)
• Methylated cytosines frequently undergo deamination to form
thymidine (CpG -> TpG)
• CpG Islands are regions of DNA where CG dinucleotides
occur at a frequency consistent with C and G
mononucleotide frequencies
• Highlight regions in which histones are acetylated – regions of
active transcription
Module 3
17
CpG Islands (2)
• Important to recognize, that promoters
selectively active after early development will
not be acetylated (and hence will be
methylated) in the cell divisions preceding the
establishment of germ cells and therefore will
not have CpG islands.
• Lists of genes that have higher or lower CpG
frequencies than average can misleadingly
appear to have TF binding motifs based on
this compositional characteristic.
Module 3
18
Section 3.1
What have we learned?
• Transcription controlled by regulatory regions
• Regulatory regions can be distant from
initiation regions
• Laboratory methods can identify regulatory
regions and TF binding sites
• Concept of single initiation site is flawed
• Promoters fall into subclasses
• CpG vs TATA
• Can impact assessment of motifs in sets of genes
Module 3
19
Module 3
Part 1: Overview of transcription
Lab 3.1: Promoters in Genome Browser (UCSC)
Part 2: Prediction of transcription factor binding sites
using binding profiles (“Discrimination”)
Lab 3.2: TFBS scan (Footer)
Part 3: Interrogation of sets of co-expressed genes to
identify mediating transcription factors
Lab 3.3: TFBS Over-Representation (oPOSSUM)
Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed
genes (“Discovery”)
Lab 3.4: Motif Discovery (MEME/STAMP)
Module 3
20
Part 2
Prediction of
TF Binding Sites
Teaching a computer
to find TFBS…
Module 3
21
Representing Binding Sites for a TF
• A single site
• AAGTTAATGA
• A set of sites represented as a consensus
• VDRTWRWWSHD (IUPAC degenerate DNA)
• A matrix describing a set of sites:
A
C
G
T
14 16 4 0 1 19 20 1
3 0 0 0 0 0 0 0
4 3 17 0 0 2 0 0
0 2 0 21 20 0 1 20
4 13 4 4 13 12 3
7 3 1 0 3 1 12
9 1 3 0 5 2 2
1 4 13 17 0 6 4
Logo – A graphical
representation of frequency
matrix. Y-axis is information
content , which reflects the
strength of the pattern in each
column of the matrix
Module 3
Set of
binding
sites
AAGTTAATGA
CAGTTAATAA
GAGTTAAACA
CAGTTAATTA
GAGTTAATAA
CAGTTATTCA
GAGTTAATAA
CAGTTAATCA
AGATTAAAGA
AAGTTAACGA
AGGTTAACGA
ATGTTGATGA
AAGTTAATGA
AAGTTAACGA
AAATTAATGA
GAGTTAATGA
AAGTTAATCA
AAGTTGATGA
AAATTAATGA
ATGTTAATGA
AAGTAAATGA
AAGTTAATGA
AAGTTAATGA
AAATTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA
AAGTTAATGA22
Conversion of PFMs to Position
Specific Scoring Matrices (PSSM)
Add the following features to the matrix profile:
1. Correct for nucleotide frequencies in genome
2. Weight for the confidence (depth) in the pattern
3. Convert to log-scale probability for easy arithmetic
pssm
pfm
A
C
G
T
5
0
0
0
0
2
3
0
1
2
1
1
0
4
0
1
f(b,i) + s(n)
0
Log
p(b)
0
4
1
(
)
A
C
G
T
1.6
-1.7
-1.7
-1.7
-1.7
0.5
1.0
-1.7
-0.2
0.5
-0.2
-0.2
-1.7
1.3
-1.7
-0.2
-1.7
-1.7
1.3
-0.2
TGCTG = 0.9
Module 3
23
PSSM Scoring Scales
• Raw scores
• Sum of values from indicated cells of the matrix
• Relative Scores (most common)
• Normalize the scores to range of 0-1 or 0%-100%
• Empirical p-values
• Based on distribution of scores for some DNA sequence,
determine a p-value (see next slide)
Module 3
24
Detecting binding sites in a single sequence
Raw Scores
Sp1
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Abs_score = 13.4 (sum of column scores)
Relative Scores
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
]
]
]
]
Max_score = 15.2 (sum of highest column scores)
A
C
G
T
[-0.2284 0.4368
[-0.2284 -0.2284
[ 1.2348 1.2348
[ 0.4368 -0.2284
-1.5
-1.5
2.1222
-1.5
-1.5
-1.5
-1.5 1.5128
2.1222 0.4368
-1.5 -0.2284
0.4368
-1.5
-1.5 -0.2284
1.2348 1.5128
0.4368 0.4368
-1.5 -0.2284
-1.5 -0.2284
1.7457 1.7457
0.4368
-1.5
0.4368
-1.5
-1.5
1.7457
Min_score = -10.3 (sum of lowest column scores)
Abs_score - Min_score
100 %
Max_score - Min_score
13.4- (-10.3)

100%  93%
15.2  (10.3)
Rel_score 
Module 3
0.3
Area to right of value
]
]
]
]
Frequency
A
C
G
T
Empirical p-value Scores
Area under entire curve
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8 1.0
Relative Score
25
]
]
]
]
JASPAR:
AN OPEN-ACCESS DATABASE
OF TF BINDING PROFILES
( jaspar.genereg.net )
Module 3
26
The Good…
• Tronche (1997) tested 50 predicted HNF1
TFBS using an in vitro binding test and found
that 96% of the predicted sites were bound!
BINDING
ENERGY
• Stormo and Fields (1998) found in detailed
biochemical studies that the best weight
matrices produce scores highly correlated
with in vitro binding energy
PSSM SCORE
Module 3
27
…the Bad…
• Fickett (1995) found that a profile for the
myoD TF made predictions at a rate of 1 per
~500bp of human DNA sequence
– This corresponds to an average of 20 sites / gene
(assuming 10,000 bp as average gene size)
Module 3
28
…and the Ugly!
Human Cardiac a-Actin gene analyzed
with a set of profiles
(each line represents a TFBS prediction)
Futility Conjuncture:
TFBS predictions are
almost always wrong
Red boxes are protein coding exons TFBS predictions excluded in this analysis
Module 3
29
ADVANCED TOPIC
Issues of Column Independence
• PSSM model assumes independence
between positions
• For example, if you observe a G at position 2, the model
assumes there is no influence on the likelihood of a T at
position 3 - this is known to be an incorrect assumption
• Other models can represent dependence
• Hidden Markov models of Nth order where Nth refers to
the number of influencing positions
• For the very few cases where there are hundreds of
TFBS known for a TF, there has been only modest
improvement in the specificity of TFBS predictions using
advanced column inter-dependent models
Module 3
30
A Conundrum…
P
P
V
THRESHOLD
• Counter to intuition, the ratio of true positives
to predictions fails to improve for “stringent”
thresholds
• For most predictive models this ratio would increase
• Why?
• True binding sites are defined by properties not
incorporated into the profile scores - above some
threshold all sites could be bound if in the right setting
Module 3
31
Section 3.1A
What have we learned?
• PSSMs accurately reflect in vitro binding properties of
DNA binding proteins
• Suitable binding sites occur at a rate far too frequent
to reflect in vivo function
• Bioinformatics methods that use PSSMs for binding
site studies must incorporate additional information to
enhance specificity
• Unfiltered predictions are too noisy for most applications
• Organisms with short regulatory sequences are less
problematic (e.g. yeast and E.coli)
Module 3
32
Using Phylogenetic Footprinting
to Improve TFBS Discrimination
70,000,000 years of evolution
can reveal regulatory regions
Module 3
33
Phylogenetic Footprinting
FoxC2 – a single exon gene
1001
0.8%
80%
0.6
60%
0.4
40%
0.2
20%
0
0%
-0.2
0
•
•
1000
2000
3000
4000
5000
6000
7000
Align orthologous gene sequences (e.g. LAGAN)
For first window of 100 bp, of sequence#1, determine the % with
identical match in sequence#2
• Step across the first sequence, recording rhe percentage of identical nucleotides
in each window
•
•
Observe that single exon contains a region of high identity that
corresponds to the ORF, with lower identity in the 5’ and 3’ UTRs
Additional conserved region could be regulatory regions
Module 3
34
Phylogenetic Footprinting (cont)
% Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse
Module 3
35
Phylogenetic Footprinting Dramatically
Reduces Spurious Hits
Human
Mouse
Module 3
Actin, alpha cardiac
36
TFBS Prediction with Human & Mouse
Pairwise Phylogenetic Footprinting
SELECTIVITY
SENSITIVITY
• Testing set: 40 experimentally defined sites in 15 well studied
genes (Replicated with 100+ site set)
• 75-80% of defined sites detected with conservation filter, while
only 11-16% of total predictions retained
Module 3
37
1kbp insulin receptor promoter
screened with footprinting
Module 3
38
Choosing the ”right” species for
pairwise comparison...
CHICKEN
HUMAN
MOUSE
COW
HUMAN
HUMAN
Module 3
39
Multi-species
Phylogenetic Footprinting
• PhastCons scores indicate the regions of
DNA which are unusual in their sequence
composition in some subset of organisms
Module 3
40
ConSite
Module 3
41
TFBS Discrimination Tools
• Phylogenetic Footprinting Servers
• FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php
• CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/
• rVISTA http://rvista.dcode.org/
• SNPs in TFBS Analysis
• RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home
• Prokaryotes
• PRODORIC
http://prodoric.tu-bs.de/
• Software Packages
• TOUCAN
http://homes.esat.kuleuven.be/~saerts/software/toucan.php
• Programming Tools
• TFBS http://tfbs.genereg.net/
• ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk
Module 3
42
Analysis of TFBS with Phylogenetic Footprinting
Scanning a single sequence
Scanning a pair orf orthologous
sequences for conserved patterns in
conserved sequence regions
A dramatic improvement in the
percentage of biologically significant
detections
Low specificity of profiles:
•too many hits
•great majority not biologically
significant
Module 3
43
Section 3.2B
What have we learned?
• TFBS discrimination coupled with
phylogenetic footprinting has greater
specificity with tolerable loss of sensitivity
• As with any purification process, some true binding sites
will be lost
• Available online resources support
phylogenetic footprinting
Module 3
44
Laboratory Exercise 3.2
TF Binding Site Prediction
Module 3
45
20 minute break
Until 10:50am
Next: Sections 3.3 and 3.4
Module 3
46
Module 3
Part 1: Overview of transcription
Lab 3.1: Promoters in Genome Browser (UCSC)
Part 2: Prediction of transcription factor binding sites
using binding profiles (“Discrimination”)
Lab 3.2: TFBS scan (Footer)
Part 3: Interrogation of sets of co-expressed genes to
identify mediating transcription factors
Lab 3.3: TFBS Over-Representation (oPOSSUM)
Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed
genes (“Discovery”)
Lab 3.4: Motif Discovery (MEME/STAMP)
Module 3
47
Part 3:
Inferring Regulating TFs for
Sets of Co-Expressed Genes
Module 3
48
Deciphering Regulation of CoExpressed Genes
Co-Expressed
Negative Controls
Module 3
49
TFBS Over-representation
• Akin to the GO studies yesterday, we seek to
determine if a set of co-expressed genes
contains an over-abundance of predicted
binding sites for a known TF
• Phylogenetic footprinting to reduce false prediction rate
Module 3
50
Two Examples of
TFBS Over-Representation
Foreground
Foreground
More Total TFBS
More Genes with TFBS
Background
Module 3
Background
51
Statistical Methods for Identifying
Over-represented TFBS
• Binomial test (Z scores)
– Based on the number of occurrences of the TFBS relative
to background
– Normalized for sequence length
– Simple binomial distribution model
• Fisher exact probability scores
– Based on the number of genes containing the TFBS
relative to background
– Hypergeometric probability distribution
Module 3
52
oPOSSUM Procedure
Set of coexpressed
genes
Automated
sequence retrieval
from EnsEMBL
Phylogenetic
Footprinting
ORCA
Putative
mediating
transcription
factors
Module 3
Statistical
significance of
binding sites
Detection of
transcription factor
binding sites
53
Validation using Reference Gene Sets
A. Muscle-specific (23 input; 16 analyzed)
Rank
Z-score
B. Liver-specific (20 input; 12 analyzed)
Fisher
Rank
Z-score
Fisher
SRF
1
21.41
1.18e-02
HNF-1
1
38.21
8.83e-08
MEF2
2
18.12
8.05e-04
HLF
2
11.00
9.50e-03
c-MYB_1
3
14.41
1.25e-03
Sox-5
3
9.822
1.22e-01
Myf
4
13.54
3.83e-03
FREAC-4
4
7.101
1.60e-01
TEF-1
5
11.22
2.87e-03
HNF-3beta
5
4.494
4.66e-02
deltaEF1
6
10.88
1.09e-02
SOX17
6
4.229
4.20e-01
S8
7
5.874
2.93e-01
Yin-Yang
7
4.070
1.16e-01
Irf-1
8
5.245
2.63e-01
S8
8
3.821
1.61e-02
Thing1-E47
9
4.485
4.97e-02
Irf-1
9
3.477
1.69e-01
HNF-1
10
3.353
2.93e-01
COUP-TF
10
3.286
2.97e-01
TFs with experimentally-verified sites in the reference sets.
Module 3
54
Empirical Selection of Parameters based
on Reference Studies
40
p65
SRF
c-Rel
HNF-1
30
NF-κB
p50
20
Z-score
TEF-1
Muscle
MEF2
Liver
FREAC-2
Myf
10
cEBP
SP1
HNF-3β
0
NF-κB
Z-score cutoff
Fisher cutoff
-10
-20
1.0E-09
1.0E-07
1.0E-05
1.0E-03
1.0E-01
Fisher p-value
Module 3
55
C-Myc SAGE Data
• c-Myc transcription factor dimerizes with the Max
protein
• Key regulator of cell proliferation, differentiation and
apoptosis
• Menssen and Hermeking identified 216 different
SAGE tags corresponding to unique mRNAs that
were induced after adenoviral expression of c-Myc in
HUVEC cells
• They then went on to confirm the induction of 53
genes using microarray analysis and RT-PCR
Module 3
56
Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36
analyzed)
TF Class
Rank
Myc-Max
bHLH-ZIP
1
21.68
5.35e-03
7
Staf
ZN-FINGER, C2H2
2
20.17
1.70e-02
2
Max
bHLH-ZIP
3
18.32
2.16e-02
12
SAP-1
ETS
4
13.23
1.61e-04
13
USF
bHLH-ZIP
5
11.90
1.84e-01
16
SP1
ZN-FINGER, C2H2
6
11.68
4.40e-02
12
n-MYC
bHLH-ZIP
7
11.11
1.55e-01
20
ARNT
bHLH
8
11.11
1.55e-01
20
Elk-1
ETS
9
10.92
3.88e-03
19
Ahr-ARNT
bHLH
10
10.17
1.11e-01
25
Module 3
Z-score
Fisher
No. Genes
57
C-Fos Microarray Experiment
• In a study examining the role of
transcriptional repression in oncogenesis,
Ordway et al. compared the gene expression
profiles of fibroblasts transformed by c-fos to
the parental 208F rat fibroblast cell line
• We mapped the list of 252 induced Affymetrix
Rat Genome U34A GeneChip sequences to
136 human orthologs
Module 3
58
Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136
input; 86 analyzed)
TF Class
Rank
Z-score
Fisher
No. Genes
c-FOS
bZIP
1
17.53
2.60e-05
45
RREB-1
ZN-FINGER, C2H2
2
8.899
1.41e-01
1
PPARgamma-RXRal
NUCLEAR RECEPTOR
3
3.991
2.98e-01
1
CREB
bZIP
4
3.626
1.25e-01
10
E2F
Unknown
5
2.965
7.67e-02
15
Module 3
59
Structurally-related TFs with
Indistinguishable TFBS
• Ets example
Module 3
60
oPOSSUM Server
Module 3
61
Section 3.3
What have we learned?
• New generation of tools to help interrogate the
meaning of observed clusters of co-expressed genes
• Generally best performance has been with data
directly linked to a transcription factor
• Highly dependent on the experimental design – cannot
overcome noisy data from poor design (Recall Day 1)
• The identity of a mediating TF may not be apparent
when many proteins can bind to the same motif
Module 3
62
Laboratory Exercise 3.3
TFBS Over-Representation Analysis
Module 3
63
Module 3: Overview
Part 1: Overview of transcription
Lab 3.1: Promoters in Genome Browser (UCSC)
Part 2: Prediction of transcription factor binding sites
using binding profiles (“Discrimination”)
Lab 3.2: TFBS scan (Footer)
Part 3: Interrogation of sets of co-expressed genes to
identify mediating transcription factors
Lab 3.3: TFBS Over-Representation (oPOSSUM)
Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed
genes (“Discovery”)
Lab 3.4: Motif Discovery (MEME/STAMP)
Module 3
64
Part 4:
de novo Discovery
of TF Binding Sites
Module 3
65
de novo Pattern Discovery
Module 3
66
de novo Pattern Discovery
• String-based
– e.g. YMF (Sinha & Tompa)
– Generalization: Identify over-represented oligomers in
comparison of “+” and “-” (or complete) promoter collections
– Used often for yeast promoter analysis
• Profile-based
– e.g. AnnSpec (Workman & Stormo) or MEME (Bailey &
Elkin)
– Generalization: Identify strong patterns in “+” promoter
collection vs. background model of expected sequence
characteristics
Module 3
67
Assessing Discovered Patterns
• Strength
• Similarity search
Module 3
68
String-based methods(1)
How likely are X words in a set of
sequences, given background sequence
characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC
TTCAAATTTTAACGCCGGAATAATCTCCTATT
TCGCTGTAACCGGAATATTTAGTCAGTTTTTG
TATCGTCATTCTCCGCCTCTTTTCTT
GCTTATCAATGCGCCCGGAATAAAACGCTATA
CATTGACTTTATCGAATAAATCTGTT
ATCTATTTACAATGATAAAACTTCAA
ATGGTCTCTACCGGAAAGCTACTTTCAGAATT
TTTCAAATCCGGAATTTCCACCCGGAATTACT
TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC
ACTGAACTTGTCTTCAAATTTCAACACCGGAA
TCAATGCCGGAATTCTGAATGTGAGTCGCCCT
Module 3
>EP71002
>EP63009
>EP63010
>EP11013
>EP11014
>EP11015
>EP11016
>EP11017
>EP63007
>EP63008
>EP17012
>EP55011
(+)
(+)
(+)
(+)
(+)
(-)
(+)
(+)
(-)
(+)
(+)
(-)
Ce[IV] msp-56 B; range -100 to -75
Ce Cuticle Col-12; range -100 to -75
Ce Cuticle Col-13; range -100 to -75
Ce vitellogenin 2; range -100 to -75
Ce vitellogenin 5; range -100 to -75
Ce vitellogenin 4; range -100 to -75
Ce vitellogenin 6; range -100 to -75
Ce calmodulin cal-2; range -100 to -75
Ce cAMP-dep. PKR P1+; range -100 to -75
Ce cAMP-dep. PKR P2; range -100 to -75
Ce hsp 16K-1 A; range -100 to -75
Ce hsp 16K-1 B; range
69
String-based methods(2)
Find all words of length n in the yeast promoters (e.g. n=7)
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG
ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC
GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC
CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT
CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA
ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG
GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA
GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT
TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA
AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA
AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC
ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT
CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA
GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG
ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG
ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT
CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA
ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG
AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG
TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC
TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT
ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT
AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT
CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT
GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG
CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
Module 3
Make a lookup table:
AAACCTTT
TTTTTTTT
GATAGGCA
456
57788
589
Etc...
70
String-based methods(3)
Xw: Instances of a
word w within our set
of X genes
X w  EX w 
Zw 
VarX w 
Module 3
E[Xw]: Average
number of instances of
w based on number of
genes in our set
Var[Xw]: Variance –
how much deviation
from the average is
expected for w
71
Limitations of String-based Methods
• Longer word lengths not possible
• While degeneracy codes can be used, TFBS are not
words – we lose quantitation for variable positions
with consensus sequences
• Imagine column in PFM with 7 A’s and 1 T --- in a consensus
sequence we would represent as W or throw out the instance
with T
• Recently the string-based method has found renewed
utility in the analysis of 3’UTRs for the presence of
microRNA target sequences...
Module 3
72
microRNA Target Sequences
• Lim et al expressed miRNAs in cells and observed
that the overall pattern of gene expression shifted
toward the pattern of expression observed in cells
which naturally express the miRNA
• The genes with reduced expression in response to
miRNA exposure shared 7nt motifs the 3’UTR of their
transcripts
• Nice website tutorial:
• http://www.ambion.com/main/explorations/mirna.html
Module 3
73
Probabilistic
Methods for Pattern
Discovery
•What is a probabilistic method?
•The Gibbs sampler algorithm
Module 3
74
Probabilistic Methods
Overview:
Find a local alignment of width x of sites that
maximizes information content (or related
measure) in reasonable time
Usually by Gibbs sampling or EM methods
Motivation:
TFBS are not words
Efficiency – can handle longer patterns than string-based
methods
Can be intentionally influenced to reflect prior knowledge
Module 3
75
What does probabilistic mean?
• Based on probability
• Functionally, it means we’re going to guess
our way to a good pattern (TFBS)
• We’re going to try to make a good guess
• Two different flavours of the approach
– Expectation Maximization in which we try to make
the best guess each time
– Gibbs Sampling in which we make our guesses
based on the strength of our conviction
Module 3
76
Gibbs Sampling
Two data structures used:
1) Current pattern nucleotide frequencies
qi,1,..., qi,4 and corresponding background
frequencies pi,1,..., pi,4
tgacttcc
tgatctct
agacctca
tgacctct
2) Current positions of site startpoints in
the N sequences a1, ..., aN , i.e. the
alignment that contributes to qi,j.
One starting point in each sequence is
chosen randomly initially.
Module 3
77
Iterations in Gibbs Sampling
Remove one sequence z from the
set. Update the current pattern
according to
A
qi , j 
ci , j  b j
N 1  B
Pseudocount for symbol j
Sum of all
pseudocounts in column
z
tgacttcc
tgatctct
agacctca
tgacctct
’Score’ the current pattern
against each possible occurence
ak in z. Draw a new ak with
probabilities based on respective
score divided by the background
model
B
Module 3
78
Gibbs Sampling
(grossly over-simplified)
ttcgctcc
cgatacgc
tgctacct
tgacttcc
agacctca
ctgtagtg
acgcatct
Module 3
A
C
G
T
1 2 3 4 5 6 7 8
2 0 2 2 2 1 0 1
0 2 3 3 2 1 6 2
0 4 1 0 1 0 1 1
4 1 1 2 2 5 0 2
79
Pattern Discovery
• Gibbs sampling is guaranteed to return an
optimal pattern if repeated sufficiently often
• Procedure is fast, so running many 1000s of times is
feasible
• Unfortunately, we have a problem…what if
the mediating TFBS are not strongly overrepresented relative to other patterns…
Module 3
80
Applied Pattern Discovery is Acutely
Sensitive to Noise
PATTERN SIMILARITY
vs. TRUE MEF2 PROFILE
18
Pink line is negative control
with no Mef2 sites included
16
14
12
10
0
100
200
300
400
500
600
SEQUENCE LENGTH
True Mef2 Binding Sites
Module 3
81
Four Approaches to Improve
Sensitivity
• Better background models
-Higher-order properties of DNA
• Phylogenetic Footprinting
– Human:Mouse comparison eliminates ~75% of sequence
• Regulatory Modules
– Architectural rules
• Limit the types of binding profiles allowed
– TFBS patterns are NOT random
Module 3
82
Pattern Discovery Summary
• Pattern discovery methods can recover overrepresented patterns in the promoters of coexpressed genes
• Methods are acutely sensitive to noise,
indicating that the signal we seek is weak
• TFs tolerate great variability between binding sites
• As for pattern discrimination, supplementary
information/approaches are required to overcome the noise
Module 3
83
Laboratory Exercise 3.4
Motif Discovery
Module 3
84
REFLECTIONS
• Part 2
– Futility Theorem – Essentially predictions of individual TFBS
have no relationship to an in vivo function
– Successful bioinformatics methods for site discrimination
incorporate additional information (clusters, conservation)
• Part 3
– TFBS over-representation is a powerful new means to
identify TFs likely to contribute to observed patterns of coexpression
• Part 4
– Pattern discovery methods are severely restricted by the
Signal-to-Noise problem
• Observed patterns must be carefully considered
– Successful methods for pattern discovery will have to
incorporate additional information (conservation, structural
constraints on TFs)
Module 3
85
Module 3: Overview
Part 1: Overview of transcription
Lab 3.1: Promoters in Genome Browser (UCSC)
Part 2: Prediction of transcription factor binding sites
using binding profiles (“Discrimination”)
Lab 3.2: TFBS scan (Footer)
Part 3: Interrogation of sets of co-expressed genes to
identify mediating transcription factors
Lab 3.3: TFBS Over-Representation (oPOSSUM)
Part 4: Detection of novel motifs (TFBS) overrepresented in regulatory regions of co-expressed
genes (“Discovery”)
Lab 3.4: Motif Discovery (MEME/STAMP)
Module 3
86
THE END
• Questions before the break?
• Lab exercises address Sections 2 and 3
Module 3
87