Transcript Document

In Silico Identification of Promoters in
Prokaryotic Genomes
Manju Bansal
Molecular Biophysics Unit
Indian Institute of Science
Bangalore
[email protected]
Indo-Russia Workshop
Novosibirsk
12-14 Oct 2008 1
How does RNA polymerase know where to
start transcription?
It is through sequence motifs which match the consensus sequences in -10 and -35 regions,
but large variability seen. Also similar sequences seen in non-promoter regions.
2
Some typical promoter sequence motifs
-35
Consensus
araBAD
araC
galP1
TTGACA
CTGACG
TGGACT
GTCACA
17 bp
SPACER
-10
TSS
1
TATAAT
TACTGT
GACACT
TATGGT
• There are few sequence motifs which exactly match the consensus
sequence, large variability seen.
• Similar sequences seen in non-promoter regions.
3
Because:
The sequence motifs are only 6-10 bp long and are degenerate,
the probability of finding similar sequences in regions other than
promoters is quite high.
•
•
E. coli genome size: 4,639,221 bp
E. coli DNA has ~1400 annotated promoter sites in Ecocyc database
but ~4500 annotated genes
•
Number of ‘-10 consensus’ hexamer sequences expected in E. coli :
1058 (with exact match viz no mismatch/changes from consensus)
35,762 (1 mismatch),
3,26,746 (with 2 mismatches) e.g.: consensus TATAAT vs TATGGT
OR
E. coli should have a ‘-10 like’ sequence
at every 4400 nt (exact match), or
every 130th nt (with 1 mismatch) or 14th nt (with 2 mismatches)
4
Does this indicate that there are other signals
which help in positioning RNA polymerase?
Hence analysis of structural properties of a DNA sequence to
locate signals that are:
•Relevant to transcription from a
functional/mechanistic/structural point of view.
•Unique to the promoter sequences and can be used to
differentiate between promoter and non-promoters.
•Can be predicted from a given sequence. For example:
1) DNA STABILITY (Ability of DNA to Open up)
2) DNA CURVATURE (Intrinsically curved DNA structure)
3) DNA BENDABILITY (Ability of DNA to bend)
5
Why Stability?
•An important step in transcription is the formation
of an open complex which involves strand
separation of DNA duplex upstream of the
transcription start site (TSS)
•This separation takes place without the help of any
external energy.
•Hence evaluating stabilities of promoter sequences
may give some clues.
6
Stability of base paired dinucleotides
based on Tm
(melting temp data)
on a collection of
108 oligonucleotide
duplexes.
SantaLucia J (1998) Proc. Natl. Acad. Sci. USA 95(4):1460-1465.
7
A representative free energy profile for 1000nt long
E. coli promoter sequence
8
Verteb: 252
E coli: 227
Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175
Plants: 74
B Subtilis: 89
9
Curved DNA sequences are present in
upstream region
Gene Name
Organism
Distance of
the bent
site from
TSS
virF
Shigella flexneri
-137
-43
per-fdx
Clostridium
perfringens
-98
Streptokinase
Streptococcus
equisimilis H46A
aprE
Bacillus subtilis
-103
-95
nifLA
Klebsiella
pneumoniae
-23
GyrA
Streptococcus
pneumoniae
Reference
Prosseda et al. (2004)
Kaji et al. (2003)
Malke et al. (2000)
Jan et al. (2000)
Cheema et al. (1999)
Balas et al. (1998)
10
Roll at junction
Roll at every step
11
Dinucleotide parameters
Dinucleotide
o
Roll
o
Tilt
o
Twist
CA/TG(BI)*
5.10
-0.31
31.03
GG/CC
5.02
-1.83
32.4
AG/CT
4.30
2.68
29.46
CG/CG
3.50
0.15
34.1
TA/TA
2.94
0.04
39.94
AA/TT
2.60
-1.66
35.58
AC/GT
-0.70
-0.15
33.29
AT/AT
-1.79
0.21
32.49
GA/TC
-2.31
-0.88
38.55
GC/GC
-6.49
-0.18
38.61
CA/TG(BII)*
-7.50
0.68
47.62
Bansal M (1996) Biological Structure and Dynamics,
12
Proceedings of the Ninth Conversation (Vol. I) pp 121-134
A representative intrinsic curvature profile for 1000nt
long E. coli promoter sequence
13
Kanhere and Bansal, Nucleic Acid Research (2005) 33, 3165-3175
14
DNA bendability
DNA
Protein
15
Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175
16
Distribution of different signals in
272 E. coli promoters
Stability 79%
10% seqs show no signals
90% show atleast one signal
24%
19%
17%
19%
3%
2%
Bendability 43%
4%
Curvature 42%
17
Hence:
• The upstream region and downstream regions, with respect to
the TSS, show considerable differences in their properties.
• Upstream region is less stable, more rigid and more curved
compared to the downstream region, in prokaryotic and
eukaryotic genomes.
• Stability signal is much more common than other two signals
• Some of the promoters which do not show any of the three
signals are either internal/secondary/weak promoters
18
•Can incorporating these features help in
improving the promoter prediction tools?
Since low stability signature was found to be most
common in promoters – it was examined first.
E. Coli promoter data was studied in detail, also
B. Subtilis and M. tuberculosis as examples.
19
Average stability profile for 429 E. coli promoters (from EcoCyc Database V 9.1),
located atleast 500 nt apart
20
Nucleotide composition (in %) for three bacterial systems.
Difference between Mtb and others is clearly seen
E. coli
B. subtilis
M. tuberculosis
A
T
G
C
A+T
A
T
G
C
A+T
A
T
G
C
A+T
Whole genome
0.25
0.24
0.25
0.25
0.49
0.28
0.28
0.22
0.22
0.56
0.17
0.17
0.33
0.33
0.34
Up
stream
region
-200
to
-100
0.27
0.27
0.23
0.23
0.54
0.31
0.28
0.21
0.20
0.59
0.19
0.17
0.33
0.32
0.35
Down
stream
region
100
to
200
0.25
0.25
0.26
0.24
0.50
0.31
0.26
0.23
0.19
0.57
0.17
0.18
0.34
0.31
0.35
Promoter
region
-80
to
+20
0.28
0.29
0.21
0.21
0.58
0.34
0.32
0.18
0.15
0.66
0.20
0.20
0.31
0.29
0.40
The composition was calculated for 101 nt length (ranges from -200 to -100, 100 to 200 and -80 to +20 with
respect to TSS) promoter sequences. 582 promoter sequences from E. coli, 305 promoter sequences from
B.subtilis and 42 promoter sequences from M. tuberculosis were obtained when the TSS are 200 nt apart.
21
Average stability profile for promoter sequences that are 500 nt apart
A) 429 E. coli promoters (from EcoCyc Database V 9.1)
B) 239 B. subtilis promoters (from DBTBS Database)
C) 40 M. tuberculosis promoters
(from MtbRegList Database)
One sharp peak corresponding to
high A+T content seen
22
23
Sensitivity and precision for promoter prediction of 500 nt apart
experimentally verified bacterial TSS.
E. coli
Sensitivity / Cutoff value applied
(kcal/mol)
E-cutoff = -18.7
D-cutoff = 1.0
B. subtilis
E-cutoff = -17.2
D-cutoff = 1.5
M. tuberculosis
E-cutoff = -21.0
D-cutoff = 1.0
Total no. of promoter sequences
of 1001 nt length considered for
analysis
429
239
40
No. of True Positives
366
184
27
No. of False Positives
272
131
28
No. of False
Negatives
After I cycle
58
71
6
After II cycle
6
12
0
Calculated Sensitivity =
TP/(TP+FN)
0.98
0.94
1
Calculated Precision =
TP/(TP+FP)
0.57
0.58
0.49
False negatives after first cycle are taken for the second cycle promoter prediction, with E1 window
size of 50nt. False negatives remaining after second cycle are considered for sensitivity calculation.
True positives and false positives are added up after first and second cycle prediction.
Definition of TP, FP: V Rangannan and M Bansal, J. Biosci. 32, 851-862 (2007).
24
Av stability profile for all 4461 genes in E. Coli
aligned w.r.t their TLS
Average stability profile for 4461
E. coli gene sequences of 1001nt
length (-500 to +500 w.r.t TLS)
Nos of nucleotides between each TSS (#729)
and TLS (considering the occurrence of the
first gene).
Min dist = 0, Max dist = 708
25
E. Coli – Average stability profile for 1089 Protein promoter sequences and 59 RNA promoter sequences
E. Coli – Average stability profile for 34 tRNA promoter sequences and 13 other RNA promoter sequences
26
Whole genome annotation for promoter regions in E coli and B. subtilis
E. coli
B. subtilis
Forward strand of
the genome
Reverse strand of
the genome
Protein
coding
genes
RNA
coding
genes
Protein
coding
genes
RNA
genes
No of TSSs
507
34
582
25
No of genes
2089
109
2185
73
No of predictions
4369
4354
Total
Forward strand of
the genome
Reverse strand of
the genome
Protein
coding
genes
RNA
genes
Protein
coding
genes
RNA
genes
1145a
305
5
302
2
613a
4456
1942
85
2164
34
4225
8723
2692
Total
2720
5412
TP calculated
w.r.t gene TLS b
1329
75
1596
48
3048
68%
866
30
1142
9
2038
48%
FP calculated
w.r.t gene TLSb
1004
9
795
4
1812
522
0
425
0
947
TP calculated
w.r.t TSS c
394
23
429
18
864
75%
167
4
189
2
362
59%
a
3 TSSs of E. coli and 1 TSS of B. subtilis regulate protein as well as RNA genes.
True and false positives are identified against the genes in forward and reverse strand.
c True positive is calculated with respect to the annotated TSS (located in -150 to +50 nt region w.r.t TSS)
b
 63% and 68% accuracy (precision) achieved in case of E. coli and B. subtilis respectively w.r.t TLS
75% and 59% reliability achieved in case of E. coli and B. subtilis respectively w.r.t
annotated TSS (against 37% in case of SIDD for 927 TSS in E. coli).
27
Whole genome annotation of promoter regions over M. tuberculosis genome
M. tuberculosis
No of genes
Forward strand of the
genome
Reverse strand of the
genome
Protein
coding
genes
RNA coding
genes
Protein
coding
genes
RNA coding
genes
Total
2010
27
1989
23
4049
No of predictions
3153
3163
5316
TP calculated
w.r.t gene
TLS
692
14
938
6
1650
(41%)
FP calculated
w.r.t gene
1032
6
802
3
1843
28
All false positives need not be REAL false positives
• In prokaryotic genomes, the intergenic region is very small (~ 12%).
• Experimental evidence shows that for some genes the regulating transcription start site
lies within the coding region of an upstream neighboring gene.
• For example, the E.coli rpoS gene has its transcribing TSS (rpoSp) within the coding
region of nlpD gene and 567 nt away from its own TLS.
Lange R, Fischer D and Hengge-Aronis R., J Bacteriol. (1995); 177(16):4676-80
29
Distribution of coding and intergenic regions in the bacterial genomes
Histograms showing the distribution of predicted promoter regions in different genomic
regions in E. coli, B. subtilis and M. tuberculosis genomes. Color coding for intergenic
and coding region are shown on top right.
30
Predicted promoter region distribution in E. coli genome
(over ALL 1145 Ecocyc annotated, 1001 nt long promoter sequences).
31
Comparison of our method of promoter prediction with NNPP, w.r.t TLSS at position 0
32
Average energy profile for E.coli genomic fragment 9000bp to 15300bp
33
Average energy profile for E.coli genomic fragment 3483400bp to 3487000bp (DIV intergenic region)
34
Average energy profile for E.coli genomic fragment 2863000bp to 2867600bp (CON intergenic region)
35
Conclusions
• Relative stability of DNA in neighboring regions can help in
annotating for promoter regions in whole genomes
• The method is quite general and shown to work for genomes
with varying AT/GC content.
• The stability criteria performs better than other commonly used
methods based on sequence motif search as well as the
superhelix induced destabilization in DNA (SIDD) method.
36
No of promoter sequences grouped according to their %GC content in the
three bacterial systems
No of sequences analyzed*
%GC
E. coli
B. sub
Mtb
30 – 35
-
6
-
35 – 40
16
61
-
40 – 45
47
168
-
45 – 50
183
47
-
50 – 55
193
-
-
55 – 60
18
-
-
60 – 65
-
-
25
65 – 70
-
-
15
Total
457
282
40
 TSSs which are 500nt apart are
considered in E. coli, B. subtilis and
M. tuberculosis.
 GC categorization is done based on
the %GC over 1001nt long promoter
sequences (ranging from -500 to +500
w.r.t TSS).
37
Average free energy distribution over promoter sequences with diverse GC
composition
(A)
-500 to +500 region with respect to TSS
(B) -80 to +20 region with respect to TSS
 The average free energies over the promoter regions with similar GC composition are
approximately same with E. coli and B.subtilis nearly overlapping for %GC intervals 35-40%,
40-45%, and 45-50% , in case of 1001 nt long promoter regions.
38
Thresholds of free energy values used to predict promoters in
genomic DNA with varying GC content
E is the average free
energy over the -80 to +20
region
of
known
promoters, and D is the
difference between E and
the average free energy
over random sequences
generated
from
downstream (+100 to +500
region) genomic sequence
(REav).
39
41
Stability characteristics of TF binding site (e.g. CRP)
CRP binding region: glpTQp
glpTQp
-15
-12
-14
-16
-18
Free energy (kcal/mol)
Free energy (kcal/mol)
-16
-20
-22
-24
-26
-28
-30
-600
-400
-200
0
200
400
600
-17
-18
-19
-20
-21
Distance from TSS
-22
-23
-160
-150
-140
-130
-120
-110
-100
Distance from TSS
Region of high stability corresponds to a binding site for CRP in E coli.
The high stability trough extends for ~22 nucleotides (window size = 15 nts),
which is the same as the foot print size of the protein reported in literature.
42
Ecoli CRP binding site consensus sequence for 209 sites
Position specific base composition for 209 CRP
binding sites
100%
80%
C
60%
G
40%
T
20%
A
20
18
16
14
12
10
8
6
4
2
0
0%
Position
43
CRP: Average stability profile
Average stability profile: 209 CRP binding sites
-10.8
Free energy (kcal/mol)
-11.0
-11.2
-11.4
-11.6
-11.8
-12.0
-12.2
-10
-5
0
5
10
15
20
25
30
Position
CRP: Average stability for 209 sites
CRP: Average stability for scrambled sequences
44
CRP: Average stability profile for manipulated sequences
NNNNNNNNNNNNNTGTGANNNNNNACACANNNNNNNNNNNNN
5’ flanking region
3’ flanking region
6-nt linker
CRP: Average stability profile for 209 non-redundant binding sites
CRP: Average stability profiles for 209 sequences
-17.2
-17.0
-17.4
-17.5
-17.8
Free energy (kcal/mol)
Free energy (kcal/mol)
-17.6
-18.0
-18.2
-18.4
-18.6
-18.0
-18.5
-19.0
-19.5
-20.0
-18.8
-20.5
-19.0
-5
0
5
10
15
20
25
30
-21.0
Position (first nucleotide of protein binding site taken as 0)
-5
0
5
10
15
20
25
30
Position
CRP: WT sequences
CRP: Scrambled sequences
CRP: Flanking regions scrambled
CRP: Linker and flanking regions scrambled
CRP: Linker region scrambled
CRP: WT sequences
CRP: GC inverted linker
CRP: GC inverted flanking sequences
45
CRP: Average bendability profile
Bendability profiles for CRP binding sites (209)
3.0
TGTGANNNNNNACACA
Bendability
2.8
2.6
2.4
2.2
2.0
-10
-5
0
5
10
15
20
25
30
Position
Average bendability: CRP(209 sequences)
Average bendability: scrambled CRP sequences
46
Acknowledgements:
Dr Dhananjay Bhattacharyya
Dr Aditi Kanhere
Ms Vetriselvi R
Mr Vikas Sarma
Mr Nishad Matange
Financial Support:
Dept of Biotechnology, India
Thank You
47
48
Coding and inter-genic region distribution in E. coli and B. subtilis genome.
Histograms show the distribution of predicted promoter regions in different intergenic regions
in E.coli and B.subtilis genomes (as per the color coding in the legend).
49
NarL: Binding site Consensus sequence
Percent Base com position at each position for 76 NarL
binding sequences
C
G
Position
17
15
13
11
9
7
5
3
1
-1
-3
-5
T
-7
-9
100%
80%
60%
40%
20%
0%
A
50
NarL: Average stability profile
Average stability profile for 76 NarL binding sequences
-10.6
Free energy (kcal/mol)
-10.8
-11.0
-11.2
-11.4
-11.6
-6
-4
-2
0
2
4
6
8
10
12
14
Position
NarL: Average stability profile for 76 sequences
Average stability profile for scrambled sequences
51
NarL: Average bendability profile
Average bendability for 76 NarL binding sites
2.70
2.65
Bendability
2.60
2.55
2.50
2.45
2.40
2.35
-6
-4
-2
0
2
4
6
8
10
12
14
Position
Average bendability: NarL BSs (76 sequences)
Average bendability: NarL BSs scrambled sequences
52
Definition of thresholds of free energy values used to predict promoters in bacterial genome
sequences. G specifies the average free energy over the entire genome. E is the average free energy
over known promoter regions. All energy values are in kcal/mol and the standard deviation values are
also indicated. E-cutoff and D-cutoff are the thresholds used to predict promoter regions.
Average free energy G
calculated over whole
genome sequence
Average free energy E
calculated over
upstream region of
TSS
E. coli
B. subtilis
M. tuberculosis
-20.10
-18.88
-22.49
0.13
0.06
0.15
GEav (Mean+3σ)
-19.70
-18.72
-22.04
Upstream region
considered with
respect to TSS
-80 to +20
(101 nt length)
-80 to +20
(101 nt length)
-40 to +20
(61 nt length)
-18.70
-17.20
-21.02
0
0
0
-18.70
-17.20
-21
1.0
1.5
Mean G
Standard
Deviation (σ)
Mean E
Standard
Deviation (σ)
E-cutoff
(Mean+3σ)
D-cutoff (E-cutoff – GEav)
1.0
Specific cutoff for diverse %GC sequences
 No
of predicted promoters as well as length of the predicted promoter region has increased with
generalized cutoff derived for E. coli
54
Variation in base composition and average free energy (AFE) in different regions of bacterial genomes. Promoter
sequences of 491, 283 and 40 TSS which are 500nt nucleotides apart are considered from E. coli, B. subtilis and
M .tuberculosis respectively. Sequences are aligned with respect to the TSS. Standard deviation from the respective
mean is given in brackets.
Region extracted from respective genome with
respect to TSS
(Length of the region)
AFE
G+C
AFE
G+C
AFE
G+C
Upstream
region
-500 to -100
(401 nt)
-19.9
(1.0)
0.49
(0.06)
-18.8
(0.8)
0.43
(0.05)
-22.4
(0.7)
0.65
(0.03)
-500 to -100
(401 nt) shuffled sequence
-19.6
(1.0)
100 to 500
(401nt)
-20.1
(0.7)
100 to 500
(401nt) shuffled sequence
-19.9
(0.7)
-80 to +20
(101nt)
-18.6
(1.3)
-80 to +20
(101nt) shuffled sequence
-18.5
(1.2)
-500 to +500
(1001nt)
-19.8
(0.7)
-500 to +500
(1001nt) shuffled sequence
-19.5
(0.6)
Downstream
region
Promoter
region
Longer region
Whole genome
E. coli
-20.1
(2.4)
B. subtilis
-19.6
(0.8)
0.49
(0.04)
-19.0
(0.7)
-22.1
(0.6)
0.44
(0.04)
-18.7
(0.7)
0.42
(0.08)
-17.1
(1.0)
-18.6
(0.5)
0.33
(0.06)
-18.9
(2.3)
0.66
(0.03)
-21.4
(1.0)
0.61
(0.05)
-21.4
(0.9)
0.42
(0.03)
-18.4
(0.5)
0.51
-22.5
(0.5)
-22.3
(0.5)
-17.0
(0.9)
0.49
(0.04)
M. tuberculosis
-22.3
(0.4)
0.65
(0.02)
-22.1
(0.33)
0.44
-22.5
(2.1)
55
0.66
Average stability profile for promoter sequences from three different organisms
491 E. coli promoters from
EcoCyc Database version
11.0
239
B.
subtilis
promoters
from
DBTBS Database
40
M.
tuberculosis
promoters
from
MtbRegList Database