Transcript Document
In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore [email protected] Indo-Russia Workshop Novosibirsk 12-14 Oct 2008 1 How does RNA polymerase know where to start transcription? It is through sequence motifs which match the consensus sequences in -10 and -35 regions, but large variability seen. Also similar sequences seen in non-promoter regions. 2 Some typical promoter sequence motifs -35 Consensus araBAD araC galP1 TTGACA CTGACG TGGACT GTCACA 17 bp SPACER -10 TSS 1 TATAAT TACTGT GACACT TATGGT • There are few sequence motifs which exactly match the consensus sequence, large variability seen. • Similar sequences seen in non-promoter regions. 3 Because: The sequence motifs are only 6-10 bp long and are degenerate, the probability of finding similar sequences in regions other than promoters is quite high. • • E. coli genome size: 4,639,221 bp E. coli DNA has ~1400 annotated promoter sites in Ecocyc database but ~4500 annotated genes • Number of ‘-10 consensus’ hexamer sequences expected in E. coli : 1058 (with exact match viz no mismatch/changes from consensus) 35,762 (1 mismatch), 3,26,746 (with 2 mismatches) e.g.: consensus TATAAT vs TATGGT OR E. coli should have a ‘-10 like’ sequence at every 4400 nt (exact match), or every 130th nt (with 1 mismatch) or 14th nt (with 2 mismatches) 4 Does this indicate that there are other signals which help in positioning RNA polymerase? Hence analysis of structural properties of a DNA sequence to locate signals that are: •Relevant to transcription from a functional/mechanistic/structural point of view. •Unique to the promoter sequences and can be used to differentiate between promoter and non-promoters. •Can be predicted from a given sequence. For example: 1) DNA STABILITY (Ability of DNA to Open up) 2) DNA CURVATURE (Intrinsically curved DNA structure) 3) DNA BENDABILITY (Ability of DNA to bend) 5 Why Stability? •An important step in transcription is the formation of an open complex which involves strand separation of DNA duplex upstream of the transcription start site (TSS) •This separation takes place without the help of any external energy. •Hence evaluating stabilities of promoter sequences may give some clues. 6 Stability of base paired dinucleotides based on Tm (melting temp data) on a collection of 108 oligonucleotide duplexes. SantaLucia J (1998) Proc. Natl. Acad. Sci. USA 95(4):1460-1465. 7 A representative free energy profile for 1000nt long E. coli promoter sequence 8 Verteb: 252 E coli: 227 Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175 Plants: 74 B Subtilis: 89 9 Curved DNA sequences are present in upstream region Gene Name Organism Distance of the bent site from TSS virF Shigella flexneri -137 -43 per-fdx Clostridium perfringens -98 Streptokinase Streptococcus equisimilis H46A aprE Bacillus subtilis -103 -95 nifLA Klebsiella pneumoniae -23 GyrA Streptococcus pneumoniae Reference Prosseda et al. (2004) Kaji et al. (2003) Malke et al. (2000) Jan et al. (2000) Cheema et al. (1999) Balas et al. (1998) 10 Roll at junction Roll at every step 11 Dinucleotide parameters Dinucleotide o Roll o Tilt o Twist CA/TG(BI)* 5.10 -0.31 31.03 GG/CC 5.02 -1.83 32.4 AG/CT 4.30 2.68 29.46 CG/CG 3.50 0.15 34.1 TA/TA 2.94 0.04 39.94 AA/TT 2.60 -1.66 35.58 AC/GT -0.70 -0.15 33.29 AT/AT -1.79 0.21 32.49 GA/TC -2.31 -0.88 38.55 GC/GC -6.49 -0.18 38.61 CA/TG(BII)* -7.50 0.68 47.62 Bansal M (1996) Biological Structure and Dynamics, 12 Proceedings of the Ninth Conversation (Vol. I) pp 121-134 A representative intrinsic curvature profile for 1000nt long E. coli promoter sequence 13 Kanhere and Bansal, Nucleic Acid Research (2005) 33, 3165-3175 14 DNA bendability DNA Protein 15 Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175 16 Distribution of different signals in 272 E. coli promoters Stability 79% 10% seqs show no signals 90% show atleast one signal 24% 19% 17% 19% 3% 2% Bendability 43% 4% Curvature 42% 17 Hence: • The upstream region and downstream regions, with respect to the TSS, show considerable differences in their properties. • Upstream region is less stable, more rigid and more curved compared to the downstream region, in prokaryotic and eukaryotic genomes. • Stability signal is much more common than other two signals • Some of the promoters which do not show any of the three signals are either internal/secondary/weak promoters 18 •Can incorporating these features help in improving the promoter prediction tools? Since low stability signature was found to be most common in promoters – it was examined first. E. Coli promoter data was studied in detail, also B. Subtilis and M. tuberculosis as examples. 19 Average stability profile for 429 E. coli promoters (from EcoCyc Database V 9.1), located atleast 500 nt apart 20 Nucleotide composition (in %) for three bacterial systems. Difference between Mtb and others is clearly seen E. coli B. subtilis M. tuberculosis A T G C A+T A T G C A+T A T G C A+T Whole genome 0.25 0.24 0.25 0.25 0.49 0.28 0.28 0.22 0.22 0.56 0.17 0.17 0.33 0.33 0.34 Up stream region -200 to -100 0.27 0.27 0.23 0.23 0.54 0.31 0.28 0.21 0.20 0.59 0.19 0.17 0.33 0.32 0.35 Down stream region 100 to 200 0.25 0.25 0.26 0.24 0.50 0.31 0.26 0.23 0.19 0.57 0.17 0.18 0.34 0.31 0.35 Promoter region -80 to +20 0.28 0.29 0.21 0.21 0.58 0.34 0.32 0.18 0.15 0.66 0.20 0.20 0.31 0.29 0.40 The composition was calculated for 101 nt length (ranges from -200 to -100, 100 to 200 and -80 to +20 with respect to TSS) promoter sequences. 582 promoter sequences from E. coli, 305 promoter sequences from B.subtilis and 42 promoter sequences from M. tuberculosis were obtained when the TSS are 200 nt apart. 21 Average stability profile for promoter sequences that are 500 nt apart A) 429 E. coli promoters (from EcoCyc Database V 9.1) B) 239 B. subtilis promoters (from DBTBS Database) C) 40 M. tuberculosis promoters (from MtbRegList Database) One sharp peak corresponding to high A+T content seen 22 23 Sensitivity and precision for promoter prediction of 500 nt apart experimentally verified bacterial TSS. E. coli Sensitivity / Cutoff value applied (kcal/mol) E-cutoff = -18.7 D-cutoff = 1.0 B. subtilis E-cutoff = -17.2 D-cutoff = 1.5 M. tuberculosis E-cutoff = -21.0 D-cutoff = 1.0 Total no. of promoter sequences of 1001 nt length considered for analysis 429 239 40 No. of True Positives 366 184 27 No. of False Positives 272 131 28 No. of False Negatives After I cycle 58 71 6 After II cycle 6 12 0 Calculated Sensitivity = TP/(TP+FN) 0.98 0.94 1 Calculated Precision = TP/(TP+FP) 0.57 0.58 0.49 False negatives after first cycle are taken for the second cycle promoter prediction, with E1 window size of 50nt. False negatives remaining after second cycle are considered for sensitivity calculation. True positives and false positives are added up after first and second cycle prediction. Definition of TP, FP: V Rangannan and M Bansal, J. Biosci. 32, 851-862 (2007). 24 Av stability profile for all 4461 genes in E. Coli aligned w.r.t their TLS Average stability profile for 4461 E. coli gene sequences of 1001nt length (-500 to +500 w.r.t TLS) Nos of nucleotides between each TSS (#729) and TLS (considering the occurrence of the first gene). Min dist = 0, Max dist = 708 25 E. Coli – Average stability profile for 1089 Protein promoter sequences and 59 RNA promoter sequences E. Coli – Average stability profile for 34 tRNA promoter sequences and 13 other RNA promoter sequences 26 Whole genome annotation for promoter regions in E coli and B. subtilis E. coli B. subtilis Forward strand of the genome Reverse strand of the genome Protein coding genes RNA coding genes Protein coding genes RNA genes No of TSSs 507 34 582 25 No of genes 2089 109 2185 73 No of predictions 4369 4354 Total Forward strand of the genome Reverse strand of the genome Protein coding genes RNA genes Protein coding genes RNA genes 1145a 305 5 302 2 613a 4456 1942 85 2164 34 4225 8723 2692 Total 2720 5412 TP calculated w.r.t gene TLS b 1329 75 1596 48 3048 68% 866 30 1142 9 2038 48% FP calculated w.r.t gene TLSb 1004 9 795 4 1812 522 0 425 0 947 TP calculated w.r.t TSS c 394 23 429 18 864 75% 167 4 189 2 362 59% a 3 TSSs of E. coli and 1 TSS of B. subtilis regulate protein as well as RNA genes. True and false positives are identified against the genes in forward and reverse strand. c True positive is calculated with respect to the annotated TSS (located in -150 to +50 nt region w.r.t TSS) b 63% and 68% accuracy (precision) achieved in case of E. coli and B. subtilis respectively w.r.t TLS 75% and 59% reliability achieved in case of E. coli and B. subtilis respectively w.r.t annotated TSS (against 37% in case of SIDD for 927 TSS in E. coli). 27 Whole genome annotation of promoter regions over M. tuberculosis genome M. tuberculosis No of genes Forward strand of the genome Reverse strand of the genome Protein coding genes RNA coding genes Protein coding genes RNA coding genes Total 2010 27 1989 23 4049 No of predictions 3153 3163 5316 TP calculated w.r.t gene TLS 692 14 938 6 1650 (41%) FP calculated w.r.t gene 1032 6 802 3 1843 28 All false positives need not be REAL false positives • In prokaryotic genomes, the intergenic region is very small (~ 12%). • Experimental evidence shows that for some genes the regulating transcription start site lies within the coding region of an upstream neighboring gene. • For example, the E.coli rpoS gene has its transcribing TSS (rpoSp) within the coding region of nlpD gene and 567 nt away from its own TLS. Lange R, Fischer D and Hengge-Aronis R., J Bacteriol. (1995); 177(16):4676-80 29 Distribution of coding and intergenic regions in the bacterial genomes Histograms showing the distribution of predicted promoter regions in different genomic regions in E. coli, B. subtilis and M. tuberculosis genomes. Color coding for intergenic and coding region are shown on top right. 30 Predicted promoter region distribution in E. coli genome (over ALL 1145 Ecocyc annotated, 1001 nt long promoter sequences). 31 Comparison of our method of promoter prediction with NNPP, w.r.t TLSS at position 0 32 Average energy profile for E.coli genomic fragment 9000bp to 15300bp 33 Average energy profile for E.coli genomic fragment 3483400bp to 3487000bp (DIV intergenic region) 34 Average energy profile for E.coli genomic fragment 2863000bp to 2867600bp (CON intergenic region) 35 Conclusions • Relative stability of DNA in neighboring regions can help in annotating for promoter regions in whole genomes • The method is quite general and shown to work for genomes with varying AT/GC content. • The stability criteria performs better than other commonly used methods based on sequence motif search as well as the superhelix induced destabilization in DNA (SIDD) method. 36 No of promoter sequences grouped according to their %GC content in the three bacterial systems No of sequences analyzed* %GC E. coli B. sub Mtb 30 – 35 - 6 - 35 – 40 16 61 - 40 – 45 47 168 - 45 – 50 183 47 - 50 – 55 193 - - 55 – 60 18 - - 60 – 65 - - 25 65 – 70 - - 15 Total 457 282 40 TSSs which are 500nt apart are considered in E. coli, B. subtilis and M. tuberculosis. GC categorization is done based on the %GC over 1001nt long promoter sequences (ranging from -500 to +500 w.r.t TSS). 37 Average free energy distribution over promoter sequences with diverse GC composition (A) -500 to +500 region with respect to TSS (B) -80 to +20 region with respect to TSS The average free energies over the promoter regions with similar GC composition are approximately same with E. coli and B.subtilis nearly overlapping for %GC intervals 35-40%, 40-45%, and 45-50% , in case of 1001 nt long promoter regions. 38 Thresholds of free energy values used to predict promoters in genomic DNA with varying GC content E is the average free energy over the -80 to +20 region of known promoters, and D is the difference between E and the average free energy over random sequences generated from downstream (+100 to +500 region) genomic sequence (REav). 39 41 Stability characteristics of TF binding site (e.g. CRP) CRP binding region: glpTQp glpTQp -15 -12 -14 -16 -18 Free energy (kcal/mol) Free energy (kcal/mol) -16 -20 -22 -24 -26 -28 -30 -600 -400 -200 0 200 400 600 -17 -18 -19 -20 -21 Distance from TSS -22 -23 -160 -150 -140 -130 -120 -110 -100 Distance from TSS Region of high stability corresponds to a binding site for CRP in E coli. The high stability trough extends for ~22 nucleotides (window size = 15 nts), which is the same as the foot print size of the protein reported in literature. 42 Ecoli CRP binding site consensus sequence for 209 sites Position specific base composition for 209 CRP binding sites 100% 80% C 60% G 40% T 20% A 20 18 16 14 12 10 8 6 4 2 0 0% Position 43 CRP: Average stability profile Average stability profile: 209 CRP binding sites -10.8 Free energy (kcal/mol) -11.0 -11.2 -11.4 -11.6 -11.8 -12.0 -12.2 -10 -5 0 5 10 15 20 25 30 Position CRP: Average stability for 209 sites CRP: Average stability for scrambled sequences 44 CRP: Average stability profile for manipulated sequences NNNNNNNNNNNNNTGTGANNNNNNACACANNNNNNNNNNNNN 5’ flanking region 3’ flanking region 6-nt linker CRP: Average stability profile for 209 non-redundant binding sites CRP: Average stability profiles for 209 sequences -17.2 -17.0 -17.4 -17.5 -17.8 Free energy (kcal/mol) Free energy (kcal/mol) -17.6 -18.0 -18.2 -18.4 -18.6 -18.0 -18.5 -19.0 -19.5 -20.0 -18.8 -20.5 -19.0 -5 0 5 10 15 20 25 30 -21.0 Position (first nucleotide of protein binding site taken as 0) -5 0 5 10 15 20 25 30 Position CRP: WT sequences CRP: Scrambled sequences CRP: Flanking regions scrambled CRP: Linker and flanking regions scrambled CRP: Linker region scrambled CRP: WT sequences CRP: GC inverted linker CRP: GC inverted flanking sequences 45 CRP: Average bendability profile Bendability profiles for CRP binding sites (209) 3.0 TGTGANNNNNNACACA Bendability 2.8 2.6 2.4 2.2 2.0 -10 -5 0 5 10 15 20 25 30 Position Average bendability: CRP(209 sequences) Average bendability: scrambled CRP sequences 46 Acknowledgements: Dr Dhananjay Bhattacharyya Dr Aditi Kanhere Ms Vetriselvi R Mr Vikas Sarma Mr Nishad Matange Financial Support: Dept of Biotechnology, India Thank You 47 48 Coding and inter-genic region distribution in E. coli and B. subtilis genome. Histograms show the distribution of predicted promoter regions in different intergenic regions in E.coli and B.subtilis genomes (as per the color coding in the legend). 49 NarL: Binding site Consensus sequence Percent Base com position at each position for 76 NarL binding sequences C G Position 17 15 13 11 9 7 5 3 1 -1 -3 -5 T -7 -9 100% 80% 60% 40% 20% 0% A 50 NarL: Average stability profile Average stability profile for 76 NarL binding sequences -10.6 Free energy (kcal/mol) -10.8 -11.0 -11.2 -11.4 -11.6 -6 -4 -2 0 2 4 6 8 10 12 14 Position NarL: Average stability profile for 76 sequences Average stability profile for scrambled sequences 51 NarL: Average bendability profile Average bendability for 76 NarL binding sites 2.70 2.65 Bendability 2.60 2.55 2.50 2.45 2.40 2.35 -6 -4 -2 0 2 4 6 8 10 12 14 Position Average bendability: NarL BSs (76 sequences) Average bendability: NarL BSs scrambled sequences 52 Definition of thresholds of free energy values used to predict promoters in bacterial genome sequences. G specifies the average free energy over the entire genome. E is the average free energy over known promoter regions. All energy values are in kcal/mol and the standard deviation values are also indicated. E-cutoff and D-cutoff are the thresholds used to predict promoter regions. Average free energy G calculated over whole genome sequence Average free energy E calculated over upstream region of TSS E. coli B. subtilis M. tuberculosis -20.10 -18.88 -22.49 0.13 0.06 0.15 GEav (Mean+3σ) -19.70 -18.72 -22.04 Upstream region considered with respect to TSS -80 to +20 (101 nt length) -80 to +20 (101 nt length) -40 to +20 (61 nt length) -18.70 -17.20 -21.02 0 0 0 -18.70 -17.20 -21 1.0 1.5 Mean G Standard Deviation (σ) Mean E Standard Deviation (σ) E-cutoff (Mean+3σ) D-cutoff (E-cutoff – GEav) 1.0 Specific cutoff for diverse %GC sequences No of predicted promoters as well as length of the predicted promoter region has increased with generalized cutoff derived for E. coli 54 Variation in base composition and average free energy (AFE) in different regions of bacterial genomes. Promoter sequences of 491, 283 and 40 TSS which are 500nt nucleotides apart are considered from E. coli, B. subtilis and M .tuberculosis respectively. Sequences are aligned with respect to the TSS. Standard deviation from the respective mean is given in brackets. Region extracted from respective genome with respect to TSS (Length of the region) AFE G+C AFE G+C AFE G+C Upstream region -500 to -100 (401 nt) -19.9 (1.0) 0.49 (0.06) -18.8 (0.8) 0.43 (0.05) -22.4 (0.7) 0.65 (0.03) -500 to -100 (401 nt) shuffled sequence -19.6 (1.0) 100 to 500 (401nt) -20.1 (0.7) 100 to 500 (401nt) shuffled sequence -19.9 (0.7) -80 to +20 (101nt) -18.6 (1.3) -80 to +20 (101nt) shuffled sequence -18.5 (1.2) -500 to +500 (1001nt) -19.8 (0.7) -500 to +500 (1001nt) shuffled sequence -19.5 (0.6) Downstream region Promoter region Longer region Whole genome E. coli -20.1 (2.4) B. subtilis -19.6 (0.8) 0.49 (0.04) -19.0 (0.7) -22.1 (0.6) 0.44 (0.04) -18.7 (0.7) 0.42 (0.08) -17.1 (1.0) -18.6 (0.5) 0.33 (0.06) -18.9 (2.3) 0.66 (0.03) -21.4 (1.0) 0.61 (0.05) -21.4 (0.9) 0.42 (0.03) -18.4 (0.5) 0.51 -22.5 (0.5) -22.3 (0.5) -17.0 (0.9) 0.49 (0.04) M. tuberculosis -22.3 (0.4) 0.65 (0.02) -22.1 (0.33) 0.44 -22.5 (2.1) 55 0.66 Average stability profile for promoter sequences from three different organisms 491 E. coli promoters from EcoCyc Database version 11.0 239 B. subtilis promoters from DBTBS Database 40 M. tuberculosis promoters from MtbRegList Database