Transcript Document

Bioinformatics approaches for
studying of gene regulation.
By Ilya Ioshikhes, Ph.D.
Department of Biomedical
Informatics.
Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the Cell
Chapter
3. Macromolecules: Structure, Shape, and Information Nucleic Acids 8
Figure 3-19. Information flow in protein synthesis. (A) The nucleo-tides in an
mRNA molecule are joined together to form a complementary copy of a segment of one
strand of DNA. (B) They are then matched three at a time to complementary sets of
three nucleotides in the anticodon regions of tRNA molecules. At the other end of each
type of tRNA molecule, a specific amino acid is held in a high-energy linkage, and
when matching occurs, this amino acid is added to the end of the growing polypeptide
chain. Thus translation of the mRNA nucleotide sequence into an amino acid sequence
depends on complementary base-pairing between codons in the mRNA and
corresponding tRNA anticodons. The molecular basis of information transfer in
translation is therefore very similar to that in DNA replication and transcription. Note
that the mRNA is both synthesized and translated starting from its 5' end.
Molecular Cell Biology
9. Molecular Structure of Genes and Chromosomes
9.5. Organizing Cellular DNA into Chromosomes
Figure 9-30. Structure of the nucleosome. (a) Ribbon diagram
of the nucleosome shown face-on (left) and from the side (right).
One DNA strand is shown in green and the other in brown. H2A
is yellow; H2B, red; H3, blue; H4, green. (b) Space-filling model
shown from the side. DNA is shown in white; histones are colored
as in (a). H2A, H2A′, H2B, H2B′, H3, and H4 indicate the positions
of the respective histone N-terminal tails visible in this view. The
H2A′ N-terminal tail interacts with the upper loop of DNA, while
the H2A N-terminal tail (only partially seen in this view) interacts
with the bottom loop of DNA. The N-terminal tail of one H4 extends
from the bottom of the nucleosome and interacts with the neighboring
histone octamer in the crystal lattice (not shown). The N-terminal
tails of histones H2B, H2B′, H3, and H3′ pass between the two loops
of DNA. The N-terminal tails of H2A, H4, H3, and H2B include an
additional 3, 15, 19, and 23 residues, respectively, that are not
visualized in the crystal structure because they are not highly structured.
They extend further from the surface of the nucleosome where they may
participate in nucleosome-nucleosome interactions in the 30 nm fiber
(See Figure 9-31) or interact with other chromatin-associated proteins.
[From K. Luger et al., 1997, Nature 389:251; courtesy of T. J. Richmond.]
Molecular Biology of the Cell, 3rd edn.
Part II. Molecular Genetics
Chapter 8. The Cell Nucleus
The Global Structure of Chromosomes
Figure 8-30. Model of chromatin packing.
This schematic drawing shows some of the many
orders of chromatin packing postulated to give rise
to the highly condensed mitotic chromosome.
Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D.
“Ordered recruitment of chromatin modifying and general transcription factors
to the IFN-beta promoter.”
Cell. 2000 Nov 10;103(4):667-78.
Characteristic features of gene
regulation mechanisms:
• Large number and variety of participating regulatory
elements: thousands of transcription factors (TFs),
chromatin, DNA methylation etc.
• None of those elements is neither absolutely necessary nor
sufficient for the regulatory processes.
• There are a lot of DNA sequence motifs (signals) related to
these agents: TF binding sites, nucleosome sequence
pattern, CpG islands etc.
• Majority of those signals are very weak.
• Gene expression is regulated by large number of weak
signals interacting with each other in some sophisticated
ways.
Possible approaches in
that study :
• Exhaustive analysis of signals caused by 1-2
elements, with gradual generalization of results.
• From intuitive model to sequence analysis.
• From known sequence features to their
quantitative analysis.
• From sequences to revealing common sequence
motifs.
• In depth analysis of known features.
SEQ_1 Frog Xenopus borealis ACCURACY
1 bp
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTTGGCAGGACAAGGGCAG
CTCTGCAAACTGTAAAACCGGACAAAGGCTTTCCCCTGGCTTACACGCAA
AAGGGAAGGGCCTTTCCTGAGGAGGTGAGCGGCAACCTGGACTCGGGGAT
GGCGCTGGAAGTGATCTGCTTGGATTTTGCTCAAGACTTGGATGCAAGGG
CTATCCCGATGAGCTGACAAGGGCCTTGGGAGGGGGGCGGGGGCTGTGCA
GATAACAAGCTGTCCACTTCCAGGCACTGCCCTTCCGTGGCTCCCGTAGC
> SEQ_2 Frog Xenopus borealis ACCURACY
1 bp
GGGCTCCGCCCXTTCGGAAGGATGCTAGGGAGCCGGAGAGAGCGCAGAGA
GGCGGGGTGAAAGGGATGGGGGGAGCTGAGGCAGGAGGGCAGGCTGTCAA
GGCCGGGCTTGTTTTCCTGCCTGGGGGAAAAGACCCTGGCATGGGGAGGA
GCTGGGCCCCCCCCAGAAGGCAGCACAAGGGGAGGAAAAGTCAGCCTTGT
GCTCGCCTACGGCCATACCACCCTGAAAGTGCCCGATATCGTCTGATCTC
GGAAGCCAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCT
GGGAATACCAGGTGTCGTAGGCTTTTGCACTTTTGCCATTCTGAGTAACA
GCAGGGGGCAGTCTCCTCCATGCATTTTTCTTTCCCCGAACAGCTGCCTG
> SEQ_3 African Green Monkey ACCURACY
1 bp
ACTGCTCTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTC
AAGAAGCCTTTCGCTAAGGCTGTTCTTGTGGAATTGGCAAAGGGATATTT
GGAAGCCCATAGAGGGCTATGGTGAAAAAGGAAATATCTTCCGTTCAAAA
CTGGAAAGAAGCTTTCTGAGAAACTGCTCTGTGTTCTGTTAATTCATCTC
ACAGAGTTACATCTTTCCCTTCAAGAAGCCTTTCGCTAAGGCTGTTCTTG
TGGAATTGGCAAAGGGATATTTGGAAGCCCATAGAGGGCTATGGTGAAAA
AGGAAATATCTTCCGTTCAAAACTGGAAAGAAGCTTTCTGAGAAACTGCT
CTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTCAAGAAG
> SEQ_4 Mouse ACCURACY
1 bp
AAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAA
AAACGTGAAAAATGAGAAATGCACACTGAAGGACCTGGAATATGGCGAGA
AAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAATA
TGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTGTAGG
ACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCC
ACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAG
AAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGG
AAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGA
> SEQ_5 Psammechinus miliaris (sea urchin) ACCURACY
1 bp
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNAGCTTATAATCATCCTTATACACGCG
CAGTCGATGAGATGAAAAGTTCATTAACGCTACATTTACAGTGTTTTGGG
CAATTCTCCCTCCCCCCCCCCCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
TCTCCCTTCCTCTAAATATGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> SEQ_6 Yeast Saccharomyces cerevisiae ACCURACY
1 bp
AGTACAGAGGTCAATGGCAGTAATGGCACTTGGTGCGGCTTCTGTGCCAG
TAATGTGGCTTTCTCTAACAAGTTGGATGCACATCGGCGAGAGACAAGGC
TTTAGAATACGGTCACAGATATTGGAGGCATATTTGGAGGAAAAGCCAAT
GGAATGGTACGACAATAATGAAAAATTGTTAGGAGATTTTACTCAAATCA
ACAGATGTGTGGAAGAGCTAAGATCAAGCTCCGCAGAGGCATCAGCCATA
ACTTTCCAGAATTTAGTTGCAATATGTGCGCTTCTGGGGACGTCATTCTA
CTATTCTTGGTCATTAACTTTAATTATTCTTTGCAGCTCTCCAATAATCA
CATTTTTTGCAGTGGTGTTTTCCAGAATGATTCATGTATATTCAGAGAAG
> SEQ_7 Yeast Saccharomyces cerevisiae ACCURACY
1 bp
TTCTCTATTCTGCCACTATACAATTTATTGTTTTCCACAAAGGGTAAAGG
TACTTTAAGAAAATAGTTTCTTATTTTTTTTGCCATGTAATTACCTAATA
GGGAAATTTACACGCTGCTTCGCACATATACAATTGTTTCAGATATGAAA
ACTGTTGCATTATTGCCGTTCATCATTTAAATACCAGAGCTTATAAACCT
GGATATGGCTGAACTATCTCCCGTTGTTACGTTCACACAGAGAGCTTTCA
AGTGCCGCTGAAAATTCCACTAGGAAACAAAGAACAAGCTACGTCATGAA
CTTTTTAAGTTTTAAGACTACAAAACACTATCACATTTTCAGGTACGTGA
Sequences absolutely dissimilar.
No conserved regions.
Conventional evolution-based
approaches of sequence alignment
(like BLAST) are hardly applicable.
Dinucleotides (AA/TT first)
are primary subject to alignment.
Possible number of
configurations: P (2Ac.i + 1)
204 sequences, Ac. 1 to 55
Roughly 51 204 configurations
Algorithms of
multiple sequence alignment.
• Alignment of the most accurately mapped nucleosome sequences.
• Multicycle consecutive alignment – AA/TT matrices Mi of Ac.-sorted
sequences aligned one by one to pattern derived on previous step.
Results of 10,000 cycles are averaged.
• Quasi-exhaustive consecutive alignment – keeps track of several
“suboptimal” alignments; alignment with highest SIM = Sij (Mi^*Mj)
is final.
• Alignment with simulated annealing strategy: new alignment is
accepted if SIMk+1 > SIMk or with probability P(-E)=e-E/T, where
–E=SIMk+1-SIMk otherwise. T is decreasing “temperature” factor.
• Multiple alignment by positional entropy criterion using Gibbs
sampling strategy.
Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D.
“Ordered recruitment of chromatin modifying and general transcription factors
to the IFN-beta promoter.”
Cell. 2000 Nov 10;103(4):667-78.
Approach 2
Chromatin structure of promoter sequences
and regularity in positioning of TF sites –
an example of intuitive conceptual model.
F
TF – nucleosome correlation.
• Putative TF binding sites mapped on promoter sequences.
• Distribution of each TF site overall sequences calculated.
• Scanning with a “nucleosomal” 145 bp window through
distributions of all TF sites.
• Calculation of spectral distribution for each TF inside the
window in every scanning point.
• Evaluation of number N of TFs with main “nucleosomal”
period 10.1-10.5 bp in their spectra.
• Evaluation of difference between N and statistically
expected R number of such TFs:
dS(StD)=(N-R)/SQRT(R).
Left: Order of events leading to transcription initiation from the IFN-ß promoter.
I and II represent nucleosomes positioned in the promoter area. Derived from
Agalioti, T., Lomvardas, S., Parekh, B., Yie, J., Maniatis, T., and Thanos, D. 2000.
Ordered recruitment of chromatin modifying and general transcription factors to the IFN-ß promoter.
Cell 103:667-678.
Right: Nucleosome positioning at the pS2 promoter. Derived from Sewack, G.F. and
Hansen, U. 1997. Nucleosome positioning and transcription-associated chromatin alterations on
the human estrogen-responsive pS2 promoter. J. Biol. Chem. 272:31118-31129.
.
To further optimize the findings increasing the
statistical significance of the results, we varied the
length of the windows.
The results of the calculation indicate the
most statistically significant effect of 6.68 StD for the
windows (-46…+121) and (-46…+124), covering the
TSS. Size of this window (167–170 bp) is similar to
those of chromatosome.
Nucleosome-TF correlation.
• Very consistent effect of high statistical
significance.
• Obtained on two large, representative and
essentially independent data sets.
• Obtained by two independent approaches.
• Has many correlations with known
experimental data.
Approach 3
Large-scale human promoter
mapping using CpG islands.
(Program CpG_promoter by
Quadratic Discriminant Analysis QDA)
Quantitative analysis of known
sequence feature
Definition of CpG island
• Length > 200 bp
• C + G content > 50%
• CpG ratio Obs/Exp > 0.6
(Gardiner-Garden and Frommer,
J.Mol.Biol. 196, 261-282 (1987))
Molecular Biology of the Cell, 3rd edn. Part II. Molecular
Genetics Chapter 9. Control of Gene Expression The Molecular
Genetic Mechanisms That Create Specialized Cell Types 41
Figure 9-70. The CG islands surrounding the promoter in three
mammalian housekeeping genes. The yellow boxes show the extent of
each island. Note also that, as for most genes in mammals, the exons
(dark red) are very short relative to the introns (light red). (Adapted from
A.P. Bird, Trends Genet. 3:342-347, 1987.)
SN and SP
Sensitivity SN is proportion of True Positive (TP) predictions
out of all de-facto positives:
SN = TP / (TP + FN)
Specificity SP is proportion of True Positive (TP) predictions
out of all positive predictions:
SP = TP / (TP + FP)
Results of promoter mapping
(Test Set 2)
•
•
•
•
•
135 genes
68 have CpG island around promoter
63 recognized
SN = 0.47 (0.93)
SP = 0.34 (1 Pos./26 kb; 1/36 kb is in fact)
• Promoter Scan gives
SN = 0.44
SP = 0.06 (1 Pos. / 4.7 kb)
Approach 4
Revealing of regulatory mechanisms in
promoter sequences.
From sequence to model.
(Work in progress)
Alternative Architecture Types
of Human Pol II Promoters
Molecular Cell Biology
4. Nucleic Acids, the Genetic Code,
and the Synthesis of Macromolecules
4.3. Nucleic Acid Synthesis
Figure 4-15. Transcription of DNA into RNA
is catalyzed by RNA polymerase, which can
initiate the synthesis of strands de novo on
DNA templates. The nucleotide at the 5′ end of
an RNA strand retains all three of its phosphate
groups; all subsequent nucleotides release
pyrophosphate (PPi) when added to the chain
and retain only their α phosphate (red). The
released PPi is subsequently hydrolyzed by
pyrophosphatase to Pi, driving the equilibrium
of the overall reaction toward chain elongation.
In most cases, only one DNA strand is transcribed
into RNA.
The Cell
II. The Flow of Genetic Information 6. RNA Synthesis and Processing
Eukaryotic RNA Polymerases and General Transcription Factors
Figure 6.14. RNA polymerase II holoenzyme The holoenzyme consists of a preformed complex
of RNA polymerase II, the general transcription factors TFIIB, TFIIE, TFIIF, and TFIIH, and
several other proteins that activate transcription. This complex can be recruited directly to a
promoter via interaction with TFIID (TBP + TAFs).
An Introduction to Genetic Analysis
11. Regulation of Gene Transcription
Transcription: an overview of gene
regulation in eukaryotes.
Figure 11-29. (a) Assembly of the RNA
Polymerase II initiation complex begins with the
binding of transcription factor TFIID to the TATA
box. TFIID is composed of one TATA box-binding
subunit called TBP (dark blue) and more than eight
other subunits (TAFs), represented by one large
symbol (light blue). Inhibitors can bind to the
TFIID-promoter complex, blocking the binding of
other general transcription factors. Binding of
TFIIA to the TFIID-promoter complex (to form
the D-A complex) prevents inhibitor binding.
TFIIB then binds to the D-A complex, followed by
binding of a preformed complex between TFIIF
and RNA polymerase II. Finally, TFIIE, TFIIH,
and TFIIJ must add to the complex, in that order,
for transcription to be initiated. (From H.Lodish,
D.Baltimore, A.Berk, S.L.Zipursky, P.Matsudaira,
and J.Darnell, Molecular Cell Biology, 3d ed.
Copyright © 1995 by Scientific American Books)
Figure 10-52. Structure of the complex
formed between TBP, promoter DNA, and
TFIIB. In in vitro transcription systems, TFIIB
binds to the assembled TBP – promoter DNA
complex. Shown here are the C-terminal
domain of Arabidopsis TBP and the C-terminal
domain of human TFIIB. Transcription
initiation in vivo also requires TFIIA, which
binds to the TBP – promoter DNA complex on
the side opposite to where TFIIB binds. TFIIA
is thought to bind before TFIIB does. [Adapted
from D. B. Nikolov et al., 1995, Nature
377:119.]
Molecular Cell Biology
Fourth Edition
Harvey Lodish (Massachusetts Institute of Technology)
Arnold Berk (U. of California, Los Angeles)
Lawrence Zipursky (U. of California, Los Angeles)
Paul Matsudaira (Massachusetts Institute of
Technology)
David Baltimore (California Institute of Technology)
James Darnell (Rockefeller U.)
Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics
Chapter 9. Control of Gene Expression How Genetic Switches Work 20
Figure 9-34. The gene control region of a typical eucaryotic gene. The promoter is the DNA sequence
where the general transcription factors and the polymerase assemble. The most important feature of the
promoter is the TATA box, a short sequence of T-A and A-T base pairs that is recognized by the general
transcription factor TFIID. The start point of transcription is typically located about 25 nucleotide pairs
downstream from the TATA box. The regulatory sequences serve as binding sites for gene regulatory
proteins, whose presence on the DNA affects the rate of transcription initiation. These sequences can be
located adjacent to the promoter, far upstream of it, or even downstream of the gene. DNA looping is thought
to allow gene regulatory proteins bound at any of these positions to interact with the proteins that assemble at
the promoter. Whereas the general transcription factors that assemble at the promoter are similar for all
polymerase II transcribed genes, the gene regulatory proteins and the locations of their binding sites relative
to the promoter are different for each gene.
A total of 1871 non-redundant human promoter sequences
from the Eukaryotic Promoter Database (EPD) release
75 (http://www.epd.isb-sib.ch) and 8793 human promoters
from the Database of Transcriptional Start Sites (DBTSS)
(http://www.dbtss.hgc.jp/index.html) were used for statistical
analyses as two separate datasets. We also constructed
a small test set of 27 human promoters with MSS. This set
was utilized to analyze the statistics of core-promoter elements
in MSS promoters. Each promoter was considered several time
one time for each known TSS, so the total number of
sequences in this set is 107.
Molecular Cell Biology
10. Regulation of Transcription Initiation
10.4. Regulatory Sequences in Eukaryotic ProteinCoding Genes
Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different
vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the
region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each
position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose
consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes
containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981, Ann.
Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.]
To extract a subset of promoter sequences containing the TATA b
or Inr element at theirfunctional positions, the positional weight
matrices (PWM) with optimal cut-off values were applied (Bucher,
1990).We define the TATA or Inr element as being present at a
certain position if the PWM score at this position exceeds the
cut-off value, and define the element to be absent at this position
otherwise. Since there are no matrices for DPE and BRE, we
matched 5 out of 5 letters and 6 out of 7 for the DPE and BRE
consensuses (Smale and Kadonaga, 2003), respectively.
We used the same parameters to extract subsets containing
known synergetic combinations, yet the respective elements
had to be placed at their experimentally defined synergetic
distance from one another. The distances between the elements i
the remaining combinations were chosen based on the positions
the respective elements in the known combinations.
To estimate the statistical significance of the occurrence
frequency of an element or synergetic combination in the
respective functional window, we calculated a parameter
statistical significance, dS, measured in units of standard deviatio
(StD = √Nout) dS = (Nin−Nout)/√Nout, where Nin
is the number of occurrences of an element or combination
inside its functional window and Nout is the number of occurrence
of that element or combination in the average interval
of the same length outside the functional window.
Figure 1. The occurrence frequency (the percentage of sequences having a considered motif
centered at particular position) distribution of the TATA box motifs based on scanning of EPD
(blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed
at position +1. The straight horizontal gray line depicts the average amount of TATA motifs
found in the randomly generated sequence with the same percentage of each of four nucleotides
as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5.
The shadow rectangles indicate standard deviation calculated based on 1871 random sequences
(short rectangle) and on 8973 random sequences (long rectangle), respectively.
Figure 2. The occurrence frequency distribution of the Inr motifs based on scanning of EPD
(blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed
at position +1. The straight horizontal gray line depicts the average amount of TATA motifs
found in the randomly generated sequence with the same percentage of each of four nucleotides
as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5.
The shadow rectangles indicate standard deviation calculated based on 1871 random sequences
(short rectangle) and on 8973 random sequences (long rectangle), respectively.
According to these data, half of the promoters, 49.0% (48.4%), have the
Inr element at a functional position, only 21.8% (10.4%) have TATA box,
24.6% (24.6%) contain DPE, and 24.5% (25.5%) have BRE.
The majority of the promoters, 77.3% (74.3%), have at least one of four
core-promoter elements at its functional position and 41.8% (44.1%)
have only one element including TATA – 5.5% (2.9%), Inr – 20.1%
(23.0%), DPE – 6.6% (8.4%), and BRE – 9.6% (9.8%)
Figure 1. Occurrence frequency distribution of combination TATA_Inr for EPD
(blue) and DBTSS (magenta). TSS is placed at position +1.
Figure 2. Occurrence frequency distribution of combination Inr_DPE for EPD
(blue) and DBTSS (magenta).
Figure 3. Occurrence frequency distribution of combination TATA_BRE for EPD
(blue) and DBTSS (magenta).
Figure 4. Occurrence frequency distribution of combination Inr_BRE for EPD
(blue) and DBTSS (magenta).
Figure 5. Occurrence frequency distribution of combination DPE_BRE for EPD
(blue) and DBTSS (magenta). The value at each position is an 11-point sliding average.
Figure 6. Occurrence frequency distribution of combination TATA_DPE for EPD
(blue) and DBTSS (magenta).
Note the common features of the aforementioned combinations:
(1) all of them involve TFIID, and TBP binds to DNA
regardless of the presence/absence of TATA box; (2) TFIID
covers the TSS area; (3) the distance from the TSS to the
edge of the complex is approximately the same (~30–40 bp).
Combinations BRE_Inr, BRE_DPE and TATA_DPE also satisfy
these requirements. These combinations are presented in a
number of promoters comparable with the three previous combina
with comparable statistical significance (Table 4).
They may therefore be also considered as possible synergetic
combinations of core-promoter elements (Fig. 1D–F).
We found that 83 (76.9%) of the MSS promoters contain
at least one core-promoter element in the functional position
relatively the TSS. This percentage is practically the same
as for all promoters from both the datasets. The statistical
significance of the presence of any one of the four elements
in the functional position is comparatively high for a relatively
small dataset: dS = 3.5StD, P-value = 0.0005. Remarkably,
the portion of MSS promoters containing BRE (29.6%) is
larger than on average in the EPD/DBTSS datasets. Thus the
presence of the BRE element in the CpG+and MSS promoters
is comparable with the presence of the TATA box in the
CpG-less promoters.
An example of MSS promoter.
Figure 1. An example of MSS promoter sequence (36, GenBank Accession #X52601, TSS positions marked
by shadow) containing all four core promoter elements at functional position relative to a TSS (marked by
the bold letters of a color same as the respective core element).
Nature Structural & Molecular Biology 11, 1031 - 1033 (2004)
doi:10.1038/nsmb1104-1031
Another piece in the transcription initiation puzzle
Francisco J Asturias
The author is at the Department of Cell Biology, The Scripps Research Institute,
10550 North Torrey Pines Road, La Jolla, California 92037, USA. [email protected]
A new report provides evidence that the TFIIB-RNAPII interaction depends on
the presence of additional factors and highlights the importance of structural
characterization of the entire preinitiation complex.
Beyond core-promoter
Molecular Cell Biology
10. Regulation of Transcription Initiation
10.4. Regulatory Sequences in Eukaryotic ProteinCoding Genes
Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different
vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the
region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each
position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose
consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes
containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981, Ann.
Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.]
The occurrence frequency (the percentage of sequences having a considered motif
centered at particular position) distribution of the GC-box sites. The distribution is obtained
by scanning of 8973 human promoters from DBTSS (magenta – positive strand, red – negative
strand, dark blue – both strands) and 1871 human promoters from EPD (green – both stands)
sequences. The value at each position is an eleven point sliding average. The TSS is placed at
position +1. The straight horizontal line depicts the average amount of GC-box sites found in
both strands of the randomly generated sequence with the same percentage of each of four
nucleotides as in the training set of promoter sequences.
The flowchart of optimization process.
The input parameters are promoter database, an
initial PWM (or motif consensus), a set of
experimentally defined sites, and a “functional
window”.
The first step is the extraction of the dataset of
putative sites.
There are two levels of optimization at the beginning:
cutoff value and motif length. The Correlation
Coefficient (CC) is used as optimization parameter.
Each cycle brings a portion of new sites typical for
this particular window and excludes some not
typical sites increasing the influence of sites from that
window.
This influence is strongly limited by the requirement to
be as close as possible to the previous matrix
expressed by the definition of CC.
All aforementioned steps should be repeated for each
window from the functional window. As a result we
will have a set of optimal matrices, one matrix for each
considered window.
Each matrix has its own sensitivity and specificity.
CC 
(TP * TN )  ( FN * FP)
(TP  FN ) * (TN  FP) * (TP  FP) * (TN  FN )
Sensitivity (Sn) - percentage of experimentally confirmed sites
recognized by the respective matrix.
Specificity (Sp). To compare the specificity of two matrices we will
suppose that the majority of sites found by these matrices in the
randomly generated DNA sequences are false positives. If this is true,
the ratio of the occurrence frequencies found by the new and original
matrices is inversely proportional to the ratio of their specificities.
Therefore, we will consider the averaged occurrence frequency of sites
in the randomly generated sequences as a parameter describing the
specificity of the PWM.
4-row mononucleotide versus 16-row dinucleotide matrices
The majority of practically used PWMs are the 4-row mononucleotide
matrices based on the ‘additivity hypothesis’, which considers the
contributions from each position of the binding site as independent and
additive (Berg and von Hippel, 1987).
Some experimental evidence (Man and Stormo 2001; Bulyk,M.,
Johnson,P., and Church,G., 2002) and theoretical considerations (Zhang and
Marr, 1993) show that a dinucleotide approach (counting of dependence
between adjacent nucleotides of TFBS) could be in some cases the more
appropriate approximation. Using the same methodology, we built the 16-row
dinucleotide matrices.
The limitations of small experimental datasets have convinced researchers to
use less accurate, but fairly reliable 4-rows matrices (Benos,P., Bulyk,M., and
Stormo,G., 2002). There is no such limitation in our case since we use a
large set of putative sites.
The sensitivity/specificity ratios for the original and new matrices for GC-box.
Specificity - the averaged occurrence frequency of GC-box sites found by the
original matrix (circle at the left upper corner) and two sets of new 4-row (squares)
and 16-row (diamonds) matrices. The x-axis is sensitivity - the percentage of
recognized sites from a control set of experimentally defined sites.
Figure 3. The occurrence frequency distribution of the HMG1 sites. The rest as for Sp1.
Figure 4. The occurrence frequency distribution of the PAX2 sites. The rest as for Sp1.
Figure 5. The occurrence frequency distribution of the NRF2 sites. The rest as at Figure 2.
A pair of two closely positioned TF binding sites that acquire new
regulatory properties due to direct or indirect interactions between
corresponding transcription factors is called a composite element
(CE).
We performed clustering of putative binding sites predicted by the
MATCH program in a vicinity of putative binding sites for TF
STAT-1, as a study case. Clear over-representation of putative
binding sites was obtained for transcription factors AML-1a, AP-2,
CDX-a, c-Ets-1, c-Myb, c-REL, ELK-1, EN-1, GKLF, HSF-1,
HSF-2, IK-1, IK-2, IK-3, LYF-1, MSX-1, Myo-D, NF-AT, NF-κB,
NRF-2, Oct-1, P300, Pax-4, Pax-6, RFX-1, SRY, TST-1. On the
contrary, putative binding sites for GATA-1, MZF-1, and Sp1 were
clearly under represented in that area. Although some of the
results might be a mere consequence of shared motifs for
respective binding sites, others warrant different interpretation
and may point to potential CEs.
Influence of variant histone
H2A.Z on local chromatin
dynamics
(In-depth chromatin analysis
by structural modeling)
Gaussian Network Model (Bahar et
al.,1997)
•The dynamics of the interactions is
controlled by the connectivity (or
Kirchhoff) matrix G, by analogy with
the statistical mechanical theory of
elasticity originally developed by
Flory and coworkers for polymer
networks.
•The elements of G are defined as
•Here rc is the cutoff distance defining the range of
Inhibitor binding alters the directions of motions in HIV-1 reverse transcriptase
interaction of residues, each residue being represented by
"Anisotropy of fluctuation dynamics of proteins with an elastic network model" Atilgan, AR,
its a-carbon, and Rij is the distance between ith and jth
Durrell, SR, Jernigan, RL, Demirel, MC, Keskin, O. & Bahar, I. Biophys. J. 80, 505-515,
residues.
2001. (.pdf)
Anisotropic Network Model (Atilgan et al., 2001)
•The value of rc = 7 Å includes the neighboring residues
•The anisotropic network model (ANM) is an extension of the GNM to the 3N-d space of
located in the first coordination shell near a central
collective modes.
residue.
•Note that the columns (or rows) of G are interdependent (all
sum up to zero), and thus G cannot be inverted; instead it is
reconstructed after removal of its zero eigenvalue and
corresponding eigenvector.
•The inter-residue 'distances' are controlled by harmonic potentials in the GNM, ANM
adopts further assumption that the three (-x, -y and -z) components of the inter-residue
separation vectors obey Gaussian dynamics.
G is replaced by its 3N x 3N counterpart (1/g)H where H is the Hessian matrix of the
second derivatives of the intermolecular potential V = (g/2) DRT G DR.
Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the Cell
Chapter 2. Small Molecules, Energy, and Biosynthesis
The Chemical Components of a Cell
Panel 2-5: The 20 amino acids involved in the synthesis of proteins
Molecular Biology of the Cell, 3rd edn.
© 1994 by Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson.
Part I. Introduction to the Cell
Chapter 2. Small Molecules, Energy, and Biosynthesis
The Chemical Components of a Cell
Panel 2-6: A survey of the major types of nucleotides and their derivatives encountered in cells
Going beyond:
• To other species (promoter-chromatin
architecture in Drosophila and Yeast).
• TF regulatory modules.
• Post-transcriptional regulation (RNAi).
• From sequence analysis to molecular
modeling and vice versa.
• Still beyond…
Acknowledgements
• Prof. Ed Trifonov (Weizmann
Institute / University of Haifa)
• Prof. Michael Q. Zhang
(Cold Spring Harbor Lab NY)
• Prof. Ivet Bahar
(University of Pittsburgh)
• Prof. Gary Stormo (Washington
University, St. Louis)
• Prof. Alex Bolshoy
(Weizmann Ins. /Haifa U.)
• Prof. Mark Borodovsky
(Georgia Institute of
Technology, Atlanta)
• K. Derenshteyn (GIT)
Ioshikhes’ group:
• Dr. Naum Gershenzon
• Dr. Li Wang
• Dr. Amutha
Ramaswamy
(Dept. Biomedical
Informatics, Ohio
State University)
Summary
“Do you see anything there?” …
“Just a suggestion, perhaps. But wait an instant!” He stood
upon a chair, and holding up the light in his left hand, he
curved his right arm over the board hat and round the long
ringlets.
“Good havens!” I cried in amazement.
The face of Stapleton had sprung out of the canvas.
“The fellow is a Baskerville – that is evident.”
Arthur Konan-Doyle
“The Hound of the Baskervilles”