Функциональная аннотация

Download Report

Transcript Функциональная аннотация

Comparative genomics:
functional characterization
of new genes and regulatory interactions
using computer analysis
Mikhail Gelfand
Institute for Information Transmission Problems
(The Kharkevich Institute), RAS
Workshop at the Landau Instiute of Theoretical Physics, RAS
September 27-28, 2007, Moscow
The genome is decyphered!
Is it?
To intercept a message does not mean to
understand it
Fragment of a genome (0.1% of E. coli)
A typical bacterial genome:
several million nucleotides
~600 through ~9,000 genes
(~90% of the genome encodes proteins)
Propaganda
10000000
1000000
100000
10000
1000
sequences in GenBank
100
(~genes)
articles in PubMed
(~experiments)
1982 1984 1986 1988 1990 1992 1994 1996 1998 2000
год
More propaganda
Most genes will never be studied in experiment
Even in E.coli: only 20-30 new genes per year
(hundreds are still uncharacterized)
• “Universally missing genes” – not a single known gene
even for ~10% reactions of the central metabolism. No
genes for >40% reactions overall.
• “Conserved hypothetical genes” (5-15% of any bacterial
genome) – essential, but unknown function.
The local goal:
to characterize the genes
• What?
– function (rather, role)
• When?
– regulation (conditions)
• gene expression
• lifetime (mRNA, protein)
• Where?
– Localization
• Cellular/membrane/secreted
• How?
– Mechanism of action
• Specificity, regulation (biochemistry)
Propaganda-2: complete genomes
90
84
80
70
2007:
> 1200
bacterial
genomes
60
55
50
40
30
30
20
10
19
18
14
9
2
0
1995
4
1
2 1
3 2
1996
1997
1998
4
2
10
7
4
1999
2000
15
8
2001
2002
The global goal:
to predict the organism’s
properties given its genome
(plus some additional information, e.g.
the initial state after cell division)
and “to understand” the
evolution of genomes/organisms
Haemophilus influenzae, 1995
Vibrio cholerae, 2000
The metabolic map, the bird’s view
Metabolic pathways,
the eagle’s view
A submap (metabolism of arginine and proline)
Approaches
• Similarity => homology (common origin)
• Homology => common function
• “The Pearson Principle” (after Karl Pearson):
important features are conserved
– functional sites in proteins
– regulatory (protein-binding) sites in DNA
– not necessarily sequences:
• structure of protein and RNA
• gene localization on chromosomes
• co-expression of genes
• Allows one to annotate 50-75% of genes in a
bacterial genome
• Necessary first step, may be automated (to
some extent)
… but not so simple
• Similarity ≠ homology
– Low complexity regions, unstructured domains,
transmembrane segments and other regions with
non-strandard amino acid composition
• The need for correct similarity measures
– Does homology always follow from the structural
similarity?
• What is structural similarity?
How can it be measured?
• Convergent evolution of structures?
Independent emergence of folds?
• Homology ≠ same function
– What is «the same function»?
• Biochemical details and cellular role
“The Fermi principle”
(after Enrico Fermi)
Purely homology-based annotation: boring
(nothing radically new)
It turns out, one can predict something
completely new
Comparative genomics
Positional clustering
• Genes that are located in immediate proximity
tend to be involved in the same metabolic
pathway or functional subsystem
– caused by operon structure, but not only
• horizontal transfer of loci containing several functionally
linked operons
• compartmentalisation of products in the cytoplasm
– very weak evidence
• stronger if observed in may unrelated genomes
• May be measured
– e.g. the STRING database/server (P.Bork, EMBL)
– and other sources
STRING:
trpB –
positional
clusters
Functionally dependent genes tend to cluster
on chromosomes in many different organisms
Vertical
axis: number
of gene pairs
with
association
score
exceeding a
threshold.
Control:
same graph,
random
re-labeling of
vertices
More genomes (stronger links)
=> highly significant clustering
Fusions
• If two (or more) proteins form a single
multidomain protein in some organism, they all
are likely to be tightly functionally related
• Very useful for the analysis of eukaryotes
• Sometimes useful for the analysis of
prokaryotes
STRING:
trpB –
fusions
Phyletic patterns
• Functionally linked genes tend to occur
together
• Enzymes with the same function (isozymes)
have complementary phyletic profiles
STRING:
trpB –
co-occurrence
(phyletic
patterns)
Phyletic patterns in the Phe/Tyr pathway
shikimate
kinase
Archaeal shikimate-kinase
Chorismate biosynthesis pathway (E. coli)
Arithmetics of phyletic patterns
Shikimate dehydrogenase (EC 1.1.1.25):
AroE
COG0169
aompkzyqvdrlbcefghsnuj-i-5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19)
AroA
COG0128
aompkzyqvdrlbcefghsnuj-i-Chorismate synthase (EC 2.5.1.19)
AroC
COG0082
aompkzyqvdrlbcefghsnuj-i--
Shikimate kinase (EC 2.7.1.71):
Typical (AroK) COG0703
------yqvdrlbcefghsnuj-i-Archaeal-type COG1685
aompkz-------------------+ aompkzyqvdrlbcefghsnuj-i-Two forms combined
3-dehydroquinate dehydratase (EC 4.2.1.10):
Class I (AroD) COG0710
aompkzyq---lb-e----n---i-Class II (AroQ) COG0757
------y-vdr-bcefghs-uj---+ aompkzyqvdrlbcefghsnuj-i-Two forms combined
Distribution of association scores:
monotonic for subunits,
bimodal for isozymes
Comparative analysis of regulation
• Phylogenetic footprinting: regulatory sites
are more conserved than non-coding
regions in general and are often seen as
conserved islands in alignments of gene
upstream regions
• Consistency filtering: regulons (sets of coregulated genes) are conserved =>
– true sites occur upstream of orthologous genes
– false sites are scattered at random
Riboflavin (vitamin B2)
biosynthesis pathway
PURINE BIOSYNTHESIS PATHWAY
GTP
ribA
PENTOSE-PHOSPHATE PATHWAY
ribA
GTP cyclohydrolase II
2,5-diamino-6-hydroxy-4-(5`-phosphoribosylamino)pyrimidine
ribG
ribA
Pyrimidine deaminase
5-amino-6-(5`-phosphoribosylamino)uracil
ribulose-5-phosphate
3,4-DHBP synthase
ribD
ribB
ribG
3,4-dihydroxy-2-butanone-4-phosphate
ribD
Pyrimidine reductase
5-amino-6-(5`-phosphoribitylamino)uracil
ribH
ribH
Riboflavin synthase, -chain
6,7-dimethyl-8-ribityllumazine
ribB
ypaA
ribE
Riboflavin
Riboflavin synthase, -chain
5’ UTR regions of riboflavin genes from bacteria
BS
BQ
BE
HD
Bam
CA
DF
SA
LLX
PN
TM
DR
TQ
AO
DU
CAU
FN
TFU
SX
BU
BPS
REU
RSO
EC
TY
KP
HI
VK
VC
YP
AB
BP
AC
Spu
PP
AU
PU
PY
PA
MLO
SM
BME
BS
BQ
BE
CA
DF
EF
LLX
LO
PN
ST
MN
SA
AMI
DHA
FN
GLU
1
2
2’
3
=========>
==>
<==
===>
TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT
AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT
TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT
TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT
TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT
GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT
CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT
TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT
ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT
AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT
GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT
CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT
AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT
TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT
GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT
TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT
ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT
-AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT
GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT
GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT
GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT
GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT
AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT
GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT
GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT
TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT
GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT
ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT
GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT
ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT
AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT
GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT
GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT
AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT
GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT
AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT
AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT
AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT
ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT
TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT
ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT
AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT
---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
Add.
3’
-><<===
21 AGCCCGTGAC-19 AGTCCGTGAC-20 AGCCCGCGA--19 AGTCCGTGAC-23 AGCCCGTGAC-2 AGCCCGCAA--2 AGCCCGCG---6 AGCCTGCGAC-2 AGCCCGCGA--2 AGCCCACGA--3 AGCCCGCGAG-15 AGCCCGCGAA-3 AGCCCGCGAA-2 AGTCCGCGA--2 AGTCCGCGA--20 AGCCCGCGA--2 AGTCCACG---3 AGTCCGCGAC-3 AGTCCGCGAC-30 AGCCCGCGAGCG
21 AGCCCGCGAGCG
31 AGCCCGCGAGCG
21 AGCCCGCGAGCG
17 AGCCCGCGAGCG
67 AGCCCGCGAGCG
20 AGCCCGCGAGCG
2 AGCCCACGAGCG
14 AGCCCACGAGCG
13 AGCCCACGAGCG
40 AGCCCGCGAGCG
25 AGCCCACGAGCG
18 AGCCCGCGAGCG
16 AGCCCGCGAGCA
34 AGCCCGCGAGCG
13 AGCCCGCGAGCG
17 AGCCCGCGAGCG
19 AGCCCGCGAGCG
19 AGCCCGCGAGCG
19 AGCCCGCGAGCG
16 AGCCCGCGAGCG
34 AGCCCGCGAGCG
17 AGCCCGCGAGCG
18 AGCCCGCGA--27 AGCCCGCGA—-20 AGCCCGCGA--2 AGCCCGCGAG-2 AGCCCGCG---3 AGTCCACGAC-21 ACTCCGCGAT-3 AGTCCACGAT-125 AGTCCGTG---14 AGTCCGCG---104 AGTCCGCG---6 AGCCTGCGAC-14 AGCCCGCGC--20 AGCCCGCAAC-2 AGTCCACG---28 AGCCCGCGAGCG
Variable
4
4’
5
5’
1’
->
<====>
<====
==>
<==
<=========
8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT
8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG
3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC
10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG
8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG
3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG
7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT
11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT
4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA
3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA
5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA
8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG
5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC
7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA
13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA
3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC
5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG
8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG
8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG
137
GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC
8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC
7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC
11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC
8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG
8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG
8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG
26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA
11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC
5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA
16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG
16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA
10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC
10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG
6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA
7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA
7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG
19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA
15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA
14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA
8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA
8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC
10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT
5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG
3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG
3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG
3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT
6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT
5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG
4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC
3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG
3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA
3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG
3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG
11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG
5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC
11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA
4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA
10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---
Conserved secondary structure of the RFNelement
additional
stemloop
variable
stem-loop
Ag
Y
u
C
N
rU G CRY G N
GY
G
3 G
C
c
A
A N UC
CcN
a
*
GGgN
N
c
G Y
2 x
G
G
g
rC
U
Y
Y
1 y
N
N
N
N
5’
*
*
*
*
GG
A
R
R
r
N
N
N
N
KN
R
A
RG K x
Y
yB RYC
V
Rr
C 4
C
G
A
U xN
CRG
N
AG Y C
x
U
A
R
G
R 5
g
x
Capitals: invariant (absolutely conserved) positions.
Au
3’
Lower case letters: strongly conserved positions.
Dashes and stars: obligatory and facultative base
pairs
Degenerate positions: R = A or G; Y = C or U;
K = G or U; B= not A; V = not U.
N: any nucleotide. X: any nucleotide or deletion
RFN: the mechanism of regulation
• Transcription attenuation
• Translation attenuation
Early observation: an uncharacterized gene (ypaA)
with an upstream RFN element
Phylogenetic tree of RFN-elements
(regulation of riboflavin biosynthesis)
no riboflavin biosynthesis
duplications
no riboflavin biosynthesis
YpaA a.k.a. RibU: riboflavin transporter
in Gram-positive bacteria
• 5 predicted transmembrane segments => a transporter
• Upstream RFN element (likely co-regulation with
riboflavin genes) => transport of riboflaving or a
precursor
• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin
pathway => transport of riboflavin
Prediction: YpaA is riboflavin transporter (Gelfand et al.,
1999)
Validation:
• YpaA transports flavines (riboflavin, FMN, FAD): by
genetic analysis (Kreneva et al., 2000) by direct
measurement (Burgess et al., 2006; Vogl et al., 2007 )
• ypaA is regulated by riboflavin: by microarray
expression study (Lee et al., 2001)
• … via attenuation of transcription (and to some extent
inhibition of translaition) (Winkler et al., 2003)
Conserved structures of riboswitches
(circled: X-ray)
RFN-element
Var
B12-element
THI-element
Add I
Add
Ag
Y
CC
N
r UG
G
P3
A
a
N
NU
GY
R
A
Y
x
B
K
N GA y
YC
R
V
Rr
C
C
G P5
UxN
A
CRG
N
GG Y CU Ax
G
A
u
x
g
R
R
GA
A
R
R
r
N
N
N
N
P4
RY G N
YG
CCc N
G
C
c
A
G G g Nc
P2 xG
G
g
R
P1
C
U
Y
Y
y
N
N
N
N
K N
u
RG
g
t Gg
P3
Add
P5
U
R
R
C
P4
G GG
P3
P2
G
M
P2
UN
UCU
P3
A
C
UA
U
R
P1 C
U
U
A
Y
G
R
C
3'
5'
base stem
P2
r
N
g k
c tG
y
G
h
gg
N
CCCD
P3
Gr
a
K
G
T
r
a
r Cc N
y GgN
g
P2 A
Ga
Nc
U
A
P1 Uu
C
u
a
H
g
G
P4
U
G
C
YAA
N
u
c
c
N
g
car
Ga
A
U
R
A
G
a
N
r gu y
3'
5'
base stem
P5
Var
C
C
d
box
Add III
LYS-element
a
A
N
a
P6
r
y Yu G G g
R
a
A
G
C
y yGC
P5
ga
k
P5
P4
P3
a aG
G
r
a
ug a
y
a
r r CG
P2
y
G
GA
G
a
u
R
P1 r
C
u
a
Y
y
a
gN
c
U
P7
G
u CaY
a
G
3'
5'
base stem
A
P7
CTG
c gG
GGY
AG
A
C
G M B12
k G
C g
A
C
a g
P6
g c C
r
A
G
Y
5'
3'
base stem
c
yG A
c
G
C P4
h a
C
3'
5'
base stem
S-box
c AG G G A
G
A
N
A
N R
N
N
N
A
A
G
G
G
a
N
a
a
c
C
P1
D
c
C
a
A
C
G
R
G
NUN
R
U
R
cg
C
c
y
G
C d
P1
G-box
C GU
C
A
AA
CY
GG
U
A CC
A
G
G
G
A
U
3'
5'
base stem
AU GG
U
A
R
aN
t
C
g GuR
Add II
Mechanisms
+ Effector
- Effector
Antiterminator/Antisequestor
RNA-element
A
2
1
Regulatory hairpin
5’
(terminator of transcription and/or RBS-sequestor)
3
UUUUUUUU
GENES
3’
In the case of regulation of
translation
In the case of regulation of
transcription
1
2
Antiterminator/Antisequestor
3
5’
UUUUUUUU
GENES
3’
3
2
UUUUUUUU
GENES
1
gcvT:
ribozyme,
cleaves its
mRNA (the
Breaker
group)
THI-box in
plants:
inhibition of
splicing (the
Breaker and
Hanamoto
groups)
B
RNA-element
5’
3’
GENES
5’
5’
GENES
3’
RNA-element
C
Regulatory hairpin
1
2
GENES
1
5’
5’
2
GENES
3’
3’
3’
Characterized riboswitches (more are predicted)
RFN
Riboflavin
biosynthesis and
transport
FMN (flavin
Bacillus/Clostridium group,
mononucleotide) proteobacteria, actinobacteria, other
bacteria
THI
Biosynthesis and
transport of thiamin
and related
compounds
TPP (thiamin
pyrophosphate)
Bacillus/Clostridium group,
proteobacteria, actinobacteria,
cyanobacteria, other bacteria, archea
(thermoplasmas), plants, fungi
B12
Biosynthesis of
cobalamine,
transport of cobalt,
cobalamindependent enzymes
Coenzyme B12
(adenosylcobalamin)
Bacillus/Clostridium group,
proteobacteria, actinobacteria,
cyanobacteria, spirochaetes, other
bacteria
S-box
SAM-II
SAM-III
Metabolism of
methionine and
cystein
SAM
(S-adenosylmethionine)
Bacillus/Clostridium group and some
other bacteria
SAM-II (alpha), SAM-III (Streptococci)
LYS
Lysine metabolism
lysine
Bacillus/Clostridium group,
enterobacteria, other bacteria
G-box
Metabolism of
purines
purines
Bacillus/Clostridium group and some
other bacteria
glmS
Synthesis of
glucosamine-6phosphate
glucosamine-6phosphate
Bacillus/Clostridium group
Catabolism of
glycine
glycine
Bacillus/Clostridium group
(ribozyme)
gcvT
(tandem)
Properties of riboswitches
•
Direct binding of ligands
•
High conservation
– Including “unpaired” regions: tertiary interactions, ligand binding
•
Same structure – different mechanisms:
transcription, translation, splicing, (RNA cleavage)
•
Distribution in all taxonomic groups
– diverse bacteria
– archaea: thermoplasmas
– eukaryotes: plants and fungi
•
Correlation of the mechanism and taxonomy:
– attenuation of transcription (anti-anti-terminator) – Bacillus/Clostridium group
– attenuation of translation (anti-anti-sequestor of translation initiation) –
proteobacteria
– attenuation of translation (direct sequestor of translation initiation) –
actinobacteria
•
Evolution: horizontal transfer, duplications, lineage-specific loss
•
Sometimes very narrow distribution: evolution from scratch?
Conserved signal upstream of nrd genes
Identification of the candidate regulator by
the analysis of phyletic patterns
COG1327: the only COG with exactly the
same phylogenetic pattern as the signal
– “large scale” on the level of major taxa
– “small scale” within major taxa:
• absent in small parasites among alpha- and gammaproteobacteria
• absent in Desulfovibrio spp. among delta-proteobacteria
• absent in Nostoc sp. among cyanobacteria
• absent in Oenococcus and Leuconostoc among Firmicutes
• present only in Treponema denticola among four
spirochetes
COG1327 “Predicted transcriptional regulator, consists
of a Zn-ribbon and ATP-cone domains”:
regulator of the riboflavin pathway (RibX)?
Additional evidence: co-localization
nrdR is
sometimes
clustered
with nrd
genes or
with
replication
genes
dnaB,
dnaI, polA
Additional evidence:
co-regulated genes
In some genomes,
candidate NrdRbinding sites are
found upstream of
other replicationrelated genes
– dNTP salvage
– topoisomerase I,
replication initiator
dnaA, chromosome
partitioning, DNA
helicase II
Multiple sites (nrd genes): FNR, DnaA, NrdR
Mode of regulation
• Repressor (overlaps with promoters)
• Co-operative binding:
– most sites occur in tandem (> 90% cases)
– the distance between the copies (centers
of palindromes) equals an integer number of
DNA turns:
• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3
turns
• 21 bp (2 turns) in Vibrio spp.
• 41-42 bp (4 turns) in some Firmicutes
Experimental validations
Acknowledgements
• Dmitry Rodionov (comparative genomics)
• Andrei Mironov (software)
• Alexei Vitreschak (riboswitches)
• Funding:
–
–
–
–
Howard Hughes Medical Institute
Russian Foundation of Basic Research
RAS, program “Molecular and Cellular Biology”
INTAS