in silico Mikhail Gelfand AlBio06, Moscow, July 2006 Research and Training Center “Bioinformatics”,

Download Report

Transcript in silico Mikhail Gelfand AlBio06, Moscow, July 2006 Research and Training Center “Bioinformatics”,

Molecular biology in silico
Mikhail Gelfand
Research and Training Center “Bioinformatics”,
Institute for Information Transmission Problems, RAS
AlBio06, Moscow, July 2006
Propaganda
red: papers (experiments)
blue: sequence fragments
10000000
1000000
100000
10000
1000
100
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
год
90
84
80
Complete genomes
70
60
55
50
40
30
30
10
19
18
20
14
9
2
0
1995
4
1
2 1
3 2
1996
1997
1998
4
2
10
7
4
1999
2000
15
8
2001
2002
GOLD db.(III.2006):
361 complete genomes
Incomplete (in the process):
952 bacteria
58 archaea
607 eukaryotes (incl. ESTs)
46 metagenomes
More propaganda
Most genes will never be studied in experiment
Even in E.coli: only 20-30 new genes per year
(hundreds are still uncharacterized)
Bioinformatics = molecular biology in silico
• ~2% of all recent papers in biological journals
• Essential component of biological research
• Make predictions about function and regulation of genes
(many quite reliable!)
• Metabolic reconstruction and prediction of phenotype given
genome
• Identify really interesting cases, fill gaps in knowledge
– “Universally missing genes” – not a single known gene even for
~10% reactions of central metabolism. No genes for >40% reactions
overall
– “Conserved hypothetical genes” (5-15% of any bacterial genome) –
essential, but unknown function
Haemophilus influenzae, 1995
Vibrio cholerae, 2000
How?
Similarity to known proteins
• Useful for many purposes (allows one to
annotate 50-75% genes in a bacterial
genome)
• Necessary first step
• May be automated
– … to some extent …
– in particular, care is needed to avoid too specific
predictions
– Problem: propagation of annotation errors
• Boring (nothing new)
Noradrenaline transporter in an archaeon?
SOURCE
ORGANISM
FEATURES
source
Protein
Methanococcus jannaschii.
Methanococcus jannaschii
Archaea; Euryarchaeota; Methanococcales; Methanococcaceae;
Methanococcus.
Location/Qualifiers
1..492
/organism="Methanococcus jannaschii"
/db_xref="taxon:2190"
1..492
/product="sodium-dependent
noradrenaline transporter"
CDS
1..492
/gene="MJ1319"
/note="similar to EGAD:HI0736 percent identity: 38.5;
identified by sequence
similarity;
putative"
/coded_by="U67572:71..1549"
/transl_table=11
Now corrected:
Hypothetical sodium-dependent transporter MJ1319.
Similarity to hypothetical proteins:
somebody else’s errors…
The correct
annotation
Genes with curious functional assignments
• C75604: Probable head morphogenesis
protein, Deinococcus radiodurans
• O05360: Automembrane protein H, Yersinia
enterocolitica
• Q8TID9: Benzodiazepine (valium) receptor
TspO, Methanosarcina acetivorans
• NP_069403: DR-beta chain MHC class II,
Archaeoglobus fulgidus
Errors in experimental papers
SwissProt:
DEFINITION Hypothetical 43.6 kDa protein.
ACCESSION
...
KEYWORDS
SOURCE
ORGANISM
P48012
Hypothetical protein.
Debaryomyces occidentalis
Debaryomyces occidentalis
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Debaryomyces.
[CAUTION] Was originally (Ref.1) thought to be
3-isopropylmalate dehydrogenase (LEU2).
PIR:
DEFINITION
3-isopropylmalate dehydrogenase
ACCESSION
KEYWORDS
- yeast(Schwanniomyces occidentalis).
S55845
oxidoreductase.
(EC 1.1.1.85)
SwissProt entry DSDX_ECOLI
-!- CAUTION: An ORF called dsdC was
originally (Ref.3) assigned to the wrong
DNA strand and thought to be a D-serine
deaminase activator, it was then
resequenced by Ref.2 and still thought to
be "dsdC", but this time to function as a
D-serine permease. It is Ref.1 that showed
that dsdC is another gene and that this
sequence should be called dsdX. It should
also be noted that the C-terminal part of
dsdX (from 338 onward) was also sequenced
(Ref.6 and Ref.7) and was thought to be a
separate ORF (don't worry, we also had
difficulties understanding what happened!).
Positional clustering
• Genes that are located in immediate proximity
tend to be involved in the same metabolic
pathway or functional subsystem
– mainly in prokaryotes, very weak in eukaryotes
– caused by operon structure, but not only
• horizontal transfer of loci containing several functionally linked
operons
• compartmentalisation of products in the cytoplasm
– very weak evidence
• stronger if observed in may unrelated genomes
• May be measured
– e.g. the STRING database/server (P.Bork, EMBL)
– and other sources
STRING:
trpB –
positional
clusters
Functionally dependent genes tend to cluster on
chromosomes in many different organisms
Vertical axis:
number of
gene pairs
with
association
score
exceeding a
threshold.
Control:
same graph,
random
re-labeling of
vertices
More genomes (stronger links)
=> highly significant clustering
Especially in linear pathways (right)
Fusions
• If two (or more) proteins form a single
multidomain protein in some organism, they
all are likely to be tightly functionally related
• Very useful for the analysis of eukaryotes
• Sometimes useful for the analysis of
prokaryotes
STRING:
trpB –
fusions
Phyletic patterns
• Functionally linked genes tend to occur
together
• Enzymes with the same function (isozymes)
have complementary phyletic profiles
STRING:
trpB – cooccurrence
(phyletic
profiles)
Phyletic profiles in the Phe/Tyr pathway
shikimate
kinase
Archaeal shikimate-kinase
Chorismate biosynthesis pathway (E. coli)
Arithmetics of phyletic patterns
Shikimate dehydrogenase (EC 1.1.1.25):
AroE
COG0169
aompkzyqvdrlbcefghsnuj-i-5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19)
AroA
COG0128
aompkzyqvdrlbcefghsnuj-i-Chorismate synthase (EC 2.5.1.19)
AroC
COG0082
aompkzyqvdrlbcefghsnuj-i--
Shikimate kinase (EC 2.7.1.71):
Typical (AroK) COG0703
------yqvdrlbcefghsnuj-i-Archaeal-type COG1685
aompkz-------------------+ aompkzyqvdrlbcefghsnuj-i-Two forms combined
3-dehydroquinate dehydratase (EC 4.2.1.10):
Class I (AroD) COG0710
aompkzyq---lb-e----n---i-Class II (AroQ) COG0757
------y-vdr-bcefghs-uj---+ aompkzyqvdrlbcefghsnuj-i-Two forms combined
Distribution of association scores
(monotonic for subunits,
bimodal for isozymes)
E.g. transporters
• Transporters of end products of metabolic
pathways may substitute the entire pathway
• Transporters of compounds for catabolic pathways
co-occur with pathways
• Transporters for intermediates substitute upstream
parts of pathways
Example:
bioY
Other approaches to phyletic patterns
• Gene signatures
of lifestyles
– e.g. thermophily:
DNA gyrase is the
only gene specific
to all
hyperthermophiles
(bacterial and
archaeal)
– see COGs
• Regulators and
signals
Example:
bioR
gene:
black arrow;
candidate
site:
red dot
Comparative analysis of regulation
• Phylogenetic footprinting: regulatory sites
are more conserved than non-coding regions
in general and are often seen as conserved
islands in alignments of gene upstream
regions
• Consistency filtering: regulons (sets of coregulated genes) are conserved =>
– true sites occur upstream of orthologous genes
– false sites are scattered at random
Enzymes
• Identification of a gap in a pathway (universal,
taxon-specific, or in individual genomes)
• Search for candidates assigned to the pathway by
co-localization and co-regulation (in many
genomes)
• Prediction of general biochemical function from
(distant) similarity and functional patterns
• Tentative filling of the gap
• Verification by analysis of phylogenetic patterns:
– Absence in genomes without this pathway
– Complementary distribution with known enzymes for the
same function
Transporters
• Identification of candidates assigned to the pathway by
co-localization and co-regulation (in many genomes)
• Prediction of general function by analysis of
transmembrane segments and similarity
• Prediction of specificity by analysis of phylogenetic
patterns:
– End product if present in genomes lacking this pathway
(substituting the biosynthetic pathway for an essential
compound)
– Input metabolite if absent in genomes without the pathway
(catabolic, also precursors in biosynthetic pathways)
– Entry point in the middle if substituting an upper or side part of
the pathway in some genomes
5’ UTR regions of riboflavin genes from bacteria
BS
BQ
BE
HD
Bam
CA
DF
SA
LLX
PN
TM
DR
TQ
AO
DU
CAU
FN
TFU
SX
BU
BPS
REU
RSO
EC
TY
KP
HI
VK
VC
YP
AB
BP
AC
Spu
PP
AU
PU
PY
PA
MLO
SM
BME
BS
BQ
BE
CA
DF
EF
LLX
LO
PN
ST
MN
SA
AMI
DHA
FN
GLU
1
2
2’
3
=========>
==>
<==
===>
TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT
AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT
TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT
TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT
TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT
GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT
CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT
TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT
ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT
AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT
GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT
CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT
AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT
TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT
GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT
TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT
ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT
-AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT
GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT
GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT
CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT
GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT
GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT
AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT
GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT
GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT
TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT
TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT
AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT
GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT
ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT
GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT
ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT
AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT
GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT
GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT
AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT
GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT
AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT
AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT
AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT
ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT
TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT
ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT
AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT
---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT
Add.
3’
-><<===
21 AGCCCGTGAC-19 AGTCCGTGAC-20 AGCCCGCGA--19 AGTCCGTGAC-23 AGCCCGTGAC-2 AGCCCGCAA--2 AGCCCGCG---6 AGCCTGCGAC-2 AGCCCGCGA--2 AGCCCACGA--3 AGCCCGCGAG-15 AGCCCGCGAA-3 AGCCCGCGAA-2 AGTCCGCGA--2 AGTCCGCGA--20 AGCCCGCGA--2 AGTCCACG---3 AGTCCGCGAC-3 AGTCCGCGAC-30 AGCCCGCGAGCG
21 AGCCCGCGAGCG
31 AGCCCGCGAGCG
21 AGCCCGCGAGCG
17 AGCCCGCGAGCG
67 AGCCCGCGAGCG
20 AGCCCGCGAGCG
2 AGCCCACGAGCG
14 AGCCCACGAGCG
13 AGCCCACGAGCG
40 AGCCCGCGAGCG
25 AGCCCACGAGCG
18 AGCCCGCGAGCG
16 AGCCCGCGAGCA
34 AGCCCGCGAGCG
13 AGCCCGCGAGCG
17 AGCCCGCGAGCG
19 AGCCCGCGAGCG
19 AGCCCGCGAGCG
19 AGCCCGCGAGCG
16 AGCCCGCGAGCG
34 AGCCCGCGAGCG
17 AGCCCGCGAGCG
18 AGCCCGCGA--27 AGCCCGCGA—-20 AGCCCGCGA--2 AGCCCGCGAG-2 AGCCCGCG---3 AGTCCACGAC-21 ACTCCGCGAT-3 AGTCCACGAT-125 AGTCCGTG---14 AGTCCGCG---104 AGTCCGCG---6 AGCCTGCGAC-14 AGCCCGCGC--20 AGCCCGCAAC-2 AGTCCACG---28 AGCCCGCGAGCG
Variable
4
4’
5
5’
1’
->
<====>
<====
==>
<==
<=========
8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT
8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG
3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC
10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG
8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG
3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG
7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT
11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT
4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA
3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA
5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA
8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG
5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC
7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA
13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA
3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC
5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG
8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG
8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG
137
GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC
8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC
7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC
11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC
8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG
8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG
8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG
26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA
11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC
5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA
16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG
16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA
10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC
10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG
6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA
7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA
7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG
19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA
15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA
14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA
8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA
8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC
10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT
5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG
3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG
3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG
3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT
6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT
5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG
4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC
3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG
3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA
3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG
3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG
11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG
5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC
11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA
4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA
10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---
Conserved secondary structure of the RFN-element
additional
stemloop
variable
stem-loop
Ag
Y
u
C
N
rU G CRY G N
GY
G
3 G
C
c
A
A N UC C c N
a
*
GGgN
N
c
G Y
2 x
G
G
g
rC
U
Y
Y
1 y
N
N
N
N
5’
*
*
*
*
G
A
R
R
r
N
N
N
N
KN
R
A
RG K x
Y
yB RYC
V
Rr
C 4
C
G
A
U xN
CRG
N
AG Y C
UG A x
R
R 5
g
x
u
Capitals: invariant (absolutely conserved) positions.
GA
Lower case letters: strongly conserved positions.
Dashes and stars: obligatory and facultative base
pairs
3’
Degenerate positions: R = A or G; Y = C or U;
K = G or U; B= not A; V = not U.
N: any nucleotide. X: any nucleotide or deletion
RFN: the mechanism of regulation
• Transcription attenuation
• Translation attenuation
Early observation: an uncharacterized gene (ypaA)
with an upstream RFN element
Phylogenetic tree of RFN-elements
(regulation of riboflavin biosynthesis)
no riboflavin biosynthesis
duplications
no riboflavin biosynthesis
YpaA: riboflavin (vitamin B2)
transporter in Gram-positive bacteria
• 5 predicted transmembrane segments => a transporter
• Upstream RFN element (likely co-regulation with riboflavin
genes) => transport of riboflaving or a precursor
• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin
pathway => transport of riboflavin
Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999)
Validation:
• YpaA transports flavines (riboflavin, FMN, FAD) (by genetic
analysis, Kreneva et al., 2000)
• ypaA is regulated by riboflavin (by microarray expression
study, Lee et al., 2001)
• … via attenuation of transcription (and to some extent
inhibition of translaition) (Winkler et al., 2003)
A new family of nickel/cobalt transporters
•
•
•
•
•
No
experimental
data
No structural
data
Specificity
predicted by
comparative
genomics
… and then
validated in
experiment
Mutational
analysis
under way
Conserved signal upstream of nrd genes
Identification of the candidate regulator by the
analysis of phyletic patterns
• COG1327: the only COG with exactly the
same phylogenetic pattern as the signal
– “large scale” on the level of major taxa
– “small scale” within major taxa:
• absent in small parasites among alpha- and gammaproteobacteria
• absent in Desulfovibrio spp. among delta-proteobacteria
• absent in Nostoc sp. among cyanobacteria
• absent in Oenococcus and Leuconostoc among Firmicutes
• present only in Treponema denticola among four spirochetes
COG1327 “Predicted transcriptional regulator,
consists of a Zn-ribbon and ATP-cone domains”:
regulator of the riboflavin pathway?
Additional evidence
• sometimes clustered with nrd genes or with
replication genes dnaB, dnaI, polA
• candidate signals upstream of other
replication-related genes
• dNTP salvage
• topoisomerase I, replication initiator dnaA,
chromosome partitioning, DNA helicase II
• experimental confirmation in Streptomyces
(Borovok et al., 2004)
Multiple sites (nrd genes): FNR, DnaA, NrdR
Mode of regulation
• Repressor (overlaps with promoters)
• Co-operative binding:
– most sites occur in tandem (> 90% cases)
– the distance between the copies (centers of
palindromes) equals an integer number of
DNA turns:
• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3 turns
• 21 bp (2 turns) in Vibrio spp.
• 41-42 bp (4 turns) in some Firmicutes
Combined regulatory network for iron homeostasis genes in a-proteobacteria.
[- Fe]
[+Fe]
[ - Fe]
[+Fe]
RirA
RirA
Irr
Irr
FeS
heme
degraded
Siderophore
uptake
2+
3+
Fe / Fe
uptake
Iron uptakesystems
Fur
[- Fe]
Iron storage
ferritins
FeS
synthesis
Heme
synthesis
Iron-requiring
enzymes
[ironcofactor]
Fur
IscR
Fe
FeS
Transcription
factors
FeS status
of cell
[+Fe]
The connecting line denote regulatory interactions, which the thickness reflecting the frequency of the interaction in the
analyzed genomes. The suggested negative or positive mode of operation is shown by dead-end and arrow-end of the line.
Fe and Mn regulons
Rhizobiaceae
Organism
Abb.
Irr
MUR /
FUR
MntR
RirA
IscR
Sinorhizobium meliloti
SM
+
+
-
+
-
+ +
+
-
+
-
Rhizobium leguminosarum
RL
Rhizobium etli
RHE
+
+
-
+
-
Agrobacterium tumefaciens
AGR
+
+
-
+
-
Mesorhizobium loti
ML
+
-
+
+
-
MBNC
+
+ +
-
+
-
+
-
+
-
+
-
Mesorhizobium
sp. BNC1
Brucella melitensis
Rhizobiales
Rhodobacteraceae
BQ
+
+
Bradyrhizobium japonicum
BJ
Rhodopseudomonas palustris
RPA
+ +
+ +
+
+
-
-
-
Nitrobacter hamburgensis
Nham
+
+
-
-
-
Nitrobacter winogradskyi
Nwi
+
+
-
-
-
Rhodobacter capsulatus
RC
-
Rhodobacter sphaeroides
Rsph
+
+
+
+
-
+
+
+
+
Silicibacter
STM
+
+
-
+
+
Silicibacter pomeroyi
S PO
+
+
-
+
+
Jannaschia
Jann
+
+
-
#?
+
+
+
+
quintana
and spp.
sp. TM1040
sp.CC51
HTCC2654
Rhodobacterales bacterium
Roseobacter
sp. MED193
Roseovarius nubinhibens
- proteobacteria
Rhodobacterales
Roseovarius
ISM
sp.217
Loktanella vestfoldensis
Sulfitobacter sp.
SKA53
EE-36
RB2654
+
+
-
MED193
+
+
-
ISM
+
+
-
+
#?
ROS217
+
+
-
+
+
SKA53
+
+
-
#?
+
EE36
+
+
-
#?
#?
+
OB2597
+
+
OA2633
-
+
-
-
+
CC
-
+
-
-
+
PB2503
-
+
-
-
+
Erythrobacter litoralis
ELI
-
-
Novosphingobium aromaticivorans
Saro
-
+
+
-
-
+
+
Sphinopyxis
g
alaskensis
HTCC2597
Oceanicola batsensis
HTCC2633
Oceanicaulis alexandrii
Caulobacterales
Caulobacter crescentu
s
Parvularculales
Parvularcula bermudensis
Rhodospirillales
SAR11 cluster
Rickettsiales
HTCC2503
Sala
-
+
-
-
+
ZM
-
+
-
-
+
Gluconobacter oxydans
GOX
-
+
-
+
Rhodospirillum rubrum
Rrub
-
+
+
-
-
+ +
Magnetospirillum magneticum
Amb
-
+ +
-
-
+
PU1002
+
+
-
-
+
-
-
-
-
+
Pelagibacter ubique
Rickettsia
HTCC1002
and Ehrlichia
species
B.
C.
+
Zymomonas mobilis
RB2256
A.
Distribution of
Irr,
Fur/Mur,
MntR,
RirA, and
IscR regulons
in α-proteobacteria
+
-
Hyphomonadaceae
Sphingomonadales
+
-
Bartonella
Bradyrhizobiaceae
BME
Group
D.
#?' in RirA column denotes
the absence of the rirA gene
in an unfinished genomic sequence
and the presence of candidate
RirA-binding sites upstream of
the iron uptake genes.
Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - I
Fur
sp|
Escherichia coli: P0A9A9
ECOLI
Pseudomonas aeruginosa
PSEAE
NEIMA
Neisseria meningitidis
: sp|Q03456
: sp|P0A0S7
Fur in g- and b- proteobacteria
HELPY Helicobacter pylori : sp|O25671
Bacillus subtilis : P54574
sp|
BACSU
SM mur
Sinorhizobium meliloti
Mesorhizobium sp. BNC1 (I)
MBNC03003179
BQ fur2
Bartonella quintana
BMEI0375
Brucella melitensis
EE36 12413 Sulfitobacter sp. EE-36
MBNC03003593Mesorhizobium sp. BNC1 (II)
HTCC2654
Rhodobacterales bacterium
RB2654 19538
Agrobacterium
tumefaciens
AGR C 620
RHE_CH00378 Rhizobium etli
Rhizobium leguminosarum
RL mur
Nham 0990 Nitrobacter hamburgensis X14
Nwi 0013
Nitrobacter winogradskyi
Rhodopseudomonas palustris
RPA0450
Bradyrhizobium japonicum
BJ fur
Roseovarius sp.217
ROS217 18337
Jannaschia sp. CC51
Jann 1799
Silicibacter pomeroyi
SPO2477
STM1w01000993Silicibacter sp. TM1040
MED193 22541 Roseobacter sp. MED193
OB2597 02997 Oceanicola batsensis HTCC2597
Loktanella vestfoldensisSKA53
SKA53 03101
Rhodobacter sphaeroides
Rsph03000505
Roseovarius nubinhibens ISM
ISM 15430
PU1002 04436Pelagibacter ubiqueHTCC1002
GOX0771 Gluconobacter oxydans
Zmomonas
y
mobilis
ZM01411
Novosphingobium aromaticivorans
Saro02001148
Sphinopyxis alaskensis RB2256
Sala 1452
ELI1325
Erythrobacter litoralis
Oceanicaulis alexandrii HTCC2633
OA2633 10204
PB2503 04877 Parvularcula bermudensis HTCC2503
CC0057
Caulobacter crescentus
Rhodospirillum rubrum
Rrub02001143
(I)
Magnetospirillum magneticum
Amb1009
Magnetospirillum magneticum (II)
Amb4460
Fur in e- proteobacteria
Fur in Firmicutes
Mur
in a-proteobacteria
Regulator of manganese
uptake genes (sit, mntH)
Fur
in a-proteobacteria
Regulator of iron uptake
and metabolism genes
Irr
a-proteobacteria
Erythrobacter litoralis
Caulobacter crescentus
Zymomonas mobilis
Novosphingobium aromaticivorans
Oceanicaulis alexandrii
Sphinopyxis alaskensis
Gluconobacter oxydans
Rhodospirillum rubrum
Parvularcula bermudensis -
Magnetospirillum magneticum
Identified Mur-binding sites
The A, B, and C groups
of a - proteobacteria
-
Sequence logos for
the identified
Fur-binding sites
in the D group of
a-proteobacteria
Bacillus subtilis
Mur
Escherichia coli
Sequence logos for
the known
Fur-binding sites
in Escherichia coli
and Bacillus subtilis
Phylogenetic tree of the Fur family of transcription factors in a-proteobacteria - II
Fur
Escherichia coli : P0A9A9
sp|
ECOLI
Pseudomonas aeruginosa : sp|Q03456
PSEAE
NEIMA
Fur in g- and b- proteobacteria
Neisseria meningitidis : sp|P0A0S7
HELPY Helicobacter pylori : sp|O25671
sp|
BACSU Bacillus subtilis : P54574
Fur in e- proteobacteria
Fur in Firmicutes
a-proteobacteria
Mur / Fur
Agrobacterium tumefaciens
AGR C 249
Sinorhizobium meliloti
SM irr
Rhizobium etli
RHE CH00106
Rhizobium leguminosarum (I)
RL irr1
RL irr2 Rhizobium leguminosarum (II)
Mesorhizobium loti
MLr5570
MBNC03003186 Mesorhizobium sp. BNC1
BQ fur1 Bartonella quintana
Brucella melitensis (I)
BMEI1955
Brucella melitensis (II)
BMEI1563
BJ blr1216 Bradyrhizobium japonicum (II)
RB2654 182 Rhodobacterales bacterium HTCC2654
Loktanella vestfoldensis SKA53
SKA53 01126
Roseovarius sp.217
ROS217 15500
Roseovarius nubinhibens ISM
ISM 00785
OB2597 14726 Oceanicola batsensis HTCC2597
Jann 1652 Jannaschia sp. CC51
Rsph03001693Rhodobacter sphaeroides
Sulfitobacter sp. EE-36
EE36 03493
STM1w01001534 Silicibacter sp. TM1040
Roseobacter sp. MED193
MED193 17849
SPOA0445
Silicibacter pomeroyi
Rhodobacter capsulatus
RC irr
RPA2339
Rhodopseudomonas palustris (I)
RPA0424*
Rhodopseudomonas palustris (II)
Bradyrhizobium japonicum (I)
BJ irr*
Nwi 0035* Nitrobacter winogradskyi
Nham 1013* Nitrobacter hamburgensis X14
PU1002 04361
Pelagibacter ubique HTCC1002
Irr in a-proteobacteria
regulator of iron
homeostasis
Sequence logos for the identified Irr binding sites in a-proteobacteria.
The A group (8 species) - Irr
The B group (4 species) - Irr
The C group (12 species) - Irr
Phylogenetic tree of the Rrf2 family of transcription factors in a-proteobacteria
Nitrite/NO-sensing regulator NsrR
(Nitrosomonas europeae, Escherichia coli)
ROS217_15206
Rsph03001477
RC NsrR
GOX0860
Amb1318
Nwi_0743
Iron repressor RirA
(Rhizobium leguminosarum)
SPOA0186
Ricket.
Sala_1049
Saro02000305
NE NsrR
OB2597_05195
ROS217_02155
ROS217_14291
SMc00785
RHE CH00735
AGR_C_344
Cysteine metabolism
repressor CymR
(Bacillus subtilis)
AGR_L_1131
SPO3722
RHE_CH02777
RL_3336
SPO1393
MBNC02000669
MLl1642
SMc02238
AGR_C_872
RHE_CH00547
OA2633_11510
RL RirA
BMEII0707
MLr1147
MBNC02002196
BQ04990
RC 0780
RB2654_19993
Rsph023178
SPO0432
MED193_09800
STM_634
Positional clustering of rrf2-like genes with:
iron uptake and storage genes;
Fe-S cluster synthesis operons;
genes involved in nitrosative stress protection;
sulfate uptake/assimilation genes;
CC0132
thioredoxin reductase;
SMc01160
BJ blr7974
carboxymuconolactone
RL_5159
AGR_L_2343
decarboxylase-family genes;
AGR_C_402
hmc cytochrome operon
NsrR
RirA
RL_619
ZMO0116
ROS217_16231
GOX0099
BS CymR
IscR-II
Rrub02000219
ZMO0422
Sala_1236
IscR
ELI0458
Saro3534
DV Rrf2
OA2633_03246
CC1866
EC IscR
Jann_2366
STM_3629
EE36_14302
SPO2025
Rsph023725
RC_0477
Rrub_1115
Amb0200
GOX1196
RPA0663
Ricket.
Cytochrome complex
regulator Rrf2
(Desulfovibrio vulgaris)
Iron-Sulfur cluster
synthesis repressor IscR
(Escherichia coli)
PB2503_ 09884
proteins with the conserved C-X(6-9)-C(4-6)-C motif within effector-responsive domain
proteins without a cysteine triad motif
Sequence logos for the identified RirA-binding sites in a-proteobacteria
The A group - RirA (8 species)
The C group - RirA (12 species)
Distribution of the conserved members of the Fe- and Mn-responsive regulons
and the predicted RirA, Fur/Mur, Irr, and DtxR binding sites in a-proteobacteria
Genes Functions:
Iron uptake
Iron storage
FeS synthesis
Iron usage
Heme biosynthesis
Regulatory genes
Manganese uptake
An attempt to reconstruct the history
Acknowledgements
• Dmitry Rodionov (comparative genomics)
• Andrei Mironov (software)
• Alexei Vitreschak (riboswitches)
• Slides:
– Michael Galperin (NCBI, Bethesda)
– Andrei Osterman (Burnham Institute, San-Diego)
• Collaboration:
– Thomas Eitinger (Humboldt University, Berlin) – Co/Ni transporters
– Andy Johnston (University of East Anglia) – Fe in alphas
• Funding:
–
–
–
–
Howard Hughes Medical Institute
Russian Fund of Basic Research
RAS, program “Molecular and Cellular Biology”
INTAS