greengenes.lbl.gov 16S rRNA gene database and workbench compatible with ARB

Download Report

Transcript greengenes.lbl.gov 16S rRNA gene database and workbench compatible with ARB

greengenes.lbl.gov
16S rRNA gene database
and workbench compatible
with ARB
Todd DeSantis, Phil Hugenholtz, Niels Larson, Igor
Dubosarskiy, Jordan Moberg, Yvette Piceno, Ingrid
Zubieta, Eoin Brodie, Gary Andersen
LBL - JGI
Andersen Group Program
Aims
• Creating a microarray for the simultaneous
differentiation and quantification of closely
related prokaryotes in complex samples.
The
Biomarker
16S rDNA
16S rDNA - identify and
classify organisms by
gene sequence
variations.
rRNA (functional molecule)
LSU
SSU
The Challenges
• 16S sequence deposit rate is increasing.
• Many are mis-annotated and/or chimeric.
• Sequence Taxonomy updates lags years
behind sequence availability (“Bacteria,
Unclassified”).
• Difficult to create and manage MSAs of all
16S seq data (or even thousands) using
Clustal/BioEdit/Arb.
• Probe quality is reliant on excellent MSAs
and taxonomy.
• “Signatures” can erode as more sequences
are discovered.
greengenes.lbl.gov
greengenes.lbl.gov
Year
Source: http://www.ncbi.nlm.nih.gov/
‘16S NOT 1.16S NOTmitochondr* NOT 18S’
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
200,000
180,000
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
1992
Cumulative NCBI 16S records
Stay current
greengenes.lbl.gov
Verify ‘16S-ness’
Fate of NCBI Records:
 short FASTA file
(9%)
 short BLAST match
length (8%)
 BLAST match to
18S/Mito SSU (1%)
 odd nt insertions
(1%)
 passed (81%)
• Hand curated MSA
provided by Phil.
• Alignment "template"
is top BLAST HSP
– q= -1, Favors long
match
• Candidate trimmed
of extra-16S seq data
– tRNA, intergenic
spacer regions,
and 23S rDNA
– based on HSP
boundries
• If HSP paired opposite
strands, candidate is
reverse
complemented.
NAST align
step 1: find template
• Hand curated MSA
provided by Phil.
• Alignment "template"
is top BLAST HSP
– q= -1, Favors long
match
• Candidate trimmed
of extra-16S seq data
– tRNA, intergenic
spacer regions,
and 23S rDNA
– based on HSP
boundries
• If HSP paired opposite
strands, candidate is
reverse
complemented.
NAST align
step 1: find template
• Hand curated MSA
provided by Phil.
• Alignment "template"
is top BLAST HSP
– q= -1, Favors long
match
• Candidate trimmed
of extra-16S seq data
– tRNA, intergenic
spacer regions,
and 23S rDNA
– based on HSP
boundries
• If HSP paired opposite
strands, candidate is
reverse
complemented.
NAST align
step 1: find template
• Hand curated MSA
provided by Phil.
• Alignment "template"
is top BLAST HSP
– q= -1, Favors long
match
• Candidate trimmed
of extra-16S seq data
– tRNA, intergenic
spacer regions,
and 23S rDNA
– based on HSP
boundries
• If HSP paired opposite
strands, candidate is
reverse
complemented.
NAST align
step 1: find template
NAST align
step 2: gap removal
Preserves global MSA
positions(columns) by
allowing local
misalignments.
DEFINE
St = post-Align0 template sequence.
Sc = post-Align0 candidate sequence.
Ht = alignment space (hyphen) inserted into St by
Align0.
Hc = alignment space (hyphen) inserted into Sc by
Align0.
WHILE (St contains one or more Ht) DO
LHt = character index of distal 5' Ht within St
L5' = character index of Hc within Sc which is 5'
proximal to Ht
L3' = character index of Hc within Sc which is 3'
proximal to Ht
IF ((LHt – L5') > (L3' – LHt)) Delete Hc found at L3'
ELSE
Delete Hc found at L5'
Delete template gap character.
END WHILE
Result: Largest MSA of fulllength (>1250 nt) 16S
rDNA genes.
greengenes.lbl.gov
Name generator
Genbank
record
• NCBI annotations are nonstandardized
Is sequence from
whole genome
record?
no
“Genus species” style name in
DEFINITION or
source>organism?
Glob text from
“DEFINITION”,
“source”, and
“TITLE”
Does a
source>isolate
field exist?
yes
– Determine if sequence is from an isolate,
environmental amplicon/metagenome
– Concatenate useful terms
yes
no
Text glob contains
“clone” OR
“uncultur”?
yes
no
no
yes
Record is from an isolate
if Gs
Gs result?
“Gs
no”
“Gs
yes”
Text
glob
“Isolate
tag no”
“Isolate
tag yes”
yes
• Effort to guide future GenBank
submitters in clear record descriptions
no
no
yes
Text glob contains
“symbiont”?
– http://www.jgi.doe.gov/16s/
Strain tag is present
Record is from a clone
Record is from a symbiont
Isolate tag present?
Record is from undecided
no
yes
Record is from a isolate_str
greengenes.lbl.gov
Chimera tracking
• Amplicons from complex gDNA can
contain partial sequence from more
than one genome.
• Up to 4% of sequences are deemed
chimeric by Bellerophon2
– Flags are set to avoid using these
questionable sequences in phylogeny
assessments
greengenes.lbl.gov
Maintain Taxonomy
JGI taxonomy organized in ARB using maximum parsimony tree insertions.
Example: http://greengenes.lbl.gov/cgibin/User/show_one_record_v2.pl?prokMSA_id=82172
prokMSA_id: 82172 prokMSAname: termite gut clone Rs-050
GenBank ACCESSION: AB100461.1, GenBank GI: 28971862,
RDP_id: S000122947, NCBI_tax_id: 203524, Study_id: 21358
G2_chip_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Peptostreptococcaceae; sf_5; otu_2988
JGI_tax_string=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae; Mogibacterium
JGI_tax_string_format_2=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae;
Mogibacterium; otu_415
Pace_tax_string=Bacteria; Firmicutes; Clostridium et al.; Peptostreptococcaceae; Clostridium acidiurici et al.;
Clostridium difficile et al.; Clostridium aminobutyricum et
RDP_tax_string= Bacteria; Firmicutes; Clostridia; Clostridiales; unclassified_Clostridiales.
ncbi_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; environmental samples
greengenes.lbl.gov
Maintain Taxonomy
greengenes.lbl.gov
Tools
• BLAST
• SimRank
• Probe matcher
• Text search
• PCR primer design
• Private NAST aligner
greengenes.lbl.gov
Compatible with ARB
• Entire data
base
downloadable in ARB
format.
• Can import
new records
into personal
ARB data
base.
How we use greengenes
data to get our work
done…..
16S Sequence clustering
• Each sequence reduced to an array (list) of
“probe-friendly” 25-mers which:
– Have high complexity
– Can be synthesized with 75 or fewer masks
– Adequate H-bond potential
• G+C content over 48%
• Or empirical bond stability found in test arrays
• Transitive clustering by fraction of 25mers in
common
– Cluster considered an Operational Taxonomic
Unit (OTU)
Extended Bergey’s Taxonomy
Bergey’s v0.9 with added nomenclature
from Hugenholtz tree of environmental
DNA
• Each OTU assigned to one of 455 families
• Families split into subfamilies where >15%
sequence variation existed.
• Results: (considering both domains)
• 63 phyla
• 136 classes
• 262 orders
• 455 families
• 842 subfamilies (~94% identity)
• 8,989 OTUs (~99% identity)
• 30,627 sequences (each belong to only one OTU)
Probe Design
Desulfovibrio sp. str. DMB.
Desulfovibrio sp. 'Bendigo A'
Desulfovibrio vulgaris DSM 644
Example of the Location of Probes Used for
the Desulfovibrio vulgaris Probe Set
Sequence
discrepancies
Regions unique to OTU
Regions not unique to OTU
Bacteria;
Proteobacteria;
Deltaproteobacteria;
Desulfovibrionales;
Desulfovibrionaceae;
sf_1; otu_10051
22/22
25/25
20/25
Example: proteobacteria
OTU composed of
sequences
26
Locus Specific Prevalence Scoring
Probe selection objectives for
each OTU
• Find 11 or more 25mers (targets)
–
–
–
–
>90% prevalent in an OTU’s sequences
dissimilar from sequences outside the OTU
>48% G+C or empirically responsive
>1 loci within 16S rDNA gene
• Presumed cross-hybridizing probes were those 25-mers that
contained a central 17-mer matching sequences in more than one
OTU (Urakawa, Stahl et al. 2002)
– avoiding probes that were unique solely due to a mismatch in
one of the outer four bases.
• As each PM probe (Perfect Match to target) was chosen, it was
paired with a control 25-mer (mismatching probe, MM), identical in
all positions except the thirteenth base.
• The MM probe did not contain an internal 17-mer complimentary to
sequences in any OTU.
Overview of Sample Preparation
Extract Genomic DNA
PCR Amplify DNA
18 µ
Fractionate DNA
End-label with biotin
18 µ
Hybridize
COUNTIF pf==1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.95
1
1
1
1
1
1
0.95
1
1
0.95
0.94
0.96
0.96
0.95
1
1
1
0.95
1
1
1
1
1
1
1
0.96
1
1
1
1
1
1
1
1
1
1
1
1
0.94
1
1
Airport_B_2.CEL
1 1
1 1
1 1
0.94 1
0.94 1
1 1
1 1
1 1
0.94 1
0.91 1
1 1
0.92 1
1 1
1 1
1 1
0.92 1
0.91 1
0.67 1
0.79 1
1 1
1 1
1 1
1 1
0.82 1
0.95 1
1 1
0.95 1
0.93 1
0.89 1
0.78 1
0.87 1
0.71 1
0.79 1
0.87 1
0.85 1
0.94 1
0.89 1
1 1
0.82 1
1 1
0.93 1
1 1
0.98 1
0.91 1
1 1
0.78 1
0.77 1
0.94 1
0.98 1
0.6 1
0.74 1
0.91 1
0.64 1
0.88 1
0.56 1
0.93 1
0.85 1
1 0.94
0.95 1
0.83 1
Airport_B_1.CEL
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.96
1
0.98
1
1
1
1
1
1
1
1
1
1
0.98
0.94
0.97
0.96
0.95
1
0.97
1
1
0.97
0.86
1
1
1
0.99
1
1
1
0.97
1
1
0.91
1
0.92
1
1
1
1
1
1
0.96
0.96
Airport_A_2.CEL
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Airport_A_1.CEL
Airport_6.CEL
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0.91 1
1 1
0.95 1
1 1
0.94 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
0.92 1
0.92 1
0.84 1
1 1
0.94 1
0.9 1
0.96 1
1 1
0.83 1
0.88 1
1 1
0.8 1
0.94 1
0.73 1
0.88 1
0.75 1
0.92 1
1 1
0.94 1
0.96 1
1 0.95
Airport_7.CEL
Airport_2.CEL
90-Los_Alamos.CEL
Airport_1.CEL
84_M_Miller.CEL
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
90-Power_Soil.CEL
84_Los_Alamos.CEL
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
81_Ultra_Soil.CEL
'022102 CHLOROPLASTS_AND_CYANELLES 1 1 1
'02280110 SPHINGOMONAS_GROUP
1 1 1
'0228030406MCH.PURPURATUM_SUBGROUP
1 1 1
'021506 CY.AURANTIACA_GROUP
1 1 1
'0228010616MSO.LOTI_SUBGROUP
1 1 1
'02280313 PSEUDOMONAS_AND_RELATIVES 0.96 1 1
'0230011302CORYNEBACTERIUM_GROUP
1 1 0.94
'021304 ENVIRONMENTAL_CLONE_OPB5_GROUP
1 1 0.91
'02150402 FLX.SANCTI_SUBGROUP
1 1 1
'0228010608BL.VIRIDIS_ASSEMBLAGE
1 1 1
'02280211 OXALOBACTER_GROUP
1 1 0.89
'02280308 XANTHOMONAS_GROUP
1 1 1
'0230010901ARTHROBACTER_AND_RELATIVES 1 1 0.92
'02300110 PROPIONIBACTERIUM_GROUP
0.9 1 1
'0218
ENVIRONMENTAL_CLONE_WCHB1-31_GROUP
0.94 1 0.89
'02250306 ACBT.CAPSULATUM_GROUP
1 1 1
'0228010611METHYLOBACTERIA_SUBGROUP 0.95 1 1
'022801061210
BDR.ELKANII_SUBGROUP
1 1 1
'022801061214
BLB.DENITRIFICANS_SUBGROUP
1 1 1
'022802090401
COM.TERRIGENA_SUBGROUP 0.95 1 0.7
'02300710 B.MEGATERIUM_GROUP
0.94 1 0.93
'02300901 C.LEPTUM_GROUP
0.89 1 0.94
'022801080102
PARACOCCUS_SUBGROUP
0.91 1 1
'0228040603POL.CELLULOSUM_SUBGROUP 0.82 1 1
'02280108010101
ROS.DENITRIFICANS_SUBGROUP 0.95 1 1
'0228050301AOB.CRYAEROPHILUS_SUBGROUP0.95 1 0.95
'023001130101
MYB.TUBERCULOSIS_SUBGROUP 0.94 1 1
'0228010404AZS.LIPOFERUM_SUBGROUP
0.96 1 0.96
'022801061201
AFIPIA.FELIS_SUBGROUP
1 1 1
'022801061204
NTB.WINOGRADSKYI_SUBGROUP 1 1 1
'022801061205
RPS.PALUSTRIS_SUBGROUP
1 1 1
'022801061208
BDR.LUPINI_SUBGROUP
1 1 1
'022801061212
BDR.LIAONINGENSIS_SUBGROUP 1 1 1
'0228020403NSS.MULTIFORMIS_SUBGROUP 0.96 1 0.96
'021306 ENVIRONMENTAL_CLONE_III1-8_GROUP
0.93 1 1
'02200101 PIRELLULA_SCHLESNER_ISOLATES0.92 1 0.92
'02280410 DESULFOBULBUS_ASSEMBLAGE 0.92 1 1
'02300711 B.SUBTILIS_GROUP
0.85 1 0.67
'021505 PERSICOBACTER_GROUP
1 1 1
'022804010401
DSV.HALOPHILUS_SUBGROUP
0.5 1 0.42
'0230011201PSC.HALOPHOBICA_SUBGROUP 0.92 1 1
'0230040105BTV.FIBRISOLVENS_SUBGROUP 0.68 1 0.78
'02280327 ENTERICS_AND_RELATIVES
0.91 1 0.86
'0230010602A.FERROOXIDANS_SUBGROUP 0.96 1 0.96
'02250301 MOUNT_COOT-THA_ENVIRONMENTAL_CLONES_III
0.89 1 0.89
'0228010609MSI.TRICHOSPORIUM_SUBGROUP 0.77 1 0.87
'0228020804BRD.BRONCHISEPTICA_SUBGROUP0.82 1 0.91
'02300111 MICROMONOSPORA_GROUP
0.83 1 0.83
'0230070903B.ALCALOPHILUS_SUBGROUP 0.81 1 0.46
'0205
ENVIRONMENTAL_CLONE_OPB45_GROUP
0.73 1 0.91
'021312 ENVIRONMENTAL_CLONE_RB40_GROUP
0.82 1 0.81
'0215010204CY.FERMENTANS_SUBGROUP 0.94 1 1
'02280108010105
OCT.ANTARCTICUS_SUBGROUP 0.71 1 0.75
'023001080110
THERMOPHILIC_STREPTOMYCES 0.75 1 0.8
'0230040103EUB.SABURREUM_SUBGROUP
0.4 1 0.53
'02300713 B.SPHAERICUS_GROUP
0.86 1 0.71
'0230072109STC.PNEUMONIAE_SUBGROUP 0.92 1 0.92
'02250102 ENVIRONMENTAL_CLONE_OCS307_GROUP
0.89 1 1
'0230040104RUC.GNAVUS_SUBGROUP
0.9 1 0.95
'021305 ENVIRONMENTAL_CLONE_RB25_GROUP
0.93 1 0.95
description
81_Los_Alamos.CEL
Image
Capture and
Data
Reduction
SUBGROUP
•Over 500,000 data points
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0.92
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 0.93
1 0.94
1 1
1 1
14
14
14
13
13
13
13
13
13
13
13
13
13
13
12
12
12
12
12
12
12
12
12
12
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
9
9
9
•Scores for each of 9000 OTUS
Distribution of 16S rDNA Sequences detected
via Cloning or Microarray Analysis
Clone Hits
Only (8)
Clone and Array
Hits (73)
Array Hits
Only (97)
Confirmed by specific PCR and sequencing:
Actinobacteria; Actinosynnemataceae; sf_1
Nitrospira; Nitrospiraceae; sf_1
Clostridia; Syntrophomonadaceae; sf_5
Planctomycetes; Plantomycetaceae; sf_3
Gammaproteobacteria; Pseudoaltermonadaceae; sf_1
Acidobacteria; Ellin6075/11-25; sf_1
Spirochaetes; Spirochaetaceae; sf_1
Spirochaetes; Spirochaetaceae; sf_3
Spirochaetes; Leptospiracea; sf_3
Array is quantitative
14
13
log2 HybScore (a.u.)
12
r = 0.917
11
% G+C
sequence
% G+C
probes
Mycoplasma
neurolyticum
50.0
45.4
Oenococcus oeni
50.9
50.8
8
Saprospira grandis
51.8
50.9
7
Fervidobacterium
nodosum
58.2
53.8
6
Caulobacter
8 vibrioides
56.4
58.5
Spike–in
10
9
1
2
3
4
5
6
log2 Concentration (pM)
7
Array is quantitative
~1011 16S gene copies
~107 16S gene copies
Example query against
meteorological data:
Does detection of Actinobacterium PENDANT-38
correlate with temperature?
log(HybScore)
6.5
6
5.5
5
4.5
r = 0.64, p=0.026527
(adjusted for multiple testing)
4
75
80
85
Temp. degC
90
Real-time quantitative PCR confirmation of array monitoring.
Uranium Bioremediation – is uranium re-oxidation under reducing conditions due to loss of metal reducers?
(a) Array quantitation
Corrected Array Intensity
Area 2
Reduction
Oxidation
Representative organism
Phylocode
Group
Geothrix fermentans
2.13.8.386
Acidobacteriaceae
45
2344
2290
Geobacter metallireducens
2.28.4.7.4.10207
Geobacteraceae
251
2238
2188
Geobacter arculus
2.28.4.7.4.10209
Geobacteraceae
38
1412
1698
(b) qPCR quantitation
Species specific - Geothrix fermentans
Group specific - Geobacteraceae
Real-time quantitative PCR confirmation – Urban Aerosol
Array hybridization signal correlates significantly
with 16S copies in environmental aerosol DNA extract
Pseudomonas oleovorans example
FEMS Letters - pseudoshift
Order
Class
Peak Duration
(sec)
5
Phaeophyceae (phylum)
Stramenopiles (no rank)
Basidiomycota (phylum)
Fungi (kingdom)
45
Deferribacterales
Cyanobacteria
450
Ascomycota (phylum)
Fungi (kingdom)
450
Vibrionales
Gammaproteobacteria
450
Flavobacteriales
Flavobacteria
450
Clostridiales
Clostridia
45
Rhizobiales
Alphaproteobacteria
45
Rhodospirillales
Alphaproteobacteria
45 n.s.
Lactobacillales
Bacilli
45
Bacillales
Bacilli
450
Mycoplasmatales
Mollicutes
5
Xanthomonadales
Gammaproteobacteria
5 n.s.
Burkholderiales
Betaproteobacteria
0
Sphingomonadales
Alphaproteobacteria
0
Sphingobacteriales
Sphingobacteria
0
Acholeplasmatales
Mollicutes
45
Acknowledgements
•
•
•
•
•
•
•
•
Phil Hugenholtz – Taxonomy, Arb Interface, Chimera
Niels Larson – SimRank
Igor Dubosarskiy – JSP
Jordan Moberg – Microarrays, Cloning
Yvette Piceno – Microarrays, Primer Design
Ingrid Zubieta – PCR, Cloning
Eoin Brodie – Microarrays, QPCR
Gary Andersen – 16S Microarray Group Leader
C. perfringens probe set identified in
EPA sample 22 (N.Y. Spring)
C.AURANTIBUTYRICUM
CFB
Cyan
Bacteria
C.THERMOBUTYRICUM_SUBGROUP
High G+C
C.ALGIDICARNIS
Proteo
Gram +
C. BUTYRICUM
Bacil-Strep
Clostridium
C.BOTULINUM_SUBGROUP
C.CADAVERIS
C.BARATI_SUBGROUP
C.PERFRINGENS
1492
27
420
...CGTAAAGCTCTGTCTTTGGGGAAGATAATGACGGTACCCAAGGAGGAAGCCACGGCTAACT...
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
.................................T.................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
...................................................................
TAAAGCTCTGTCTTTGGGGAAGATA
AAAGCTCTGTCTTTGGGGAAGATAA
AAGCTCTGTCTTTGGGGAAGATAAT
AGCTCTGTCTTTGGGGAAGATAATG
Probes 5 - 8
16S rDNA
469
tacccaaggaggaagccacggctaa
C. perf. str.CPN50
C. perf. resistant
Clostridium sp. AB&J
clone p-4636-2Wa2
C. perf. A
C. perf rrnA
C. perf rrnE
C. perf rrnD
C. perf rrnC
C. perf rrnB
C. perf rrnF
C. perf rrnG
C. perf str.13a
C. perf str.13b
C. perf rrnH
C. perf rrnI
C. perf rrnJ
clone OI1612
C. perf. B
Swine manure 37 -3
Swine manure 37 -4
5 6 7 8
Ave Diff =1891
Probe Properties:
25mer exits in 90% of the taxon’s seqs
Internal 21mer exists only in one taxon.