32. Accurate Estimation of Microbial Communities Using Pyrotags

Download Report

Transcript 32. Accurate Estimation of Microbial Communities Using Pyrotags

Accurate estimation of microbial
communities using 16S tags
Julien Tremblay, PhD
[email protected]
16S rRNA as phylogenetic marker gene
21 proteins
16S rRNA
30S
70S Ribosome
subunits
50S
34 proteins
5S rRNA
23S rRNA
highly conserved between different species
of bacteria and archaea
Escherichia coli
16S rRNA
Primary and Secondary Structure
Falk Warnecke
16S rRNA in environmental microbiology
(Sanger clone libraries)
900-1100 bp length
Falk Warnecke
Next generation sequencing (NGS)
Illumina
0.5M 450bp reads
10-350M 150bp reads/lane
$$
$
Read length
454
Throughput and cost
Illumina tags (itags)
454
Illumina
~400 bp
ACGTGGTACTACGTGATAGTGTAT
~250 bp
• 454 = “1” read
• Illumina = “2” reads => have to be assembled
• Both reads need to be of good quality
Game plan to survey microbial diversity
V1
V2
V3
V4
V5
V6
V7
V8
V9
16S rRNA
Generate amplicons of a
given variable region
from bacterial community
(many millions of sequences)
Reduce dataset by
dereplication/clustering
X 10
X1
X 1,000
X 2,000
X 200
X 1,200
X 800
X 10,000
Why amplicon tags ?
Deeper, cheaper, faster
Identification
(BLAST, RDP classifier)
Rare biosphere
Abundance
High sequencing depth of NGS  reveals “rare” OTUs
Lots of reads only present once in sample…
Rare biosphere
Rank
Sequencing error? Chimeras? Background noise?
Rare bias sphere?
Is rare biosphere an artifact of the NGS error?
Control experiment: estimate rare biosphere
in a single strain of E.coli
V1 & V2
27F
342R
V8
1114F
1392R
Should not be a sequencing artifact, if relatively stringent
clustering parameters are applied
Subject to controversy – Is rare always real?
Kunin et al., (2009), Environ. Microbiol.
Quince et al., (2009), Nat. Methods
Illumina tags (itags)
• Typical 454 run  450,000 – 500,000 reads
• “Typical” Illumina run:
• GAIIx  10,000,000 – 40,000,000 reads/lane
• Hiseq  ~ 350,000,000 reads/lane
• Miseq (new)  ~8,000,000 reads/lane
• Move 16S tags sequencing to Illumina platform
• HiSeq = huge output compared to 454 (suitable for big
projects 1000+ indexes(barcodes)/libraries
• MiSeq = moderatly high throughput (More suitable)
• throughput more efficient clustering algorithm
(SeqObs).
16S tags clustering
Edward Kirton, JGI
Number of reads >> number of clusters
Illumina rRNA Amplicon Sequencing
30
30000000
25
20
15000000
15
10000000
10
5
5000000
00
RAW
BARCODE
OVERLAP
ASSEM
~100,000 clusters
20000000
Clustering happens here!
Number of Sequences
Number of reads (millions)
25000000
CLUSTERS
Edward Kirton, JGI
Validation of 16S tags on MiSeq
• Quality is superior in MiSeq
454
MiSeq reads 1 and 2 separately
MiSeq reads 1 and 2 assembled
MiSeq validation
• Exploratory experiments using 11 wetlands
samples.
• Validate reproducibility between runs
MiSeq validation
• Beta diversity (UniFrac Distances)
Run 1
Run 2
MiSeq validation
• What are the gains using MiSeq assembled paired-end reads
over 454 reads?
Average bootstrap value for all clusters at every tax level.
Average bootstrap value
•
454 clusters shows higher
confidence than MiSeq clusters
Better quality in MiSeq reads, but
lower read lengths
Taxonomic level
Comparing 454 with MiSeq
• What are the gains using MiSeq assembled pairedend reads over 454?
Clusters having > 0.50 bootstrap value
For instance, ~310,000 reads
made it to the class level
MiSeq outperforms 454 in terms of read depth
itags – rare OTUs
MiSeq wetlands test samples
Low abundant reads consistently shows low confidence in
Classification.
Low abundant reads = errors, artifacts?
Low abundant reads are underrepresented in databases?
Comparing 454 with illumina
• Compare runs of 454 and MiSeq of same
sample
• Although challenge to compare V4 with V6-8 region.
515
806 926
~291 bp
1392
~466 bp
Comparing 454 with illumina
• Primer pair of variable region is likely to
affect outcome of results.
In silico PCR on 16S Greengenes database.
PyroTagger (for 454 amplicons)
Unzip, validate
Remove low-quality reads
Redundancy removal
PyroClust & Uclust
Remove chimeras
Samples comparison,
post-processing
pyrotagger.jgi-psf.org
Classification and barcode separation
• Sequences of cluster (OTU) representatives
100%
90%
80%
70%
60%
• Blast vs GreenGenes and Silva databases,
dereplicated at 99.5%
50%
40%
30%
20%
10%
0%
• Distribution of microbial phyla in the dataset
C lus ter1
C lus ter2
C lus ter3
C lus ter4
C lus ter5
C lus ter7
C lus ter8
C lus ter9
C lus ter1 0
C lus ter1 3
C lus ter1 5
C lus ter1 7
C lus ter1 8
C lus ter1 9
C lus ter2 0
C lus ter2 1
C lus ter2 2
C lus ter2 3
C lus ter2 4
C lus ter2 9
C lus ter3 1
C lus ter3 3
C lus ter3 4
C lus ter3 5
C lus ter4 1
C lus ter4 4
C lus ter5 0
C lus ter5 3
C lus ter6 7
1
2
6732
1
1 3464
6303
1
4464
8836
4
4218
2628
1111
1
648
1737
1
2676
1
4706
828
1
1353
2303
1446
4062
2593
1098
1
150
1203
86 247
1079
625
772
353
347
4
354
490
3
267
322
2330
58
5052
530
55
88
12
128
467
663
1629
23
2
147
138
722
1354
1321
479
98
322
165
378
6
64
33
2
4
5
7 14532
1981
1
7
726
2750
7
266
2304
1
7769
8102
1358
115 3971
2885
104
153
29
118
388
378
2777
8319
3
5204
43 1065
12
4
28
1680
28
7
139
1
2470
19
1436
17
592
1
543
758
6 % identity
A lignment L ength
M is matc hesG aps
9 7 .7
345
8
9 3 7 2 9 8 .8
345
4
100
345
0
2 1 5 3 9 7 .7
345
8
9 9 .7
345
1
4 6 9 0 9 9 .1
345
1
9 9 .7
345
1
1 3 9 6 .5
345
10
2 8 3 7 9 3 .3
345
23
11 100
345
0
2 9 8 .8
345
4
100
345
0
9 6 .8
347
9
9 6 0 9 9 .4
345
2
3214 100
345
0
9 7 .7
345
7
9 9 .1
345
3
9 8 .3
345
6
100
345
0
1
100
345
0
9 7 .1
345
10
8 6 9 6 .8
345
11
100
345
0
1 9 6 .8
345
11
9 6 .8
345
11
9 8 .8
346
3
3
100
345
0
100
345
0
9 7 .7
345
8
0
0
0
0
0
1
0
2
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
10GP1
Proteobacteria
1PS1
Metazoa
1PS2
Firmicutes
Bacteroidetes
2A?1
Spirochaetes
Q uery Start
Q uery E nd
H it Start
H it E nd E - value
Sc ore I D
Full N ame T axonomy
1
345 1336
992
1 .0 0 E - 1 7 6 6 2 0 among geographic
2 1ally
0 3 1regions
9Bac teria
c entralFirmic
T ibet utes
geothermal
C los tridia
s pring mat
C los
c lone
tridiales
D T MC4los
2 tridiac eae
1
345 1247
903
0
6 5 2 M ic robial c ons ortia
1 3 4fermentor
8 0 0Bac teria
methanogenic
Firmic utes
bioreac
C tor
los tridia
c lone E BR
C los
-0 2tridiales
E - 0 4 3 6C oproc oc c us
1
345 1345
1001
0
6 8 4 Bac teroides s p. s tr.
7 3265833Bac
c teria
Bac teroidetesBac teroidetesBac
(c las
teroidales
s)
Bac teroidac eae
Bac teroides
1
345 1338
994
1 .0 0 E - 1 7 6 6 2 0 G eobac illus s p. D3169654 1 9Bac teria
Firmic utes
Bac illales
Bac illac eae G eobac illus
1
345 1382
1038
0
6 7 6 E lec tric igen E nric
2 hment
2 6 5 8 1Bac
M FC
teria
full-s c ale
Bac
anaerobic
teroidetesbioreac
Bac teroidales
tor s ludge
Bac teroidac
treatingeae
brewery
P revotellac
waseae
te c lone 3 1 f0 6
1
345 1273
931
0
6 5 4 T hermoanaerobac
35
terium
6 9 2 2Bac
s acteria
c harolytic
Firmic
um s tr.
utes
B6 A C los tridia
T hermoanaerobac
T hermoanaerobac
terales T hermoanaerobac
terales Familyterium
I I I . I nc ertae Sedis
1
345 1151
807
0
6 7 6 P ortugues e dry s1moked
0 0 9 7 9Bac
s austeria
ages (c houric os ) type Ribatejano is olate s tr. T e1 6 R
1
345 1365
1023
3 .0 0 E - 1 6 2 5 7 3 C los tridium s terc orarium
3 1 2 7 8Bac
s tr.
teria
D SM 8 5Firmic
3 2 T utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1288
944
8 .0 0 E - 1 4 1 5 0 2 pac ked- bed reac2tor
0 4c2lone
4 5Bac
C FBteria
4
Firmic utes
C los tridia
1
345 1388
1044
0
6 8 4 Bac illus c irc ulans
3 4s5tr.
4 1X3
3Bac teria
Firmic utes
Bac illales
Bac illac eae Bac illus
1
345 1277
933
0
6 5 2 mes ophilic anaerobic
1 0 7 4BSA
6 6Bacdiges
teriater c lone
Firmic
BSA
utes
1 B-0C5los tridia
P eptos treptoc oc c ac eae
1
345 1326
982
0
6 8 4 Bac illus s p. s tr. SL
3 3167175 9Bac teria
Firmic utes
Bac illales
Bac illac eae Bac illus
1
345 1359
1013
8 .0 0 E - 1 6 9 5 9 5 A c tinobac ulum s p.
83
P0
1 1s1Bac
tr. P teria
2 P _1 9
A c tinobac teria
A c tinobac teridae
A c tinomyc etales
A c tinomyc ineae
A c tinomyc etac
A ceae
tinobac ulum
1
345 1245
901
0
6 6 8 G uguan hot s pring
1 0is
3 olate
8 8 2Bac
s tr.
teria
K1 L 1 Firmic utes
C los tridia
T hermoanaerobac teriales
1
345 1351
1007
0
6 8 4 C los tridium c ellulos
1 6i0 5 9Bac teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1362
1019
3 .0 0 E - 1 7 4 6 1 3 C los tridiac eae bac
2 1terium
7 0 6 0Bac
SNteria
021
Firmic utes
C los tridia
C los tridiales C los tridiac eae
1
345 1261
917
0
6 6 0 C los tridiac eae s tr.
284
07
Wc
9Bac teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
1
345 1375
1031
0
6 3 6 intes tinal that ac1tivate
4 2 2 1 dietary
6Bac teria
lignan s ec ois olaric ires inol digluc os ide human fec es is olate E D - M t6 1 /P Y G - s 6 anaerobic s tr. E D - M t6 1 /P Y G - s 6
1
345 1356
1012
0
6 8 4 Klebs iella pneumoniae
3 5 8 7 6s3Bac
tr. FI
teria
U M S1 P roteobac teria
G ammaproteobac
E nterobac
teria teriales
E nterobac teriac
Klebs
eaeiella
1
345 1378
1034
0
6 8 4 P s eudomonas indic
3 8a2 1 7Bac teria
P roteobac teria
G ammaproteobac
P s eudomonadales
teria
P s eudomonadac
P s eudomonas
eae
1
345 1274
930
8 .0 0 E - 1 7 2 6 0 5 C los tridium s p. s1tr.
0 2I 3
M2SN
8Bac
U 4
teria
0011
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1347
1003
2 .0 0 E - 1 6 9 5 9 7 Symbiobac terium3 4
s p.
4 0s9tr.
8Bac
KAteria
13
Firmic utes
C los tridia
C los tridiales C los tridiales Symbiobac
Family XV Iterium
I I . I nc ertae Sedis
1
345 1376
1032
0
6 8 4 human fec al c lone
2 0SJ
4 4T5U8_G
Bac_0
teria
5 _2 6
P roteobac teria
G ammaproteobac
Betaproteobac
teria
SJteria
T U _B_0 2 _4 5
1
345 1366
1022
3 .0 0 E - 1 5 9 5 6 3 on -A rc tic penins1ula
5 3 2Svalbard
9 1Bac teria
N orwayFirmic
determined
utes genes
C los tridia
and rumen
C losis
tridiales
olates reindeer
RF3 0
fed pelleted
RF6
c onc entrates (RF- 8 0 ) c lone A F 1 1
1
345 1379
1035
2 .0 0 E - 1 6 9 5 9 7 human fec al c lone
2 0SJ
4 1T1U7_C
Bac_0
teria
3 _7 2
Bac teroidetesBac teroidalesBac teroidac eae
1
345 1348
1003
0
6 4 6 C los tridium jejuens
104
e 4s4tr.
7Bac
H Yteria
- 3 5 -1 2 T Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1382
1038
0
6 8 4 E nteroc oc c us c as
31
s eliflavus
2 3 5 9Bac teria
s tr. eS8 5Firmic
2
utes
L ac tobac illales
E nteroc oc c acEeae
nteroc oc c us
1
345 1334
990
0
6 8 4 C los tridium s p. s3tr.
5 7BG
5 8-C
5Bac
6 6teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1372
1028
1 .0 0 E - 1 7 6 6 2 0 mes ophilic anaerobic
2 2 2 7diges
3 3Bacter
teria
c lone GFirmic
3 5 _Dutes
8 _H _B_E
C los
1 1tridia
C los tridiales
• Also see the Qiime pipeline
Challenges
• Short size of amplicon
• What filtering parameters to use (stringency level)?
•  balance between stringency filter and keeping as
much data as we can
• Whole new dimension for rare biosphere?
• Handling large numbers of sample (tens of
thousand magnitude)
• Sequencing run is fast, but library preparation time
is long.
• Cost of barcoded primers (will need lots of
barcodes), handling.
• Huge ammount of samples  statistics models…
Acknowledgments
•
•
•
•
•
•
Susannah Tringe
Edward Kirton
Feng Chen
Kanwar Singh
Alison Fern
And many others!
Thanks!
16S rRNA
Dangl lab, UNC