GASiC_SeqAn - Fachbereich Mathematik und Informatik

Transcript GASiC_SeqAn - Fachbereich Mathematik und Informatik

GASiC: Metagenomic abundance
estimation and diagnostic
testing on species level
Martin Lindner, Bernhard Renard
NG 4, Robert Koch-Institut
Contents
• Motivation
– What is Metagenomics?
– Focus: Abundance Estimation
• GASiC Method
– Mapping
– Genome Similarity Estimation
– Similarity Correction
• Comparison, Application
• Technical Details
– Current Status
– GASiC and SeqAn
What is Metagenomics?
Analysis of genomic material directly taken from environmental samples.
vs.
Purified Escherichia coli
Lake Washington Microbes
[Rocky Mountain Laboratories, NIAID, NIH]
[Dennis Kunkel Microscopy, Inc.]
+
+
+
-
Identify contributors of special functions
Study interaction of microbes
Estimate microbial diversity
Highly complex samples
Mostly unknown organisms
High spatial/temporal variability
Metagenomic Communities
Lake Lanier (USA)
Bioreactor
Famous polar bear
Soil
Hydrothermal vents
Human microbiome
Marine sediments
Acid mine drainage
Low Complexity
Number of Microbial Species:
1
10
High Complexity
100
1000
10000
Bioinformatics in Metagenomics
•
•
•
•
Genome assembly
Gene/function prediction
Taxonomic profiling
Interaction networks
Focus on Taxonomic profiling:
Who is out there? And, how many?
Taxonomic Profiling
Reference based
High accuracy
Narrow focus
Clinical
Applications
Comparative
Metagenomics
Abundance
Estimation
Diversity
Estimation
Low accuracy
Broad focus
Composition based
Exploration
& Assembly
Genome Abundance Estimation
Goal:
Estimate relative abundance of
organisms from metagenomic
sequence reads
Problems:
• (Reference genome unknown)
• Unequal genome lengths
• Genomic Similarity
Buchnera aphidicola:
Streptomyces bingchenggensis:
???
0.64 M bp
11.9 M bp
GASiC Method
1. Read Mapping
• Chose suitable read mapper
• Map reads against reference genomes
– Each genome separately
– Does it match? Yes/No
• Write results to SAM-files
2. Similarity Estimation
j
Similarity matrix:
A =
i
aij
aij = Probability that a read from
genome i can be mapped to genome j
How to obtain aij:
• Simulate N reads from genome i (e.g. with Mason)
• Map reads to genome j with same mapper/settings as in 1.
• Count the number of mapped reads rij
• aij = rij/rii
3. Similarity Correction
Matrix notation:
Linear Model:
Dataset contains ci reads of Organism i
Similarity between Organism i and j: aij
 aij * ci reads will map to genome j
 𝑟𝑗 =
𝑟:
𝑨:
𝑐:
𝑁
𝑖=1 𝑎𝑖𝑗 𝑐𝑖
Number of mapped reads (step 1.)
Similarity matrix (step 2.)
True abundances
Linear Algebra lecture:
𝑐 = 𝑨−𝟏 𝑟
𝑟 = 𝑨𝑐
Solving 𝑟 = 𝑨𝑐
Constraints for 𝑐 :
Approximate solution:
𝑐 = argmin 𝑨𝑐′ − 𝑟
𝑐′
2
𝑐𝑖 ≤ 1
𝑖
𝑐𝑖 ≥ 0 ∀𝑖
Non-negative
LASSO
[Renard et al.]
Solve with standard solver for constrained optimization
GASiC: COBYLA from scipy package
Comparison
Metagenomic FAMeS dataset: [Mavromatis et al.]
•
•
•
•
•
113 microbial species
3 datasets with different complexities
100,000 Sanger reads (1000bp) per dataset
Ground truth available
Comparison by Xia et al.
Tool
MEGAN
GAAS
GRAMMy
GASiC
simLC
low complexity
RRMSE
AVGRE
48.6%
39.3%
433.8%
152.5%
20.0%
14.0%
18.7%
9.1%
simMC
medium complexity
RRMSE
AVGRE
50.0%
40.6%
171.4%
111.6%
25.6%
19.7%
17.5%
10.9%
simHC
high complexity
RRMSE
AVGRE
50.2%
40.8%
507.9%
165.8%
21.6%
14.7%
10.4%
5.8%
Application
Viral recombination data: [Moore et al.]
– 4 viruses with 80%-96% sequence similarity
– Abundance estimates from biological experiments
Technical Details
• Language: Python
– Use scipy/numpy packages
• Platform: Linux (native)
• Interfaces (command line) to:
– Read simulator (e.g. Mason [Holtgrewe])
– Read mapper (e.g. bowtie [Langmead et al.])
Technical Details
Mapping
Genomes
Simulator
write
Mapper
Mapper
write
write
SAM
SAM
read
Sim. Reads
read+write
Similarity
Matrix
read
read
Abundance
Estimates
Similarity Correction
Similarity Estimation
Reads
GASiC & SeqAn
• Avoid disk IO!
• Integrate all modules in one tool
• Abandon dependences on external tools
 SeqAn looks like a suitable framework!
Example: Similarity Matrix
Current implementation:
1.
Simulate 100,000 reads and write to fastq file
2.
Read file and map to ref. genome, write results to SAM file
3.
Read SAM file and count the number of matching reads
The SeqAn way:
1.
Simulate 1 read and map to ref. genomes; count if read mapped
2.
Repeat 100,000 times
References
Method:
•
Lindner,M.S. and Renard,B.Y. (2012) Metagenomic abundance estimation and diagnostic testing on
species level. Nucl. Acids Res., doi: 10.1093/nar/gks803.
•
Renard,B.Y. et al. (2008) NITPICK: peak identification for mass spectrometry data. BMC Bioinformatics, 9,
355.
Datasets:
•
•
Mavromatis,K. et al. (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing
methods. Nat. Methods, 4, 495–500.
Moore,J. et al. (2011) Recombinants between Deformed wing virus and Varroa destructor virus-1 may
prevail in Varroa destructor-infested honeybee colonies. J. Gen. Virol., 92, pp 156–161.
Related Methods:
•
•
Huson,D. et al. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386.
Xia,L. et al. (2011) Accurate genome relative abundance estimation based on shotgun metagenomic reads.
PLoS One, 6, e27992.
External Tools:
•
•
Langmead,B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol., 10, R25.
Holtgrewe,M. (2010) Mason – a read simulator for second generation sequencing data. Technical report
TR-B-10-06. Institut für Mathematik und Informatik, Freie Universität Berlin.
Acknowledgements
Research Group Bioinformatics (NG4)
Bernhard Renard
Franziska Zickmann
Martina Fischer
Robert Rentzsch
Anke Penzlin
Mathias Kuhring
Sven Giese