Transcript Slides

SeqMap: mapping massive amount of
oligonucleotides to the genome
Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396
The GNUMAP algorithm: unbiased
probabilistic mapping of oligonucleotides
from next-generation sequencing
Nathan Clement et al. Bioinformatics (2010) 26: 38-45
Presented by: Xia Li
Short-read mapping software
Software Technique
Hashing refs + base quality +
GNUMAP
repeated regions
Novoalign Hashing refs
Reference
SOAP
Hashing refs
Li et al., 2008
SeqMap
Hashing reads
Jiang et al., 2008
RMAP
Hashing reads + read quality
Smith et al., 2008
Eland
Hashing reads
Cox, unpublished
Bowtie
BWT
Langmead et al., 2009
lexicographically sorting + base
Malhis et al., 2009
quality
Slider
Clement et al., 2010
Novocraft, unpublished
SeqMap
• Motivation
– Hashing genome usually needs large memory (e.g.
SOAP needs 14GB memory when mapping to the
human genome)
– Allow more substitutions and insertion/deletion
SeqMap
• Pigeonhole principle
Split into 4 parts
– Spaced seed alignment
– ELAND, SOAP, RMAP
• Hash reads
• Insertion/deletion:
All combinations
of 2/4 parts
Short read look up table
(indexed by 2 parts)
2/4 combinations with
1/2 shifted one nucleotide
to its left or right
Image credit: J. Ruan
Reference Genome
Short Read
Experiment & Result
Experiment & Result
• Deal with more substitutions and
insertion/deletion
Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random
substitutions, N’s and insertion/deletions
GNUMAP
• Motivation
– Base uncertainty
• Such as nearly equal or low probabilities to A, C, G or T
• Filter low quality reads [RMAP] -> discard up to half of
the reads (Harismendy et al., 2009)
– Repeated regions in the genome
• Discard them -> loss of up to half of the data
(Harismendy et al., 2009)
• Record one -> unequal mapping to some of the repeat
regions
• Record all -> each location having 3 times the correct
score
GNUMAP
• Flow-chart
Probabilistic Needleman-Wunsch
Alignment Score
Read from sequencer
GGGTACAACCATTAC
Read is added to both
repeat regions
proportionally to their
match quality
weighted by its # of
occurrences in the genome
AACCAT
GGGTAC
AACCAT
ACTGAACCATACGGGTACTGAACCATGAA
Slide credit: N. Clement
Experiment & Result
Comments
• SeqMap
– Pos: dealing with more
substations/insertion/deletion
– Cons: memory consuming, not fast
• GNUMAP
– Pos: consider base quality and repeated regions ->
generate more useful information and achieves
best performance (~15% increase)
– Cos: memory consuming, slow, more noise