Transcript Slides
SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: 2395-2396 The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing Nathan Clement et al. Bioinformatics (2010) 26: 38-45 Presented by: Xia Li Short-read mapping software Software Technique Hashing refs + base quality + GNUMAP repeated regions Novoalign Hashing refs Reference SOAP Hashing refs Li et al., 2008 SeqMap Hashing reads Jiang et al., 2008 RMAP Hashing reads + read quality Smith et al., 2008 Eland Hashing reads Cox, unpublished Bowtie BWT Langmead et al., 2009 lexicographically sorting + base Malhis et al., 2009 quality Slider Clement et al., 2010 Novocraft, unpublished SeqMap • Motivation – Hashing genome usually needs large memory (e.g. SOAP needs 14GB memory when mapping to the human genome) – Allow more substitutions and insertion/deletion SeqMap • Pigeonhole principle Split into 4 parts – Spaced seed alignment – ELAND, SOAP, RMAP • Hash reads • Insertion/deletion: All combinations of 2/4 parts Short read look up table (indexed by 2 parts) 2/4 combinations with 1/2 shifted one nucleotide to its left or right Image credit: J. Ruan Reference Genome Short Read Experiment & Result Experiment & Result • Deal with more substitutions and insertion/deletion Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions GNUMAP • Motivation – Base uncertainty • Such as nearly equal or low probabilities to A, C, G or T • Filter low quality reads [RMAP] -> discard up to half of the reads (Harismendy et al., 2009) – Repeated regions in the genome • Discard them -> loss of up to half of the data (Harismendy et al., 2009) • Record one -> unequal mapping to some of the repeat regions • Record all -> each location having 3 times the correct score GNUMAP • Flow-chart Probabilistic Needleman-Wunsch Alignment Score Read from sequencer GGGTACAACCATTAC Read is added to both repeat regions proportionally to their match quality weighted by its # of occurrences in the genome AACCAT GGGTAC AACCAT ACTGAACCATACGGGTACTGAACCATGAA Slide credit: N. Clement Experiment & Result Comments • SeqMap – Pos: dealing with more substations/insertion/deletion – Cons: memory consuming, not fast • GNUMAP – Pos: consider base quality and repeated regions -> generate more useful information and achieves best performance (~15% increase) – Cos: memory consuming, slow, more noise