Transcript ppt
Presentation – Homework 2 Advanced Topics: Current Bioinformatics Instructor: Dr. Jianhua Ruan GROUP MEMBERS: JAMIUL JAHID MOHAMMAD IFTEKHARUL ISLAM TANZIR MUSABBIR NGS Analysis Papers PatMaN: Rapid Alignment of Short Sequences to Large Databases Kay Prufer, Udo Stenzel, Michael Dannemann, Richard Green, Michael Lachmann ProbeMatch: Rapid Alignment of Obligonucleotides to Genome Allowing Both Gaps and Mismatches You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson and Jignesh Patel PatMaN: Rapid Alignment of Short Sequences to Large Databases PatMaN – Patter Matching in Nucleotide Databases A tool for performing exhaustive searches to identify all occurrences of a large number of short sequences within a genome-sized databases. Reads sequences in FastA format and reports all hits within the given edit-distance cutoff. Advantages: Allows predefined number of gaps and mismatches Ambiguity codes can be searched Search time is short for perfect matches ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches For matching a large set of oligonucleotides sequences against a genome database using gapped alignments Advantages: It generates both ungapped and gapped alignments It allows up to three errors including insertion, deletion and mismatch It able to detect multiple classes of mutations: SNVs and indels. ProbeMatch: Background High throughput DNA sequence technologies : Illumina, 454 Life Sciences Large set of short sequences is produced Must be mapped to a genome, allowing for only a few errors Traditional sequence alignment tools can do this, but computationally impractical ProbeMatch: Background ELAND (Efficient Local Alignment of Nucleotide Data) Search DNA databases for a large number of short sequences Only ungapped alignments allowing up to two mismatches MAQ (Mapping and Assembly with Quality) Only ungapped alignments allowing up to three mismatches Measures error probability of alignements using sequence quality information SOAP SeqMap ProbeMatch: Background These programs are often faster than BLAST by an order of magnitude or more But usually map only 60-80% of the query sequences to genomes Further processing is needed using computationally expensive but sensitive alignment method Overall gain is limited ProbeMatch effectively approaches this challenge ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches Allows a richer match model Finds gapped and ungapped alignements with up to three errors of any error combination Able to detect multiple classes of mutations ProbeMatch: Methodology Takes as input a query sequence set and a database of sequences. Database is divided into small segments ProbeMatch loads each segment and build a q-gram index To find potential hits, ProbeMatch searches against q- gram index and extends hits to find longer alignments. ProbeMatch: Methodology If two sequences Q and T, match within k errors and j non- overlapping fragments are taken from Q, then T contains at least one of the fragments with at most ⌊k/j⌋ errors The matched hits then are extended to check if the entire query sequence and the target sequence can be aligned within k errors Gapped q-gram index (“Better Filtering with gapped q- grams” Burkhardt and Kärkkäinen, 2002) provides more efficient filtering than ungapped q-gram ProbeMatch: Result 169095 transcriptome short reads from a prostate cell line(RWPE), generated by the Illumina Genome Analyzer, was matched against the human genome using various alignment programs Table : Comparison of execution times and sensitivity PatMaN: Rapid Alignment of Short Sequences to Large Databases Algorithm a) b) c) d) e) Constructing a single keyword tree of all the query sequences. When ambiguity flag is set, a match occurs if the base is one of the nucleotide in ambiguity code. When ambiguity flag is omitted a base alignment to this character will be counted as a mismatch. All bases along a query sequence are added as a path from the root of the tree to a leaf, with edge as a base added and leaf as the query sequence id. Suffix link is also added into the tree PatMaN: Rapid Alignment of Short Sequences to Large Databases Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’. Basic keyword tree is -- C C C G A GA T GT CCC PatMaN: Rapid Alignment of Short Sequences to Large Databases After adding the suffix link C C C C G G G A GA T GT CCC PatMaN: Rapid Alignment of Short Sequences to Large Databases Completing the tree A, T, N C C C C G A, T, N N G G A GA T G GT CCC PatMaN: Rapid Alignment of Short Sequences to Large Databases Algorithm Once the tree is completed each sequence in the target database is evaluated base by base and compared to a list of partial matches. Each partial match consist A node Number of mismatches and gaps so far. The list is initialized with Root of the tree An edit count of zero. In each iteration of the algorithm all partial matches are advanced along a perfectly matching outgoing edges. PatMaN: Rapid Alignment of Short Sequences to Large Databases Complexity Without ambiguity code O(L) time and spaces requires, where L is the total length of all query sequences. When ambiguity is enabled both time and space requirement increases exponentially. The time depends on the target database but heavily depends on the maximum edit distances as well as the average length of query sequences. For each additional edit operation an exponentially increasing number of partial matches must be considered. PatMaN: Rapid Alignment of Short Sequences to Large Databases Result Time constrain of PatMaN means it is suitable for short sequence with a limited number edit operation. HG -U95 is matched against chimpanzee genome(panTro2) with no gaps but one mismatch. PatMaN takes 2.5h and found 15.9 million hits. Q/A?