Transcript ppt

Presentation – Homework 2
Advanced Topics: Current Bioinformatics
Instructor: Dr. Jianhua Ruan
GROUP MEMBERS:
JAMIUL JAHID
MOHAMMAD IFTEKHARUL ISLAM
TANZIR MUSABBIR
NGS Analysis Papers
 PatMaN: Rapid Alignment of Short Sequences to
Large Databases

Kay Prufer, Udo Stenzel, Michael Dannemann, Richard
Green, Michael Lachmann
 ProbeMatch: Rapid Alignment of Obligonucleotides
to Genome Allowing Both Gaps and Mismatches

You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson
and Jignesh Patel
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 PatMaN – Patter Matching in Nucleotide Databases
 A tool for performing exhaustive searches to identify
all occurrences of a large number of short sequences
within a genome-sized databases.
 Reads sequences in FastA format and reports all hits
within the given edit-distance cutoff.
 Advantages:



Allows predefined number of gaps and mismatches
Ambiguity codes can be searched
Search time is short for perfect matches
ProbeMatch: Rapid Alignment of Oligonucleotides to
Genome Allowing Both Gaps and Mismatches
 For matching a large set of oligonucleotides
sequences against a genome database using gapped
alignments
 Advantages:



It generates both ungapped and gapped alignments
It allows up to three errors including insertion, deletion and
mismatch
It able to detect multiple classes of mutations: SNVs and
indels.
ProbeMatch: Background
 High throughput DNA sequence technologies : Illumina,
454 Life Sciences
 Large set of short sequences is produced
 Must be mapped to a genome, allowing for only a few
errors
 Traditional sequence alignment tools can do this, but
computationally impractical
ProbeMatch: Background
 ELAND (Efficient Local Alignment of Nucleotide Data)


Search DNA databases for a large number of short sequences
Only ungapped alignments allowing up to two mismatches
 MAQ (Mapping and Assembly with Quality)


Only ungapped alignments allowing up to three mismatches
Measures error probability of alignements using sequence quality
information
 SOAP
 SeqMap
ProbeMatch: Background
 These programs are often faster than BLAST by an order
of magnitude or more
 But usually map only 60-80% of the query sequences to
genomes
 Further processing is needed using computationally
expensive but sensitive alignment method
 Overall gain is limited
 ProbeMatch effectively approaches this challenge
ProbeMatch: Rapid Alignment of Oligonucleotides to
Genome Allowing Both Gaps and Mismatches
 Allows a richer match model
 Finds gapped and ungapped alignements with up to
three errors of any error combination
 Able to detect multiple classes of mutations
ProbeMatch: Methodology
 Takes as input a query sequence set and a database of
sequences.
 Database is divided into small segments
 ProbeMatch loads each segment and build a q-gram
index
 To find potential hits, ProbeMatch searches against q-
gram index and extends hits to find longer alignments.
ProbeMatch: Methodology
 If two sequences Q and T, match within k errors and j non-
overlapping fragments are taken from Q, then T contains at
least one of the fragments with at most ⌊k/j⌋ errors
 The matched hits then are extended to check if the entire
query sequence and the target sequence can be aligned
within k errors
 Gapped q-gram index (“Better Filtering with gapped q-
grams” Burkhardt and Kärkkäinen, 2002) provides more
efficient filtering than ungapped q-gram
ProbeMatch: Result
 169095 transcriptome short reads from a prostate cell
line(RWPE), generated by the Illumina Genome Analyzer,
was matched against the human genome using various
alignment programs
Table : Comparison of execution times and sensitivity
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Algorithm
a)
b)
c)
d)
e)
Constructing a single keyword tree of all the query sequences.
When ambiguity flag is set, a match occurs if the base is one of
the nucleotide in ambiguity code.
When ambiguity flag is omitted a base alignment to this
character will be counted as a mismatch.
All bases along a query sequence are added as a path from the
root of the tree to a leaf, with edge as a base added and leaf as
the query sequence id.
Suffix link is also added into the tree
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’.
 Basic keyword tree is --
C
C
C
G
A
GA
T
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 After adding the suffix link
C
C
C
C
G
G
G
A
GA
T
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Completing the tree
A, T, N
C
C
C
C
G
A, T, N
N
G
G
A
GA
T
G
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Algorithm
 Once the tree is completed each sequence in the target
database is evaluated base by base and compared to a list of
partial matches.
 Each partial match consist
A node
 Number of mismatches and gaps so far.


The list is initialized with
Root of the tree
 An edit count of zero.


In each iteration of the algorithm all partial matches are
advanced along a perfectly matching outgoing edges.
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Complexity
 Without ambiguity code O(L) time and spaces requires, where
L is the total length of all query sequences.
 When ambiguity is enabled both time and space requirement
increases exponentially.
 The time depends on the target database but heavily depends
on the maximum edit distances as well as the average length of
query sequences.
 For each additional edit operation an exponentially increasing
number of partial matches must be considered.
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
 Result
 Time constrain of PatMaN means it is suitable for short
sequence with a limited number edit operation.
 HG -U95 is matched against chimpanzee genome(panTro2)
with no gaps but one mismatch.
 PatMaN takes 2.5h and found 15.9 million hits.
Q/A?