Transcript ppt
Presentation – Homework 2
Advanced Topics: Current Bioinformatics
Instructor: Dr. Jianhua Ruan
GROUP MEMBERS:
JAMIUL JAHID
MOHAMMAD IFTEKHARUL ISLAM
TANZIR MUSABBIR
NGS Analysis Papers
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Kay Prufer, Udo Stenzel, Michael Dannemann, Richard
Green, Michael Lachmann
ProbeMatch: Rapid Alignment of Obligonucleotides
to Genome Allowing Both Gaps and Mismatches
You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson
and Jignesh Patel
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
PatMaN – Patter Matching in Nucleotide Databases
A tool for performing exhaustive searches to identify
all occurrences of a large number of short sequences
within a genome-sized databases.
Reads sequences in FastA format and reports all hits
within the given edit-distance cutoff.
Advantages:
Allows predefined number of gaps and mismatches
Ambiguity codes can be searched
Search time is short for perfect matches
ProbeMatch: Rapid Alignment of Oligonucleotides to
Genome Allowing Both Gaps and Mismatches
For matching a large set of oligonucleotides
sequences against a genome database using gapped
alignments
Advantages:
It generates both ungapped and gapped alignments
It allows up to three errors including insertion, deletion and
mismatch
It able to detect multiple classes of mutations: SNVs and
indels.
ProbeMatch: Background
High throughput DNA sequence technologies : Illumina,
454 Life Sciences
Large set of short sequences is produced
Must be mapped to a genome, allowing for only a few
errors
Traditional sequence alignment tools can do this, but
computationally impractical
ProbeMatch: Background
ELAND (Efficient Local Alignment of Nucleotide Data)
Search DNA databases for a large number of short sequences
Only ungapped alignments allowing up to two mismatches
MAQ (Mapping and Assembly with Quality)
Only ungapped alignments allowing up to three mismatches
Measures error probability of alignements using sequence quality
information
SOAP
SeqMap
ProbeMatch: Background
These programs are often faster than BLAST by an order
of magnitude or more
But usually map only 60-80% of the query sequences to
genomes
Further processing is needed using computationally
expensive but sensitive alignment method
Overall gain is limited
ProbeMatch effectively approaches this challenge
ProbeMatch: Rapid Alignment of Oligonucleotides to
Genome Allowing Both Gaps and Mismatches
Allows a richer match model
Finds gapped and ungapped alignements with up to
three errors of any error combination
Able to detect multiple classes of mutations
ProbeMatch: Methodology
Takes as input a query sequence set and a database of
sequences.
Database is divided into small segments
ProbeMatch loads each segment and build a q-gram
index
To find potential hits, ProbeMatch searches against q-
gram index and extends hits to find longer alignments.
ProbeMatch: Methodology
If two sequences Q and T, match within k errors and j non-
overlapping fragments are taken from Q, then T contains at
least one of the fragments with at most ⌊k/j⌋ errors
The matched hits then are extended to check if the entire
query sequence and the target sequence can be aligned
within k errors
Gapped q-gram index (“Better Filtering with gapped q-
grams” Burkhardt and Kärkkäinen, 2002) provides more
efficient filtering than ungapped q-gram
ProbeMatch: Result
169095 transcriptome short reads from a prostate cell
line(RWPE), generated by the Illumina Genome Analyzer,
was matched against the human genome using various
alignment programs
Table : Comparison of execution times and sensitivity
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Algorithm
a)
b)
c)
d)
e)
Constructing a single keyword tree of all the query sequences.
When ambiguity flag is set, a match occurs if the base is one of
the nucleotide in ambiguity code.
When ambiguity flag is omitted a base alignment to this
character will be counted as a mismatch.
All bases along a query sequence are added as a path from the
root of the tree to a leaf, with edge as a base added and leaf as
the query sequence id.
Suffix link is also added into the tree
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’.
Basic keyword tree is --
C
C
C
G
A
GA
T
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
After adding the suffix link
C
C
C
C
G
G
G
A
GA
T
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Completing the tree
A, T, N
C
C
C
C
G
A, T, N
N
G
G
A
GA
T
G
GT
CCC
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Algorithm
Once the tree is completed each sequence in the target
database is evaluated base by base and compared to a list of
partial matches.
Each partial match consist
A node
Number of mismatches and gaps so far.
The list is initialized with
Root of the tree
An edit count of zero.
In each iteration of the algorithm all partial matches are
advanced along a perfectly matching outgoing edges.
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Complexity
Without ambiguity code O(L) time and spaces requires, where
L is the total length of all query sequences.
When ambiguity is enabled both time and space requirement
increases exponentially.
The time depends on the target database but heavily depends
on the maximum edit distances as well as the average length of
query sequences.
For each additional edit operation an exponentially increasing
number of partial matches must be considered.
PatMaN: Rapid Alignment of Short Sequences to
Large Databases
Result
Time constrain of PatMaN means it is suitable for short
sequence with a limited number edit operation.
HG -U95 is matched against chimpanzee genome(panTro2)
with no gaps but one mismatch.
PatMaN takes 2.5h and found 15.9 million hits.
Q/A?