Transcript Slides

Assignment 2:
Papers read for this assignment
• Paper 1: PALMA: mRNA to Genome
Alignments using Large Margin
Algorithms
• Paper 2: Optimal spliced alignments of
short sequence reads
• Badil Elhady, Michael Chan
Paper 1: PALMA: mRNA to
Genome Alignments using
Large Margin Algorithms
Motivation
• Question for the study?
– The correct alignment of mRNA sequences to
genomic DNA is still a challenging task. ( Due to
the presence of sequencing errors, micro-exons,
alternative splicing)
Method
• Splice Site Prediction
– SVM with large margin, decided under convex optimization
• Intron Length Model
• Dynamic Programming is used to maximize the scoring
function, leading to Optimal Alignment. (Smith-Waterman
Alignments with Intron Model)
• This leads to:
– Tuning the parameters of scoring function leads to.
• A larger score
• Other alignment would score lower
Slice site prediction
– Accurately differentiates the exon-intron boundaries
– Compartmentalize the local alignment of EST.
– Claim:
• Robust to mutations, insertions and deletions, as well as noise
levels in accurately identifying intron boundaries as well as boundaries of the
optimal local alignment.
Splice Site Predictions
• From a set of ETS, sequences were
extracted of confirmed donor and acceptor
slice sites.
• To recognize acceptor and donor slice
sites, 2 SVM classifiers were trained.
Using “ weighted degree ” kernel.
• kernel computes the similarity between
sequences s and s’.
Intron Length Model
• The main idea of the algorithm is to compute a local alignment
by determining the maximum over all alignments of all prefixes
– SE (1 : i) :=(SE(1), . . . , SE(i))
– SD(1 : j) := (SD(1), . . . , SD(j))
» SE EST Sequences
» SD  DNA Sequences
– Running time is O(m*n*L)
» m length of SE
» n length of SD
» Smith-Waterman does not distinguish between exons
and introns.
Scoring Function
• In generalizing the SmithWaterman algorithm by
including an intron model
taking splice site predictions
as well as intron length into
account.
• The information is then used
to optimize the parameters
used for alignment.
Smith-Waterman Alignments
with Intron Model
Splice site prediction
assisting params
Experimental setup
• Evaluating PALMA vs. exalin, sim4, and blat.
• Alignment of mRNA seq. artificially shorting the middle exon
(3-50)nt as shown.
Experimental setup cont.
• Artificially generating the data : as a control to know
exactly what the correct alignment has to be.
• Add varying amounts of noise (p ¼ 0 ,1 ,5 and 10% of
random mutations, deletions or insertions) to the query
sequence.
• Replace a part of the DNA or mRNA sequence at its
terminal ends with random sequence leading to a
shortened correct alignment.
PALMA
Add noise
vs.
exalin, sim4, and blat.
PALMA
Varying
lengths
vs.
exalin, sim4, and blat.
Conclusion
– Motivation: high sensitivity detection of short
exons in the midst of noise.
– Principles
• Splice Site Prediction
• Intron Length Model
• maximize scoring function, for Optimal Alignment.
– Results: PALMA detects short exons while
exalin, blat, etc, are unsuccessful
Further Topics
• vmatch
• svm
• convex optimization
Paper 2: Optimal spliced alignments
of short sequence reads
Situation
• NGS has short length and inherent high error rate even
compared to Sanger. It is fast but the accuracy?
• Many methods are efficient and accurate if the sequence
blocks (exons) are sufficiently long and are highly similar
to the genomic sequence.
• Reads from NG sequencing techniques do not have either
of 2 properties.
Motivation
• Objective to be able to accurately align the sequence
reads over intron boundaries.
• QPALMA takes the read’s quality information as well
as computational splice site predictions to compute
accurate spliced alignments.
Principles
• Learn, in a supervised manner, how to score quality
information, splice site predictions and sequence identity
based on a representative set of sequence reads with
known alignments.
• Extended Smith-Waterman algorithm:
• Extension 1: Quality Scores
• Extension 2: Splice Sites
• Extension 3: Non-affine Intron Length Model
1) Splice site prediction
• Need to know acceptor and donor splice
sites as well as suitable decoy sequences.
• Extension 1: Quality Scores
• Extension 2: Splice Sites
• Extension 3: Non-affine Intron Length Model
Extension 1: Quality Scores
• same computational complexity as the original
Smith–Waterman algorithm (O(mn)). However, it
uses a more complex scoring that may depend on
the sequencing technology used.
Constant
Extension 2: Splice Sites
• (O(mnL)) operations, where L is the maximal length of the intron.
• The idea is to maintain an additional recurrence matrix W used to keep track
of the intron boundaries.
Extension 3: Non-affine Intron Length Model
• Recurrence can be implemented as follows
• For long introns this approach seem
computationally infeasible.
An alignment pipeline against whole genomes
• !!! optimal alignments is time consuming
• => use vmatch(multi-step approach on
enhanced suffix arrays) + high quality splice
site detection
[Optional]
An alignment pipeline against whole genomes
• vmatch (1st round) finds global alignments of all short reads
(max 2 mismatches) against the genome to identify large
fraction of unspliced reads.
– If there are reads that cannot be aligned (leftover reads) – spliced or low
quality reads
• Yet, there is possibility that the boundary of the reads are the
spliced sites
– Check with QPALMA scoring function as a filter to quickly decide
whether the read is spliced or not.
– all combinations of putative donor splice sites within the read
and acceptor splice sites ≤2000 nt downstream of the read, and
– all combinations of putative acceptor splice sites within the read
and donor splice sites ≤2000 nt upstream of the read.
[Optional]
An alignment pipeline against whole genomes
• leftover reads + spliced (predicted to be by QPALMA) used as
seeds for vmatch (2nd round) and localize the splice sites with a
‘window’
Results
Conclusion
– Motivation: NGS is inaccurate.
– Principles
• 3 extentions to PALMA
• Vmatch pipelining, for boundary precision.
– Results: lower error
• QPALMA + vmatch pipelining
= PALMA + 3extentions – {SVM, large marigin}
References
• A Tutorial on Support Vector Machines for
Pattern Recognition
• NCBI National Center for
Biotechnology Information
http://www.ncbi.nlm.nih.gov/About/primer/genetics_
genome.html
• BioInfoBank Library http://lib.bioinfo.pl/
• High Throughput Short Read Alignment via Bidirectional BWT
Badil_el
ha dy
Mich_a___el__chan