Presentation

Download Report

Transcript Presentation

Multiple Genome Alignment
by
Clustering Pairwise matches
Shravya Konda
Motivation
Number of completely sequenced
genomes is growing rapidly
But annotation of such genomes relies a
lot on human expert knowledge
To understand genomic content, one of the
most effective methods is to compare
multiple genomes
Why don’t traditional approaches work?
In general,
- heuristic approach for multiple sequence alignment
Genome sequences are very long
- iterative alignment
-- requires generation of profiles
-- alignment of multiple sequences while combining pairwise matches.
- hard to use a guide tree
-- there are many local regions where their phylogenetic relationship is
very different from the entire genome
- greedy progressive alignment
-- this works due to the guide tree
BAG Sequence Clustering Algorithm
For a given set of sequences(s1,s2…sn)
- Use FASTA to find pairwise computed results.
- Build a weighted graph
-- each s i is represented as a node
-- create an edge between si and sj when the pairwise alignment score is
greater than a preset threshold.
- Select a cut-off score Tc
-- generate bi-connected components G1,G2,G3…..Gn
-- set of articulation points {a1,a2,a3…am}
-- Perform RANGE-TEST for each bi-connected component
-- generate the different clusters
- After iterative splitting is done,
-- each articulation point is tested
-- Perform AP-TEST and build the hyper graph
Transforming genomic data - input to BAG
BAG cannot directly handle all pairwise alignments of
genomes
- numerous edges (pairwise matches) between a pair of nodes
(genomes)
Solution :
- generate subsequences with their own identifiers
- convert local alignments to subsequence identifiers which begin at
one of the evenly spaced break positions.
Example