Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Download
Report
Transcript Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Whole Genome Alignment using
Multithreaded Parallel Implementation
Hyma S Murthy
CMSC 838 Presentation
Talk Overview
Organization of the paper
Motivation
Technique:
Pairwise Sequence Comparison using Dynamic Programming
EARTH Execution Model
Evaluation
Result Graphs
Conclusions
Related Work (MUMmer)
CMSC 838T – Presentation
Motivation
Importance of Genome Alignment :
Identify important matched and mismatched regions
“matches” represent homolog pairs, conserved regions or long repeats
“mismatches”represent foreign fragments inserted by transposition,
sequence reversal or lateral transfer
Detect functional differences between pathogenic/ non-pathogenic strains,
evolutionary distance, mutations leading to disease, phenotypes, etc.
Problems
Large computational power, memory and execution time
Existing algorithms apply dynamic programming only to subsequences
Computationally intensive to apply to whole sequences (O(n2))
Thus applicable only to closely related genomes
CMSC 838T – Presentation
Solution..
Multithreaded parallel implementation of sequence alignment
algorithm to align whole genomes
Parallel implementation of dynamic programming technique
Uses collective memory of several nodes
Uses multithreading to overlap computation and communication
Applicable to closely related as well as less similar genomes
Reliable output in reasonable time
CMSC 838T – Presentation
Pairwise Sequence Comparison using Dynamic
Programming
Basic Idea:
Quantify the similarity between pairs of symbols of target sequences
Associate score for each possible arrangement
Similarity is given by the highest score
Example :
sequence x A T A A G T
sequence y A T G C A G T
SCORE
1 1 -1 –1 –1 –1 –1
TOTAL = -3
sequence x A T A - A G T
sequence y A T G C A G T
SCORE
1 1 -1 –2 1 1 1
TOTAL = 2
Model mutation by “gaps” (gaps indicate evolution of one sequence into
another)
CMSC 838T – Presentation
Dynamic Programming
Smith and Waterman approach:
Aligns subsequences of given sequences
Involves: (a) calculation of scores indicating similarity
(b) identification of alignment(s) corresponding to the score
Build solution using previous solutions for smaller subsequences
Construct a two-dimensional array – “Similarity Matrix” to store scores
corresponding to partial results
Matrix represents all possible alignments of the input sequences
Recurrence equation
SM[i, j] =
SM[i, j-1] + gp
SM[i-1, j-1] + ss
SM[i-1, j] + gp
0
CMSC 838T – Presentation
Contd….
Each element of the matrix is the max of the foll four values:
Left element + gap, upper-left element + score of replacing vertical
with horizontal symbol, upper element + gap, 0.
Consider the foll example
0
T G A T G G A G G T
0 0 0 0 0 0 0 0 0 0
G
0
0
1
0
A
0
0
0
2
T
0
A
0
G
0
G
0
2 = max{0 + (-2),
1 + (1),
0 + (-2),
0}
CMSC 838T – Presentation
Identifying alignments
Alignments with score above a given threshold are reported
Start at end of the alignment and move backwards to the beginning
T G A T G G A G G T
0
0
0
0
0
0
0
0
0
0
0
G
0
0
1
0
0
1
1
0
1
1
0
A
0
0
0
2
0
0
0
2
0
0
0
T
0
1
0
0
3
1
0
0
1
0
1
A
0
0
0
1
1
2
0
1
0
0
0
G
0
0
1
0
0
2
3
1
2
1
0
G
0
0
1
0
0
1
3
2
2
3
1
CMSC 838T – Presentation
TGAT– G GAG G T
GATAG G
TGATG GAG G T
GATAG G
TGATG GAG G T
GATAG G
TGATG GAG G T
GATAG G
EARTH Execution Model
Program is viewed as a collection of threads
execution order determined by data and control dependencies
Threads further divided into fibers
fibers are non-preemptive and
all data is ready before their execution
Each node in EARTH has
an execution unit
synchronization unit
queues linking the two (RQ and EQ)
local memory
interface to interconnection network
CMSC 838T – Presentation
EARTH Architecture
Memory bus
PE
PE
From RQ
node
…
PE
.
.
To EQ
node
node
EU
Local
Memory
RQ
EQ
SU
CMSC 838T – Presentation
Inter
connection
Network
Multithreaded parallel implementation
Divide scoring matrix as follows
horizontal strips (each element of input sequence X)
strips into rectangular blocks
Blocks are calculated by two fibers within a thread
Each thread is assigned to one horizontal strip
only one fiber is active at any given time
the computation is done by even/ odd fibers within the thread
Initialization delay of reading sequences from server is minimized
Each thread needs only the piece of input sequence it grabs and not the whole of
sequence X
After computing a block, fiber sends to fiber beneath a piece of sequence Y
among other information
The computation of the anti-diagonal elements of the matrix is as shown
CMSC 838T – Presentation
Computation of similarity matrix on EARTH
P1
Thread A
P2
Thread B
E fibers
Inactive fiber
E fibers
O
O
Active fiber
Ack
Sync
Data
P3
P1
P2
P3
P4
P1
P2
P3
P4
CMSC 838T – Presentation
Evaluation
Experimental environment
Beowulf implementation of EARTH
Uses Beowulf machine consisting of 64 nodes, each containing two
200MHz Pentium Pro processors (a total of 128 processors and 128MB of
memory)
Sequences of lengths ranging from 30K to 900K were tested
Execution times for sequential and parallel implementation of Smith and
Waterman algorithm is given below:
Implementation
Time
Seq. Smith-Waterman
53 hours
ATGC on 16 nodes
3.3 hours
ATGC on 32 nodes
2.1 hours
ATGC on 64 nodes
1.3 hours
CMSC 838T – Presentation
Evaluation
The multithreaded parallel implementation is named ATGC –
Another Tool for Genomic Comparison
Experiment alignes
human and mice mitochondrial genomes
human and drosophila mitochondrial genomes
Reason for selection
human and mice are closely related and the other pair are less similar
The results were confirmed with MUMmer – another whole
genome alignment tool
Result graphs show that ATGC is more accurate than MUMmer
(verified by using NCBI Blast)
CMSC 838T – Presentation
Result Graphs
CMSC 838T – Presentation
Contd….
CMSC 838T – Presentation
Conclusions
Comparison of whole genomes requires high computation and
memory
Made convenient by using a multithreaded parallel
implementation of dynamic programming on a cluster of PCs
Accurate results obtained in reasonable amount of time
Aligns closely related as well as less similar genomes
Slower, but plays important role where high accuracy is needed
( as seen in comparison with MUMmer for human and drosophila
mitochondrial genome)
CMSC 838T – Presentation
Related work –MUMmer(Maximal Unique
Match)
given genomes A and B
find all maximal, unique, matching subsequences (MUMs)
extract the longest possible set of matches that occur in the same order in
both genomes
close the gaps
output the alignment
maximal unique match (MUM):
occurs exactly once in both genomes A and B
not contained in any longer MUM
key idea in identifying MUMs is to build a suffix tree for
genomes A and B
CMSC 838T – Presentation