Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II Original Algorithm - mrFAST Goal : Find out matched coordination of fragment.
Download
Report
Transcript Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II Original Algorithm - mrFAST Goal : Find out matched coordination of fragment.
Parallelization of mrFAST on GPGPU
Hongyi Xin, Donghyuk Lee
Milestone II
Original Algorithm - mrFAST
Goal : Find out matched coordination of fragment on reference
…ACAGTAACTATT
ACAAAAAAAACACGATTCAGATTAAACATAACATACGACCCTTACACTG…
Address: 1225
Algorithm
Reference DNA
Sequence
Sample fragment
Sequence
Create hash table
AAAA
Coordinate 1
Coordinate 2
Coordinate 3
AAAC
1225
Coordinate 2
Coordinate 3
Coordinate 1
Coordinate 2
Coordinate 3
-------TTTT
Get coordinate list
Compare against reference for each coordinate by
Edit-distance calculation --- Expansive!
Problem
- High cost of edit-distance calculation (High complexity and memory accesses)
1 memory access to hash table. / 188 in average Reference DNA lookups.
At least 108 character compares and at lest 324 addes
Average 188 edit distance calculation for each Fragment!
2
Edit-Distance Calculation
3
New Idea : Binary Search Filtering
Insight
Search expected coordinate of each fragment's substring with hash table.
Pros.
+ Avoid accessing to the reference sequence.
+ Less memory access.
Individual DNA
Sequence
ACCCTTACACTAAAAA
…CAGTACCCTTACACTAAAAAGTMTTCCAAACC…
m
AAAA
m+4
m+8
Reference DNA
Sequence
m+12
Coordinate
f
1
Coordinate
m+12 2
Coordinate
n+11 3
Coordinate
m
1
Coordinate
n
2
Coordinate
p
3
Coordinate
d
1
Coordinate
m+8 2
Coordinate
n+7 3
Coordinate
m+4 1
Coordinate
n+4 2
Coordinate
t
3
Coordinate 1
Coordinate 2
Coordinate 3
-------
ACCC
------ACTA
------TTAC
-------TTTT
4
Load imbalance of Hash-table
These keys have
really large entries
5
New Idea : Prefiltering to load balancing
Insight
Pick the cheap keys in binary search filtering, which has small coordinate list size
Pros.
+ Reducing # of binary search.
+ Balancing computation Load of binary search.
AAAATTACACTAAAAA
AAAA
TTAC
# of same
pattern
Large
Small
# of coordinate
Large
Small
# of computation
Large
Small
Individual DNA
Sequence
Balance the load of binary search computation by selecting key,
based on the coordinate size.
Effectiveness of Binary Search Filtering
We want all dots to fall into the left box.
As left as possible!
7
Effectiveness of Binary Search Filtering
8
Future Work
Implement in GPU
Analyze the load imbalance problem
Coordinates passed binary search filtering may vary
Solve the divergence problem
Edit distance may diverge
Divergence is bad for GPU
SIMT model
9
Q&A
Thank you!
10