Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II Original Algorithm - mrFAST  Goal : Find out matched coordination of fragment.

Download Report

Transcript Parallelization of mrFAST on GPGPU Hongyi Xin, Donghyuk Lee Milestone II Original Algorithm - mrFAST  Goal : Find out matched coordination of fragment.

Parallelization of mrFAST on GPGPU
Hongyi Xin, Donghyuk Lee
Milestone II
Original Algorithm - mrFAST

Goal : Find out matched coordination of fragment on reference
…ACAGTAACTATT
ACAAAAAAAACACGATTCAGATTAAACATAACATACGACCCTTACACTG…
Address: 1225

Algorithm

Reference DNA
Sequence
Sample fragment
Sequence
Create hash table
AAAA
Coordinate 1
Coordinate 2
Coordinate 3
AAAC
1225
Coordinate 2
Coordinate 3
Coordinate 1
Coordinate 2
Coordinate 3
-------TTTT




Get coordinate list
Compare against reference for each coordinate by
Edit-distance calculation --- Expansive!
Problem

- High cost of edit-distance calculation (High complexity and memory accesses)

1 memory access to hash table. / 188 in average Reference DNA lookups.

At least 108 character compares and at lest 324 addes

Average 188 edit distance calculation for each Fragment!
2
Edit-Distance Calculation
3
New Idea : Binary Search Filtering

Insight


Search expected coordinate of each fragment's substring with hash table.
Pros.


+ Avoid accessing to the reference sequence.
+ Less memory access.
Individual DNA
Sequence
ACCCTTACACTAAAAA
…CAGTACCCTTACACTAAAAAGTMTTCCAAACC…
m
AAAA
m+4
m+8
Reference DNA
Sequence
m+12
Coordinate
f
1
Coordinate
m+12 2
Coordinate
n+11 3
Coordinate
m
1
Coordinate
n
2
Coordinate
p
3
Coordinate
d
1
Coordinate
m+8 2
Coordinate
n+7 3
Coordinate
m+4 1
Coordinate
n+4 2
Coordinate
t
3
Coordinate 1
Coordinate 2
Coordinate 3
-------
ACCC
------ACTA
------TTAC
-------TTTT
4
Load imbalance of Hash-table
These keys have
really large entries
5
New Idea : Prefiltering to load balancing

Insight


Pick the cheap keys in binary search filtering, which has small coordinate list size
Pros.


+ Reducing # of binary search.
+ Balancing computation Load of binary search.
AAAATTACACTAAAAA
AAAA
TTAC
# of same
pattern
Large
Small
# of coordinate
Large
Small
# of computation
Large
Small
Individual DNA
Sequence
Balance the load of binary search computation by selecting key,
based on the coordinate size.
Effectiveness of Binary Search Filtering
We want all dots to fall into the left box.
As left as possible!
7
Effectiveness of Binary Search Filtering
8
Future Work

Implement in GPU

Analyze the load imbalance problem


Coordinates passed binary search filtering may vary
Solve the divergence problem


Edit distance may diverge
Divergence is bad for GPU

SIMT model
9
Q&A

Thank you!
10