Transcript ppt

Sequence
Alignment in DNA
Under the Guidance of :
Prof . Kolin Paul
Presented By:
Lalchand
Gaurav Jain
Agenda
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Time-Line
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Time-Line
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Time-Line
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Time-Line
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Time-Line
•
•
•
•
•
•
Application Domain & objective
General Alignment Procedure
Scope of parallelism in BWT
Selection sort and quick sort implementation
Bwt Implementation on GPU
Comparative study
Application Domain & Objective
•
•
•
•
Analyzing Gene expression
Mapping variations between individuals
Mapping homologous Proteins
Assembling Genome of Organism
To present an efficient implementation (Specially
parallel) that effectively aids the problem of
searching for short sequences in DNA.
Basic Alignment Procedure
Genome
To be
parallelized
Indexing
Intermediate
size :10^18
Reads
Parallelized
O(logG)
Searching
{ Location,Occurance}
Scope of Parallelism in BWT
• With BWT , w length string can be find in O(w) time.
• The BWT is closely related to the suffix array
• Lexicographic sorted list of all suffixes in a genome.
BWT
• Bwt[i] = ref [ SA[i] -1] {Bwt[i] = $ when S(i) =1}
10
Initial Step - 1
●
Implementation of Bwt using Selection Sort
–
OpenMp
Selection Sort - Openmp
Bwt Creation using Selection sort
7000
6000
5000
Proc 1
4000
Proc 2
3000
Proc 4
Proc 8
Time in Seconds
2000
1000
0
0
200
400
600
800
1000
CPU
Cores
8
Data
cache
L1 :32K L2
:6M
DRAM
12GB
Proc.
Clock
2.9
GHz
File Size in KB
Initial Step - 2
●
Implementation of Bwt using Selection Sort
–
●
OpenMp
Implementation of Bwt using Quick Sort
–
OpenMp
Quick Sort - Openmp
CPU Statistics
Cores
8
Data
cache
L1 :32K L2
:6M
DRAM
12GB
Proc.
Clock
2.9
GHz
Initial Step - 3
●
Implementation of Bwt using Selection Sort
–
●
Implementation of Bwt using Quick Sort
–
●
OpenMp
OpenMp
Implementing Bwt on GPU
–
Bitonic sort
Why Bitonic ??...
• Concatenations of two sub-sequences sorted in opposite
directions
– A cyclic shift of elements
• Implemented by comparator networks
– Work in place
– No Communication
• Naturally suitable for SIMD architectures
– Each thread executing same code but different data
• O(log2n) time and O(nlog2n) work
Burrows-Wheeler Transform
Basic String Sorting Algorithm
Input: A C G T A $
indices: 0 1 2 3 4 5
5
$
A
C
G
T
A
4
A
$
A
C
G
T
0
A
C
G
T
A
$
1
C
G
T
A
$
A
2
G
T
A
$
A
C
3
T
A
$
A
C
G
5
$
A
C
G
T
A
4
A
$
A
C
G
T
3
T
A
$
A
C
G
2
G
T
A
$
A
C
1
C
G
T
A
$
A
0
A
C
G
T
A
$
indices: 5 4 0 1 2 3
Output: A T $ A C G
18
Steps Performed
• Copy Genome from host to device Memory
• Indices Array for pointing Reference string
• Compare Suffix based on indices array
– Swap indices accordingly.
• Sorts n elements in log2n Kernel calls.
– Each of O(1) time & O(n) work
• One more step for BWT from suffix array
– Bwt[i] = ref [ SA[i] -1] {Bwt[i] = $ when S(i)= 1}
CPU – GPU Interaction (BWT)
O(log2G)
Searching
Genome
Cuda_Memcpy
& kernel call
Suffix Array
Evaluation
Bwt with Bitonic Sort
GPU Statistics
SM
30
Core/SM
8
Cores
240
Data cache
(SM)
16 K
DRAM
536 M
Proc. Freq
1.2 MHz
Comparison between Expected (GPU)
and Exact result
(Quick_Sort_time) * 2 ) / 240
CPU
GPU
Cores
2
240
Data
cache
(SM)
L1 :32K L2
:6M
16K
DRAM
12GB
536 M
Proc.
Clock
2.9
GHz
1.2 MHz
References :
•
Fast in-place sorting with CUDA based on bitonic sort :Hagen Peters
•
Rapid Parallel Genome Indexing with MapReduce :Rohith K. Menon
•
M. Burrows and D. Wheeler. A Block-Sorting Lossless Data Compression Algorithm.
Technical report
•
Lightweight Data Indexing and Compression in External Memory :Paolo Ferragina
•
Parallel Lossless Data Compression on the GPU : Yao Zhang
Thanks
Future Work
• Run in limited memory environments
– Compute in parts
• To use the memory hierarchy of GPU
– Sort keys are cached in register or shared memory
– Long runs of repeated character
• Position indicating end of run
• Can only sort sequence,with length power of 2
– 2k+1 
2k+1
– Padding with largest symbol