Towards High Performance Data Analytics with Java SALIYA EKANAYAKE [email protected] 4/1/2013 SALSA PRESENTATION A Bit of Background Gene Sequence Clustering and Visualization Gene Sequences ◦ Projects ◦ Million sequence project.

Download Report

Transcript Towards High Performance Data Analytics with Java SALIYA EKANAYAKE [email protected] 4/1/2013 SALSA PRESENTATION A Bit of Background Gene Sequence Clustering and Visualization Gene Sequences ◦ Projects ◦ Million sequence project.

Towards High Performance Data Analytics
with Java
SALIYA EKANAYAKE
[email protected]
4/1/2013
SALSA PRESENTATION
1
A Bit of Background
Gene Sequence Clustering and Visualization
Gene
Sequences
◦ Projects
◦ Million sequence project http://salsahpc.indiana.edu/millionseq/
>G0H13NN01D34CL
GTCGTTTAAGCCATTACGTC …
>G0H13NN01DK2OZ
GTCGTTAAGCCATTACGTC …
◦ Work on COG (Protein) sequences http://salsacog.blogspot.com/
◦ Work on phylogenetic trees http://salsafungiphy.blogspot.com/
◦ Publications
◦ G. L. H. Yang Ruan, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang,
Geoffrey Fox, “Integration of Clustering and Multidimensional Scaling to Determine
Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions,” in C4Bio
2014 of IEEE/ACM CCGrid 2014, Chicago, USA, 2014
◦ L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, A. Hughes, Y.
Ruan, J. Qiu, E. Kolker, and G. Fox, “Visualizing the protein sequence universe,” in
Proceedings of the 3rd international workshop on Emerging computational methods
for the life sciences, Delft, The Netherlands, 2012, pp. 13-22
Determine,
Represent,
and Verify
Clusters
Sequence
Cluster
0
2
1
1
…
…
Represent
and Verify
Visualize in
3D
◦ Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, “DACIDR:
deterministic annealed clustering with interpolative dimension reduction using a large
collection of 16S rRNA sequences,” in Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Biomedicine, Orlando, Florida, 2012, pp.
329-336
◦ A. Hughes, Y. Ruan, S. Ekanayake, S. H. Bae, Q. Dong, M. Rho, J. Qiu, and G. Fox,
“Interpolative multidimensional scaling techniques for the identification of clusters in
very large sequence sets,” BMC Bioinformatics, vol. 13 Suppl 2, pp. S9, 2012
4/1/2013
SALSA PRESENTATION
Generate
Phylogenetic
Trees
Compared to
Traditional 2D
2
Under the Hood
#
>G0H13NN01D34CL
GTCGTTTAAGCCATTACGTC …
Dimension
Reduction
>G0H13NN01DK2OZ
GTCGTTAAGCCATTACGTC …
D1
Alignment
and Distance
Calculation
X
Y
Z
0
0.358 0.262 0. 295
1
0.252 0.422 0.372
◦ Alignment and Distance Calculation
D3
D2
Visualization
Clustering
Algorithms
D5
Cluster
0
1
1
3
◦ SALSA-SWG-MBF
 C# MPI
◦ SALSA-NW-MBF
 C# MPI
◦ SALSA-NW-BioJava
 Java MapReduce
◦ Dimension Reduction
Reality Is More Complex
◦ Study of Biological Sequence Structure
◦ http://salsahpc.blogspot.com/2013/05/study-of-biological-sequence-structure.html
◦ Million Sequence Processes
◦ MDSasChisq
 C# MPI
◦ DA-SMACOF
 C# MPI
◦ Twister DA-SMACOF
 Java Iterative MapReduce
◦ WDA-SMACOF
 Java Iterative MapReduce
◦ Clustering
◦ http://salsahpc.indiana.edu/millionseq/fungi2/fungi2_index.html
Runs On
◦ Tempest  Windows HPC Cluster
◦ FutureGrid, BigRed II, Quarry  Traditional Linux Based HPC Clusters
4/1/2013
 C# MPI
◦ SALSA-SWG-MBF2Java  Java MapReduce
D4
#
◦ SALSA-SWG
SALSA PRESENTATION
◦ DAPWC
 C# MPI
◦ DAVS
 C# MPI
3
Towards Java
Motivation
◦ Immediate  Limited Windows HPC Clusters
◦ Future  Integrate with Apache Big Data Stack (ABDS)
Options
◦ Keep C#
◦ Run on Azure cloud  Not the best for MPI because of high latencies and low bandwidths
◦ Run on Mono  We tried, it worked, but poor in performance
◦ Convert to Java
◦ Time consuming, but gained good results
“Java Ready” Applications
◦ Deterministic Annealing Vector Sponge (DAVS)
◦ Deterministic Annealing Pairwise Clustering (DAPWC)
4/1/2013
SALSA PRESENTATION
4
Evaluations
MPI Frameworks
◦ MPI.NET  A high performance message passing interface for .NET environment
◦ FastMPJ  A pure Java implementation of mpiJava 1.2 specification
◦ OpenMPI  Java wrapper for native MPI implementation
◦ Nightly snapshot 1.9a1r28881 (OMPI-nightly) – conforms with mpiJava 1.2 specification
◦ Source tree revision 30301 (OMPI-trunk)
◦ Release candidate version 1.7.5rc5 (OMPI-175rc5) – latest of the three
 Your code was here!!
Kernel Benchmarks
◦ Ohio MicroBenchmark (OMB) Suite
◦ Send and receive
◦ Allreduce
Application Benchmarks
◦ DAVS and DAPWC on Real Data
◦ Parallel Patterns of T x P x N
◦ T - # threads per process
◦ P - # MPI processes per node
◦ N - # nodes
◦ Threads from Habanero Java Library Mainly for Parallel Loops
4/1/2013
SALSA PRESENTATION
5
Kernel Benchmarks
MPI Send and Receive
10000
10000
MPI.NET C# in Tempest
OMPI-trunk C Madrid
FastMPJ Java in FG
OMPI-trunk Java Madrid
OMPI-nightly Java FG
OMPI-trunk C FG
OMPI-trunk Java FG
1000
1000
OMPI-trunk Java FG
OMPI-trunk C FG
100
10
OMPI-trunk Performance with and without Infiniband
SALSA PRESENTATION
6
1MB
512KB
256KB
128KB
64KB
32KB
16KB
8KB
4KB
2KB
1KB
512B
256B
128B
64B
32B
16B
8B
4B
Message Size (bytes)
Performance with Different MPI Frameworks
4/1/2013
1
2B
1MB
512KB
256KB
Message size (bytes)
128KB
64KB
32KB
16KB
8KB
4KB
2KB
1KB
512B
256B
128B
64B
32B
16B
8B
4B
2B
1B
1
1B
10
0B
Average time (us)
100
0B
Average Time (us)
OMPI-nightly C FG
Kernel Benchmarks
MPI Allreduce
1000000
MPI.NET C# in Tempest
OMPI-trunk C Madrid
FastMPJ Java in FG
50000
OMPI-trunk Java Madrid
100000
OMPI-nightly Java FG
OMPI-trunk C FG
OMPI-trunk Java FG
OMPI-trunk C FG
OMPI-trunk Java FG
10000
Average Time (us)
OMPI-nightly C FG
500
1000
100
50
10
Performance with Different MPI Frameworks
4/1/2013
OMPI-trunk Performance with and without Infiniband
SALSA PRESENTATION
7
8MB
4MB
2MB
1MB
512KB
256KB
64KB
128KB
Message Size (bytes)
32KB
16KB
8KB
4KB
2KB
1KB
512B
256B
128B
64B
32B
16B
8B
8MB
4MB
2MB
1MB
512KB
256KB
128KB
64KB
32KB
16KB
8KB
4KB
2KB
1KB
512B
256B
128B
64B
32B
8B
16B
Message size (bytes)
4B
1
5
4B
Average time (us)
5000
DAVS Performance
Mode – Charge5
1.2
6
0.35
MPI.NET
MPI.NET
MPI.NET
0.3
1
5.5
OMPI-nightly
OMPI-nightly
5
OMPI-trunk
OMPI-nightly
OMPI-trunk
OMPI-trunk
4.5
0.25
0.8
0.6
0.2
Speedup
Time (hours)
Time (hours)
4
0.15
3.5
3
0.4
2.5
0.1
2
0.2
0.05
0
1x1x1
1x1x2
1x2x1
4/1/2013
1x1x4
1x4x1
1x1x8
1x2x4
1x4x2
1x8x1
1.5
1
0
2x1x8
4x1x8
8x1x8
1x2x8
4x2x8
TxPxN
TxPxN
Pure MPI
MPI with Threads
SALSA PRESENTATION
1x4x8
2x4x8
1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1
TxPxN
Pure MPI Speedup
8
DAVS Performance
Mode – Charge2
30
5.5
5
MPI.NET
MPI.NET
4.5
OMPI-nightly
OMPI-nightly
25
OMPI-trunk
OMPI-nightly
OMPI-trunk
4
OMPI-trunk
4.5
3.5
20
4
15
3
3.5
Speedup
Time (hours)
Time (hours)
MPI.NET
5
2.5
3
2
10
2.5
1.5
2
1
5
1.5
0.5
0
0
1x1x1
1x1x2
1x2x1
1x1x4
1x4x1
TxPxN
Pure MPI
4/1/2013
1x1x8
1x2x4
1x4x2
1
2x1x8
4x1x8
8x1x8
1x2x8
4x2x8
1x4x8
TxPxN
MPI with Threads
SALSA PRESENTATION
2x4x8
1x8x8
1x1x1
1x1x2
1x2x1
1x1x4
1x4x1
1x1x8
TxPxN
Pure MPI Speedup
9
1x2x4
1x4x2
DAVS Performance
Single Node Charge 2, Charge 5 and Charge 6
160
OMPI-trunk Madrid
OMPI-trunk FG
120
OMPI-trunk FG
25.00
MPI.NET Tempest
100
20.00
MPI.NET Tempest
15.00
10.00
5.00
0.00
1x1x1
Time (s)
30.00
Time (hours)
140
OMPI-trunk Madrid
80
60
140
MPI.NET Tempest
80
60
40
20
20
1x1x1
OMPI-trunk FG
100
40
0
MPI.NET Madrid
120
Time (s)
35.00
0
1x4x1
TxPxN
TxPxN
TxPxN
Charge 2
Charge 5
Charge 6
Points
◦
◦
◦
◦
OMPI-trunk performed the best and OMPI-nightly was near too
MPI.NET may be suffering from bad Infiniband
FastMPJ had issues that prevented it from running the applications
Performance with threading is not up to expected for Java
4/1/2013
SALSA PRESENTATION
10
4/1/2013
8x1x2
4x2x2
4x1x4
2x4x2
2x2x4
2x1x8
1x8x2
1x4x4
1x2x8
1x1x16
8x1x1
4x2x1
4x1x2
2x4x1
2x2x2
2x1x4
1x8x1
1x4x2
1x2x4
1x1x8
4x1x1
2x2x1
2x1x2
1x4x1
1x2x2
1x1x4
2x1x1
1x2x1
1x1x2
1x1x1
TxPxN
SALSA PRESENTATION
11
1x8x43
8x1x32
4x2x32
2x4x32
1x8x32
8x1x16
4x2x16
4x1x32
2x4x16
2x2x32
1x8x16
1x4x32
8x1x8
4x2x8
4x1x16
2x4x8
2x2x16
2x1x32
1x8x8
1x4x16
1x2x32
8x1x4
4x2x4
4x1x8
2x4x4
2x2x8
2x1x16
1x8x4
1x4x8
1x2x16
1x1x32
Time (hours)
DAPWC Performance
OMPI-175 Only (Chosen over OMPI-trunk)
5
4.5
Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Performance
4
3.5
3
2.5
2
1.5
1
0.5
0
DAPWC Performance
Parallelism  16
Time (hours)
0.8
Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Performance
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
TxPxN
4/1/2013
SALSA PRESENTATION
12
4/1/2013
SALSA PRESENTATION
TxPxN
Points
◦ Performance with threads is better than DAVS, but Tx1xN is peculiar
◦ FastMPJ failed as before
◦ MPI.NET and OMPI-nightly runs are yet to perform
13
8x1x32
4x2x32
2x4x32
1x8x32
8x1x16
4x2x16
4x1x32
2x4x16
2x2x32
1x8x16
1x4x32
8x1x8
4x2x8
4x1x16
2x4x8
2x2x16
2x1x32
1x8x8
1x4x16
1x2x32
8x1x4
4x2x4
4x1x8
2x4x4
2x2x8
2x1x16
1x8x4
1x4x8
1x2x16
1x1x32
8x1x2
4x2x2
4x1x4
2x4x2
2x2x4
2x1x8
1x8x2
1x4x4
1x2x8
1x1x16
8x1x1
4x2x1
4x1x2
2x4x1
2x2x2
2x1x4
1x8x1
1x4x2
1x2x4
1x1x8
4x1x1
2x2x1
2x1x2
1x4x1
1x2x2
1x1x4
2x1x1
1x2x1
1x1x2
1x1x1
Speedup
DAPWC Performance
Speedup
121
101
Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Speedup
81
61
41
21
1
Current Tasks and Future
Current
◦ Complete migration of applications to Java
◦ Evaluate performance
◦ Investigate “not so great” thread performance
Future
◦ How to integrate with ABDS?
◦ Provide SaaS?
4/1/2013
SALSA PRESENTATION
14
Thank you!
4/1/2013
SALSA PRESENTATION
15