Towards High Performance Data Analytics with Java SALIYA EKANAYAKE [email protected] 4/1/2013 SALSA PRESENTATION A Bit of Background Gene Sequence Clustering and Visualization Gene Sequences ◦ Projects ◦ Million sequence project.
Download ReportTranscript Towards High Performance Data Analytics with Java SALIYA EKANAYAKE [email protected] 4/1/2013 SALSA PRESENTATION A Bit of Background Gene Sequence Clustering and Visualization Gene Sequences ◦ Projects ◦ Million sequence project.
Towards High Performance Data Analytics with Java SALIYA EKANAYAKE [email protected] 4/1/2013 SALSA PRESENTATION 1 A Bit of Background Gene Sequence Clustering and Visualization Gene Sequences ◦ Projects ◦ Million sequence project http://salsahpc.indiana.edu/millionseq/ >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC … ◦ Work on COG (Protein) sequences http://salsacog.blogspot.com/ ◦ Work on phylogenetic trees http://salsafungiphy.blogspot.com/ ◦ Publications ◦ G. L. H. Yang Ruan, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox, “Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions,” in C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, 2014 ◦ L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, A. Hughes, Y. Ruan, J. Qiu, E. Kolker, and G. Fox, “Visualizing the protein sequence universe,” in Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences, Delft, The Netherlands, 2012, pp. 13-22 Determine, Represent, and Verify Clusters Sequence Cluster 0 2 1 1 … … Represent and Verify Visualize in 3D ◦ Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, “DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences,” in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, Orlando, Florida, 2012, pp. 329-336 ◦ A. Hughes, Y. Ruan, S. Ekanayake, S. H. Bae, Q. Dong, M. Rho, J. Qiu, and G. Fox, “Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets,” BMC Bioinformatics, vol. 13 Suppl 2, pp. S9, 2012 4/1/2013 SALSA PRESENTATION Generate Phylogenetic Trees Compared to Traditional 2D 2 Under the Hood # >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … Dimension Reduction >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC … D1 Alignment and Distance Calculation X Y Z 0 0.358 0.262 0. 295 1 0.252 0.422 0.372 ◦ Alignment and Distance Calculation D3 D2 Visualization Clustering Algorithms D5 Cluster 0 1 1 3 ◦ SALSA-SWG-MBF C# MPI ◦ SALSA-NW-MBF C# MPI ◦ SALSA-NW-BioJava Java MapReduce ◦ Dimension Reduction Reality Is More Complex ◦ Study of Biological Sequence Structure ◦ http://salsahpc.blogspot.com/2013/05/study-of-biological-sequence-structure.html ◦ Million Sequence Processes ◦ MDSasChisq C# MPI ◦ DA-SMACOF C# MPI ◦ Twister DA-SMACOF Java Iterative MapReduce ◦ WDA-SMACOF Java Iterative MapReduce ◦ Clustering ◦ http://salsahpc.indiana.edu/millionseq/fungi2/fungi2_index.html Runs On ◦ Tempest Windows HPC Cluster ◦ FutureGrid, BigRed II, Quarry Traditional Linux Based HPC Clusters 4/1/2013 C# MPI ◦ SALSA-SWG-MBF2Java Java MapReduce D4 # ◦ SALSA-SWG SALSA PRESENTATION ◦ DAPWC C# MPI ◦ DAVS C# MPI 3 Towards Java Motivation ◦ Immediate Limited Windows HPC Clusters ◦ Future Integrate with Apache Big Data Stack (ABDS) Options ◦ Keep C# ◦ Run on Azure cloud Not the best for MPI because of high latencies and low bandwidths ◦ Run on Mono We tried, it worked, but poor in performance ◦ Convert to Java ◦ Time consuming, but gained good results “Java Ready” Applications ◦ Deterministic Annealing Vector Sponge (DAVS) ◦ Deterministic Annealing Pairwise Clustering (DAPWC) 4/1/2013 SALSA PRESENTATION 4 Evaluations MPI Frameworks ◦ MPI.NET A high performance message passing interface for .NET environment ◦ FastMPJ A pure Java implementation of mpiJava 1.2 specification ◦ OpenMPI Java wrapper for native MPI implementation ◦ Nightly snapshot 1.9a1r28881 (OMPI-nightly) – conforms with mpiJava 1.2 specification ◦ Source tree revision 30301 (OMPI-trunk) ◦ Release candidate version 1.7.5rc5 (OMPI-175rc5) – latest of the three Your code was here!! Kernel Benchmarks ◦ Ohio MicroBenchmark (OMB) Suite ◦ Send and receive ◦ Allreduce Application Benchmarks ◦ DAVS and DAPWC on Real Data ◦ Parallel Patterns of T x P x N ◦ T - # threads per process ◦ P - # MPI processes per node ◦ N - # nodes ◦ Threads from Habanero Java Library Mainly for Parallel Loops 4/1/2013 SALSA PRESENTATION 5 Kernel Benchmarks MPI Send and Receive 10000 10000 MPI.NET C# in Tempest OMPI-trunk C Madrid FastMPJ Java in FG OMPI-trunk Java Madrid OMPI-nightly Java FG OMPI-trunk C FG OMPI-trunk Java FG 1000 1000 OMPI-trunk Java FG OMPI-trunk C FG 100 10 OMPI-trunk Performance with and without Infiniband SALSA PRESENTATION 6 1MB 512KB 256KB 128KB 64KB 32KB 16KB 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 16B 8B 4B Message Size (bytes) Performance with Different MPI Frameworks 4/1/2013 1 2B 1MB 512KB 256KB Message size (bytes) 128KB 64KB 32KB 16KB 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 16B 8B 4B 2B 1B 1 1B 10 0B Average time (us) 100 0B Average Time (us) OMPI-nightly C FG Kernel Benchmarks MPI Allreduce 1000000 MPI.NET C# in Tempest OMPI-trunk C Madrid FastMPJ Java in FG 50000 OMPI-trunk Java Madrid 100000 OMPI-nightly Java FG OMPI-trunk C FG OMPI-trunk Java FG OMPI-trunk C FG OMPI-trunk Java FG 10000 Average Time (us) OMPI-nightly C FG 500 1000 100 50 10 Performance with Different MPI Frameworks 4/1/2013 OMPI-trunk Performance with and without Infiniband SALSA PRESENTATION 7 8MB 4MB 2MB 1MB 512KB 256KB 64KB 128KB Message Size (bytes) 32KB 16KB 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 16B 8B 8MB 4MB 2MB 1MB 512KB 256KB 128KB 64KB 32KB 16KB 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 8B 16B Message size (bytes) 4B 1 5 4B Average time (us) 5000 DAVS Performance Mode – Charge5 1.2 6 0.35 MPI.NET MPI.NET MPI.NET 0.3 1 5.5 OMPI-nightly OMPI-nightly 5 OMPI-trunk OMPI-nightly OMPI-trunk OMPI-trunk 4.5 0.25 0.8 0.6 0.2 Speedup Time (hours) Time (hours) 4 0.15 3.5 3 0.4 2.5 0.1 2 0.2 0.05 0 1x1x1 1x1x2 1x2x1 4/1/2013 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1 1.5 1 0 2x1x8 4x1x8 8x1x8 1x2x8 4x2x8 TxPxN TxPxN Pure MPI MPI with Threads SALSA PRESENTATION 1x4x8 2x4x8 1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 1x2x4 1x4x2 1x8x1 TxPxN Pure MPI Speedup 8 DAVS Performance Mode – Charge2 30 5.5 5 MPI.NET MPI.NET 4.5 OMPI-nightly OMPI-nightly 25 OMPI-trunk OMPI-nightly OMPI-trunk 4 OMPI-trunk 4.5 3.5 20 4 15 3 3.5 Speedup Time (hours) Time (hours) MPI.NET 5 2.5 3 2 10 2.5 1.5 2 1 5 1.5 0.5 0 0 1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 TxPxN Pure MPI 4/1/2013 1x1x8 1x2x4 1x4x2 1 2x1x8 4x1x8 8x1x8 1x2x8 4x2x8 1x4x8 TxPxN MPI with Threads SALSA PRESENTATION 2x4x8 1x8x8 1x1x1 1x1x2 1x2x1 1x1x4 1x4x1 1x1x8 TxPxN Pure MPI Speedup 9 1x2x4 1x4x2 DAVS Performance Single Node Charge 2, Charge 5 and Charge 6 160 OMPI-trunk Madrid OMPI-trunk FG 120 OMPI-trunk FG 25.00 MPI.NET Tempest 100 20.00 MPI.NET Tempest 15.00 10.00 5.00 0.00 1x1x1 Time (s) 30.00 Time (hours) 140 OMPI-trunk Madrid 80 60 140 MPI.NET Tempest 80 60 40 20 20 1x1x1 OMPI-trunk FG 100 40 0 MPI.NET Madrid 120 Time (s) 35.00 0 1x4x1 TxPxN TxPxN TxPxN Charge 2 Charge 5 Charge 6 Points ◦ ◦ ◦ ◦ OMPI-trunk performed the best and OMPI-nightly was near too MPI.NET may be suffering from bad Infiniband FastMPJ had issues that prevented it from running the applications Performance with threading is not up to expected for Java 4/1/2013 SALSA PRESENTATION 10 4/1/2013 8x1x2 4x2x2 4x1x4 2x4x2 2x2x4 2x1x8 1x8x2 1x4x4 1x2x8 1x1x16 8x1x1 4x2x1 4x1x2 2x4x1 2x2x2 2x1x4 1x8x1 1x4x2 1x2x4 1x1x8 4x1x1 2x2x1 2x1x2 1x4x1 1x2x2 1x1x4 2x1x1 1x2x1 1x1x2 1x1x1 TxPxN SALSA PRESENTATION 11 1x8x43 8x1x32 4x2x32 2x4x32 1x8x32 8x1x16 4x2x16 4x1x32 2x4x16 2x2x32 1x8x16 1x4x32 8x1x8 4x2x8 4x1x16 2x4x8 2x2x16 2x1x32 1x8x8 1x4x16 1x2x32 8x1x4 4x2x4 4x1x8 2x4x4 2x2x8 2x1x16 1x8x4 1x4x8 1x2x16 1x1x32 Time (hours) DAPWC Performance OMPI-175 Only (Chosen over OMPI-trunk) 5 4.5 Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Performance 4 3.5 3 2.5 2 1.5 1 0.5 0 DAPWC Performance Parallelism 16 Time (hours) 0.8 Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Performance 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 TxPxN 4/1/2013 SALSA PRESENTATION 12 4/1/2013 SALSA PRESENTATION TxPxN Points ◦ Performance with threads is better than DAVS, but Tx1xN is peculiar ◦ FastMPJ failed as before ◦ MPI.NET and OMPI-nightly runs are yet to perform 13 8x1x32 4x2x32 2x4x32 1x8x32 8x1x16 4x2x16 4x1x32 2x4x16 2x2x32 1x8x16 1x4x32 8x1x8 4x2x8 4x1x16 2x4x8 2x2x16 2x1x32 1x8x8 1x4x16 1x2x32 8x1x4 4x2x4 4x1x8 2x4x4 2x2x8 2x1x16 1x8x4 1x4x8 1x2x16 1x1x32 8x1x2 4x2x2 4x1x4 2x4x2 2x2x4 2x1x8 1x8x2 1x4x4 1x2x8 1x1x16 8x1x1 4x2x1 4x1x2 2x4x1 2x2x2 2x1x4 1x8x1 1x4x2 1x2x4 1x1x8 4x1x1 2x2x1 2x1x2 1x4x1 1x2x2 1x1x4 2x1x1 1x2x1 1x1x2 1x1x1 Speedup DAPWC Performance Speedup 121 101 Region 5(10)_2(4) 12579 Points 4 Clusters - OMPI-1.7.5rc5 Speedup 81 61 41 21 1 Current Tasks and Future Current ◦ Complete migration of applications to Java ◦ Evaluate performance ◦ Investigate “not so great” thread performance Future ◦ How to integrate with ABDS? ◦ Provide SaaS? 4/1/2013 SALSA PRESENTATION 14 Thank you! 4/1/2013 SALSA PRESENTATION 15