SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University Gene Sequences (N = 1 Million) Select Referenc e N-M Sequence Set (900K) Pairwise Alignment & Distance Calculation Reference Sequence Set (M = 100K) Reference Coordinates Interpolative MDS with Pairwise Distance Calculation x,
Download ReportTranscript SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University Gene Sequences (N = 1 Million) Select Referenc e N-M Sequence Set (900K) Pairwise Alignment & Distance Calculation Reference Sequence Set (M = 100K) Reference Coordinates Interpolative MDS with Pairwise Distance Calculation x,
SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University Gene Sequences (N = 1 Million) Select Referenc e N-M Sequence Set (900K) Pairwise Alignment & Distance Calculation Reference Sequence Set (M = 100K) Reference Coordinates Interpolative MDS with Pairwise Distance Calculation x, y, z O(N2) N-M x, y, z Coordinates Visualization Distance Matrix MultiDimensional Scaling (MDS) 3D Plot Job Start Map Combine Map Combine Reduce Merge Add Iteration? Map Combine Reduce Data Cache Hybrid scheduling of the new iteration Yes No Job Finish Performance with/without data caching Scaling speedup Speedup gained using data cache Increasing number of iterations BLAST Sequence Search Smith Watermann Sequence Alignment 100.00% 90.00% 3000 70.00% 2500 60.00% 50.00% 40.00% 30.00% Twister4Azure 20.00% Hadoop-Blast DryadLINQ-Blast 10.00% Adjusted Time (s) Parallel Efficiency 80.00% 2000 1500 Twister4Azure 1000 Amazon EMR 0.00% 128 228 328 428 528 Number of Query Files 628 728 500 Apache Hadoop 0 Parallel Efficiency Cap3 Sequence Assembly 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Num. of Cores * Num. of Blocks Twister4Azure Amazon EMR Apache Hadoop Num. of Cores * Num. of Files Configuration Program to setup Twister environment automatically on a cluster Full mesh network of brokers for facilitating communication New messaging interface for reducing the message serialization overhead Memory Cache to share data between tasks and jobs This demo is for real time visualization of the process of multidimensional scaling(MDS) calculation. We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer. The process of computation and monitoring is automated by the program. MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute. Client Node II. Send intermediate results Master Node Twister Driver ActiveMQ Broker MDS Monitor Twister-MDS PlotViz I. Send message to start the job IV. Read data III. Write data Local Disk Master Node Twister Driver Twister-MDS Twister Daemon Pub/Sub Broker Network Twister Daemon map reduc e map reduc e Worker Pool Worker Node calculateBC calculateStres s Worker Pool Worker Node MDS Output Monitoring Interface Twister Daemon Node ActiveMQ Broker Node Twister Driver Node 7 Brokers and 32 Computing Nodes in total Broker-Driver Connection Broker-Daemon Connection Broker-Broker Connection Twister-MDS Execution Time 100 iterations, 40 nodes, under different input data sizes 1600.000 1508.487 1404.431 Total Execution Time (Seconds) 1400.000 1200.000 1000.000 816.364 800.000 737.073 600.000 359.625 400.000 200.000 189.288 303.432 148.805 0.000 38400 51200 76800 Number of Data Points Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number) 102400 Group VPN instantiate … GroupVPN Credentials (from Web site) copy Virtual IP - DHCP 5.5.1.1 Virtual IP - DHCP 5.5.1.2 Virtual Machines Support Scientific Simulations (Data Mining and Data Analysis) Applications Life Sciences, Physics, Information Retrieval, Social Network Smith Waterman Dissimilarities, CAP-3 Gene Assembly, PhyloD, High Energy Physics, Clustering, Multidimensional Scaling, Generative Topological Mapping Services and Workflow High Level Language Runtimes Cross Platform Iterative MapReduce Messaging Middleware Infrastructure software Hardware Storage and Data Parallel File System Windows Server Linux HPC Amazon Cloud HPC Bare-system Bare-system Virtualization CPU Nodes Azure Cloud Virtualization GPU Nodes Grid Appliance Development of library of Collectives to use at Reduce phase Broadcast and Gather needed by current applications Discover other important ones Implement efficiently on each platform – especially Azure Better software message routing with broker networks using asynchronous I/O with communication fault tolerance Support nearby location of data and computing using data parallel file systems Clearer application fault tolerance model based on implicit synchronizations points at iteration end points Later: Investigate GPU support Later: run time for data parallel languages like Sawzall, Pig Latin, LINQ (a) Map Only (b) Classic MapReduce (c) Iterative MapReduce Iterations Input Input Input (d) Loosely Synchronous map map map Pij reduce reduce Output High Energy Physics (HEP) Expectation maximization clustering CAP3 Analysis Histograms e.g. Kmeans Smith-Waterman Distances Distributed search Linear Algebra Parametric sweeps Distributed sorting Multimensional Scaling PolarGrid Matlab data analysis Information retrieval Page Rank Domain of MapReduce and Iterative Extensions Many MPI scientific applications such as solving differential equations and particle dynamics MPI SALSA HPC Group Indiana University http://salsahpc.indiana.edu