SALSA Group Research Activities April 27, 2011 Research Overview MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/PlotViz Education.
Download
Report
Transcript SALSA Group Research Activities April 27, 2011 Research Overview MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/PlotViz Education.
SALSA Group Research Activities
April 27, 2011
Research Overview
MapReduce Runtime
Twister
Azure MapReduce
Dryad and Parallel Applications
NIH Projects
Bioinformatics
Workflow
Data Visualization – GTM/MDS/PlotViz
Education
Twister &
Azure MapReduce
What is Twister?
Twister is an Iterative MapReduce
Framework which supports
Customized static input data partition
Cacheable map/reduce tasks
Combining operation to converge
intermediate outputs to main program
Fault recovery between iterations
Twister Programming Model
Twister Architecture
Applications and Performance
MapReduceRoles for Azure
MapReduce framework for Azure Cloud
Built using highly-available and scalable Azure
cloud services
Distributed, highly scalable & highly available services
Minimal management / maintenance overhead
Reduced footprint
Hides the complexity of cloud & cloud services
from the users
Co-exist with eventual consistency & high latency
of cloud services
Decentralized control
avoids single point of failure
MapReduceRoles for Azure
• Supports dynamically scaling up and
down of the compute resources.
• Fault Tolerance
• Combiner step
• Web based monitoring console
• Easy testing and deployment
Twister for Azure
Map
Job Start
Map
Scheduling Queue
Combine
Combine
Worker Role
Job Bulleting Board
Reduce
MapID
Merge
Add
Iteration?
Map
Combine
Reduce
No
Map
1
Map
2
Map
n
Reduce Workers
Red
1
Yes
Hybrid scheduling of the new iteration
Status
Job Finish
Data Cache
…….
Map Workers
Map Task Table
MapID
…….
Status
Red
2
Red
n
In Memory Data Cache
Task Monitoring
Role Monitoring
Iterative MapReduce Framework for
Microsoft Azure Cloud.
Merge Step
In-Memory Caching of static data
Cache aware hybrid scheduling using Queues
as well as using a bulletin board
Kmeans Performance with/without data caching.
Performance Comparisons
Kmeans Scaling speedup
100.00%
1600
90.00%
160%
1400
Relative Parallel
Efficiency
Time(s)
80.00%
1200
70.00%
800
80%
600
60%
Hadoop-Blast
400
40%
DryadLINQ-Blast
200
20%
Time (s)
40.00%
Twister4Azure
20.00%
10.00%
0.00%
0
528
428
328
Number of Query Files
628
728
Cap3 Sequence Assembly
100%
95%
90%
85%
80%
75%
70%
65%
60%
55%
50%
0%
8 X 16M
16 X 32M
32 X 64M
48 X 96M
Num Instances X Num Data Points
64 X 128M
Smith Watermann Sequence Alignment
3000
2500
Twister4Azure
Amazon EMR
Apache Hadoop
Adjusted Time (s)
228
128
120%
100%
50.00%
30.00%
140%
1000
60.00%
Parallel Efficiency
Parallel Efficiency
Kmeans Increasing number of iterations
Relative Parallel Efficiency
BLAST Sequence Search
2000
1500
Twister4Azure
1000
Amazon EMR
500
Apache Hadoop
0
Num. of Cores * Num. of Files
Num. of Cores * Num. of Blocks
Dryad &
Parallel Applications
DryadLINQ CTP Evaluation
The beta version released on Dec 2010
Motivation:
Evaluate key features and interface in DryadLINQ
Study parallel programming model in DryadLINQ
Three applications
SW-G bioinformatics application
Matrix Matrix Multiplication
PageRank
Parallel programming model
DryadLINQ store input data as DistributedQuery<T> objects
It splits distributed objects into partitions with following APIs:
AsDistributed()
Window HPC Server 2008 R2 Cluster
RangePartition()
Common LINQ providers
Data
Provider
Base class
LINQ-to-objects
IEnumerable<T>
PLINQ
ParallelQuery<T>
LINQ-to-SQL
IQueryable<T>
DSC Client Service
LINQ-to-?
IQueryable<T>
HPC Client Utilites
DryadLINQ
DistributedQuery<T>
DSC
DryadLINQ Provider
Workstation computer
Dryad graph
manager
Vertex 1
Compute node
Compute node
DSC Service
HPC Job
Scheduler
Service
Data
Head node
...
Data
Vertex 2
Vertex n
Compute node
Compute node
SW-G bioinformatics application
Workload balance issue
SW-G tasks are inhomogeneous in CPU time.
Skewed distributed input data cause in-balance workload distribution
Randomized distributed input data can alleviate above issue
Static and Dynamic optimization in Dryad/DryadLINQ
skewed/randomized ratio
Skewed/randomize Ratio
3.5
3
2.5
2
1.5
1
previous Dryad
0.5
new Dryad
0
0
50
100
150
200
mean sequence length (400) with varying standard deviations
250
Matrix-Matrix Multiplication
Parallel programming algorithms
Row split
Row Column split
2 dimensional block decomposition in Fox algorithm
Multi core technologies in .NET
250
TPL, PLINQ, Thread pool
200
Hybrid parallel model
Port multi-core to Dryad
150
task to improve performance
100
TPL
Thread
Task
PLINQ
50
0
Fox-DSC
RowColumn-DSC
RowSplit-DSC
PageRank
Grouped Aggregation
A core primitive of many distributed programming models.
Two stage:1) Partition the data into groups by some keys 2)
Performs an aggregation over each groups
DryadLINQ provide two types of grouped aggregation
GroupBy(), without partial aggregation optimization.
GroupAndAggregate(), with partial aggregation.
3500
Seconds
3000
2500
GroupAndAggregate
2000
TwoApplyPerpartition
1500
OneApplyPerPartition
1000
GroupBy
500
HierarchicalAggregation
0
1280
960
number of am files
640
320
NIH Projects
Sequence Clustering
MPI.NET
Implementation
Smith-Waterman /
Needleman-Wunsch
with
Kimura2 / Jukes-Cantor
/ Percent-Identity
C# Desktop
Application based
on VTK
Pairwise
Clustering
Cluster Indices
Gene
Sequences
Pairwise
Alignment &
Distance
Calculation
Visualization
Distance Matrix
3D Plot
Coordinates
MultiDimensional
Scaling
Chi-Square /
Deterministic
Annealing
MPI.NET
Implementation
MPI.NET
Implementation
* Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation library
Scale-up Sequence Clustering with
Twister
Gene Sequences
(N = 1 Million)
e.g. 25 Million
O(MxM)
Select
Reference
Pairwise
Alignment &
Distance
Calculation
Reference
Sequence Set
(M = 100K)
Distance Matrix
Reference
Coordinates
N-M
Sequence
Set (900K)
Interpolative MDS
with Pairwise
Distance Calculation
x, y, z
MultiDimensional
Scaling (MDS)
O(Mx(N-1))
N-M
Coordinates
x, y, z
Visualization
3D Plot
O(MxM)
Services and Support
Web Portal and Metadata Management
CGB work
// todo - Ryan
GTM vs. MDS
GTM
Purpose
MDS (SMACOF)
• Non-linear dimension reduction
• Find an optimal configuration in a lower-dimension
• Iterative optimization method
Input
Vector-based data
Non-vector (Pairwise similarity matrix)
Objective
Function
Maximize Log-Likelihood
Minimize STRESS or SSTRESS
Complexity
O(KN) (K << N)
O(N2)
Optimization
Method
EM
Iterative Majorization (EM-like)
PlotViz
PlotViz
Light-weight client
DrugBank
CTD
QSAR
Visualization
Algorithms
Parallel dimension
reduction algorithms
PubChem
Chem2Bio2RDF
Aggregated public
databases
23
Education
SALSAHPC Dynamic Virtual Cluster on
Demonstrate the concept of Science
FutureGrid -- Demo
SC09
on Clouds onat
FutureGrid
Dynamic Cluster Architecture
Monitoring Infrastructure
SW-G Using
Hadoop
SW-G Using
Hadoop
SW-G Using
DryadLINQ
Linux
Baresystem
Linux on
Xen
Windows
Server 2008
Bare-system
XCAT Infrastructure
iDataplex Bare-metal Nodes
(32 nodes)
Monitoring & Control Infrastructure
Monitoring Interface
Pub/Sub
Broker
Network
Virtual/Physical
Clusters
XCAT Infrastructure
iDataplex Baremetal Nodes
Summarizer
Switcher
SALSAHPC Dynamic Virtual Cluster on
Demonstrate the concept of Science
FutureGrid -- Demo
ata SC09
on Clouds using
FutureGrid cluster
http://salsahpc.indiana.edu/b534
http://salsahpc.indiana.edu/b534projects