Clouds and Grids Multicore and all that GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University [email protected], http://www.infomall.org.
Download ReportTranscript Clouds and Grids Multicore and all that GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University [email protected], http://www.infomall.org.
Clouds and Grids Multicore and all that
GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory
,
School of informatics Indiana University [email protected]
, http://www.infomall.org
1
Grids become Clouds
Grids solve problem of too little computing : We need to harness all the world’s computers to do Science Clouds solve the problem of too much computing : with multicore we have so much power that we need to use effectively to solve user’s problems on “designed (maybe homogeneous)” hardware One new technology: Virtual Machines enable more dynamic flexible environments but not clearly essential
•
Is Virtual Cluster or Virtual Machine right way to think?
Virtualization is pretty inconsistent with parallel computing as virtualization makes it hard to use correct algorithms and correct runtime respecting locality and “reality”
•
2 cores in a chip in separate chips very different algorithm/software than 2 cores Clouds naturally address pleasingly parallel” workflows processes of “embarrassingly – MPI invoked outside cloud
2
Old Issues
Essentially all “vastly” parallel applications are data parallel including algorithms in Intel’s RMS analysis of future multicore “killer apps”
•
Gaming (Physics) and Data mining (“iterated linear algebra”)
So MPI works (Map is normal SPMD ; Reduce is MPI_Reduce) but may not be highest performance or easiest to use
Some new issues
Clouds have commercial software ; Grids don’t There is overhead of using virtual machines (if your cloud like Amazon uses them) There are dynamic, fault tolerance features favoring MapReduce Hadoop and Dryad No new ideas but several new powerful systems Developing scientifically interesting codes in C#, C++, Java and using to compare cores, nodes, VM, not VM, Programming models
3
Intel’s Application Stack
Gartner 2006 Technology Hype Curve
5
C Y B E R I N F R A S T R U C T U R E C E N T E R F O R P O L A R S C I E N C E ( C I C P S )
Gartner 2007 Technology Hype Curve
No Grids!
Sensor Nets Web 2.0
6
C Y B E R I N F R A S T R U C T U R E C E N T E R F O R P O L A R S C I E N C E ( C I C P S )
Gartner 2008 Technology Hype Curve
Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA becoming mainstream
7
QuakeSpace
QuakeSim built using Web 2.0 and Cloud Technology
Applications, Sensors, Data Repositories as Services
Computing
via
Clouds
Portals
as
Gadgets
Metadata
by
tagging
Data sharing
as in
YouTube
Alerts
by
RSS
Virtual Organizations
via
Social Networking
Workflow
by
Mashups
Performance
by
multicore
Interfaces via iPhone, Android etc.
8
Sensor Clouds
Note
sensors
are any time dependent source of information and a fixed source of information is just a broken sensor
• • • •
SAR Satellites Environmental Monitors Nokia N800 pocket computers Presentation of teacher in distance education
• •
Text chats of students Cell phones
• • • • •
RFID tags and readers GPS Sensors Lego Robots RSS Feeds Audio/video: web-cams
Naturally implemented with
dynamic proxies in the Cloud
that filter, archive, queue and distribute Have initial EC2 implementation
9
The Sensors on the Fun Grid
Laptop for PowerPoint 2 Robots used Lego Robot GPS Nokia N800 RFID Tag RFID Reader
10
11
Nimbus Cloud – MPI Performance
Kmeans clustering time vs. the number of 2D data points.
(Both axes are in log scale)
• • • •
Kmeans clustering time (for 100000 data points) vs. the number of iterations of each MPI communication routine
Graph 1 (Left) - MPI implementation of Kmeans clustering algorithm Graph 2 (right) - MPI implementation of Kmeans algorithm modified to perform each MPI communication up to 100 times Performed using 8 MPI processes running on 8 compute nodes each with AMD Opteron™ processors (2.2 GHz and 3 GB of memory) Note large fluctuations in VM-based runtime – implies terrible scaling
25 20 15 10 5 0 35 30 25 20 15 10 5 0 20 15 10 5 0
Nimbus Kmeans Time in secs for 100 MPI calls
Setup 1 Kmeans Time for X=100 of figure A (seconds) Setup 2 Kmeans Time for X=100 of figure A (seconds) Setup 3 Kmeans Time for X=100 of figure A (seconds)
VM_MIN VM_Average VM_MAX
Setup 1
4.857
12.070
24.255
Setup 3
VM_MIN VM_Average VM_MAX 16 14 12 10 8 6 4 2 0 7.736
17.744
32.922
VM_MIN VM_Average VM_MAX MIN Average MAX
Direct Setup 2 Direct
2.05-2.07
2.07-2.09
2.09-2.11
2.11-2.13
Kmeans Time for X=100 of figure A (seconds)
5.067
9.262
24.142
2.058
2.069
2.112
Test Setup # of cores to the VM OS (domU) # of cores to the host OS (dom0)
1 2 3 2 1 1 2 2 1
MPI on Eucalyptus Public Cloud
Kmeans Time for 100 iterations
• Average Kmeans clustering time vs. the number of iterations of each MPI communication routine • 4 MPI processes on 4 VM instances were used
Configuration
CPU and Memory Virtual Machine Operating System gcc MPI Network
VM
Intel(R) Xeon(TM) CPU 3.20GHz, 128MB Memory Xen virtual machine (VMs) Debian Etch gcc version 4.1.1
LAM 7.1.4/MPI 2 18 16 14 12 10 8 2 0 6 4
Variable
VM_MIN VM_Average VM_MAX
MPI Time
7.056
7.417
8.152
We will redo on larger dedicated hardware Used for direct (no VM), Eucalyptus and Nimbus
Consider a Collection of Computers We can have various
hardware
•
Multicore – Shared memory, low latency
•
High quality Cluster – Distributed Memory, Low latency
•
Standard distributed system – Distributed Memory, High latency
We can program the coordination of these units by
•
Threads on cores
• • • •
MPI on cores and/or between nodes MapReduce/Hadoop/Dryad../AVS for dataflow Workflow linking services These can all be considered as some sort of execution unit exchanging messages with some other unit
And there are
higher level programming models
such as OpenMP, PGAS, HPCS Languages
15
Data Parallel Run Time Architectures
MPI MPI MPI MPI MPI is long running processes with Rendezvous for message exchange/ synchronization short or long Ports (messages) Disk HTTP Disk HTTP Disk HTTP Disk HTTP Yahoo Hadoop uses short running processes communicating via disk and tracking processes Trackers Pipes CGL MapReduce is long short
16
Is Dataflow the answer?
For functional parallelism, dataflow natural as one moves from one step to another For much data parallel one needs “ deltaflow ” – send change messages to long running processes/threads as in MPI or any rendezvous model
Potentially huge reduction in communication cost For threads no difference but for processes big difference Overhead is Communication/Computation Dataflow overhead proportional to problem size N per process For solution of PDE’s
•
Deltaflow overhead is N 1/3 and computation like N
•
So dataflow not popular in scientific computing For matrix multiplication, deltaflow computation N 1.5
and dataflow both O( N ) and MapReduce noted that several data analysis algorithms can use dataflow (especially in Information Retrieval)
17
MapReduce implemented by Hadoop map(key, value) reduce(key, list
18
Kmeans Clustering
• • • •
MapReduce for Kmeans Clustering Kmeans Clustering, execution time vs. the number of 2D data points (Both axes are in log scale) All four implementations perform the same Kmeans clustering algorithm Each test is performed using 5 compute nodes (Total of 40 processor cores) CGL-MapReduce shows a performance close to the MPI and Threads implementation Hadoop’s high execution time is due to:
•
Lack of support for iterative MapReduce computation
•
Overhead associated with the file system based communication
HADOOP Factor of 10 3 In memory MapReduce MPI Factor of 30 Number of Data Points
M R
CGL-MapReduce
Content Dissemination Network Map Worker M D Worker Nodes D M R M R M R MR Driver User Program R D Reduce Worker MRDeamon Data Read/Write
• • • • • • •
Data Split File System Communication Architecture of CGL-MapReduce
A streaming based MapReduce runtime implemented in Java All the communications(control/intermediate results) are routed via a content dissemination network Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files MRDriver – Maintains the state of the system – Controls the execution of map/reduce tasks User Program is the composer of MapReduce computations Support both stepped (dataflow) and iterative (deltaflow) MapReduce computations All communication uses publish-subscribe “queues in the cloud” not MPI
Particle Physics (LHC) Data Analysis
Data: Up to 1 terabytes of data, placed in IU Data Capacitor Processing:12 dedicated computing nodes from Quarry (total of 96 processing cores) MapReduce for LHC data analysis LHC data analysis, execution time vs. the volume of data (fixed compute resources)
• • • •
Hadoop and CGL-MapReduce both show similar performance The amount of data accessed in each analysis is extremely large Performance is limited by the I/O bandwidth The overhead induced by the MapReduce implementations has negligible effect on the overall computation 4/28/2020 Jaliya Ekanayake 22
LHC Data Analysis Scalability and Speedup
Execution time vs. the number of compute nodes (fixed data)
• • • •
100 GB of data Speedup = MapReduce Time / Sequential Time around 10 units) Speedup for 100GB of HEP data One core of each node is used (Performance is limited by the I/O bandwidth) Speed gain diminish after a certain number of parallel processing units (after
MPI outside the mainstream
Multicore best practice and large scale distributed processing not scientific computing will drive best concurrent/parallel computing environments Party Line Parallel Programming Model: Workflow (parallel- distributed) controlling optimized library calls
•
Core parallel implementations no easier than before; deployment is easier MPI is wonderful but it will be ignored in real world unless simplified; competition from thread and distributed system technology CCR from Microsoft – only ~7 primitives – is one possible commodity multicore driver
• •
It is roughly active messages Runs MPI style codes fine on multicore
24
Windows Thread Runtime System
We implement thread parallelism using Microsoft CCR ( Concurrency and Coordination Runtime parallelism ) as it supports both MPI rendezvous and dynamic (spawned) threading style of http://msdn.microsoft.com/robotics/ CCR Supports exchange of messages between threads using named ports and has primitives like: FromHandler: Spawn threads without reading ports Receive: Each handler reads one item from a single port MultipleItemReceive: Each handler reads a prescribed number of items of a given type from a given port. Note items in a port can be general structures but all must have same type.
MultiplePortReceive: Each handler reads a one item of a given type from multiple ports.
CCR has fewer primitives than MPI but can implement MPI collectives efficiently Can use DSS ( Decentralized System Services ) built in terms of CCR for service model DSS has ~35 µs and CCR a few µs overhead
0.06
0.04
0.02
0.00
0.20
0.18
0.16
0.14
0.12
0.10
0.08
Parallel Overhead
1-efficiency = (PT(P)/T(1)-1) On P processors = (1/efficiency)-1 2-way Deterministic Annealing Clustering Scaled Speedup Tests on 4 8-core Systems 4-way 1,600,000 points per C# thread 1, 2, 4. 8, 16, 32-way parallelism On Windows 8-way 16-way 32-way Nodes 1 2 1 1 4 2 1 2 1 1 4 2 1 4 2 1 2 1 1 4 2 4 2 4 2 2 4 4 4 4 MPI Processes per Node 1 1 2 1 1 2 4 1 2 1 2 4 8 1 2 4 1 2 1 4 8 2 4 1 2 1 8 4 2 1 CCR Threads per Process 1 1 1 2 1 1 1 2 2 4 1 1 1 2 2 2 4 4 8 1 1 2 2 4 4 8 1 2 4 8
Deterministic
Annealing for Pairwise Clustering
Clustering is a well known data mining algorithm with K-means best known approach Two ideas that lead to new supercomputer data mining algorithms Use deterministic annealing to avoid local minima Do not use vectors that are often not known – use distances δ(i,j) between points i, j in collection – N=millions of points are available in Biology; algorithms go like N 2 . Number of clusters Developed (partially) by Hofmann and Buhmann in 1997 but little or no application PCA Minimize H PC = 0.5
i=1 N
j=1 N δ(i, j)
k=1 K M
i
(k) M
j
(k) / C(k) M
i
(k) is probability that point
i
C(k) =
i=1 N M
i
(k) belongs to cluster
k
is number of points in k’th cluster M
i
(k)
exp( -
i
(k)/T ) with Hamiltonian
i=1 N
k=1 K M
i
(k)
i
(k) Reduce T from large to small values to anneal
N=3000 sequences each length ~1000 features Only use pairwise distances will repeat with 0.1 to 0.5 million sequences with a larger machine C# with CCR and MPI
28
Famous Lolcats
LOL is Internet Slang for Laughing out Loud
29