Clouds and Grids Multicore and all that GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University [email protected], http://www.infomall.org.

Download Report

Transcript Clouds and Grids Multicore and all that GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University [email protected], http://www.infomall.org.

Clouds and Grids Multicore and all that

GADA Panel November 14 2008 Geoffrey Fox Community Grids Laboratory

,

School of informatics Indiana University [email protected]

, http://www.infomall.org

1

    

Grids become Clouds

Grids solve problem of too little computing : We need to harness all the world’s computers to do Science Clouds solve the problem of too much computing : with multicore we have so much power that we need to use effectively to solve user’s problems on “designed (maybe homogeneous)” hardware One new technology: Virtual Machines enable more dynamic flexible environments but not clearly essential

Is Virtual Cluster or Virtual Machine right way to think?

Virtualization is pretty inconsistent with parallel computing as virtualization makes it hard to use correct algorithms and correct runtime respecting locality and “reality”

2 cores in a chip in separate chips very different algorithm/software than 2 cores Clouds naturally address pleasingly parallel” workflows processes of “embarrassingly – MPI invoked outside cloud

2

   

Old Issues

Essentially all “vastly” parallel applications are data parallel including algorithms in Intel’s RMS analysis of future multicore “killer apps”

Gaming (Physics) and Data mining (“iterated linear algebra”)

  

So MPI works (Map is normal SPMD ; Reduce is MPI_Reduce) but may not be highest performance or easiest to use

Some new issues

Clouds have commercial software ; Grids don’t There is overhead of using virtual machines (if your cloud like Amazon uses them) There are dynamic, fault tolerance features favoring MapReduce Hadoop and Dryad No new ideas but several new powerful systems Developing scientifically interesting codes in C#, C++, Java and using to compare cores, nodes, VM, not VM, Programming models

3

Intel’s Application Stack

Gartner 2006 Technology Hype Curve

5

C Y B E R I N F R A S T R U C T U R E C E N T E R F O R P O L A R S C I E N C E ( C I C P S )

Gartner 2007 Technology Hype Curve

No Grids!

Sensor Nets Web 2.0

6

C Y B E R I N F R A S T R U C T U R E C E N T E R F O R P O L A R S C I E N C E ( C I C P S )

Gartner 2008 Technology Hype Curve

Clouds, Microblogs and Green IT appear Basic Web Services, Wikis and SOA becoming mainstream

7

QuakeSpace

QuakeSim built using Web 2.0 and Cloud Technology

Applications, Sensors, Data Repositories as Services

Computing

via

Clouds

Portals

as

Gadgets

Metadata

by

tagging

Data sharing

as in

YouTube

Alerts

by

RSS

Virtual Organizations

via

Social Networking

Workflow

by

Mashups

Performance

by

multicore

Interfaces via iPhone, Android etc.

8

Sensor Clouds

  

Note

sensors

are any time dependent source of information and a fixed source of information is just a broken sensor

• • • •

SAR Satellites Environmental Monitors Nokia N800 pocket computers Presentation of teacher in distance education

• •

Text chats of students Cell phones

• • • • •

RFID tags and readers GPS Sensors Lego Robots RSS Feeds Audio/video: web-cams

Naturally implemented with

dynamic proxies in the Cloud

that filter, archive, queue and distribute Have initial EC2 implementation

9

The Sensors on the Fun Grid

Laptop for PowerPoint 2 Robots used Lego Robot GPS Nokia N800 RFID Tag RFID Reader

10

11

Nimbus Cloud – MPI Performance

Kmeans clustering time vs. the number of 2D data points.

(Both axes are in log scale)

• • • •

Kmeans clustering time (for 100000 data points) vs. the number of iterations of each MPI communication routine

Graph 1 (Left) - MPI implementation of Kmeans clustering algorithm Graph 2 (right) - MPI implementation of Kmeans algorithm modified to perform each MPI communication up to 100 times Performed using 8 MPI processes running on 8 compute nodes each with AMD Opteron™ processors (2.2 GHz and 3 GB of memory) Note large fluctuations in VM-based runtime – implies terrible scaling

25 20 15 10 5 0 35 30 25 20 15 10 5 0 20 15 10 5 0

Nimbus Kmeans Time in secs for 100 MPI calls

Setup 1 Kmeans Time for X=100 of figure A (seconds) Setup 2 Kmeans Time for X=100 of figure A (seconds) Setup 3 Kmeans Time for X=100 of figure A (seconds)

VM_MIN VM_Average VM_MAX

Setup 1

4.857

12.070

24.255

Setup 3

VM_MIN VM_Average VM_MAX 16 14 12 10 8 6 4 2 0 7.736

17.744

32.922

VM_MIN VM_Average VM_MAX MIN Average MAX

Direct Setup 2 Direct

2.05-2.07

2.07-2.09

2.09-2.11

2.11-2.13

Kmeans Time for X=100 of figure A (seconds)

5.067

9.262

24.142

2.058

2.069

2.112

Test Setup # of cores to the VM OS (domU) # of cores to the host OS (dom0)

1 2 3 2 1 1 2 2 1

MPI on Eucalyptus Public Cloud

Kmeans Time for 100 iterations

• Average Kmeans clustering time vs. the number of iterations of each MPI communication routine • 4 MPI processes on 4 VM instances were used

Configuration

CPU and Memory Virtual Machine Operating System gcc MPI Network

VM

Intel(R) Xeon(TM) CPU 3.20GHz, 128MB Memory Xen virtual machine (VMs) Debian Etch gcc version 4.1.1

LAM 7.1.4/MPI 2 18 16 14 12 10 8 2 0 6 4

Variable

VM_MIN VM_Average VM_MAX

MPI Time

7.056

7.417

8.152

We will redo on larger dedicated hardware Used for direct (no VM), Eucalyptus and Nimbus

  

Consider a Collection of Computers We can have various

hardware

Multicore – Shared memory, low latency

High quality Cluster – Distributed Memory, Low latency

Standard distributed system – Distributed Memory, High latency

We can program the coordination of these units by

Threads on cores

• • • •

MPI on cores and/or between nodes MapReduce/Hadoop/Dryad../AVS for dataflow Workflow linking services These can all be considered as some sort of execution unit exchanging messages with some other unit

And there are

higher level programming models

such as OpenMP, PGAS, HPCS Languages

15

Data Parallel Run Time Architectures

MPI MPI MPI MPI MPI is long running processes with Rendezvous for message exchange/ synchronization short or long Ports (messages) Disk HTTP Disk HTTP Disk HTTP Disk HTTP Yahoo Hadoop uses short running processes communicating via disk and tracking processes Trackers Pipes CGL MapReduce is long short

16

       

Is Dataflow the answer?

For functional parallelism, dataflow natural as one moves from one step to another For much data parallel one needs “ deltaflow ” – send change messages to long running processes/threads as in MPI or any rendezvous model

Potentially huge reduction in communication cost For threads no difference but for processes big difference Overhead is Communication/Computation Dataflow overhead proportional to problem size N per process For solution of PDE’s

Deltaflow overhead is N 1/3 and computation like N

So dataflow not popular in scientific computing For matrix multiplication, deltaflow computation N 1.5

and dataflow both O( N ) and MapReduce noted that several data analysis algorithms can use dataflow (especially in Information Retrieval)

17

MapReduce implemented by Hadoop map(key, value) reduce(key, list) E.g. Word Count map(String key, String value): // key: document name // value: document contents reduce(String key, Iterator values): // key: a word // values: a list of counts Dryad H U Y n Y S 4n S M 4n M U U D X n D n X N U N

18

Kmeans Clustering

• • • •

MapReduce for Kmeans Clustering Kmeans Clustering, execution time vs. the number of 2D data points (Both axes are in log scale) All four implementations perform the same Kmeans clustering algorithm Each test is performed using 5 compute nodes (Total of 40 processor cores) CGL-MapReduce shows a performance close to the MPI and Threads implementation Hadoop’s high execution time is due to:

Lack of support for iterative MapReduce computation

Overhead associated with the file system based communication

HADOOP Factor of 10 3 In memory MapReduce MPI Factor of 30 Number of Data Points

M R

CGL-MapReduce

Content Dissemination Network Map Worker M D Worker Nodes D M R M R M R MR Driver User Program R D Reduce Worker MRDeamon Data Read/Write

• • • • • • •

Data Split File System Communication Architecture of CGL-MapReduce

A streaming based MapReduce runtime implemented in Java All the communications(control/intermediate results) are routed via a content dissemination network Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files MRDriver – Maintains the state of the system – Controls the execution of map/reduce tasks User Program is the composer of MapReduce computations Support both stepped (dataflow) and iterative (deltaflow) MapReduce computations All communication uses publish-subscribe “queues in the cloud” not MPI

Particle Physics (LHC) Data Analysis

Data: Up to 1 terabytes of data, placed in IU Data Capacitor Processing:12 dedicated computing nodes from Quarry (total of 96 processing cores) MapReduce for LHC data analysis LHC data analysis, execution time vs. the volume of data (fixed compute resources)

• • • •

Hadoop and CGL-MapReduce both show similar performance The amount of data accessed in each analysis is extremely large Performance is limited by the I/O bandwidth The overhead induced by the MapReduce implementations has negligible effect on the overall computation 4/28/2020 Jaliya Ekanayake 22

LHC Data Analysis Scalability and Speedup

Execution time vs. the number of compute nodes (fixed data)

• • • •

100 GB of data Speedup = MapReduce Time / Sequential Time around 10 units) Speedup for 100GB of HEP data One core of each node is used (Performance is limited by the I/O bandwidth) Speed gain diminish after a certain number of parallel processing units (after

   

MPI outside the mainstream

Multicore best practice and large scale distributed processing not scientific computing will drive best concurrent/parallel computing environments Party Line Parallel Programming Model: Workflow (parallel- distributed) controlling optimized library calls

Core parallel implementations no easier than before; deployment is easier MPI is wonderful but it will be ignored in real world unless simplified; competition from thread and distributed system technology CCR from Microsoft – only ~7 primitives – is one possible commodity multicore driver

• •

It is roughly active messages Runs MPI style codes fine on multicore

24

        

Windows Thread Runtime System

We implement thread parallelism using Microsoft CCR ( Concurrency and Coordination Runtime parallelism ) as it supports both MPI rendezvous and dynamic (spawned) threading style of http://msdn.microsoft.com/robotics/ CCR Supports exchange of messages between threads using named ports and has primitives like: FromHandler: Spawn threads without reading ports Receive: Each handler reads one item from a single port MultipleItemReceive: Each handler reads a prescribed number of items of a given type from a given port. Note items in a port can be general structures but all must have same type.

MultiplePortReceive: Each handler reads a one item of a given type from multiple ports.

CCR has fewer primitives than MPI but can implement MPI collectives efficiently Can use DSS ( Decentralized System Services ) built in terms of CCR for service model DSS has ~35 µs and CCR a few µs overhead

0.06

0.04

0.02

0.00

0.20

0.18

0.16

0.14

0.12

0.10

0.08

Parallel Overhead

1-efficiency = (PT(P)/T(1)-1) On P processors = (1/efficiency)-1 2-way Deterministic Annealing Clustering Scaled Speedup Tests on 4 8-core Systems 4-way 1,600,000 points per C# thread 1, 2, 4. 8, 16, 32-way parallelism On Windows 8-way 16-way 32-way Nodes 1 2 1 1 4 2 1 2 1 1 4 2 1 4 2 1 2 1 1 4 2 4 2 4 2 2 4 4 4 4 MPI Processes per Node 1 1 2 1 1 2 4 1 2 1 2 4 8 1 2 4 1 2 1 4 8 2 4 1 2 1 8 4 2 1 CCR Threads per Process 1 1 1 2 1 1 1 2 2 4 1 1 1 2 2 2 4 4 8 1 1 2 2 4 4 8 1 2 4 8

Deterministic

Annealing for Pairwise Clustering

         

Clustering is a well known data mining algorithm with K-means best known approach Two ideas that lead to new supercomputer data mining algorithms Use deterministic annealing to avoid local minima Do not use vectors that are often not known – use distances δ(i,j) between points i, j in collection – N=millions of points are available in Biology; algorithms go like N 2 . Number of clusters Developed (partially) by Hofmann and Buhmann in 1997 but little or no application PCA Minimize H PC = 0.5

i=1 N

j=1 N δ(i, j)

k=1 K M

i

(k) M

j

(k) / C(k) M

i

(k) is probability that point

i

C(k) =

i=1 N M

i

(k) belongs to cluster

k

is number of points in k’th cluster M

i

(k)

exp( -

i

(k)/T ) with Hamiltonian

i=1 N

k=1 K M

i

(k)

i

(k) Reduce T from large to small values to anneal

N=3000 sequences each length ~1000 features Only use pairwise distances will repeat with 0.1 to 0.5 million sequences with a larger machine C# with CCR and MPI

28

Famous Lolcats

LOL is Internet Slang for Laughing out Loud

29

I’M IN UR CLOUD

INVISIBLE COMPLEXITY