(Superficial!) Review of Uniprocessor Architecture and

Transcript (Superficial!) Review of Uniprocessor Architecture and

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts

CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign Department of Computer Science

Parallel Machines: an abstract introduction

• Our main focus will be on three kinds of machines – Bus-based shared memory machines – Scalable shared memory machines • Cache coherent • Hardware support for remote memory access – Distributed memory machines

Distributed memory m/cs: debate

Should this machine support a shared address space? If not : coordination by “passing messages” If so: how and whether to keep caches “ coherent”?

Mem0 cache PE0 Mem1 cache PE1 Memp-1 Pep-1 cache Interconnection Network This debate is also tied to the debate over programming models:

Writing parallel programs

• Programming model – How should a programmer view the parallel machine?

– Sequential programming: von Neumann model • Parallel programming models: – Shared memory (Shared address space) model – Message passing model – Shared Objects model • Common to all these models: – In all these models, you have multiple independent entities communicating, synchronizing and coordinating with each other via specific mechanisms provided by the model • Special-purpose models: – A common case: data-parallel (loop-parallel) models – Other “domain-specific” models

Shared Address space model

• Also called shared memory model sometimes: – considered a misnomer by some: shared memory is an arch. Concept • Independent entities are called threads (or processes) – All threads use the same common address space – When thread i refers to an address A, it is the same location as when thread j refers to address A.

• Advantages: – Natural extension of sequential programming model • Some people disagree even about this – Relatively easy to get “first parallel version” of an existing sequential code

Shared Address space model:

• Issues: – Need hardware support for cache coherence and consistency: • But that’s not the concern when we are discussing efficacy of the prog model – Data being read by one may be being modified by another • Need ways of synchronizing access • E.g. Producer-consumer relationship between threads – Producer is to store the result in shared variable X – When can the consumer thread read it?

– Another example: inconsistent modifications: • Suppose two processes are both trying to add 5 to x. – In reality, it is not one instruction, but 3: • Now, the 6 instructions (3 from each thread) – may interleave in many possible ways – leading to wrong behavior x := x-5 ld r1,x; add r1,r1,5; st r1,x

SAS model: Locks and Barriers

• Solution: Locks – A lock is a variable – You can: create a lock, “lock” a lock, and “unlock” a lock – The implementation guarantees that: • only one thread can “get” or “lock” a lock at a time • Using locks: – Protect vulnerable shared data using a lock – associate a lock with such a variable • Mentally (there is no construct or call to do the association) – Before changing the variable, lock its associated variable • unlock it as soon as you finished using it – Remember that this is only a convention • Nothing prevents a thread from inadvertently changing a variable that is protected by lock in another part of the code: • Analogy: locking a room with a “post-it” on the door

Matrix multiplication:

• Why people like SAS model: for (i=0; i

SAS matrix multiply

• Each thread know its “serial number”: – myPe() size= M/numPEs( ); myStart = myPE( ) for (i=myStart; i

Message passing

• Parallel entities are processes – With their own address space • Assume that processors have direct access to only their memory • Each processor typically executes the same executable, but may be running different part of the program at a time • Coordination : – via sending and receiving “messages”: bytes of data

Message passing basics:

• Basic calls: send and recv • send(int proc, int tag, int size, char *buf); • recv(int proc, int tag, int size, char * buf); • Recv may return the actual number of bytes received in some systems • tag and proc may be wildcarded in a recv: – recv(ANY, ANY, 1000, &buf); • broadcast: • Other global operations (reductions)

Parallel Programming

• Decomposition – what to do in parallel – Tasks (loop iterations, functions,.. ) that can be done in parallel • Mapping: – Which processor does each task • Scheduling (sequencing) – On each processor • Machine dependent expression – Express the above decisions for the particular parallel machine

Decomposition

Spectrum of parallel Languages

Parallelizing fortran compiler Mapping Scheduling (sequencing) Machine dependent expression l L e v e Charm++ MPI/SAS Specialization What is automated

Shared objects model:

• Basic philosophy: – Let the programmer decide what to do in parallel – Let the system handle the rest: • Which processor executes what, and when • With some override control to the programmer, when needed • Basic model: – The program is set of communicating objects – Objects only know about other objects (not processors) – System maps objects to processors • And may remap the objects for load balancing etc. dynamically • Shared objects, not shared memory – So, in some ways, in between “shared nothing” message passing, and “shared everything” of SAS – More disciplined sharing – Additional information sharing mechanisms

Charm++

• Data Driven Objects: called chares • Asynchronous method invocation • Prioritized scheduling • Object Arrays • Object Groups: – global object with a “representative” on each PE • Information sharing abstractions – readonly data – accumulators – distributed tables

Data Driven Execution

Objects Scheduler Message Q Scheduler Message Q

Object Arrays

• A collection of chares, – with a single global name for the collection, and – each member addressed by an index – Mapping of element objects to processors handled by the system A[0] A[1] A[2] A[3] A[0] A[3] A[..] User’s view System view

Object Groups

• A group of objects (chares) – with exactly one representative on each processor – A single Id for the group as a whole – invoke methods in a branch (asynchronously), all branches (broadcast), or in the local branch

Information sharing abstractions

• Observation: – Information is shared in several specific modes in parallel programs • Other models support only a limited sets of modes: – Shared memory: everything is shared: sledgehammer approach – Message passing: messages are the only method • Charm++: identifies and supports several modes – Readonly / writeonce – Tables (hash tables) – accumulators – Monotonic variables

Comparing Programming Models

• What are the advantages and disadvantages of the models?

– even at this simple/abstract level of introduction?