Scalable Numerical Algorithms and Methods on the ASCI Machines

Transcript Scalable Numerical Algorithms and Methods on the ASCI Machines

Department of Computer Science University of the West Indies

Computing Components

How did we learn to fly ?

By constructing a machine that flaps its wings like a bird ?

Answer

By applying aerodynamics principles demonstrated by the nature... Likewise we model parallel processing after those of biological species.

Motivating Factors 1.

Aggregated speed with which complex calculations carried out by neurons 2.

Indivi dual response is slow (measured in ms).

This demonstrates the feasibility of parallel processing.

Computing Components

Applications Threads Interface Microkernel P P P P P P Programming paradigms Operating System Hardware P Processor Thread Process

Processing Elements Simple classification by Flynn: (No. of instruction and data streams) 

SISD - conventional



SIMD - data parallel, vector computing



MISD -



MIMD - very general, multiple approaches.

Current focus is on MIMD model, using general purpose processors.

SISD : A Conventional Computer

Data Input

Processor Data Output Speed is limited by the rate at which computer can transfer information internally.

Examples: PC, Macintosh, Workstations

The MISD Architecture

Data Input Stream Processor A Instruction Stream A Instruction Stream B Instruction Stream C Processor B Data Output Stream Processor C

More of an intellectual exercise than a practical configuration. Few built, but commercially not available

SIMD Architecture

Instruction Stream Data Input stream A Data Input stream B Data Input stream C Processor A Processor B Processor C Data Output stream A Data Output stream B Data Output stream C

Examples: CRAY machine vector processing, Thinking machine CM Intel MMX (multimedia support)

MIMD Architecture

Instruction Stream A Instruction Stream B Instruction Stream C Data Input stream A Processor A Data Input stream B Processor B Processor Data Input stream C

Unlike SIMD, MIMD computer works asynchronously.

Shared memory (tightly coupled) MIMD Distributed memory (loosely coupled) MIMD

C Data Output stream A Data Output stream B Data Output stream C

Shared Memory MIMD machine

Processor A Processor B Processor C M E M O R Y B U S M E M O R Y B U S M E M O R Y B U S Global Memory System

Comm: Source PE writes data to GM & destination retrieves it  Easy to build, conventional OSes of SISD can be easily be ported  Limitation : reliability & expandibility. failure affects the whole system.

A memory component or any processor  Increase of processors leads to scalability problems.

Examples : Silicon graphics supercomputers....

SMM Examples

  Dual and quad Pentiums Power Mac G5s  Dual processor (2 GHz each)

Quad Pentium Shared Memory Multiprocessor

Processor L1 cache L2 cache Bus interface Processor/ memory bus I/O interface Processor L1 cache L2 cache Bus interface I/O bus Processor L1 cache L2 cache Bus interface Processor L1 cache L2 cache Bus interface Memory controller Shared memory Memory

Shared memory

 Any memory location is accessible by any of the processors  A single address space exists, meaning that each memory location is given a unique address within a single range of addresses  Generally shared memory programming is more convenient although it does require access to shared data to be controlled by the programmer  Inter-process communication is done in the memory interface through reads and writes.

 Virtual memory address maps to a real address.

Shared Memory Address Space  Different processors may have memory locally attached to them.

 Different instances of memory access could take different amounts of time. Collisions are possible.

 UMA (i.e., shared memory) vs. NUMA (i.e., distributed shared memory)

Building Shared Memory systems

Building SMM machines with more than 4 processors is very difficult and very expensive e.g.

Sun Microsystems E10000 “Starfire” server  64 processors  Price: $US several million

IPC channel Processor A

Distributed Memory MIMD

IPC channel Processor B Processor C M E M O R Y B U S M E M O R Y B U S M E M O R Y B U S Memory System A Memory System B Memory System C

   Communication : IPC on High Speed Network.

Network can be configured to ... Tree, Mesh, Cube, etc.

Unlike Shared MIMD  

easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system)

Distributed Memory Decentralized memory (memory module with CPU) • Lower memory latency

Drawbacks

• Longer communication latency • Software model more complex

Decentralized Memory versions Message passing "multi-computer" with separate address space per processor • • • Can invoke software with Remote Procedure Call (RPC) Often via library, such as MPI: Message Passing Interface Also called “synchronous communication" since communication causes synchronization between 2 processes

Message Passing System  Inter-process communication is done at the program level using sends and receives.

 Reads and writes refer only to a processor’s local memory.

 Data can be packed into long messages before being sent, to compensate for latency.

 Global scheduling of messages can help avoid message collisions.

MIMD program structure

Multiple Program Multiple Data (MPMD) Each processor will have its own program to execute Single Program Multiple Data (SPMD) A single source program is written, and each processor executes its own personal copy of the program

Speedup factor

S(n) gives increase in speed by using a multiprocessor S(n) = Execution time on a single processor Execution time on a multiprocessor with n processors Speedup factor can also be cast in terms of computational steps S(n) = Number of steps using one processor Number of parallel steps using n processors Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved

One processor ft s Serial section

Maximum Speedup - Amdahl’s Law

t s (1-f)t s Parallelizable sections Multiple processors S(n) = n 1 + f(n-1) t p (1-f)t s /n

Parallel Architectures Data-parallel architectures Instruction level PAs Function-parallel architectures Thread level PAs Process level PAs (MIMDs) Distributed Memory MIMD Shared Memory MIMD