Transcript Scalable Numerical Algorithms and Methods on the ASCI Machines
Department of Computer Science University of the West Indies
Computing Components
How did we learn to fly ?
By constructing a machine that flaps its wings like a bird ?
Answer
By applying aerodynamics principles demonstrated by the nature... Likewise we model parallel processing after those of biological species.
Motivating Factors 1.
Aggregated speed with which complex calculations carried out by neurons 2.
Indivi dual response is slow (measured in ms).
This demonstrates the feasibility of parallel processing.
Computing Components
Applications Threads Interface Microkernel P P P P P P Programming paradigms Operating System Hardware P Processor Thread Process
Processing Elements Simple classification by Flynn: (No. of instruction and data streams)
SISD - conventional
SIMD - data parallel, vector computing
MISD -
MIMD - very general, multiple approaches.
Current focus is on MIMD model, using general purpose processors.
SISD : A Conventional Computer
Data Input
Processor Data Output Speed is limited by the rate at which computer can transfer information internally.
Examples: PC, Macintosh, Workstations
The MISD Architecture
Data Input Stream Processor A Instruction Stream A Instruction Stream B Instruction Stream C Processor B Data Output Stream Processor C
More of an intellectual exercise than a practical configuration. Few built, but commercially not available
SIMD Architecture
Instruction Stream Data Input stream A Data Input stream B Data Input stream C Processor A Processor B Processor C Data Output stream A Data Output stream B Data Output stream C
Examples: CRAY machine vector processing, Thinking machine CM Intel MMX (multimedia support)
MIMD Architecture
Instruction Stream A Instruction Stream B Instruction Stream C Data Input stream A Processor A Data Input stream B Processor B Processor Data Input stream C
Unlike SIMD, MIMD computer works asynchronously.
Shared memory (tightly coupled) MIMD Distributed memory (loosely coupled) MIMD
C Data Output stream A Data Output stream B Data Output stream C
Shared Memory MIMD machine
Processor A Processor B Processor C M E M O R Y B U S M E M O R Y B U S M E M O R Y B U S Global Memory System
Comm: Source PE writes data to GM & destination retrieves it Easy to build, conventional OSes of SISD can be easily be ported Limitation : reliability & expandibility. failure affects the whole system.
A memory component or any processor Increase of processors leads to scalability problems.
Examples : Silicon graphics supercomputers....
SMM Examples
Dual and quad Pentiums Power Mac G5s Dual processor (2 GHz each)
Quad Pentium Shared Memory Multiprocessor
Processor L1 cache L2 cache Bus interface Processor/ memory bus I/O interface Processor L1 cache L2 cache Bus interface I/O bus Processor L1 cache L2 cache Bus interface Processor L1 cache L2 cache Bus interface Memory controller Shared memory Memory
Shared memory
Any memory location is accessible by any of the processors A single address space exists, meaning that each memory location is given a unique address within a single range of addresses Generally shared memory programming is more convenient although it does require access to shared data to be controlled by the programmer Inter-process communication is done in the memory interface through reads and writes.
Virtual memory address maps to a real address.
Shared Memory Address Space Different processors may have memory locally attached to them.
Different instances of memory access could take different amounts of time. Collisions are possible.
UMA (i.e., shared memory) vs. NUMA (i.e., distributed shared memory)
Building Shared Memory systems
Building SMM machines with more than 4 processors is very difficult and very expensive e.g.
Sun Microsystems E10000 “Starfire” server 64 processors Price: $US several million
IPC channel Processor A
Distributed Memory MIMD
IPC channel Processor B Processor C M E M O R Y B U S M E M O R Y B U S M E M O R Y B U S Memory System A Memory System B Memory System C
Communication : IPC on High Speed Network.
Network can be configured to ... Tree, Mesh, Cube, etc.
Unlike Shared MIMD
easily/ readily expandable Highly reliable (any CPU failure does not affect the whole system)
Distributed Memory Decentralized memory (memory module with CPU) • Lower memory latency
Drawbacks
• Longer communication latency • Software model more complex
Decentralized Memory versions Message passing "multi-computer" with separate address space per processor • • • Can invoke software with Remote Procedure Call (RPC) Often via library, such as MPI: Message Passing Interface Also called “synchronous communication" since communication causes synchronization between 2 processes
Message Passing System Inter-process communication is done at the program level using sends and receives.
Reads and writes refer only to a processor’s local memory.
Data can be packed into long messages before being sent, to compensate for latency.
Global scheduling of messages can help avoid message collisions.
MIMD program structure
Multiple Program Multiple Data (MPMD) Each processor will have its own program to execute Single Program Multiple Data (SPMD) A single source program is written, and each processor executes its own personal copy of the program
Speedup factor
S(n) gives increase in speed by using a multiprocessor S(n) = Execution time on a single processor Execution time on a multiprocessor with n processors Speedup factor can also be cast in terms of computational steps S(n) = Number of steps using one processor Number of parallel steps using n processors Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved
One processor ft s Serial section
Maximum Speedup - Amdahl’s Law
t s (1-f)t s Parallelizable sections Multiple processors S(n) = n 1 + f(n-1) t p (1-f)t s /n
Parallel Architectures Data-parallel architectures Instruction level PAs Function-parallel architectures Thread level PAs Process level PAs (MIMDs) Distributed Memory MIMD Shared Memory MIMD