2. Machine Architectures

Transcript 2. Machine Architectures

Arquitectura de Sistemas
Paralelos e Distribuídos
Ago/2007
2. Machine Architectures
Paulo Marques
Dep. Eng. Informática – Universidade de Coimbra
[email protected]
von Neumann Architecture


Based on the fetch-decode-execute cycle
The computer executes a single sequence of instructions
that act on data. Both program and data are stored in
memory.
Flow of instructions
A
B
C
Data
2
Flynn's Taxonomy

Classifies computers according to…


The number of execution flows
The number of data flows
Number of data flows
Number of
execution
flows
SISD
SIMD
Single-Instruction
Single-Data
Single-Instruction
Multiple-Data
MISD
MIMD
Multiple-Instruction
Single-Data
Multiple-Instruction
Multiple-Data
3
Single Instruction, Single Data (SISD)




A serial (non-parallel) computer
Single instruction: only one instruction stream is being
acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input
during any one clock cycle
Most PCs, single CPU workstations, …
4
Single Instruction, Multiple Data (SIMD)





A type of parallel computer
Single instruction: All processing units execute the same instruction at
any given clock cycle
Multiple data: Each processing unit can operate on a different data
element
Best suited for specialized problems characterized by a high degree of
regularity, such as image processing.
Examples: Connection Machine CM-2, Cray J90, Pentium MMX
instructions
5
The Connection Machine 2 (SIMD)

The massively parallel Connection Machine 2 was a supercomputer produced by
Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that
work in parallel.
6
Multiple Instruction, Single Data (MISD)


Few actual examples of this class of parallel computer
have ever existed
Some conceivable examples might be:



multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message
the Data Flow Architecture
7
Multiple Instruction, Multiple Data (MIMD)





Currently, the most common type of parallel computer
Multiple Instruction: every processor may be executing a different
instruction stream
Multiple Data: every processor may be working with a different data
stream
Execution can be synchronous or asynchronous, deterministic or nondeterministic
Examples: most current supercomputers, computer clusters, multiprocessor SMP machines (inc. some types of PCs)
8
Earth Simulator Center – Yokohama, NecSX (MIMD)

The Earth Simulator is a
project to develop a 40
TFLOPS system for climate
modeling. It performs at
35.86 TFLOPS.
The ES is based on:
- 5,120 (640 8-way nodes) 500 MHz NEC CPUs
- 8 GFLOPS per CPU (41 TFLOPS total)
- 2 GB RAM per CPU (10 TB total)
- Shared memory inside the node
- 640 × 640 crossbar switch between the nodes
- 16 GB/s inter-node bandwidth
9
What about Memory?

The interface between CPUs and Memory in Parallel
Machines is of crucial importance


The bottleneck on the bus, many times between memory and
CPU, is known as the von Neumann bottleneck
It limits how fast a machine can operate:
relationship between computation/communication
10
Communication in Parallel Machines


Programs act on data.
Quite important: how do processors access each others
data?
Message Passing Model
Shared Memory Model
CPU
Memory
CPU
Memory
CPU
network
Memory
CPU
Memory
CPU
Memory
CPU
CPU
CPU
11
Shared Memory




Shared memory parallel computers vary widely, but
generally have in common the ability for all processors to
access all memory as a global address space
Multiple processors can operate independently but share
the same memory resources
Changes in a memory location made by one processor are
visible to all other processors
Shared memory machines can be divided into two main
classes based upon memory access times: UMA and
NUMA
12
Shared Memory (2)
Single 4-processor
Machine
A 3-processor
NUMA Machine
CPU
CPU
Memory
CPU
CPU
UMA: Uniform Memory Access
CPU
CPU
CPU
Memory
Memory
Memory
Fast Memory
Interconnect
NUMA: Non-Uniform Memory Access
13
Uniform Memory Access (UMA)






Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Identical processors
Equal access and access times to memory
Sometimes called CC-UMA - Cache Coherent UMA.
Cache coherent means if one processor updates a location
in shared memory, all the other processors know about
the update. Cache coherency is accomplished at the
hardware level.
Very hard to scale
14
Non-Uniform Memory Access (NUMA)

Often made by physically linking two or more SMPs. One
SMP can directly access memory of another SMP.
Not all processors have equal access time to all memories
Sometimes called DSM – Distributed Shared Memory

Advantages






User-friendly programming perspective to memory
Data sharing between tasks is both fast and uniform due to the
proximity of memory and CPUs
More scalable than SMPs
Disadvantages



Lack of scalability between memory and CPUs
Programmer responsibility for synchronization constructs that
ensure "correct" access of global memory
Expensive: it becomes increasingly difficult and expensive to
design and produce shared memory machines with ever increasing
numbers of processors
15
UMA and NUMA
The new MAC PRO features
2 Intel Core2 Duo processors
that share a common central
memory (up to 16Gbyte)
SGI Origin 3900:
- 16 R14000A processors per brick,
each brick with 32GBytes of RAM.
- 12.8GB/s aggregated memory bw
(Scales up to 512 processors and
1TByte of memory)
16
Distributed Memory (DM)









Processors have their own local memory.
Memory addresses in one processor do not map to
another processor (no global address space)
Because each processor has its own local memory, cache
coherency does not apply
Requires a communication network to connect interprocessor memory
When a processor needs access to data in another
processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
Synchronization between tasks is the programmer's
responsibility
Very scalable
Cost effective: use of off-the-shelf processors and
networking
Slower than UMA and NUMA machines
17
Distributed Memory
Computer
Computer
Computer
CPU
CPU
CPU
Memory
Memory
Memory
TITAN@DEI, a PC cluster
interconnected by FastEthernet
network interconnect
18
Hybrid Architectures

Today, most systems are an hybrid featuring shared distributed
memory.



Each node has several processors that share a central memory
A fast switch interconnects the several nodes
In some cases the interconnect allows for the mapping of memory among
nodes; in most cases it gives a message passing interface
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
fast network interconnect
CPU
CPU
Memory
CPU
CPU
CPU
CPU
Memory
CPU
CPU
19
ASCI White at the
Lawrence Livermore National Laboratory





Each node is an IBM POWER3 375 MHz NH-2 16-way SMP
(i.e. 16 processors/node)
Each node has 16GB of memory
A total of 512 nodes, interconnected by a 2GB/sec
network node-to-node
The 512 nodes feature a total of 8192 processors,
having a total of 8192 GB of memory
It currently operates at 13.8 TFLOPS
20
Summary
Architecture
CC-UMA
CC-NUMA
Distributed/
Hybrid
Examples
-
SMPs
Sun Vexx
SGI Challenge
IBM Power3
- SGI Origin
- HP Exemplar
- IBM Power4
- Cray T3E
- IBM SP2
Programming
-
MPI
Threads
OpenMP
Shmem
-
- MPI
Scalability
Draw Backs
MPI
Threads
OpenMP
Shmem
<10 processors
<1000 processors
~1000 processors
- Limited mem bw
- Hard to scale
- New architecture
- Point-to-point
communication
- Costly system
administration
- Programming is
hard to develop and
maintain
- Great
- Limited
Software Availability - Great
21

2. Machine Architectures

Transcript 2. Machine Architectures

Directory