Chapter 5 Array Processors

Download Report

Transcript Chapter 5 Array Processors

Chapter 5 Array Processors
Introduction
Major characteristics of SIMD
architectures
–
–
–
–
–
A single processor(CP)
Synchronous array processors(PEs)
Data-parallel architectures
Hardware intensive architectures
Interconnection network
Associative Processor
 An SIMD whose main component is an
associative memory.(Figure 2.19)
 AM(Associative Memory): Figure 2.18
–
–
–
–
–
Used in fast search operations
Data register
Mask register
Word selector
Result register
Introduction(continued)
Associative processor architectures
also belong to the SIMD
classification.
– STRAN
– Goodyear Aerospace’s MPP(massively
parallel processor)
The systolic architectures are a
special type of synchronous array
processor architecture.
5.1 SIMD Organization
Figure 5.1 shows a SIMD processing
model. (Compare to Figure 4.1)
 Example 5.1
– SIMDs offer an N-fold throughput
enhancement over SISD provided the
application exhibits a data-parallelism
of degree N.
5.1 SIMD Organization
(continued)
Memory
– Data are distributed among the memory
blocks
– A data alignment network allows any
data memory to be accessed by any PE.
5.1 SIMD Organization
(continued)
Control processor
– To fetch instructions and decode them
– To transfer instructions to PEs for
executions
– To perform all address computations
– To retrieve some data elements from
the memory
– To broadcast them to all PEs as
required.
5.1 SIMD Organization
(continued)

Arithmetic/Logic processors
– To perform the arithmetic and logical
operations on the data
– Each PE corresponding to data paths
and arithmetic/logic units of an SISD
processor capable of responding to
control control signals from the control
unit.
5.1 SIMD Organization
(continued)

Interconnection network (Refer to
Figure 2.9)
– In type 1 and type 2 SIMD architectures,
the PE to memory interconnection
through n x n switch
– In type 3, there is no PE-to-PE
interconnection network. There is a n x
n alignment switch between PEs and
the memory block.
5.1 SIMD Organization
(continued)
Registers, instruction set,
performance considerations
– The instruction set contains two types
of index manipulation instructions, one
set for global registers and the other for
local registers
5.2 Data Storage Techniques
and Memory Organization
Straight storage / skewed storage
GCD
5.3 Interconnection
Networks
 Terminology and performance measures
–
–
–
–
–
–
–
Nodes
Links
Messages
Paths: dedicated / shared
Switches
Directed(or indirect) message transfer
Centralized (or decentralized) indirect
message transfer
5.3 Interconnection
Networks (continued)
 Terminology and performance measures
– Performance measures
•
•
•
•
•
•
•
•
•
•
Connectivity
Bandwidth
Latency
Average distance
Hardware complexity
Cost
Place modularity
Regularity
Reliability and fault tolerance
Additional functionality
5.3 Interconnection
Networks (continued)
Terminology and performance
measures
– Design choices(by Feng): refer to
Figure 5.9
•
•
•
•
Switching mode
Control strategy
Topology
Mode of operation
5.3 Interconnection
Networks (continued)
Routing protocols
– Circuit switching
– Packet switching
– Worm hole switching
Routing mechanism
– Static / dynamic
Switching setting functions
– Centralized / distributed
5.3 Interconnection
Networks (continued)
Static topologies
–
–
–
–
–
–
Linear array and ring
Two dimensional mesh
Star
Binary tree
Complete interconnection
hypercube
5.3 Interconnection
Networks (continued)
Dynamic topologies
– Bus networks
– Crossbar network
– Switching networks
• Perfect shuffle
– Single stage
– Multistage
5.4 Performance Evaluation
and Scalability
 The speedup S of a parallel computer
system:
S
sequential_ execution_ time
parallel_ execution_ time
 Theoretically, the maximum speed
possible with a p processor system is p.
( A superlinear speedup is an exception)
– Maximum speedup is not possible in practice,
because all the processors in the system
cannot be kept busy performing useful
computations all the time.
5.4 Performance Evaluation
and Scalability (continued)
 The timing diagram of Figure 5.20
illustrates the operation of a typical SIMD
system.
 Efficiency, E is a measure of the fraction
of the time that the processors are busy.
In Figure 5.20, s is the fraction of the time
spent in serial code. 0  E  1
E
s
1
 (1  s )  1  s (1  )
p
p
5.4 Performance Evaluation
and Scalability (continued)
 The serial execution time in Figure
5.20 is one unit and if the code that
can be run in parallel takes N time
units on a single processor system,
s 
1
N
1
p

p
p  N
The efficiency is also defines as
E 
S
p
5.4 Performance Evaluation
and Scalability (continued)
The cost is the product of the
parallel run time and the number of
processors.
– Cost optimal: if the cost of a parallel system is
proportional to the execution time of the
fastest algorithm.
 Scalability is a measure of its ability to
increase speedup as the number of
processors increases.
5.5 Programming SIMDs
 The SIMD instruction set contains
additional instruction for IN operations,
manipulating local and global registers,
setting activity bits based on data
conditions.
 Popular high-level languages such as
FORTRAN, C, and LISP have been
extended to allow data-parallel
programming on SIMDs.
5.6 Example
Systems
 ILLIAC-IV
– The ILLIAC-IV project was started in 1966 at
the University of Illinois.
– A system with 256 processors controlled by a
CP was envisioned.
– The set of processors was divided into four
quadrants of 64 processors.
– Figure 5.21 shows the system structure.
– Figure 5.22 shows the configuration of a
quadrant.
– The PE array is arranged as an 8x8 torus.
5.6 Example
Systems (continued)
CM-2
– The CM-2, introduced in 1987, is a
massively parallel SIMD machine.
– Table 5.1 summarizes its characteristics.
– Figure 5.23 shows the architecture of
CM-2.
5.6 Example
Systems (continued)
 CM-2
– Processors
• The 16 processors are connected by a 4x4 mesh.
(Figure 5.24)
• Figure 5.25 shows a processing cell.
– Hypercube
• The processors are linked by a 12-dimensional
hypercube router network.
• The following parallel communication operations
permit elements of parallel variables: reduce &
broadcast, grid(NEWS), general(send, get), scan,
spread, sort.
5.6 Example
Systems (continued)
CM-2
– Nexus
• A 4x4 crosspoint switch,
– Router
• It is used to transmit data from a processor
to the other.
– NEWS Grid
• A two-dimensional mesh that allows
nearest-neighbor communication.
5.6 Example
Systems (continued)
 CM-2
– Input/Output system
• Each 8-K processor section is connected to one of
the eight I/O channels (Figure 5,26).
• Data is passed along the channels to I/O controller
(Figure 5.27).
– Software
• Assembly language, Paris
• *LISP, CM-LISP, and *C
– Applications: refer to page 211.
5.6 Example
Systems (continued)
 MasPar MP
– The MasPar MP-1 is a data parallel SIMD with
basic configuration consisting of the data
parallel unit(DDP) and a host workstation.
– The DDP consists of from 1,024 to 16,384
processing elements.
– The programming environment is UNIX-based.
Programming languages are MDF(MasPar
FORTRAN), MPL(MasPar Programming
Language)
5.6 Example
Systems (continued)
MasPar MP
– Hardware architecture
• The DPU consists of a PE array and an
array control unit(ACU).
• The PE array(Figure 5.28) is configurable
from 1 to 16 identical processor boards.
Each processor board has 64 PE
clusters(PECs) of 16 PEs per cluster. Each
processor board thus contains 1024 PEs.
5.7 Systolic Arrays
 A systolic array is a special purpose
planar array of simple processors
that feature a regular, near-neighbor
interconnection network.
Figure 5-31(iWarP System)
 iWarp (Intel 1991)
– Developed jointly by CMU and Intel Corp.
– A programmable systolic array
– Memory communication & systolic
communication
– The advantages of systolic communication
•
•
•
•
Fine grain communication
Reduced access to local memory
Increased instruction level parallelism
Reduced size of local memory
Figure 5-31(iWarP System)
Figure 5-31(iWarP System)
An iWarp system is made of an array
of iWarp cells
Each iWarp cell consists of an iWarp
component and the local memory.
The iWarp component contains
independent communication and
computation agents
Figure 5-31(iWarP System)