Chapter 7

Transcript Chapter 7

Chapter 7
Multicores,
Multiprocessors, and
Clusters
Objectives
The Student shall be able to:
 Define parallel processing, multicore, cluster, vector
processing.
 Define SISD, SIMD, MISD, MIMD.
 Define multithreading, hardware multithreading, coursegrained multithreading, fine-grained multithreading,
simultaneous multithreading
 Draw network configurations: bus, ring, mesh, cube.
 Define how vector processing works: how instructions
may look, and how they may be processed.
 Define GPU and name 3 characteristics that differentiate
a CPU and GPU.
Chapter 7 — Multicores, Multiprocessors, and Clusters — 2
What We’ve Already Covered

§4.10: Parallelism and Advanced
Instruction-Level Parallelism


§5.8: Parallelism and Memory Hierarchies


Pipelines and Multiple Issue
Associative Memory, Interleaved Memory
§6.9: Parallelism and I/O:

Redundant Arrays of Inexpensive Disks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 3
Parallel Processing
Microsoft
Word
Editor
Matrix Multiply
Backup
SpellCheck
GrammarCheck
Chapter 7 — Multicores, Multiprocessors, and Clusters — 4
Parallel Programming
Main
Factor::child(int begin, int end)
cout << "Run Factor " << total << ":"
<< numChild << endl;
Factor factor;
// Spawn children
for (i=0; i<numChild; i++)
if (fork() == 0) {
factor.child(begin, begin+range);
begin += range + 1;
}
int val, i;
for (val=begin; val<end; val++) {
for (i=2; i<=end/2; i++)
if (val % i == 0) break;
if (i>val/2)
cout << "Factor:" << val << endl;
}
exit(0);
// Wait for children to finish
for (i=0; i<numChild; i++)
wait(&stat);
cout << "All Children Done: "
<< numChild << endl;
}
Chapter 7 — Multicores, Multiprocessors, and Clusters — 5

Goal: connecting multiple computers
to get higher performance



High throughput for independent jobs
Parallel processing program


Multiprocessors
Scalability, availability, power efficiency
Job-level (process-level) parallelism


§9.1 Introduction
Introduction
Single program run on multiple processors
Multicore microprocessors

Chips with multiple processors (cores)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 6
Hardware and Software

Hardware



Software



Serial: e.g., Pentium 4
Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., traditional program
Concurrent: e.g., operating system
Sequential/concurrent software can run on
serial/parallel hardware

Challenge: making effective use of parallel
hardware
Chapter 7 — Multicores, Multiprocessors, and Clusters — 7
Amdahl’s Law



Sequential part can limit speedup
Example: 100 processors, 90× speedup?

Tnew = Tparallelizable/100 + Tsequential

1
Speedup 
 90
(1 Fparalleliz able )  Fparalleliz able /100

Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original
time
Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

SMP: shared memory multiprocessor



Hardware provides single physical
address space for all processors
Synchronize shared variables using locks
Memory access time

UMA (uniform) vs. NUMA (nonuniform)
§7.3 Shared Memory Multiprocessors
Shared Memory
Chapter 7 — Multicores, Multiprocessors, and Clusters — 9
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 7 — Multicores, Multiprocessors, and Clusters — 10


Each processor (or computer) has private
physical address space
Hardware sends/receives messages
between processors
§7.4 Clusters and Other Message-Passing Multiprocessors
Cluster - Message Passing
Chapter 7 — Multicores, Multiprocessors, and Clusters — 11
Loosely Coupled Clusters

Network of independent computers


Each has private memory and OS
Connected using I/O system


Suitable for applications with independent tasks



E.g., Ethernet/switch, Internet
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems


Administration cost (prefer virtual machines)
Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP
Chapter 7 — Multicores, Multiprocessors, and Clusters — 12
Grid Computing

Separate computers interconnected by
long-haul networks



E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs

E.g., SETI@home, World Community Grid
Chapter 7 — Multicores, Multiprocessors, and Clusters — 13
Multithreading




Hardware Multithreading: Each thread
has its own register file and PC
Fine-Grained = interleaved processing:
switches between threads each instruction
Course-Grained: switches threads when
stall required (memory access, wait)
Simultaneous Multithreading: uses
dynamic scheduling to schedule multiple
threads simultaneously
Chapter 7 — Multicores, Multiprocessors, and Clusters — 14

Performing multiple threads of execution in
parallel



Fine-grain multithreading




Replicate registers, PC, etc.
Fast switching between threads
§7.5 Hardware Multithreading
Multithreading
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading


Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 15
Simultaneous Multithreading

In multiple-issue dynamically scheduled
processor




Schedule instructions from multiple threads
Instructions from independent threads execute
when function units are available
Within threads, dependencies handled by
scheduling and register renaming
Example: Intel Pentium-4 HT

Two threads: duplicated registers, shared
function units and caches
Chapter 7 — Multicores, Multiprocessors, and Clusters — 16
Multithreading Example
Chapter 7 — Multicores, Multiprocessors, and Clusters — 17
Future of Multithreading


Will it survive? In what form?
Power considerations  simplified
microarchitectures


Tolerating cache-miss latency


Simpler forms of multithreading
Thread switch may be most effective
Multiple simple cores might share
resources more effectively
Chapter 7 — Multicores, Multiprocessors, and Clusters — 18
Multithreading Lab
In /home/student/Classes/Cs355/PrimeLab are 2 files: Factor.cpp runFactor

Copy them over to one of your directories (below called mydirectory):

cp Factor.cpp ~/mydirectory

cp runFactor ~/mydirectory

cd ~/mydirectory
You want to observe the processor utilization (how busy the processor is).

Linux: Applications->System Tools->System Monitor->Resources
Now compile Factor.cpp into executable Factor and run it using the command file runFactor (in Linux):

g++ Factor.cpp –o Factor

. / runFactor
The file time.dat will contain the start and end time of the program, so you can calculate the duration.
Now change in Factor.cpp the number of threads: numChild. Recompile and run.
Create a matrix in Microsoft Excel or Open Office, with one column containing the number of children, the second
column containing the seconds to complete. Label the second column: Delay.

Linux: Applications->Office->LibreOffice Calc to show efficiency of multiple threads.
Find the times for a range of thread counts:
Delay

1,2,3,4,5,6,10,20 (Whatever you have time for).
Have your spreadsheet draw a graph with your data.
Show me your data.
1
2
3
4
6
Chapter 7 — Multicores,
Multiprocessors, and Clusters — 19

An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple

Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
SPMD: Single Program Multiple Data


§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
A parallel program on a MIMD computer
Conditional code for different processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 20
SIMD

Operate elementwise on vectors of data

E.g., MMX and SSE instructions in x86


All processors execute the same
instruction at the same time




Multiple data elements in 128-bit wide registers
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 7 — Multicores, Multiprocessors, and Clusters — 21
Vector Processors


Highly pipelined function units
Stream data from/to vector registers to units



Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to MIPS


32 × 64-element registers (64-bit elements)
Vector instructions




lv, sv: load/store vector
addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Chapter 7 — Multicores, Multiprocessors, and Clusters — 22
Example: DAXPY (Y = a × X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
 Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)

;load scalar a
;upper bound of what to load
;load x(i)
;a × x(i)
;load y(i)
;a × x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Chapter 7 — Multicores, Multiprocessors, and Clusters — 23
Vector vs. Scalar

Vector architectures and compilers
Simplify data-parallel programming
 Speed up processing since no loops
 No data hazard within vector instruction
 Benefit from interleaved and burst
memory
 Avoid control hazards by avoiding loops

Chapter 7 — Multicores, Multiprocessors, and Clusters — 24
Multimedia Improvements



Intel X86 (e.g., 80386) Architecture
MMX: MultiMedia Extensions
SSE: Streaming SIMD Extensions
32 bits
16 bits
8 bits

8 bits
One ALU
16 bits
8 bits
8 bits
A register can be subdivided into smaller
units … or extended and subdivided
Chapter 7 — Multicores, Multiprocessors, and Clusters — 25

Network topologies

Arrangements of processors, switches, and links
Bus
Ring
N-cube (N = 3)
2D Mesh
§7.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks
Fully connected
Chapter 7 — Multicores, Multiprocessors, and Clusters — 26
Multistage Networks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 27
Network Characteristics

Performance






Latency (delay) per message
Throughput: messages/second
Congestion delays (depending on traffic)
Cost
Power
Routability in silicon
Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

Graphics Processing Units



Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
Architecture

GPU memory optimized for bandwidth, not latency



Smaller memories, no multilevel cache
Simultaneous execution



Wider DRAM chips
§7.7 Introduction to Graphics Processing Units
History of GPUs
Hundreds or thousands of threads
Parallel processing: SIMD + scalar
No double precision floating point
Chapter 7 — Multicores, Multiprocessors, and Clusters — 29
Graphics in the System
Chapter 7 — Multicores, Multiprocessors, and Clusters — 30
GPU Architectures

Processing is highly data-parallel


GPUs are highly multithreaded
Use thread switching to hide memory latency



Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs



Less reliance on multi-level caches
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs



DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 31
Example: NVIDIA Tesla
Streaming
multiprocessor
8 × Streaming
processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 32
Example: NVIDIA Tesla

Streaming Processors (SP)



Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads

Executed in parallel,
SIMD (or SPMD) style


8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps

Registers, PCs, …
Time
Chapter 7 — Multicores, Multiprocessors, and Clusters — 33
Classifying GPUs

Don’t fit nicely into SIMD/MIMD model

Conditional execution in a thread allows an
illusion of MIMD


But with performance degredation
Need to write general purpose code with care
Instruction-Level
Parallelism
Data-Level
Parallelism
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
Chapter 7 — Multicores, Multiprocessors, and Clusters — 34
Roofline Diagram
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Chapter 7 — Multicores, Multiprocessors, and Clusters — 35
Optimizing Performance

Choice of optimization depends on
arithmetic intensity of code

Arithmetic intensity is
not always fixed


May scale with
problem size
Caching reduces
memory accesses

Increases arithmetic
intensity
Chapter 7 — Multicores, Multiprocessors, and Clusters — 36
Comparing Systems

Example: Opteron X2 vs. Opteron X4


2-core vs. 4-core, 2× FP performance/core, 2.2GHz
vs. 2.3GHz
Same memory system

To get higher performance
on X4 than X2


Need high arithmetic intensity
Or working set must fit in X4’s
2MB L-3 cache
Chapter 7 — Multicores, Multiprocessors, and Clusters — 37
Optimizing Performance

Optimize FP performance



Balance adds & multiplies
Improve superscalar ILP
and use of SIMD
instructions
Optimize memory usage

Software prefetch


Avoid load stalls
Memory affinity

Avoid non-local data
accesses
Chapter 7 — Multicores, Multiprocessors, and Clusters — 38
2 × quad-core
Intel Xeon e5345
(Clovertown)
Chipset = Bus
Fully-Buffered DRAM DIMMs
2 × quad-core
AMD Opteron X4 2356
(Barcelona)
§7.11 Real Stuff: Benchmarking Four Multicores …
Four Example Systems
Chapter 7 — Multicores, Multiprocessors, and Clusters — 39
Four Example Systems
2 × oct-core
Sun UltraSPARC
T2 5140 (Niagara 2)
Fine-grained Multithreading
2 × oct-core
IBM Cell QS20
SPE=Synergistic Proc. Element
Have SIMD instr. set
Chapter 7 — Multicores, Multiprocessors, and Clusters — 40
Pitfalls

Not developing the software to take
account of a multiprocessor architecture

Example: using a single lock for a shared
composite resource


Serializes accesses, even if they could be done in
parallel
Use finer-granularity locking
Chapter 7 — Multicores, Multiprocessors, and Clusters — 41


Goal: higher performance by using multiple
processors
Difficulties



Many reasons for optimism



Developing parallel software
Devising appropriate architectures
§7.13 Concluding Remarks
Concluding Remarks
Changing software and application environment
Chip-level multiprocessors with lower latency,
higher bandwidth interconnect
An ongoing challenge for computer architects!
Chapter 7 — Multicores, Multiprocessors, and Clusters — 42

Chapter 7

Transcript Chapter 7

Directory