Transcript Chapter 7
Chapter 7
Multicores,
Multiprocessors, and
Clusters
Objectives
The Student shall be able to:
Define parallel processing, multicore, cluster, vector
processing.
Define SISD, SIMD, MISD, MIMD.
Define multithreading, hardware multithreading, coursegrained multithreading, fine-grained multithreading,
simultaneous multithreading
Draw network configurations: bus, ring, mesh, cube.
Define how vector processing works: how instructions
may look, and how they may be processed.
Define GPU and name 3 characteristics that differentiate
a CPU and GPU.
Chapter 7 — Multicores, Multiprocessors, and Clusters — 2
What We’ve Already Covered
§4.10: Parallelism and Advanced
Instruction-Level Parallelism
§5.8: Parallelism and Memory Hierarchies
Pipelines and Multiple Issue
Associative Memory, Interleaved Memory
§6.9: Parallelism and I/O:
Redundant Arrays of Inexpensive Disks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 3
Parallel Processing
Microsoft
Word
Editor
Matrix Multiply
Backup
SpellCheck
GrammarCheck
Chapter 7 — Multicores, Multiprocessors, and Clusters — 4
Parallel Programming
Main
Factor::child(int begin, int end)
cout << "Run Factor " << total << ":"
<< numChild << endl;
Factor factor;
// Spawn children
for (i=0; i<numChild; i++)
if (fork() == 0) {
factor.child(begin, begin+range);
begin += range + 1;
}
int val, i;
for (val=begin; val<end; val++) {
for (i=2; i<=end/2; i++)
if (val % i == 0) break;
if (i>val/2)
cout << "Factor:" << val << endl;
}
exit(0);
// Wait for children to finish
for (i=0; i<numChild; i++)
wait(&stat);
cout << "All Children Done: "
<< numChild << endl;
}
Chapter 7 — Multicores, Multiprocessors, and Clusters — 5
Goal: connecting multiple computers
to get higher performance
High throughput for independent jobs
Parallel processing program
Multiprocessors
Scalability, availability, power efficiency
Job-level (process-level) parallelism
§9.1 Introduction
Introduction
Single program run on multiple processors
Multicore microprocessors
Chips with multiple processors (cores)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 6
Hardware and Software
Hardware
Software
Serial: e.g., Pentium 4
Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., traditional program
Concurrent: e.g., operating system
Sequential/concurrent software can run on
serial/parallel hardware
Challenge: making effective use of parallel
hardware
Chapter 7 — Multicores, Multiprocessors, and Clusters — 7
Amdahl’s Law
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
Tnew = Tparallelizable/100 + Tsequential
1
Speedup
90
(1 Fparalleliz able ) Fparalleliz able /100
Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original
time
Chapter 7 — Multicores, Multiprocessors, and Clusters — 8
SMP: shared memory multiprocessor
Hardware provides single physical
address space for all processors
Synchronize shared variables using locks
Memory access time
UMA (uniform) vs. NUMA (nonuniform)
§7.3 Shared Memory Multiprocessors
Shared Memory
Chapter 7 — Multicores, Multiprocessors, and Clusters — 9
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 7 — Multicores, Multiprocessors, and Clusters — 10
Each processor (or computer) has private
physical address space
Hardware sends/receives messages
between processors
§7.4 Clusters and Other Message-Passing Multiprocessors
Cluster - Message Passing
Chapter 7 — Multicores, Multiprocessors, and Clusters — 11
Loosely Coupled Clusters
Network of independent computers
Each has private memory and OS
Connected using I/O system
Suitable for applications with independent tasks
E.g., Ethernet/switch, Internet
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems
Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Chapter 7 — Multicores, Multiprocessors, and Clusters — 12
Grid Computing
Separate computers interconnected by
long-haul networks
E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
E.g., SETI@home, World Community Grid
Chapter 7 — Multicores, Multiprocessors, and Clusters — 13
Multithreading
Hardware Multithreading: Each thread
has its own register file and PC
Fine-Grained = interleaved processing:
switches between threads each instruction
Course-Grained: switches threads when
stall required (memory access, wait)
Simultaneous Multithreading: uses
dynamic scheduling to schedule multiple
threads simultaneously
Chapter 7 — Multicores, Multiprocessors, and Clusters — 14
Performing multiple threads of execution in
parallel
Fine-grain multithreading
Replicate registers, PC, etc.
Fast switching between threads
§7.5 Hardware Multithreading
Multithreading
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 15
Simultaneous Multithreading
In multiple-issue dynamically scheduled
processor
Schedule instructions from multiple threads
Instructions from independent threads execute
when function units are available
Within threads, dependencies handled by
scheduling and register renaming
Example: Intel Pentium-4 HT
Two threads: duplicated registers, shared
function units and caches
Chapter 7 — Multicores, Multiprocessors, and Clusters — 16
Multithreading Example
Chapter 7 — Multicores, Multiprocessors, and Clusters — 17
Future of Multithreading
Will it survive? In what form?
Power considerations simplified
microarchitectures
Tolerating cache-miss latency
Simpler forms of multithreading
Thread switch may be most effective
Multiple simple cores might share
resources more effectively
Chapter 7 — Multicores, Multiprocessors, and Clusters — 18
Multithreading Lab
In /home/student/Classes/Cs355/PrimeLab are 2 files: Factor.cpp runFactor
Copy them over to one of your directories (below called mydirectory):
cp Factor.cpp ~/mydirectory
cp runFactor ~/mydirectory
cd ~/mydirectory
You want to observe the processor utilization (how busy the processor is).
Linux: Applications->System Tools->System Monitor->Resources
Now compile Factor.cpp into executable Factor and run it using the command file runFactor (in Linux):
g++ Factor.cpp –o Factor
. / runFactor
The file time.dat will contain the start and end time of the program, so you can calculate the duration.
Now change in Factor.cpp the number of threads: numChild. Recompile and run.
Create a matrix in Microsoft Excel or Open Office, with one column containing the number of children, the second
column containing the seconds to complete. Label the second column: Delay.
Linux: Applications->Office->LibreOffice Calc to show efficiency of multiple threads.
Find the times for a range of thread counts:
Delay
1,2,3,4,5,6,10,20 (Whatever you have time for).
Have your spreadsheet draw a graph with your data.
Show me your data.
1
2
3
4
6
Chapter 7 — Multicores,
Multiprocessors, and Clusters — 19
An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
SPMD: Single Program Multiple Data
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Instruction and Data Streams
A parallel program on a MIMD computer
Conditional code for different processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 20
SIMD
Operate elementwise on vectors of data
E.g., MMX and SSE instructions in x86
All processors execute the same
instruction at the same time
Multiple data elements in 128-bit wide registers
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 7 — Multicores, Multiprocessors, and Clusters — 21
Vector Processors
Highly pipelined function units
Stream data from/to vector registers to units
Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to MIPS
32 × 64-element registers (64-bit elements)
Vector instructions
lv, sv: load/store vector
addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Chapter 7 — Multicores, Multiprocessors, and Clusters — 22
Example: DAXPY (Y = a × X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
;load scalar a
;upper bound of what to load
;load x(i)
;a × x(i)
;load y(i)
;a × x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Chapter 7 — Multicores, Multiprocessors, and Clusters — 23
Vector vs. Scalar
Vector architectures and compilers
Simplify data-parallel programming
Speed up processing since no loops
No data hazard within vector instruction
Benefit from interleaved and burst
memory
Avoid control hazards by avoiding loops
Chapter 7 — Multicores, Multiprocessors, and Clusters — 24
Multimedia Improvements
Intel X86 (e.g., 80386) Architecture
MMX: MultiMedia Extensions
SSE: Streaming SIMD Extensions
32 bits
16 bits
8 bits
8 bits
One ALU
16 bits
8 bits
8 bits
A register can be subdivided into smaller
units … or extended and subdivided
Chapter 7 — Multicores, Multiprocessors, and Clusters — 25
Network topologies
Arrangements of processors, switches, and links
Bus
Ring
N-cube (N = 3)
2D Mesh
§7.8 Introduction to Multiprocessor Network Topologies
Interconnection Networks
Fully connected
Chapter 7 — Multicores, Multiprocessors, and Clusters — 26
Multistage Networks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 27
Network Characteristics
Performance
Latency (delay) per message
Throughput: messages/second
Congestion delays (depending on traffic)
Cost
Power
Routability in silicon
Chapter 7 — Multicores, Multiprocessors, and Clusters — 28
Graphics Processing Units
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
Architecture
GPU memory optimized for bandwidth, not latency
Smaller memories, no multilevel cache
Simultaneous execution
Wider DRAM chips
§7.7 Introduction to Graphics Processing Units
History of GPUs
Hundreds or thousands of threads
Parallel processing: SIMD + scalar
No double precision floating point
Chapter 7 — Multicores, Multiprocessors, and Clusters — 29
Graphics in the System
Chapter 7 — Multicores, Multiprocessors, and Clusters — 30
GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded
Use thread switching to hide memory latency
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs
Less reliance on multi-level caches
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 31
Example: NVIDIA Tesla
Streaming
multiprocessor
8 × Streaming
processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 32
Example: NVIDIA Tesla
Streaming Processors (SP)
Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads
Executed in parallel,
SIMD (or SPMD) style
8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs, …
Time
Chapter 7 — Multicores, Multiprocessors, and Clusters — 33
Classifying GPUs
Don’t fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an
illusion of MIMD
But with performance degredation
Need to write general purpose code with care
Instruction-Level
Parallelism
Data-Level
Parallelism
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
Chapter 7 — Multicores, Multiprocessors, and Clusters — 34
Roofline Diagram
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Chapter 7 — Multicores, Multiprocessors, and Clusters — 35
Optimizing Performance
Choice of optimization depends on
arithmetic intensity of code
Arithmetic intensity is
not always fixed
May scale with
problem size
Caching reduces
memory accesses
Increases arithmetic
intensity
Chapter 7 — Multicores, Multiprocessors, and Clusters — 36
Comparing Systems
Example: Opteron X2 vs. Opteron X4
2-core vs. 4-core, 2× FP performance/core, 2.2GHz
vs. 2.3GHz
Same memory system
To get higher performance
on X4 than X2
Need high arithmetic intensity
Or working set must fit in X4’s
2MB L-3 cache
Chapter 7 — Multicores, Multiprocessors, and Clusters — 37
Optimizing Performance
Optimize FP performance
Balance adds & multiplies
Improve superscalar ILP
and use of SIMD
instructions
Optimize memory usage
Software prefetch
Avoid load stalls
Memory affinity
Avoid non-local data
accesses
Chapter 7 — Multicores, Multiprocessors, and Clusters — 38
2 × quad-core
Intel Xeon e5345
(Clovertown)
Chipset = Bus
Fully-Buffered DRAM DIMMs
2 × quad-core
AMD Opteron X4 2356
(Barcelona)
§7.11 Real Stuff: Benchmarking Four Multicores …
Four Example Systems
Chapter 7 — Multicores, Multiprocessors, and Clusters — 39
Four Example Systems
2 × oct-core
Sun UltraSPARC
T2 5140 (Niagara 2)
Fine-grained Multithreading
2 × oct-core
IBM Cell QS20
SPE=Synergistic Proc. Element
Have SIMD instr. set
Chapter 7 — Multicores, Multiprocessors, and Clusters — 40
Pitfalls
Not developing the software to take
account of a multiprocessor architecture
Example: using a single lock for a shared
composite resource
Serializes accesses, even if they could be done in
parallel
Use finer-granularity locking
Chapter 7 — Multicores, Multiprocessors, and Clusters — 41
Goal: higher performance by using multiple
processors
Difficulties
Many reasons for optimism
Developing parallel software
Devising appropriate architectures
§7.13 Concluding Remarks
Concluding Remarks
Changing software and application environment
Chip-level multiprocessors with lower latency,
higher bandwidth interconnect
An ongoing challenge for computer architects!
Chapter 7 — Multicores, Multiprocessors, and Clusters — 42