Transcript Chapter

Chapter 7
Multicores (多核心),
Multiprocessors (多處理器),
and Clusters (叢集)
從 矩陣乘法 說起…
• 若 一個加法需費時 ta, 一個乘法需費時 tm, 則兩個
nxn 的矩陣相乘需費時 [(n-1)* ta + n*tm] *n2
•計算複雜度 (computational complexity): O(n2)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 2
Systolic array (心跳式陣列)
Systolic array (心跳式陣列)
Band matrix multiplication


Amxn * Bnxp = Cmxp
Amxn , Bnxp , Cmxp are band matrix
Chapter 7 — Multicores, Multiprocessors, and Clusters — 5
Systolic array by H. T. Kung (1978)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 6

Goal: connecting multiple computers to get
higher performance
 Multiprocessors (多處理器)
 Scalability, availability, power efficiency

Job-level (工作階層) parallelism (平行性)



process-level (程序階層)
High throughput for independent (獨立的) jobs
Parallel processing program (平行處理程式)


§9.1 Introduction
7.1 介紹
Single program run on multiple processors
Multicore (多核心) microprocessors (微處理器)

Chips with multiple processors (cores)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 7
Hardware and Software

Hardware (硬體)



Software (軟體)



Serial (序列的) : e.g., Pentium 4
Parallel (平行的): e.g., quad-core Xeon e5345
Sequential (循序的) : e.g., matrix multiplication
Concurrent (同時的) : e.g., operating system
Sequential/concurrent software can run on
serial/parallel hardware

Challenge: making effective use of parallel
hardware
Chapter 7 — Multicores, Multiprocessors, and Clusters — 8
What We’ve Already Covered

第2章 第11節: Parallelism




Parallelism & Computer arithmetic
Associativity
第4章 第10節: Parallelism
& Advanced
Instruction-Level Parallelism
第5章 第8節: Parallelism &Memory Hierarchies


Synchronization
第3章 第6節:

& Instructions
Cache Coherence
第6章 第9節: Parallelism

& I/O:
(RAID)Redundant Arrays of Inexpensive Disks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 10
§7.2 The Difficulty of Creating Parallel Processing Programs
7.2 創作平行處理程式的困難

Difficulties (平行軟體的困難)
 Partitioning (分割)
 Coordination (協調)
 Communications overhead (通訊 額外的負擔)

例如: 以 8 名記者撰寫 1 篇故事,希望讓工作可以快 8 倍



Partitioning: (工作必須被分割成 8 等份)
Coordination: (協調)
Communications overhead (通訊 額外的負擔)

記者們可能花費太多時間在彼此上,降低了他們的效能
挑戰:


排程 (scheduling)、負載平衡 (load balancing)
同步 (synchronize) 的時間、各方溝通的額外負擔
Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

Parallel software is the problem
(如何撰寫平行軟體呢?)

Need to get significant performance
improvement (需要明顯的效能提升時 !)
 Otherwise, just use a faster uniprocessor,
since it’s easier!
(否則,使用快一點的單一處理器,因為,較簡單)
§7.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
Chapter 7 — Multicores, Multiprocessors, and Clusters — 12
Amdahl’s Law (第一章,第 49 頁)

Sequential part can limit speedup (加速)
(平行化後,循序執行那部分的程式會限制加速)

範例: 加速的挑戰 (課本 652頁)

用100 processors 得到 90倍 加速,原始運算中多少比率是循序的?

Tnew = Tparallelizable/100 + Tsequential



Speedup

1
(1  Fparalleliz
able
)  Fparalleliz
 90
able
/100
Fparallelizable = 0.999 (Fparallelizable表示可平行比率)
循序的比率為 1- Fparallelizable = 0. 1%
Chapter 7 — Multicores, Multiprocessors, and Clusters — 13
範例: 加速的挑戰: 更大的問題 (課本 653頁)

加法問題1: 求10 個純量(scalar) 的總和 (sum)



加法問題 2: 求10 × 10 的矩陣 (matrix) 的總和 (sum)




若由 10 processors 增加到100 processors
Speedup (加速) 可以改善
1 processor: Time = (10 + 100) × tadd
10 processors



若由 10 processors 增加到100 processors
Speedup (加速) 無法改善
Time = 10 × tadd + 100/10 × tadd = 20 × tadd
Speedup = 110/20 = 5.5 (潛在加速為 55%)
100 processors


Time = 10 × tadd + 100/100 × tadd = 11 × tadd
Speedup = 110/11 = 10 (潛在加速為 10%)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 14
範例: 加速的挑戰: 更大的問題(續-課本 653頁)



若矩陣的大小增加,為 100 × 100 的矩陣相加 ?
1 processor: Time = (10 + 10000) × tadd
10 processors



Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
Speedup = 10010/1010 = 9.9 (潛在加速為 99%)
100 processors



Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91 (潛在加速為 91)
此處課本有誤, 最後一個句子正確為:
【而使用100個處理器則可大於 90 倍 。】
Chapter 7 — Multicores, Multiprocessors, and Clusters — 15
Scale Up (擴大規模)

Strong scaling (強縮放):



Weak scaling (弱縮放):


problem size fixed (固定的問題大小)
例如: 前一個範例 (課本 653頁) 所示
problem size proportional to number of processors
(問題大小和處理器個數成正比)
課本此處,強縮放和弱縮放其實只是要告訴讀者 :



當處理器增加時,【強縮放】可得到顯著的加速
當處理器增加時,【弱縮放】沒有得到顯著的加速
不要把它想的太難了
Chapter 7 — Multicores, Multiprocessors, and Clusters — 16
範例: 加速的挑戰: 平衡負載 (課本 654頁)

加法問題 2:



課本 653 的例子,是假設在負載平衡下,100 個處理
器下,計算100× 100 的矩陣 的總和 可以比 1 個處理器
得到 91 倍加速,每個處理器的負載為 1% 。
若負載不平衡,其中有 1 個處理器負載為 2%,其餘的
99 個處理器均勻的分擔剩下 98% 的負載,加速為多少?
加速為 10,010t / 210t = 48 倍
Chapter 7 — Multicores, Multiprocessors, and Clusters — 17
範例: 加速的挑戰: 平衡負載 (課本 654頁 –
續)


若負載不平衡,其中有 1 個處理器負載為 5%,其餘的
99 個處理器均勻的分擔剩下 95% 的負載,加速為多少?
解答:






1 個處理器負載為 5%,他必須做 5%*10,000 = 500 個加法
99 個處理器分配剩餘的9500 個加法
加速為 10,010t / 510t = 20 倍
此例子顯示 負載不平衡的影響 :
只因 1 個處理器負載為兩倍 (2%),加速減低為一半 (48倍)
只因 1 個處理器負載為兩倍 (5%),加速減低為五分之一 (20倍)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 18

SMP: shared memory multiprocessor
(多重處理器共享記憶體)



Hardware provides single physical
address space for all processors
Synchronize (同步)shared variables using locks (鎖)
Memory access time (記憶體存取時間)

UMA (uniform : 一致) vs. NUMA (nonuniform: 不一致)
多處理器
連結網路
單一共享記憶體
§7.3 Shared Memory Multiprocessors
7.3 共享記憶體 (Shared Memory )
範例: Sum Reduction (課本 656頁)


在 100 處理器UMA共享記憶體下,加總 100,000個數字,
分成兩個步驟: (1) 分割 (partition) (2) 縮減 (reduction) :採
用 divide (除) to conquer (攻取),成對地加總部分和
步驟 (1) : 分割


每個處理器分配 1000 個數字來計算
處理器 Pn
的平行程式
sum[Pn] = 0;
/*處理器的 ID 是 Pn : 0 ≤ Pn ≤ 99 */
for (i = 1000*Pn;
/* i 是私有變數 */
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
Chapter 7 — Multicores, Multiprocessors, and Clusters — 20
範例: Sum Reduction (續)
處理器 Pn
的平行程式

步驟 (2) : divide and conquer

Need to synchronize between reduction steps
half = 100; /* 私有變數 */
repeat
synch();
/* 同步 */
if (half%2 != 0 && Pn == 0) /* half為奇數,處理器ID=0時 */
sum[0] = sum[0] + sum[half-1]; /*處理器0將最後一個數加進來*/
half = half/2;
/* 除以 2 */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);

Message Passing (訊息傳遞)


Each processor has private physical address
space
Hardware sends/receives messages between
processors
多處理器
各別記憶體
連結網路
§7.4 Clusters and Other Message-Passing Multiprocessors
7.4 叢集與其他訊息傳遞多處理器
Loosely (鬆散)Coupled (耦合) Clusters

Network of independent computers (獨立的電腦)


Each has private (私有) memory and OS
Connected using I/O system


Suitable for applications with independent tasks



E.g., Ethernet/switch, Internet
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems


Administration cost (prefer virtual machines)
Low interconnect bandwidth

c.f. processor/memory bandwidth on an SMP
Chapter 7 — Multicores, Multiprocessors, and Clusters — 24
範例: Sum Reduction (續)


Sum 100,000 on 100 processors
第 1 步: 將 數字分配給100 個處理器


The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
第 2 步: 縮減,divide and conquer

使用 send 及 receive 在處理器間傳送訊息
Chapter 7 — Multicores, Multiprocessors, and Clusters — 25
Sum Reduction (Again)

處理器 Pn
的平行程式
Given send() and receive() operations
limit = 100; half = 100; /* 100 processors */
repeat
half = (half+1)/2;
/* 以half作為send 和
receive 的分隔線 */
if (Pn >= half && Pn < limit)
send(Pn - half, sum); /* 大於half為send */
if (Pn < (limit/2))
/* 小於half為receive */
sum = sum + receive();
limit = half;
/* senders 的上限*/
until (half == 1);
/* 結束 */


Send/receive also provide synchronization (同步)
Assumes send/receive take similar time to addition
Chapter 7 — Multicores, Multiprocessors, and Clusters — 26
Grid Computing (網格運算)

Separate computers interconnected by
long-haul networks (長距離網路)



E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
(善用 PC 待機時間)

E.g., SETI@home, World Community Grid
Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

Hardware multi-threading (硬體多緒處理)




Fine-grain (細粒度) multithreading




Performing multiple threads of execution in parallel
Replicate registers, PC, etc.
Fast switching between threads
§7.5 Hardware Multithreading
7.5 硬體多緒處理
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain (粗粒度) multithreading

Only switch on long stall (長停滯)


例如: L2-cache miss
Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 29
Simultaneous Multithreading (SMT)


Simultaneous Multithreading (SMT) (同時多緒處理)
Instructions from independent threads execute
when function units are available



multiple-issue (多重派發)
dynamically scheduled (動態排程)


Within threads, dependencies handled by scheduling
and register renaming
Schedule instructions from multiple threads
例如: Intel Pentium-4 HT
 Two threads: duplicated registers, shared
function units and caches
圖 7.5 超純量 (superscalar )處理器
Multithreading 的例子




Issue slot: 派發槽
Coarse grain MT
Fine grain MT
SMT
Chapter 7 — Multicores, Multiprocessors, and Clusters — 31
Multithreading 的未來


Will it survive? In what form?
Power considerations  simplified
microarchitectures


Tolerating cache-miss latency


Simpler forms of multithreading
Thread switch may be most effective
Multiple simple cores might share
resources more effectively
Chapter 7 — Multicores, Multiprocessors, and Clusters — 32

分類架構 (費林分類法)
Data Streams (資料流)
Single (單一)
Instruction Single
Streams
(單一)
(指令流)
Multiple
(多筆)

Multiple(多筆)
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
SPMD: Single Program Multiple Data
(單一程式 多筆資料)


§7.6 SISD, MIMD, SIMD, SPMD, and Vector
7.6 SISD、MIMD、SIMD、SPMD與向量
A parallel program on a MIMD computer
Conditional code for different processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 33
Flynn's Taxonomy (費林分類法)




高效能計算機的分類方式
1972 (1966?)年 Michael J. Flynn 提出
根據資訊流(information stream)可分成指令
(Instruction)和資料(Data)兩種。
分成四種計算機類型
1.
SISD(單一指令流單一資料流)計算機
2.
SIMD(單一指令流多資料流)計算機

類似 vector ( 向量) processor 處理器
3.
MISD(多指令流單一資料流)計算機
4.
MIMD(多指令流多資料流)計算機
Chapter 7 — Multicores, Multiprocessors, and Clusters — 34
Flynn's Taxonomy (費林分類法)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 35
SIMD

Operate elementwise on vectors of data

E.g., MMX and SSE instructions in x86


All processors execute the same
instruction at the same time




Multiple data elements in 128-bit wide registers
Each with different data address, etc.
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 7 — Multicores, Multiprocessors, and Clusters — 36
Vector Processors


Highly pipelined function units
Stream data from/to vector registers to units



Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to MIPS


32 × 64-element registers (64-bit elements)
Vector instructions




lv, sv: load/store vector
addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Chapter 7 — Multicores, Multiprocessors, and Clusters — 37
Vector vs. Scalar

Vector architectures and compilers


Simplify data-parallel programming
Explicit statement of absence of loop-carried
dependences




Reduced checking in hardware
Regular access patterns benefit from
interleaved and burst memory
Avoid control hazards by avoiding loops
More general than ad-hoc media
extensions (such as MMX, SSE)

Better match with compiler technology
Chapter 7 — Multicores, Multiprocessors, and Clusters — 38
範例: DAXPY (Y = a × X + Y)

假設 X 和 Y 都是長度 64 個雙倍精準浮點數字的向量
傳統 MIPS 程式
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
 向量 MIPS 程式
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)

;載入純量 a
;載入純量的上限
;載入 x(i)
;a × x(i)
;載入 y(i)
;a × x(i) + y(i)
;存入 y(i)
;遞增 x 的索引值
;遞增 y 的索引值
;計算邊界值
;檢查是否做完
;載入純量 a
;載入向量 x
;向量與純量相乘
;載入向量 y
;將 y 加上乘積
;儲存結果
Scalar (純量) vs vector(向量) processor

Scalar (純量) processor



SISD (single instruction single data)
傳統的處理器都是這種架構
vector ( 向量) processor



SIMD (single instruction multiple data)
processes vector (一維 陣列)
ILLIAC IV: University of Illinois (伊利諾大學) (1972)




1 個 instruction 可使用達到 64 個ALU
100 to 150 MFLOPS
Star 100: CDC公司 (1971)
Cray-1 : Cray Research (Cray公司)(1976)




US$ 8.86 million
64-bit word, 800 MFLOPS
Cray-2 (1985): 1.9 GFLOPS
Cray-3 (1995): Cray 公司 bankrupt
Superscalar (超純量) processor

Superscalar (超純量) processor


MIMD (multiple instruction multiple data)
CDC 6600 : CDC公司 (1965)


60-bit word, 3 MFLOPS
規格:10 parallel function units, no pipeline








CDC 7600 : CDC公司 (1971)




floating point multiply (2 copies)
floating point divide
floating point add
"long" integer add
incrementers (2 copies; performed memory load/store)
shift
boolean logic
60-bit word, 30 MFLOPS
規格:9 parallel function units, pipelined
Easier to program
CDC 6600 and 7600 are successful

But CDC almost bankrupted
Superscalar (超純量) processor

superscalar pipeline

Eg. 2 instructions, 4-stage pipeline
Superscalar (超純量) microprocessor

Intel 公司



AMD公司


i960CA (1988)
P5 Pentium (1993)
AMD 29000-series 29050 (1990)
essentially all general-purpose CPUs developed since
about 1998 are superscalar
Chapter 7 — Multicores, Multiprocessors, and Clusters — 43
RISC: Reduced instruction set computing
(精簡指令集運算)

CISC: complex instruction set computing
(複雜指令集運算)




complex instructions with various addressing modes
relatively few registers
例如: Intel Pentium
RISC: Reduced instruction set computing
(精簡指令集運算)




Uniform instruction format
Many identical general purpose registers
Instruction is simple to be easily pipelined
 high frequencies to achieve clock throughput
例如: MIPS, SPARC, PowerPC, ARM…
Chapter 7 — Multicores, Multiprocessors, and Clusters — 44
VLIW: Very Long Instruction Word (超長指令字)


VLIW is a type of MIMD
Josh Fisher at Yale University in 1980s

VLIW CPUs use software (compiler) to decide which
operations can run in parallel


Superscalar CPUs use hardware to decide which
operations can run in parallel
VLIW may also refer to Variable Length
Instruction Word

例如: Intel i860, (64-bit )
Chapter 7 — Multicores, Multiprocessors, and Clusters — 45

Early video cards


Frame buffer memory with address generation for
video output
3D graphics processing

早期高階顯示卡很昂貴




大多由 Silicon Graphics (SGI) 公司生產
Moore’s Law  lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units (GPU) 圖形處理晶片)


§7.7 Introduction to Graphics Processing Units
7.7 圖形處理器介紹
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
Chapter 7 — Multicores, Multiprocessors, and Clusters — 46
Graphics in the System
Intel CPU 透過
北橋晶片和
GPU溝通
早期是
VGA顯示卡
AMD CPU 透
過 chipset晶片
組和GPU溝通
GPU 和 CPU 架構有何不同 ?

GPU 處理的影像資料有大量 data-parallel 性質


GPUs are highly multithreaded
Use thread switching to hide memory latency


GPU 和 graphics memory 之間的傳輸頻寬很大

graphics memory 通常不同於 main memory
graphics memory 通常小於 main memory

GPU 之趨勢是設計發展 相同的通用處理器


盡量少用multi-level caches
CPU-GPU 的結合是 Heterogeneous (異質) 多處理
 CPU for sequential code, GPU for parallel code
Chapter 7 — Multicores, Multiprocessors, and Clusters — 48
範例: NVIDIA 公司的 GPU 晶片--Tesla
Streaming
multiprocessor
8 × Streaming
processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 49
範例: NVIDIA 公司的 GPU 晶片--Tesla

Streaming Processors



Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads

Executed in parallel,
SIMD style


8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps

Registers, PCs, …
Chapter 7 — Multicores, Multiprocessors, and Clusters — 50
GPU 的分類

不完全適用 SIMD/MIMD model (費林分類法)

Conditional execution in a thread allows an
illusion of MIMD


But with performance degredation
Need to write general purpose code with care
Instruction-Level
Parallelism
Data-Level
Parallelism
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
Chapter 7 — Multicores, Multiprocessors, and Clusters — 51
GPU 的編程介面

Programming languages/APIs




DirectX: 微軟
OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 52


多核晶片需要以晶片上的網路連結所有的核
network cost (網路成本):





network performance (網路效能):





Switch (交換器) 數量
Switch (交換器) 連到網路的連結數量
Link (連結)寬度(width): bit
Link (連結)長度 (length): 製作成晶片後,實際的長度
Latency (延遲): 在 無負載 的網路上傳送訊息所花的時間
Throughput (處理量): 在給定的時間內最多可以傳送的資料量
Delay (延誤): 由於 processor 競爭網路部分通道造成的時間延誤
Variable performance: 由於不同通訊樣式而造成的不同效能
Fault tolerance (容錯) :

§7.8 Introduction to Multiprocessor Network Topologies
7.8 多處理器網路拓樸介紹
系統必須在部分零件出錯損壞時繼續執行
Chapter 7 — Multicores, Multiprocessors, and Clusters — 53








Arrangements of processors, switches (交換器) , links
除了network cost 、network performance、Fault tolerance 的
考量因素之外,要考量 scalability
在很多拓樸可供選擇的情況下,需要根據效能指標來區分這些設計
metric (效能指標) 1: network bandwidth (整個網路的頻寬 )
metric 2: bisection bandwidth (對分頻寬): 基於最糟糕情況下
metric 3: 還有什麼 ????
metric 4: 還有什麼 ????
網路拓樸實施 (課本 681頁) : 還有一些重要實際的考量

連結距離越長,製作高速時脈越昂貴



每個連結的實際距離 => 影響到在高速時脈速率時的通訊成本
將 3 維的網路拓樸圖形實作到 2 維的平面晶片上
Power (電耗): 簡單的 network topology 的較少電耗
§7.8 Introduction to Multiprocessor Network Topologies
Network topologies (網路拓樸)-1
Chapter 7 — Multicores, Multiprocessors, and Clusters — 54


Arrangements of processors, switches (交換器) , links
scalability : 請思考以下 5 種 network topologies, 整個網路頻寬
有何不同? 對分頻寬有何不同 (公式都在課本 680頁)? 網路拓樸是
否對稱
switch
(交換器)
處理器-記憶體
Bus (匯流排)
Ring (環)
switch
processor
processor
switch
switch
processor
2D Mesh(二維網格 or 網狀圖)
N-cube (N = 3)
(n 立方樹)
§7.8 Introduction to Multiprocessor Network Topologies
Network topologies (網路拓樸)-2
Fully connected
(完全連接網路)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 55
Multistage Networks (多級網路)

Multistage network:
Chapter 7 — Multicores, Multiprocessors, and Clusters — 56
網路拓樸實施 (課本 681)

Performance


Latency per message (unloaded network)
Throughput







Link bandwidth
Total network bandwidth
Bisection bandwidth
Congestion delays (depending on traffic)
Cost
Power
Routability in silicon
Chapter 7 — Multicores, Multiprocessors, and Clusters — 57

圖 7.11 (課本 682頁)
§7.9 Multiprocessor Benchmarks
7.9 Parallel Benchmarks (多處理器測試程式集) -1
Stanford 大學:
• 共享記憶體
• 強調 強縮放
•NASA 超高計算
Chapter 7 — Multicores, Multiprocessors, and Clusters — 58
•計算流體力學

圖 7.11 (課本 682頁)
Princeton 大學:
• 共享記憶體
• Pthreads 和 OpenMP
•9 個應用程式3個內何程式
§7.9 Multiprocessor Benchmarks
7.9 Parallel Benchmarks (多處理器測試程式集) -2
Berkeley 大學:
• 13 個設計樣式
• 研究人員聲稱是未來應
用程式的一部份
Chapter 7 — Multicores, Multiprocessors, and Clusters — 59
Code or Applications?

傳統測試程式集



Fixed code and data sets
平行測試程式集
Parallel programming is evolving (演化中)


Should algorithms, programming languages,
and tools be part of the system?
Compare systems, provided they implement a
given application
對
Chapter 7 — Multicores, Multiprocessors, and Clusters — 60


源自 Williams 和 Patterson (2008) 論文
 Patterson 是本書作者之一
Peak floating-point performance (峰值浮點效能)


§7.10 Roofline: A Simple Performance Model
7.10 屋頂線: 簡單效能模型
所有內核晶片的集體峰值加總
Arithmetic intensity (算術強度)of a kernel

FLOPs per byte of memory accessed

由Berkeley 設計模式測試 (圖 7.11)得到
Chapter 7 — Multicores, Multiprocessors, and Clusters — 61


源自 Williams 和 Patterson (2008) 論文
 Patterson 是本書作者之一
Roofline model (屋頂線模型)





在 二維圖中聯結了 浮點效能、算術強度、記憶體效能
峰值浮點效能 由硬體規定 來得到
峰值記憶體效能 由第 5 章的 stream 測試程式得到
X 軸是算術強度 ( FLOPs/byte)
Y 軸是浮點效能 (GFLOPs/sec)
§7.10 Roofline: A Simple Performance Model
7.10 屋頂線: 簡單效能模型
Chapter 7 — Multicores, Multiprocessors, and Clusters — 62
Roofline Diagram (屋頂線模型)
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Chapter 7 — Multicores, Multiprocessors, and Clusters — 63
Opteron X2 和 Opteron X4 比較


2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs.
2.3GHz
Same memory system

To get higher performance
on X4 than X2


Need high arithmetic intensity
Or working set must fit in X4’s
2MB L-3 cache
Chapter 7 — Multicores, Multiprocessors, and Clusters — 64
Optimizing Performance

Optimize FP performance



Balance adds & multiplies
Improve superscalar ILP
and use of SIMD
instructions
Optimize memory usage

Software prefetch


Avoid load stalls
Memory affinity

Avoid non-local data
accesses
Chapter 7 — Multicores, Multiprocessors, and Clusters — 65
Optimizing Performance

Choice of optimization depends on
arithmetic intensity of code

Arithmetic intensity is
not always fixed


May scale with
problem size
Caching reduces
memory accesses

Increases arithmetic
intensity
Chapter 7 — Multicores, Multiprocessors, and Clusters — 66
2 × quad-core
Intel Xeon e5345
(Clovertown)
2 × quad-core
AMD Opteron X4 2356
(Barcelona)
§7.11 Real Stuff: Benchmarking Four Multicores …
7.11 實例: 以屋頂線模型測試 4 個多核系統
Chapter 7 — Multicores, Multiprocessors, and Clusters — 67
Four Example Systems
2 × oct-core
Sun UltraSPARC
T2 5140 (Niagara 2)
2 × oct-core
IBM Cell QS20
Chapter 7 — Multicores, Multiprocessors, and Clusters — 68
And Their Rooflines

Kernels
SpMV (left)
 LBHMD (right)

Some optimizations
change arithmetic
intensity
 x86 systems have
higher peak GFLOPs


But harder to achieve,
given memory
bandwidth
Chapter 7 — Multicores, Multiprocessors, and Clusters — 69
Performance on SpMV

Sparse matrix/vector multiply


Irregular memory accesses, memory bound
Arithmetic intensity

0.166 before memory optimization, 0.25 after

Xeon vs. Opteron



Similar peak FLOPS
Xeon limited by shared FSBs
and chipset
UltraSPARC/Cell vs. x86


20 – 30 vs. 75 peak GFLOPs
More cores and memory
bandwidth
Chapter 7 — Multicores, Multiprocessors, and Clusters — 70
Performance on LBMHD

Fluid dynamics: structured grid over time steps


Each point: 75 FP read/write, 1300 FP ops
Arithmetic intensity

0.70 before optimization, 1.07 after

Opteron vs. UltraSPARC


More powerful cores, not
limited by memory bandwidth
Xeon vs. others

Still suffers from memory
bottlenecks
Chapter 7 — Multicores, Multiprocessors, and Clusters — 71
Achieving Performance

Compare naïve vs. optimized code

If naïve code performs well, it’s easier to write
high performance code for the system
System
Kernel
Naïve
GFLOPs/sec
Optimized
GFLOPs/sec
Naïve as % of
optimized
Intel Xeon
SpMV
LBMHD
1.0
4.6
1.5
5.6
64%
82%
AMD
Opteron X4
SpMV
LBMHD
1.4
7.1
3.6
14.1
38%
50%
Sun UltraSPARC
T2
SpMV
LBMHD
3.5
9.7
4.1
10.5
86%
93%
IBM Cell QS20
SpMV
LBMHD
Naïve code
not feasible
6.4
16.7
0%
0%
Chapter 7 — Multicores, Multiprocessors, and Clusters — 72

Fallacy (謬誤): 似是而非 的觀念
1.
§7.12 Fallacies and Pitfalls
7.12 謬誤與陷阱
Amdahl’s Law doesn’t apply to parallel computers


Since we can achieve linear speedup
But only on applications with weak scaling
=> 錯
1.
Peak performance tracks observed performance



Marketers like this approach!
But compare Xeon with others in example
Need to be aware of bottlenecks
=> 錯
Chapter 7 — Multicores, Multiprocessors, and Clusters — 73
Pitfalls

Pitfall (陷阱): 容易犯的錯誤
1.

Not developing the software to take account of a
multiprocessor architecture
例如: using a single lock for a shared composite
resource


Serializes accesses, even if they could be done in
parallel (雖然有些地方可以很容易平行化,卻以循序製作)
Use finer-granularity locking
Chapter 7 — Multicores, Multiprocessors, and Clusters — 74


Goal: higher performance by using multiple
processors
Difficulties



Many reasons for optimism



Developing parallel software
Devising appropriate architectures
§7.13 Concluding Remarks
Concluding Remarks
Changing software and application environment
Chip-level multiprocessors with lower latency,
higher bandwidth interconnect
An ongoing challenge for computer architects!
Chapter 7 — Multicores, Multiprocessors, and Clusters — 75