Transcript Chapter
Chapter 7 Multicores (多核心), Multiprocessors (多處理器), and Clusters (叢集) 從 矩陣乘法 說起… • 若 一個加法需費時 ta, 一個乘法需費時 tm, 則兩個 nxn 的矩陣相乘需費時 [(n-1)* ta + n*tm] *n2 •計算複雜度 (computational complexity): O(n2) Chapter 7 — Multicores, Multiprocessors, and Clusters — 2 Systolic array (心跳式陣列) Systolic array (心跳式陣列) Band matrix multiplication Amxn * Bnxp = Cmxp Amxn , Bnxp , Cmxp are band matrix Chapter 7 — Multicores, Multiprocessors, and Clusters — 5 Systolic array by H. T. Kung (1978) Chapter 7 — Multicores, Multiprocessors, and Clusters — 6 Goal: connecting multiple computers to get higher performance Multiprocessors (多處理器) Scalability, availability, power efficiency Job-level (工作階層) parallelism (平行性) process-level (程序階層) High throughput for independent (獨立的) jobs Parallel processing program (平行處理程式) §9.1 Introduction 7.1 介紹 Single program run on multiple processors Multicore (多核心) microprocessors (微處理器) Chips with multiple processors (cores) Chapter 7 — Multicores, Multiprocessors, and Clusters — 7 Hardware and Software Hardware (硬體) Software (軟體) Serial (序列的) : e.g., Pentium 4 Parallel (平行的): e.g., quad-core Xeon e5345 Sequential (循序的) : e.g., matrix multiplication Concurrent (同時的) : e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 8 What We’ve Already Covered 第2章 第11節: Parallelism Parallelism & Computer arithmetic Associativity 第4章 第10節: Parallelism & Advanced Instruction-Level Parallelism 第5章 第8節: Parallelism &Memory Hierarchies Synchronization 第3章 第6節: & Instructions Cache Coherence 第6章 第9節: Parallelism & I/O: (RAID)Redundant Arrays of Inexpensive Disks Chapter 7 — Multicores, Multiprocessors, and Clusters — 10 §7.2 The Difficulty of Creating Parallel Processing Programs 7.2 創作平行處理程式的困難 Difficulties (平行軟體的困難) Partitioning (分割) Coordination (協調) Communications overhead (通訊 額外的負擔) 例如: 以 8 名記者撰寫 1 篇故事,希望讓工作可以快 8 倍 Partitioning: (工作必須被分割成 8 等份) Coordination: (協調) Communications overhead (通訊 額外的負擔) 記者們可能花費太多時間在彼此上,降低了他們的效能 挑戰: 排程 (scheduling)、負載平衡 (load balancing) 同步 (synchronize) 的時間、各方溝通的額外負擔 Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Parallel software is the problem (如何撰寫平行軟體呢?) Need to get significant performance improvement (需要明顯的效能提升時 !) Otherwise, just use a faster uniprocessor, since it’s easier! (否則,使用快一點的單一處理器,因為,較簡單) §7.2 The Difficulty of Creating Parallel Processing Programs Parallel Programming Chapter 7 — Multicores, Multiprocessors, and Clusters — 12 Amdahl’s Law (第一章,第 49 頁) Sequential part can limit speedup (加速) (平行化後,循序執行那部分的程式會限制加速) 範例: 加速的挑戰 (課本 652頁) 用100 processors 得到 90倍 加速,原始運算中多少比率是循序的? Tnew = Tparallelizable/100 + Tsequential Speedup 1 (1 Fparalleliz able ) Fparalleliz 90 able /100 Fparallelizable = 0.999 (Fparallelizable表示可平行比率) 循序的比率為 1- Fparallelizable = 0. 1% Chapter 7 — Multicores, Multiprocessors, and Clusters — 13 範例: 加速的挑戰: 更大的問題 (課本 653頁) 加法問題1: 求10 個純量(scalar) 的總和 (sum) 加法問題 2: 求10 × 10 的矩陣 (matrix) 的總和 (sum) 若由 10 processors 增加到100 processors Speedup (加速) 可以改善 1 processor: Time = (10 + 100) × tadd 10 processors 若由 10 processors 增加到100 processors Speedup (加速) 無法改善 Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5 (潛在加速為 55%) 100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10 (潛在加速為 10%) Chapter 7 — Multicores, Multiprocessors, and Clusters — 14 範例: 加速的挑戰: 更大的問題(續-課本 653頁) 若矩陣的大小增加,為 100 × 100 的矩陣相加 ? 1 processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd Speedup = 10010/1010 = 9.9 (潛在加速為 99%) 100 processors Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (潛在加速為 91) 此處課本有誤, 最後一個句子正確為: 【而使用100個處理器則可大於 90 倍 。】 Chapter 7 — Multicores, Multiprocessors, and Clusters — 15 Scale Up (擴大規模) Strong scaling (強縮放): Weak scaling (弱縮放): problem size fixed (固定的問題大小) 例如: 前一個範例 (課本 653頁) 所示 problem size proportional to number of processors (問題大小和處理器個數成正比) 課本此處,強縮放和弱縮放其實只是要告訴讀者 : 當處理器增加時,【強縮放】可得到顯著的加速 當處理器增加時,【弱縮放】沒有得到顯著的加速 不要把它想的太難了 Chapter 7 — Multicores, Multiprocessors, and Clusters — 16 範例: 加速的挑戰: 平衡負載 (課本 654頁) 加法問題 2: 課本 653 的例子,是假設在負載平衡下,100 個處理 器下,計算100× 100 的矩陣 的總和 可以比 1 個處理器 得到 91 倍加速,每個處理器的負載為 1% 。 若負載不平衡,其中有 1 個處理器負載為 2%,其餘的 99 個處理器均勻的分擔剩下 98% 的負載,加速為多少? 加速為 10,010t / 210t = 48 倍 Chapter 7 — Multicores, Multiprocessors, and Clusters — 17 範例: 加速的挑戰: 平衡負載 (課本 654頁 – 續) 若負載不平衡,其中有 1 個處理器負載為 5%,其餘的 99 個處理器均勻的分擔剩下 95% 的負載,加速為多少? 解答: 1 個處理器負載為 5%,他必須做 5%*10,000 = 500 個加法 99 個處理器分配剩餘的9500 個加法 加速為 10,010t / 510t = 20 倍 此例子顯示 負載不平衡的影響 : 只因 1 個處理器負載為兩倍 (2%),加速減低為一半 (48倍) 只因 1 個處理器負載為兩倍 (5%),加速減低為五分之一 (20倍) Chapter 7 — Multicores, Multiprocessors, and Clusters — 18 SMP: shared memory multiprocessor (多重處理器共享記憶體) Hardware provides single physical address space for all processors Synchronize (同步)shared variables using locks (鎖) Memory access time (記憶體存取時間) UMA (uniform : 一致) vs. NUMA (nonuniform: 不一致) 多處理器 連結網路 單一共享記憶體 §7.3 Shared Memory Multiprocessors 7.3 共享記憶體 (Shared Memory ) 範例: Sum Reduction (課本 656頁) 在 100 處理器UMA共享記憶體下,加總 100,000個數字, 分成兩個步驟: (1) 分割 (partition) (2) 縮減 (reduction) :採 用 divide (除) to conquer (攻取),成對地加總部分和 步驟 (1) : 分割 每個處理器分配 1000 個數字來計算 處理器 Pn 的平行程式 sum[Pn] = 0; /*處理器的 ID 是 Pn : 0 ≤ Pn ≤ 99 */ for (i = 1000*Pn; /* i 是私有變數 */ i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Chapter 7 — Multicores, Multiprocessors, and Clusters — 20 範例: Sum Reduction (續) 處理器 Pn 的平行程式 步驟 (2) : divide and conquer Need to synchronize between reduction steps half = 100; /* 私有變數 */ repeat synch(); /* 同步 */ if (half%2 != 0 && Pn == 0) /* half為奇數,處理器ID=0時 */ sum[0] = sum[0] + sum[half-1]; /*處理器0將最後一個數加進來*/ half = half/2; /* 除以 2 */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Message Passing (訊息傳遞) Each processor has private physical address space Hardware sends/receives messages between processors 多處理器 各別記憶體 連結網路 §7.4 Clusters and Other Message-Passing Multiprocessors 7.4 叢集與其他訊息傳遞多處理器 Loosely (鬆散)Coupled (耦合) Clusters Network of independent computers (獨立的電腦) Each has private (私有) memory and OS Connected using I/O system Suitable for applications with independent tasks E.g., Ethernet/switch, Internet Web servers, databases, simulations, … High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters — 24 範例: Sum Reduction (續) Sum 100,000 on 100 processors 第 1 步: 將 數字分配給100 個處理器 The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; 第 2 步: 縮減,divide and conquer 使用 send 及 receive 在處理器間傳送訊息 Chapter 7 — Multicores, Multiprocessors, and Clusters — 25 Sum Reduction (Again) 處理器 Pn 的平行程式 Given send() and receive() operations limit = 100; half = 100; /* 100 processors */ repeat half = (half+1)/2; /* 以half作為send 和 receive 的分隔線 */ if (Pn >= half && Pn < limit) send(Pn - half, sum); /* 大於half為send */ if (Pn < (limit/2)) /* 小於half為receive */ sum = sum + receive(); limit = half; /* senders 的上限*/ until (half == 1); /* 結束 */ Send/receive also provide synchronization (同步) Assumes send/receive take similar time to addition Chapter 7 — Multicores, Multiprocessors, and Clusters — 26 Grid Computing (網格運算) Separate computers interconnected by long-haul networks (長距離網路) E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs (善用 PC 待機時間) E.g., SETI@home, World Community Grid Chapter 7 — Multicores, Multiprocessors, and Clusters — 28 Hardware multi-threading (硬體多緒處理) Fine-grain (細粒度) multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads §7.5 Hardware Multithreading 7.5 硬體多緒處理 Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain (粗粒度) multithreading Only switch on long stall (長停滯) 例如: L2-cache miss Simplifies hardware, but doesn’t hide short stalls (eg, data hazards) Chapter 7 — Multicores, Multiprocessors, and Clusters — 29 Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) (同時多緒處理) Instructions from independent threads execute when function units are available multiple-issue (多重派發) dynamically scheduled (動態排程) Within threads, dependencies handled by scheduling and register renaming Schedule instructions from multiple threads 例如: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and caches 圖 7.5 超純量 (superscalar )處理器 Multithreading 的例子 Issue slot: 派發槽 Coarse grain MT Fine grain MT SMT Chapter 7 — Multicores, Multiprocessors, and Clusters — 31 Multithreading 的未來 Will it survive? In what form? Power considerations simplified microarchitectures Tolerating cache-miss latency Simpler forms of multithreading Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 7 — Multicores, Multiprocessors, and Clusters — 32 分類架構 (費林分類法) Data Streams (資料流) Single (單一) Instruction Single Streams (單一) (指令流) Multiple (多筆) Multiple(多筆) SISD: Intel Pentium 4 SIMD: SSE instructions of x86 MISD: No examples today MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data (單一程式 多筆資料) §7.6 SISD, MIMD, SIMD, SPMD, and Vector 7.6 SISD、MIMD、SIMD、SPMD與向量 A parallel program on a MIMD computer Conditional code for different processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 33 Flynn's Taxonomy (費林分類法) 高效能計算機的分類方式 1972 (1966?)年 Michael J. Flynn 提出 根據資訊流(information stream)可分成指令 (Instruction)和資料(Data)兩種。 分成四種計算機類型 1. SISD(單一指令流單一資料流)計算機 2. SIMD(單一指令流多資料流)計算機 類似 vector ( 向量) processor 處理器 3. MISD(多指令流單一資料流)計算機 4. MIMD(多指令流多資料流)計算機 Chapter 7 — Multicores, Multiprocessors, and Clusters — 34 Flynn's Taxonomy (費林分類法) Chapter 7 — Multicores, Multiprocessors, and Clusters — 35 SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 All processors execute the same instruction at the same time Multiple data elements in 128-bit wide registers Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 7 — Multicores, Multiprocessors, and Clusters — 36 Vector Processors Highly pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS 32 × 64-element registers (64-bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 37 Vector vs. Scalar Vector architectures and compilers Simplify data-parallel programming Explicit statement of absence of loop-carried dependences Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad-hoc media extensions (such as MMX, SSE) Better match with compiler technology Chapter 7 — Multicores, Multiprocessors, and Clusters — 38 範例: DAXPY (Y = a × X + Y) 假設 X 和 Y 都是長度 64 個雙倍精準浮點數字的向量 傳統 MIPS 程式 l.d $f0,a($sp) addiu r4,$s0,#512 loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) add.d $f4,$f4,$f2 s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 subu $t0,r4,$s0 bne $t0,$zero,loop 向量 MIPS 程式 l.d $f0,a($sp) lv $v1,0($s0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $v4,0($s1) ;載入純量 a ;載入純量的上限 ;載入 x(i) ;a × x(i) ;載入 y(i) ;a × x(i) + y(i) ;存入 y(i) ;遞增 x 的索引值 ;遞增 y 的索引值 ;計算邊界值 ;檢查是否做完 ;載入純量 a ;載入向量 x ;向量與純量相乘 ;載入向量 y ;將 y 加上乘積 ;儲存結果 Scalar (純量) vs vector(向量) processor Scalar (純量) processor SISD (single instruction single data) 傳統的處理器都是這種架構 vector ( 向量) processor SIMD (single instruction multiple data) processes vector (一維 陣列) ILLIAC IV: University of Illinois (伊利諾大學) (1972) 1 個 instruction 可使用達到 64 個ALU 100 to 150 MFLOPS Star 100: CDC公司 (1971) Cray-1 : Cray Research (Cray公司)(1976) US$ 8.86 million 64-bit word, 800 MFLOPS Cray-2 (1985): 1.9 GFLOPS Cray-3 (1995): Cray 公司 bankrupt Superscalar (超純量) processor Superscalar (超純量) processor MIMD (multiple instruction multiple data) CDC 6600 : CDC公司 (1965) 60-bit word, 3 MFLOPS 規格:10 parallel function units, no pipeline CDC 7600 : CDC公司 (1971) floating point multiply (2 copies) floating point divide floating point add "long" integer add incrementers (2 copies; performed memory load/store) shift boolean logic 60-bit word, 30 MFLOPS 規格:9 parallel function units, pipelined Easier to program CDC 6600 and 7600 are successful But CDC almost bankrupted Superscalar (超純量) processor superscalar pipeline Eg. 2 instructions, 4-stage pipeline Superscalar (超純量) microprocessor Intel 公司 AMD公司 i960CA (1988) P5 Pentium (1993) AMD 29000-series 29050 (1990) essentially all general-purpose CPUs developed since about 1998 are superscalar Chapter 7 — Multicores, Multiprocessors, and Clusters — 43 RISC: Reduced instruction set computing (精簡指令集運算) CISC: complex instruction set computing (複雜指令集運算) complex instructions with various addressing modes relatively few registers 例如: Intel Pentium RISC: Reduced instruction set computing (精簡指令集運算) Uniform instruction format Many identical general purpose registers Instruction is simple to be easily pipelined high frequencies to achieve clock throughput 例如: MIPS, SPARC, PowerPC, ARM… Chapter 7 — Multicores, Multiprocessors, and Clusters — 44 VLIW: Very Long Instruction Word (超長指令字) VLIW is a type of MIMD Josh Fisher at Yale University in 1980s VLIW CPUs use software (compiler) to decide which operations can run in parallel Superscalar CPUs use hardware to decide which operations can run in parallel VLIW may also refer to Variable Length Instruction Word 例如: Intel i860, (64-bit ) Chapter 7 — Multicores, Multiprocessors, and Clusters — 45 Early video cards Frame buffer memory with address generation for video output 3D graphics processing 早期高階顯示卡很昂貴 大多由 Silicon Graphics (SGI) 公司生產 Moore’s Law lower cost, higher density 3D graphics cards for PCs and game consoles Graphics Processing Units (GPU) 圖形處理晶片) §7.7 Introduction to Graphics Processing Units 7.7 圖形處理器介紹 Processors oriented to 3D graphics tasks Vertex/pixel processing, shading, texture mapping, rasterization Chapter 7 — Multicores, Multiprocessors, and Clusters — 46 Graphics in the System Intel CPU 透過 北橋晶片和 GPU溝通 早期是 VGA顯示卡 AMD CPU 透 過 chipset晶片 組和GPU溝通 GPU 和 CPU 架構有何不同 ? GPU 處理的影像資料有大量 data-parallel 性質 GPUs are highly multithreaded Use thread switching to hide memory latency GPU 和 graphics memory 之間的傳輸頻寬很大 graphics memory 通常不同於 main memory graphics memory 通常小於 main memory GPU 之趨勢是設計發展 相同的通用處理器 盡量少用multi-level caches CPU-GPU 的結合是 Heterogeneous (異質) 多處理 CPU for sequential code, GPU for parallel code Chapter 7 — Multicores, Multiprocessors, and Clusters — 48 範例: NVIDIA 公司的 GPU 晶片--Tesla Streaming multiprocessor 8 × Streaming processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 49 範例: NVIDIA 公司的 GPU 晶片--Tesla Streaming Processors Single-precision FP and integer units Each SP is fine-grained multithreaded Warp: group of 32 threads Executed in parallel, SIMD style 8 SPs × 4 clock cycles Hardware contexts for 24 warps Registers, PCs, … Chapter 7 — Multicores, Multiprocessors, and Clusters — 50 GPU 的分類 不完全適用 SIMD/MIMD model (費林分類法) Conditional execution in a thread allows an illusion of MIMD But with performance degredation Need to write general purpose code with care Instruction-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time Dynamic: Discovered at Runtime VLIW Superscalar SIMD or Vector Tesla Multiprocessor Chapter 7 — Multicores, Multiprocessors, and Clusters — 51 GPU 的編程介面 Programming languages/APIs DirectX: 微軟 OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute Unified Device Architecture (CUDA) Chapter 7 — Multicores, Multiprocessors, and Clusters — 52 多核晶片需要以晶片上的網路連結所有的核 network cost (網路成本): network performance (網路效能): Switch (交換器) 數量 Switch (交換器) 連到網路的連結數量 Link (連結)寬度(width): bit Link (連結)長度 (length): 製作成晶片後,實際的長度 Latency (延遲): 在 無負載 的網路上傳送訊息所花的時間 Throughput (處理量): 在給定的時間內最多可以傳送的資料量 Delay (延誤): 由於 processor 競爭網路部分通道造成的時間延誤 Variable performance: 由於不同通訊樣式而造成的不同效能 Fault tolerance (容錯) : §7.8 Introduction to Multiprocessor Network Topologies 7.8 多處理器網路拓樸介紹 系統必須在部分零件出錯損壞時繼續執行 Chapter 7 — Multicores, Multiprocessors, and Clusters — 53 Arrangements of processors, switches (交換器) , links 除了network cost 、network performance、Fault tolerance 的 考量因素之外,要考量 scalability 在很多拓樸可供選擇的情況下,需要根據效能指標來區分這些設計 metric (效能指標) 1: network bandwidth (整個網路的頻寬 ) metric 2: bisection bandwidth (對分頻寬): 基於最糟糕情況下 metric 3: 還有什麼 ???? metric 4: 還有什麼 ???? 網路拓樸實施 (課本 681頁) : 還有一些重要實際的考量 連結距離越長,製作高速時脈越昂貴 每個連結的實際距離 => 影響到在高速時脈速率時的通訊成本 將 3 維的網路拓樸圖形實作到 2 維的平面晶片上 Power (電耗): 簡單的 network topology 的較少電耗 §7.8 Introduction to Multiprocessor Network Topologies Network topologies (網路拓樸)-1 Chapter 7 — Multicores, Multiprocessors, and Clusters — 54 Arrangements of processors, switches (交換器) , links scalability : 請思考以下 5 種 network topologies, 整個網路頻寬 有何不同? 對分頻寬有何不同 (公式都在課本 680頁)? 網路拓樸是 否對稱 switch (交換器) 處理器-記憶體 Bus (匯流排) Ring (環) switch processor processor switch switch processor 2D Mesh(二維網格 or 網狀圖) N-cube (N = 3) (n 立方樹) §7.8 Introduction to Multiprocessor Network Topologies Network topologies (網路拓樸)-2 Fully connected (完全連接網路) Chapter 7 — Multicores, Multiprocessors, and Clusters — 55 Multistage Networks (多級網路) Multistage network: Chapter 7 — Multicores, Multiprocessors, and Clusters — 56 網路拓樸實施 (課本 681) Performance Latency per message (unloaded network) Throughput Link bandwidth Total network bandwidth Bisection bandwidth Congestion delays (depending on traffic) Cost Power Routability in silicon Chapter 7 — Multicores, Multiprocessors, and Clusters — 57 圖 7.11 (課本 682頁) §7.9 Multiprocessor Benchmarks 7.9 Parallel Benchmarks (多處理器測試程式集) -1 Stanford 大學: • 共享記憶體 • 強調 強縮放 •NASA 超高計算 Chapter 7 — Multicores, Multiprocessors, and Clusters — 58 •計算流體力學 圖 7.11 (課本 682頁) Princeton 大學: • 共享記憶體 • Pthreads 和 OpenMP •9 個應用程式3個內何程式 §7.9 Multiprocessor Benchmarks 7.9 Parallel Benchmarks (多處理器測試程式集) -2 Berkeley 大學: • 13 個設計樣式 • 研究人員聲稱是未來應 用程式的一部份 Chapter 7 — Multicores, Multiprocessors, and Clusters — 59 Code or Applications? 傳統測試程式集 Fixed code and data sets 平行測試程式集 Parallel programming is evolving (演化中) Should algorithms, programming languages, and tools be part of the system? Compare systems, provided they implement a given application 對 Chapter 7 — Multicores, Multiprocessors, and Clusters — 60 源自 Williams 和 Patterson (2008) 論文 Patterson 是本書作者之一 Peak floating-point performance (峰值浮點效能) §7.10 Roofline: A Simple Performance Model 7.10 屋頂線: 簡單效能模型 所有內核晶片的集體峰值加總 Arithmetic intensity (算術強度)of a kernel FLOPs per byte of memory accessed 由Berkeley 設計模式測試 (圖 7.11)得到 Chapter 7 — Multicores, Multiprocessors, and Clusters — 61 源自 Williams 和 Patterson (2008) 論文 Patterson 是本書作者之一 Roofline model (屋頂線模型) 在 二維圖中聯結了 浮點效能、算術強度、記憶體效能 峰值浮點效能 由硬體規定 來得到 峰值記憶體效能 由第 5 章的 stream 測試程式得到 X 軸是算術強度 ( FLOPs/byte) Y 軸是浮點效能 (GFLOPs/sec) §7.10 Roofline: A Simple Performance Model 7.10 屋頂線: 簡單效能模型 Chapter 7 — Multicores, Multiprocessors, and Clusters — 62 Roofline Diagram (屋頂線模型) Attainable GPLOPs/sec = Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance ) Chapter 7 — Multicores, Multiprocessors, and Clusters — 63 Opteron X2 和 Opteron X4 比較 2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHz Same memory system To get higher performance on X4 than X2 Need high arithmetic intensity Or working set must fit in X4’s 2MB L-3 cache Chapter 7 — Multicores, Multiprocessors, and Clusters — 64 Optimizing Performance Optimize FP performance Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Optimize memory usage Software prefetch Avoid load stalls Memory affinity Avoid non-local data accesses Chapter 7 — Multicores, Multiprocessors, and Clusters — 65 Optimizing Performance Choice of optimization depends on arithmetic intensity of code Arithmetic intensity is not always fixed May scale with problem size Caching reduces memory accesses Increases arithmetic intensity Chapter 7 — Multicores, Multiprocessors, and Clusters — 66 2 × quad-core Intel Xeon e5345 (Clovertown) 2 × quad-core AMD Opteron X4 2356 (Barcelona) §7.11 Real Stuff: Benchmarking Four Multicores … 7.11 實例: 以屋頂線模型測試 4 個多核系統 Chapter 7 — Multicores, Multiprocessors, and Clusters — 67 Four Example Systems 2 × oct-core Sun UltraSPARC T2 5140 (Niagara 2) 2 × oct-core IBM Cell QS20 Chapter 7 — Multicores, Multiprocessors, and Clusters — 68 And Their Rooflines Kernels SpMV (left) LBHMD (right) Some optimizations change arithmetic intensity x86 systems have higher peak GFLOPs But harder to achieve, given memory bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 69 Performance on SpMV Sparse matrix/vector multiply Irregular memory accesses, memory bound Arithmetic intensity 0.166 before memory optimization, 0.25 after Xeon vs. Opteron Similar peak FLOPS Xeon limited by shared FSBs and chipset UltraSPARC/Cell vs. x86 20 – 30 vs. 75 peak GFLOPs More cores and memory bandwidth Chapter 7 — Multicores, Multiprocessors, and Clusters — 70 Performance on LBMHD Fluid dynamics: structured grid over time steps Each point: 75 FP read/write, 1300 FP ops Arithmetic intensity 0.70 before optimization, 1.07 after Opteron vs. UltraSPARC More powerful cores, not limited by memory bandwidth Xeon vs. others Still suffers from memory bottlenecks Chapter 7 — Multicores, Multiprocessors, and Clusters — 71 Achieving Performance Compare naïve vs. optimized code If naïve code performs well, it’s easier to write high performance code for the system System Kernel Naïve GFLOPs/sec Optimized GFLOPs/sec Naïve as % of optimized Intel Xeon SpMV LBMHD 1.0 4.6 1.5 5.6 64% 82% AMD Opteron X4 SpMV LBMHD 1.4 7.1 3.6 14.1 38% 50% Sun UltraSPARC T2 SpMV LBMHD 3.5 9.7 4.1 10.5 86% 93% IBM Cell QS20 SpMV LBMHD Naïve code not feasible 6.4 16.7 0% 0% Chapter 7 — Multicores, Multiprocessors, and Clusters — 72 Fallacy (謬誤): 似是而非 的觀念 1. §7.12 Fallacies and Pitfalls 7.12 謬誤與陷阱 Amdahl’s Law doesn’t apply to parallel computers Since we can achieve linear speedup But only on applications with weak scaling => 錯 1. Peak performance tracks observed performance Marketers like this approach! But compare Xeon with others in example Need to be aware of bottlenecks => 錯 Chapter 7 — Multicores, Multiprocessors, and Clusters — 73 Pitfalls Pitfall (陷阱): 容易犯的錯誤 1. Not developing the software to take account of a multiprocessor architecture 例如: using a single lock for a shared composite resource Serializes accesses, even if they could be done in parallel (雖然有些地方可以很容易平行化,卻以循序製作) Use finer-granularity locking Chapter 7 — Multicores, Multiprocessors, and Clusters — 74 Goal: higher performance by using multiple processors Difficulties Many reasons for optimism Developing parallel software Devising appropriate architectures §7.13 Concluding Remarks Concluding Remarks Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects! Chapter 7 — Multicores, Multiprocessors, and Clusters — 75