Transcript Document
Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details) Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected] cpeg421-10-F/Topic-3-II-EARTH 1 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 2 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 3 The EARTH Multithreaded Execution Model Two Level of Fine-Grain Threads: - threaded procedures - fibers fiber within a frame Aync. function invocation A sync operation Invoke a threaded func 2 2 1 2 Signal Token Total # signals Arrived # signals 1 2 2 4 cpeg421-10-F/Topic-3-II-EARTH 4 EARTH vs. CILK Fiber within a frame Parallel function invocation frames fork a procedure SYNC ops CILK Model EARTH Model Note: EARTH has it origin in static dataflow model cpeg421-10-F/Topic-3-II-EARTH 5 The “Fiber” Execution Model 0 2 0 2 Signal Token Total # signals Arrived # signals 0 1 0 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 6 The “Fiber” Execution Model 1 2 0 2 Signal Token Total # signals Arrived # signals 0 1 0 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 7 The “Fiber” Execution Model 2 2 0 2 Signal Token Total # signals Arrived # signals 0 1 0 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 8 The “Fiber” Execution Model 2 2 0 2 Signal Token Total # signals Arrived # signals 1 1 0 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 9 The “Fiber” Execution Model 2 2 0 2 Signal Token Total # signals Arrived # signals 1 1 1 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 10 The “Fiber” Execution Model 2 2 1 2 Signal Token Total # signals Arrived # signals 1 1 1 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 11 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 1 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 12 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 2 2 cpeg421-10-F/Topic-3-II-EARTH 0 4 13 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 2 2 cpeg421-10-F/Topic-3-II-EARTH 1 4 14 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 2 2 cpeg421-10-F/Topic-3-II-EARTH 2 4 15 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 2 2 cpeg421-10-F/Topic-3-II-EARTH 3 4 16 The “Fiber” Execution Model 2 2 2 2 Signal Token Total # signals Arrived # signals 1 1 2 2 cpeg421-10-F/Topic-3-II-EARTH 4 4 17 A Loop Example for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: … } i= 1 i= 2 i= 3 T1 T2 T3 i= N S1: S2: S3: Sk: Note: How loop carried dependencies are handled? And its implication on cross core software pipelining cpeg421-10-F/Topic-3-II-EARTH 18 Main Features of EARTH * Fast thread context switching • Efficient parallel function invocation • Good support of fine grain dynamic load balancing * Efficient support split phase transactions and fibers *Features unique to the EARTH model in comparison to the CILK model cpeg421-10-F/Topic-3-II-EARTH 19 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 20 Compiling C for EARTH Objectives • Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C) • Develop compiler techniques to automatically translate programs written in EARTH-C to multithreaded programs. (EARTH-C, Threaded-C) • Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs. cpeg421-10-F/Topic-3-II-EARTH 21 Summary of EARTH-C Extensions • Explicit Parallelism – Parallel versus Sequential statement sequences – Forall loops • Locality Annotation – Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing – Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH 22 EARTH-C Compiler Environment EARTH SIMPLE C EARTH-C Split Phase Analysis Build DDG Program Dependence Analysis EARTH SIMPLE EARTH-C Compiler Thread Generation Thread Partitioning McCAT Compute Remote Level Merge Statements Thread Synchronization Threaded-C Threaded-C Compiler Thread Scheduling Thread Code Generation Threaded-C EARTH Compilation Environment The EARTH Compiler cpeg421-10-F/Topic-3-II-EARTH 23 The McCAT/EARTH Compiler EARTH-C Simplify goto elimination Local function inlining Points-to Analysis Heap Analysis R/W Set Analysis Array Dependence Tester PHASE I (Standard McCAT Analyses & Transformations) EARTH-SIMPLE-C Forall Loop Detection Loop Partitioning PHASE II (Parallelization) EARTH-SIMPLE-C Build Hierarchical DDG Thread Generation PHASE III Code Generation THREADED-C cpeg421-10-F/Topic-3-II-EARTH 24 result n done fib 0 0 If n < 2 DATA_RSYNC (1, result, done) else { TOKEN (fib, n-1, & sum1, slot_1); TOKEN (fib, n-2, & sum2, slot_2); } END_THREAD( ) ; 2 2 THREAD-1; DATA_RSYNC (sum1 + sum2, result, done); END_THREAD ( ) ; END_FUNCTION The Fibonacci Example 7/21/2015 \Petaflop\Workshop98-7B.ppt 25 Matrix Multiplication void main ( ) { int i, j, k; float sum; for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; } } Sequential Version 7/21/2015 \Petaflop\Workshop98-7B.ppt 26 a result b done inner 0 0 BLKMOV_SYNC (a, row_a, N, slot_1); BLKMOV_SYNC (b, column_b, N, slot_1); sum = 0; END_THREAD; 2 2 THREAD-1; for (i=0; i<N; i++ ); sum = sum + (row_a[i] * column_b[i]); DATA_RSYNC (sum, result, done); END_THREAD ( ) ; END_FUNCTION The Inner Product Example 7/21/2015 \Petaflop\Workshop98-7B.ppt 27 Summary of EARTH-C Extensions • Explicit Parallelism – Parallel versus Sequential statement sequences – Forall loops • Locality Annotation – Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing – Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH 28 EARTH C Threaded C (Thread Generation) Given a sequence of statements, s1, s2, …sn, we wish to create threads such that: – Maximize thread length (minimize thread switching overhead) – retain sufficient parallelism – Issue remote memory requests as early as possible (prefetching) – Compile split-phase remote memory operations and remote function calls correctly cpeg421-10-F/Topic-3-II-EARTH 29 An Example int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; b = x[j]; fact = fact * a; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); } cpeg421-10-F/Topic-3-II-EARTH 30 Example Partitioned into Four Fibers a = x[i]; fact = 1; 1 fact = fact * a; b = x[j]; Fiber-0: Fiber-1: sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); 1 Fiber-2: return (r1 + r2 + r3); 3 Fiber-3: cpeg421-10-F/Topic-3-II-EARTH 31 Better Strategy Using List Scheduling • Put each instruction in the earliest possible thread. • Within a thread, the remote operations are executed as early as possible. Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type. cpeg421-10-F/Topic-3-II-EARTH 32 Instruction Types • Schedule First – – – – – – remote_read, remote_write remote_fn_call local_simple remote_compound local_compound basic_fn_call • Schedule Last cpeg421-10-F/Topic-3-II-EARTH 33 List Scheduling Previous Example (0,RR) (0,RR) a = x[i]; b = x[j]; (1,LS) (1,LS) sum=a+b; prod=a*b; (1,RF) (1,RF) r1=g(sum); r2=g(prod) (0,LS) fact = 1; (1,LC) fact = fact*a; (1,RF) r3=g(fact) (2,LS) return(r1 + r2 + r3) cpeg421-10-F/Topic-3-II-EARTH 34 Resulting List Scheduled Threads a=x[i]; b=x[j]; fact=1; 2 sum=a+b; r1=g(sum); prod=a*b; r2=g(prod); fact=fact*i; r3=g(fact) 3 return (r1+r2+r3); cpeg421-10-F/Topic-3-II-EARTH 35 Generating Threaded-C Code THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j) { SLOTS SYNC_SLOTS[2]; int a, b, sum, prod, fact, r1, r2, r3; /* THREAD_0:; */ INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2); GET_SYNC_L (&x[i], &a, 0); GET_SYNC_L (&x[j], &b, 0); fact = 1; THREAD_1:; END_THREAD( ); sum = a + b; TOKEN (G, &r1, SLOT_ADR(1), sum); prod = a * b; TOKEN (g, &r2, SLOT_ADR(1), prod); fact = fact * a; TOKEN (g, &r3, SLOT_ADR(1), fact); END_THREAD( ); THREAD_2:; DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm); END_FUNCTION( ); } cpeg421-10-F/Topic-3-II-EARTH 36 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 37 Fine-Grain Synchronization: Two Types Sync Type Order Fine Grain Sync. Solution Enforce Mutual Exclusion No Specific Order required • Software Fine grained locks • Lock free concurrent data structures • Full / Empty bits cpeg421-10-F/Topic-3-II-EARTH Enforce Data Dependencies Uni-directional • I-structures • Full / Empty bits 38 Enforce Data Dependencies • A DoAcross loop with positive and constant dependence distance. In parallel iterations are assigned to different threads for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } T0 T1 (i = 2) { A[2] = … … … = A[2-D] } (i = 2 + D) { A[2+D] = … … … = A[2] } The data dependence needs to be enforced by synchronization cpeg421-10-F/Topic-3-II-EARTH 39 Memory Based Fine-Grain Synchronization: • Full/Empty Bits (HEP, Tera MTA, etc) & IStructures (dataflow based machines) • Associate “state” to a memory location (finegranularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”. Empty read write I-Structure state transition [ArvindEtAl89 @ TOPLAS] reset read Full cpeg421-10-F/Topic-3-II-EARTH write Deferred read 40 With Memory Based FineGrain Sync for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D])); } • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction cpeg421-10-F/Topic-3-II-EARTH 41 With Memory Based FineGrain Sync T0 (i = 2) { write_sync(&(A[2]),…); … … = read_sync(&(A[2-D])); } T1 (i = 2 + D) { write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));} • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction cpeg421-10-F/Topic-3-II-EARTH 42 An Alternative: control-flow based synchronizations for(i= D; i < N; ++i){ A[i] = … No data post(i); dependency … wait(i-D); No data dependency … = A[i-D]; } • • The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models You may need to worry about this: A[i] = …; fence; post(i); wait(i-D); fence; … = A[i-D]; For computation with more complicated data dependencies, memory-based finegrain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS] cpeg421-10-F/Topic-3-II-EARTH 43 A Question! Is that really necessary to tag every word in the entire memory to support memory-based fine-grain synchronization? cpeg421-10-F/Topic-3-II-EARTH 44 Key Observation Key Observation: At any instance of a “reasonable” parallel execution only a small fraction of memory locations are actively participating in synchronization. Solution: Synchronization State Buffer (SSB): Only record and manage states of active synchronized data units to support fine-grain synchronization. cpeg421-10-F/Topic-3-II-EARTH 45 What is SSB? • A small hardware buffer attached to the memory controller of each memory bank. • Record and manage states of actively synchronized data units. • Hardware Cost – Each SSB is a small look-up table: Easy-to-implement – Independence of each SSB: hardware cost increases only linearly proportional to # of memory banks cpeg421-10-F/Topic-3-II-EARTH 46 SSB on Many-Core (IBM C64) IBM Cyclops-64, Designed by Monty Denneau. cpeg421-10-F/Topic-3-II-EARTH 47 SSB Synchronization Functionalities Data Synchronization: Enforce RAW data dependencies • Support word-level – Two single-writer-single-reader (SWSR) modes – One single-writer-multiple-reader (SWMR) mode Fine-Grain Locking: Enforce mutual exclusion • Support word-level – write lock (exclusive lock) – read lock (shared lock) – recursive lock SSB is capable of supporting more functionality cpeg421-10-F/Topic-3-II-EARTH 48 Experimental Infrastructure OpenMP Compiler C Compiler (GCC/Open64) Binutils: Libraries: TiNy Threads Library/RTS OpenMP RTS linker assembler Std C/Math lib Cyclops-64 Micro Kernel Simulation Testbed: FAST Simulator (Software) Ms. Clops Hardware Emulator IBM Cyclops-64 Chip Architecture • 160 thread units (500MHz) • Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support • SSB for on-chip SRAM bank: 16-entry, 8-way associative cpeg421-10-F/Topic-3-II-EARTH 49 SSB Fine-Grain Sync. is Efficient • For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms. • For example, with up to 128 threads – Livermore loop 6 (linear recurrence): a 312% improvement over the barrier based version – Ordered integer set (hash table): outperform the software-based fine-grain methods by up to 84% cpeg421-10-F/Topic-3-II-EARTH 50 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 51 Research Layout Future Programming Models HTMT like Architecture Scientific Computation Kernels Advanced Execution / Programming Model Percolation High Performance Bio computing kernels Other High end Applications Base Execution Model Fine Grain Multi threading (e.g. EARTH, CARE) Location Consistency Infrastructure & Tools •System Software •Simulation / Emulation •Analytical Modeling cpeg421-10-F/Topic-3-II-EARTH High End PIM Architecture Cellular Multithreaded Architecture(e.g. BG/c) 52 Percolation Model DRAM PIM SRAM PIM High Speed CPUs A User’s Perspective CRAM SRAM CPUs S-PIM Engine Primary Execution Engine Prepare and percolate “parceled threads” Perform intelligent memory operations DRAM D-PIM Engine cpeg421-10-F/Topic-3-II-EARTH Global Memory Management 53 The Percolation Model • • What is percolation? dynamic, adaptive computation/data movement, migration, transformation in-place or on-the fly to keep system resource usefully busy Features of percolation – both data and thread may percolate – computation reorganization and data layout reorganization – asynchronous invocation Level 0: fast cpu Level 1 PIM Level 2 PIM Level 3 HTML-like Architectures Level 0 Level 1 Level 2 Level 3 percolation Data layout reorganization during percolation Cannon’s nearest neighbor data transfer An Example of percolation—Cannon’s Algorithm cpeg421-10-F/Topic-3-II-EARTH 54 Performance of SCCA2 Kernel 4 #threads C64 SMPs MTA2 4 2917082 5369740 752256 8 5513257 2141457 619357 16 9799661 915617 488894 32 17349325 362390 482681 Metric: TEPS -- Traversed Edges per second • Reasonable scalability –Scale well with # threads –Linear speedup for #threads < 32 • Commodity SMPs has poor performance • Competitive vs. MTA-2 SMPs: 4-way Xeon dual-core, 2MB L2 Cache cpeg421-10-F/Topic-3-II-EARTH 55 Outline • • • • Overview Fine-grain multithreading Compiling for fine-grain multithreading The power of fine-grain synchronization SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH 56