Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov [email protected] IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department.
Download ReportTranscript Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov [email protected] IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department.
Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov [email protected] IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University 11/7/2015 QEST'05 1 Motivation: self-optimizing software Goal: portable performance Self-optimizing software Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) Determines experimentally optimal parameter values Uses native C compiler to produce library Examples: ATLAS, FFTW, SPIRAL, … 11/7/2015 QEST'05 2 NU K Example: Register Blocking for MMM B Hardware parameters Number of FP registers (NR) I-Cache Capacity (ICC) A simple model for the register tile size for MMM Yotov et al. IEEE’05 MU x NU + MU + NU + Temp ≤ NR KU (unroll of K loop) NB NB MU K A C does not depend on NR depends on ICC Need to know NR and ICC! 11/7/2015 QEST'05 3 Why not consult the manuals? Self-optimizing systems Actual hardware values vs. number available for optimization 11/7/2015 For software optimization, hardware values may not be relevant (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) Incomplete Require online manuals Parameters like capacity and line size of off-chip caches vary from model to model Even same model of computer may be shipped with different cache organizations Not usually documented in processor manuals Moving Target QEST'05 4 Automatic Measurement Tools lmbench Calibrator OS benchmark, some CPU / Memory benchmarks Larry McVoy, BitMover, Inc. Carl Staelin, HP Memory hierarchy benchmark Stefan Manegold Centrum voor Wiskunde en Informatica MOB 11/7/2015 Memory hierarchy benchmark Josep Blanquer, Robert Chalmers University of California Santa Barbara QEST'05 5 X-Ray Set of micro-benchmarks in ANSI C89 Download and compile on any architecture (portable) Deduce hardware parameter values from timing results Some amount of O/S specific code High-resolution timing routines Super-page allocation Currently support Linux Windows and Solaris, IRIX, and AIX in the works Paradox 11/7/2015 Compiler optimizations may contaminate timing results Cannot afford to turn off all optimizations QEST'05 6 Example: Latency of Integer ADD (Step by Step) t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately 11/7/2015 QEST'05 7 Step by Step (cont.) t = gettime(); while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead 11/7/2015 QEST'05 8 Step by Step (cont.) t = gettime(); i = R / U; while (--i) //loop unrolled U times { r1 += r2; r1 += r2; ........ r1 += r2; } return gettime() – t; Problem: compiler optimizations 11/7/2015 QEST'05 9 Step by Step (cont.) t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” 11/7/2015 QEST'05 10 Latency of integer ADD: nano-benchmark C code Want to measure r1+=r2 Generate C Code from specification <r1+=r2, 11/7/2015 <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else { vr = r1; vr = r2; } QEST'05 11 X-Ray architecture Micro-benchmark Parameters Control Engine Nano-benchmark Specification Nanobenchmark Generator Nano-benchmark C Code Compile, Execute, Time Execution Time Hardware Parameter Value 11/7/2015 QEST'05 12 Instruction Throughput Specification Control Engine N=3, B=1: 11/7/2015 QEST'05 13 Micro-benchmarks in X-Ray CPU Frequency Instruction Latency Instruction Throughput Instruction Existence FPU on embedded processors FMA on general purpose processors SMP and SMT Memory Hierarchy Number of Registers of various types (int, float, SSE, …) Multilevel Caches, TLB 11/7/2015 Associativity Block Size Capacity Latency Instruction Cache Capacity QEST'05 14 Previous Approaches for Memory Hierarchy Parameters Saavedra Benchmark (Hennessy-Patterson) Accesses elements of an array constant stride apart Measures average memory access time Deficiencies Considers all levels simultaneously Works only for capacities that are powers-of-2 Suffers from a number of implementation level deficiencies Constant stride accesses Loop overhead problems Overlapping memory operations Prone to compiler “optimizations” S 11/7/2015 S QEST'05 S 15 Example: Isolation of lower cache levels Idea for Ln measurements L2 Use sequences as for L1 measurements Make L1…Ln-1 “transparent” to measurements Unique in isolating the behavior of Ln so that all higher levels miss L3 L1 CPU Approach S Use sequences of sequences Convolution of sequences S S = 11/7/2015 QEST'05 16 Measuring I-Cache Capacity Approach for Data Cache does not work of pointers Code sequence with branches Such branches are very predictable Nearly impossible to get precise timing Array Measure time to execute special code sequence of size N statements Find the biggest N for which there is no significant increase in time per statement 11/7/2015 QEST'05 17 Nano-benchmark Similar to Instruction Throughput Parameters (1, 4) Grow length N Code size computed 11/7/2015 (char *)&&finish – (char *)&&start QEST'05 18 Sensitivity Graph for Pentium M 9 more in the paper Performance oscillates Even after averaging out noise Cannot wait for jump Need more robust measurement 11/7/2015 QEST'05 19 Control Engine Script Start with N=256 Compute Mean Standard deviation For Binary-search Detect jump when time is more than 11/7/2015 QEST'05 20 Experimental Results 11/7/2015 QEST'05 21 Pentium 4 Does not cache ISA instructions, but uops Trace cache Measure the number of instructions Smoothing in the nano-benchmark: minimum of time in 11/7/2015 QEST'05 22 Conclusions X-Ray: A framework and tool Algorithms for precise measurements of some important hardware parameters Experimental results on many modern architectures Other X-Ray resources First to measure instruction cache capacity Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 CPU parameter measurement appeared at QEST’05 Improving X-Ray is work in progress… 11/7/2015 QEST'05 23 Current and Future Work 2-address vs. 3-address code Out-of-Order execution Number Physical registers Number / Type Functional Units Cache bandwidth write mode sharedness replacement policy 11/7/2015 QEST'05 24 Thank you! My E-Mail Cornell Group homepage http://iss.cs.cornell.edu This work emerged from a joint project with David Padua’s group at UIUC [email protected] [email protected] http://polaris.cs.uiuc.edu/newframework.html Download X-Ray! 11/7/2015 http://iss.cs.cornell.edu/software/x-ray.aspx QEST'05 25