Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation.
Download ReportTranscript Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation.
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary Motivation CMP architecture has been widely adopted SCMP: a few large out-of-order cores LCMP: many small in-order cores Intel Dual-core Xeon processor Sun Niagara, Azul High throughput Questions on cache/memory hierarchy How do we prune the cache design space for LCMP architectures? What methodology needs to be put in place? How should the cache be sized at each level and shared at each level in the hierarchy? How much memory and interconnect bandwidth is required for scalable performance? The goal of this paper is to accomplish a first-level of analysis that narrows the design space Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary DRAM CPU (LCMP) IO Bridge Memory interface Overview of LCMP C C L1 L1 L2 C C C C L1 L1 L1 L1 L2 L2 Interconnect L3 IO interface 16 or 32 light weight cores on-die C C L1 L1 L2 Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary Cache Design Considerations Die area constraints On-die and off-die bandwidth Only a fraction of space (40 to 60%) may be available to cache On-die interconnect carries the communication between cache hierarchy Off-die memory bandwidth Power consumption Overall performance Indicate the effectiveness of the cache design in supporting many simultaneous threads of execution Constraint-Aware Analysis Methodology Area constraints Bandwidth constraints Prune the design space by the area constraints Estimate the area required for L2, then apply the overall area constraints to this cache further prune the options of those already pruned by area constrains by applying the on-die and off-die bandwidth constraints Estimate the number of requests generated by the caches at each level, which depends on core performance and cache performance for a given workload Overall performance Compare the performance of the pruned options, determine the top two or three design choices Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary Experimental Setup Platform simulator CACTI 3.2 Workload and traces C C C L1 L1 L1 L1 L2 L2 Area estimation tool Core model Cache hierarchy Interconnect model Memory model C OLTP: TPC-C SAP: SAP SD 2-tier workload JAVA: SPECjbb2005 Interconnect L3 Memory Baseline configuration 32 cores (4threads/core), core CPI = 6 Several nodes (1 to 4 cores/node), L2 per node (128K to 4M) L2 Area mm) Area (sq(sq mm) Area Constraints 600 600 Potential L3 Area (A=300sqmm) Potential L3 Area (A=200 sqmm) L2 Area Consumed 500 500 400 400 300 300 200 200 100 100 0 0 1 1 12M 18M 20M X 16M 19M X X 18M X X X X X X X X X X 10M 13M X 8M 12M X X 10M X X X X X X X X X 2 2 128 128 4 4 1 1 2 2 256 256 4 4 1 1 2 2 512 512 4 4 1 1 2 2 4 4 1 1 1024 1024 2 2 2048 2048 4 4 1 1 2 4096 4096 L2 L2 cache cachesize size(KBytes) (KBytes)and andNumber NumberofofCores Coresper perNode Node 2 Look for options that support 3 levels of cache Assume total die area is 400 mm2 Two constraints of cache size: 50% 200 mm2, 75% 300 mm2 Inclusive cache L3 >= 2xL2 4 4 Sharing Impact 0.040 1 core/node 0.035 2 cores/node 4 cores/node 0.030 C C C C L2 L2 L2 L2 TPCC MPI 0.025 C 0.020 C C L2 0.015 C L2 0.010 C 0.005 0.000 128 256 512 Cache size per core (KBytes) 1024 C C C L2 MPI reduces when we increase the sharing degree 512K seems to be a sweet spot Bandwidth Constraints On-die bandwidth Off-die bandwidth 200 60 L3 = 32M 180 L3 = Inf 140 50 Bandwidth (GB/s) Bandwidth (GB/s) 160 120 100 80 60 40 30 20 40 L3 = 16M 10 20 L3 = 32M 0 0 128 256 512 L2 cache size (KBytes) 1024 128 256 512 1024 L2 cache size (KBytes) 4 cores/node, 8 nodes On-die BW demand is around 180GB/s with infinite L3, reduces significantly with a 32M L3 cache Off-die memory BW demand is between 40 to 50 GB/s, reduces as we increase the L3 cache size Cache Options Summary Node L2 size per core 1 to 4 cores Around 128K to 256K seems viable for 32-core LCMP L3 size ranging from 8M to about 20M depending on the configuration can be considered Cores/node # of nodes L2 cache/node L3 Cache size 1 32 128K ~ 12M 2 16 256K – 512K 8M – 16M 4 8 512K – 1M 10M – 18M Performance Evaluation (TPCC) Core L2 L3 Mem Perf/Area Perf^3/Area 1.2 180% Normalized CPI 140% 0.8 120% 100% 0.6 80% 0.4 60% 40% 0.2 20% 0 0% 8M 16M 1Core, 128K 8M 16M 2Cores, 256K 16M 2Cores, 512K 8M 16M 4Cores, 512K 32M 16M 32M 4Cores, 1M On-die BW is 512 GB/s, max sustainable memory BW is 64 GB/s Performance: configuration (4 cores/node, 1M L2 and 32M L3) is the best Performance per unit area: config (4 cores/node, 512K L2 and 8M L3) is the best Performance3 per unit area: (4 cores/node, 512K to 1M L2, 8M to 16M L3) Perf & Area Comparison 160% 1 Performance Evaluation (SAP, SPECjbb) 1.2 160% 100% 0.6 80% 60% 0.4 40% 0.2 20% 0 0% 8M 16M 8M 1Core, 128K 16M 2Cores, 256K 16M 8M 2Cores, 512K 16M 32M 16M 4Cores, 512K 32M 4Cores, 1M 1.2 160% 140% 1 Normalized CPI 120% 0.8 100% 0.6 80% 60% 0.4 40% 0.2 20% 0 0% 8M 16M 1Core, 128K 8M 16M 2Cores, 256K 16M 2Cores, 512K 8M 16M 4Cores, 512K 32M 16M 32M 4Cores, 1M Perf & area comparison Normalized CPI 120% 0.8 Perf & Area Comparison 140% 1 Implications and Inferences design a 3-level cache hierarchy Each node consists of four cores, 512K to 1M of L2 cache The L3 cache size is recommended to be a minimum of 16M Recommend that the platform support at least 64GB/s of memory bandwidth and 512GB/s of interconnect bandwidth Summary Performed the first study of performance, area and bandwidth implications on LCMP cache design Introduced a constraints-aware analysis methodology to explore LCMP cache design options Applied this methodology to a 32-core LCMP architecture Quickly narrowed down the design space to a small subset of viable options Conducted an in-depth performance/area evaluation of these options and summarize a set of recommendations for architecting efficient LCMP platforms