Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation.

Download Report

Transcript Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation.

Performance, Area and Bandwidth
Implications on Large-Scale CMP
Cache Design
Li Zhao, Ravi Iyer,
Srihari Makineni, Jaideep Moses,
Ramesh Illikkal, Don Newell
Intel Corporation
Outline
Motivation
 Overview of LCMP
 Constraint-aware Analysis Methodology
 Experiment Results




Area and Bandwidth Implications
Performance Evaluation
Summary
Motivation

CMP architecture has been widely adopted

SCMP: a few large out-of-order cores


LCMP: many small in-order cores



Intel Dual-core Xeon processor
Sun Niagara, Azul
High throughput
Questions on cache/memory hierarchy



How do we prune the cache design space for LCMP
architectures? What methodology needs to be put in
place?
How should the cache be sized at each level and shared
at each level in the hierarchy?
How much memory and interconnect bandwidth is
required for scalable performance?
The goal of this paper is to accomplish a first-level
of analysis that narrows the design space
Outline
Motivation
 Overview of LCMP
 Constraint-aware Analysis Methodology
 Experiment Results




Area and Bandwidth Implications
Performance Evaluation
Summary
DRAM
CPU
(LCMP)
IO Bridge
Memory interface
Overview of LCMP
C C
L1 L1
L2
C C C C
L1 L1 L1 L1
L2
L2
Interconnect
L3
IO interface

16 or 32 light weight cores on-die
C C
L1 L1
L2
Outline
Motivation
 Overview of LCMP
 Constraint-aware Analysis Methodology
 Experiment Results




Area and Bandwidth Implications
Performance Evaluation
Summary
Cache Design Considerations

Die area constraints


On-die and off-die bandwidth




Only a fraction of space (40 to 60%) may be available to
cache
On-die interconnect carries the communication between
cache hierarchy
Off-die memory bandwidth
Power consumption
Overall performance

Indicate the effectiveness of the cache design in
supporting many simultaneous threads of execution
Constraint-Aware Analysis Methodology

Area constraints



Bandwidth constraints



Prune the design space by the area constraints
Estimate the area required for L2, then apply the overall
area constraints to this cache
further prune the options of those already pruned by area
constrains by applying the on-die and off-die bandwidth
constraints
Estimate the number of requests generated by the caches
at each level, which depends on core performance and
cache performance for a given workload
Overall performance

Compare the performance of the pruned options,
determine the top two or three design choices
Outline
Motivation
 Overview of LCMP
 Constraint-aware Analysis Methodology
 Experiment Results




Area and Bandwidth Implications
Performance Evaluation
Summary
Experimental Setup

Platform simulator





CACTI 3.2
Workload and traces




C
C
C
L1
L1
L1
L1
L2
L2
Area estimation tool


Core model
Cache hierarchy
Interconnect model
Memory model
C
OLTP: TPC-C
SAP: SAP SD 2-tier workload
JAVA: SPECjbb2005
Interconnect
L3
Memory
Baseline configuration


32 cores (4threads/core), core CPI = 6
Several nodes (1 to 4 cores/node), L2 per node (128K to 4M)
L2
Area
mm)
Area
(sq(sq
mm)
Area Constraints
600
600
Potential L3 Area (A=300sqmm)
Potential L3 Area (A=200 sqmm)
L2 Area Consumed
500
500
400
400
300
300
200
200
100
100
0
0
1
1
12M
18M
20M
X
16M
19M
X
X
18M
X
X
X
X
X
X
X
X
X
X
10M
13M
X
8M
12M
X
X
10M
X
X
X
X
X
X
X
X
X
2
2
128
128
4
4
1
1
2
2
256
256
4
4
1
1
2
2
512
512
4
4
1
1
2
2
4
4
1
1
1024
1024
2
2
2048
2048
4
4
1
1
2
4096
4096
L2
L2 cache
cachesize
size(KBytes)
(KBytes)and
andNumber
NumberofofCores
Coresper
perNode
Node




2
Look for options that support 3 levels of cache
Assume total die area is 400 mm2
Two constraints of cache size: 50%  200 mm2, 75% 
300 mm2
Inclusive cache  L3 >= 2xL2
4
4
Sharing Impact
0.040
1 core/node
0.035
2 cores/node
4 cores/node
0.030
C
C
C
C
L2 L2 L2 L2
TPCC
MPI
0.025
C
0.020
C
C
L2
0.015
C
L2
0.010
C
0.005
0.000
128
256
512
Cache size per core (KBytes)
1024
C
C
C
L2
MPI reduces when we increase the sharing
degree
 512K seems to be a sweet spot

Bandwidth Constraints
On-die bandwidth
Off-die bandwidth
200
60
L3 = 32M
180
L3 = Inf
140
50
Bandwidth (GB/s)
Bandwidth (GB/s)
160
120
100
80
60
40
30
20
40
L3 = 16M
10
20
L3 = 32M
0
0
128
256
512
L2 cache size (KBytes)



1024
128
256
512
1024
L2 cache size (KBytes)
4 cores/node, 8 nodes
On-die BW demand is around 180GB/s with infinite L3,
reduces significantly with a 32M L3 cache
Off-die memory BW demand is between 40 to 50 GB/s,
reduces as we increase the L3 cache size
Cache Options Summary

Node


L2 size per core


1 to 4 cores
Around 128K to 256K seems viable for 32-core LCMP
L3 size

ranging from 8M to about 20M depending on the
configuration can be considered
Cores/node
# of nodes
L2 cache/node
L3 Cache size
1
32
128K
~ 12M
2
16
256K – 512K
8M – 16M
4
8
512K – 1M
10M – 18M
Performance Evaluation (TPCC)
Core
L2
L3
Mem
Perf/Area
Perf^3/Area
1.2
180%
Normalized CPI
140%
0.8
120%
100%
0.6
80%
0.4
60%
40%
0.2
20%
0
0%
8M
16M
1Core, 128K




8M
16M
2Cores, 256K
16M
2Cores, 512K
8M
16M
4Cores, 512K
32M
16M
32M
4Cores, 1M
On-die BW is 512 GB/s, max sustainable memory BW is 64 GB/s
Performance: configuration (4 cores/node, 1M L2 and 32M L3) is
the best
Performance per unit area: config (4 cores/node, 512K L2 and 8M
L3) is the best
Performance3 per unit area: (4 cores/node, 512K to 1M L2, 8M to
16M L3)
Perf & Area Comparison
160%
1
Performance Evaluation (SAP, SPECjbb)
1.2
160%
100%
0.6
80%
60%
0.4
40%
0.2
20%
0
0%
8M
16M
8M
1Core, 128K
16M
2Cores, 256K
16M
8M
2Cores, 512K
16M
32M
16M
4Cores, 512K
32M
4Cores, 1M
1.2
160%
140%
1
Normalized CPI
120%
0.8
100%
0.6
80%
60%
0.4
40%
0.2
20%
0
0%
8M
16M
1Core, 128K
8M
16M
2Cores, 256K
16M
2Cores, 512K
8M
16M
4Cores, 512K
32M
16M
32M
4Cores, 1M
Perf & area comparison
Normalized CPI
120%
0.8
Perf & Area Comparison
140%
1
Implications and Inferences
design a 3-level cache hierarchy
 Each node consists of four cores, 512K to
1M of L2 cache
 The L3 cache size is recommended to be a
minimum of 16M
 Recommend that the platform support at
least 64GB/s of memory bandwidth and
512GB/s of interconnect bandwidth

Summary





Performed the first study of performance, area
and bandwidth implications on LCMP cache design
Introduced a constraints-aware analysis
methodology to explore LCMP cache design
options
Applied this methodology to a 32-core LCMP
architecture
Quickly narrowed down the design space to a
small subset of viable options
Conducted an in-depth performance/area
evaluation of these options and summarize a set
of recommendations for architecting efficient LCMP
platforms