flickeringtubelight.net

Download Report

Transcript flickeringtubelight.net

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE
COMPUTING SYSTEMS
ANIL KRISHNA
Advisor: Dr. YAN SOLIHIN
PhD Defense Examination, August 6th 2013
Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu
1
Good Morning!
2
PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE
COMPUTING SYSTEMS
ANIL KRISHNA
Advisor: Dr. YAN SOLIHIN
PhD Defense Examination, August 6th 2013
Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu
3
AGENDA
o RESEARCH OVERVIEW
o
How this talk is organized
Questions I have been researching all these years
o SUMMARY – Motivation, Problem, Contribution
o
Quick overview of my latest research
o DETAILS of ReSHAPE
o
A performance estimation tool
o VALIDATION
o
Does this tool work?
o USE CASES
o
Where can it be used?
o CONCLUSIONS and FUTURE DIRECTION
o
Where are we? Where to next?
4
RESEARCH OVERVIEW
In the context of processor chip design trends
Single Core
Multi Core
Scaling the bandwidth wall: challenges in and avenues for CMP scaling
Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin
International Symposium on Computer Architecture, ISCA 2009
core
cache
Motivation
o Off-chip bandwidth is pin limited, pins are area limited, area not growing
Problem Statement
o To what extent does the bandwidth wall restrict future multi-core scaling?
o To what extent can bandwidth conservation techniques help?
Contributions and Findings
o
o
o
o
o
Developed simple but effective analytical performance model
Core to cache ratio changes from 50:50 to 10:90 in 4 generations
Core scaling is only 3x vs. 16x in 4 generations
Different bandwidth conservation techniques have different benefits
Combining techniques can delay this problem significantly
o 3D-stacked DRAM caches + link and cache compression gives >16x scaling
5
RESEARCH OVERVIEW
In the context of processor chip design trends
Single Core
core
Multi Core
Data sharing in multi-threaded applications and its impact on chip design
Anil Krishna, Ahmad Samih, Yan Solihin
Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012
cache
Motivation
o Parallel applications moving from
to a single
SMP to
chip,
a single
but no
chip
change in chip design
o No analytical models exist that can capture the effect of data sharing
Problem Statement
o What is the right way to quantify the impact of data sharing on miss rates?
o How can this be incorporated into an analytical performance model?
o Does data sharing impact optimal on-chip core vs. cache ratios?
Contributions and Findings
o
o
o
o
Developed novel approach to quantifying the true impact of data sharing
Developed analytical performance model that incorporates data sharing
Showed that core area increases 33% to 49%; throughput increases 58%
Presence of data sharing encourages larger cores over smaller ones
6
RESEARCH OVERVIEW
In the context of processor chip design trends
Homogeneous
Multi Core
Single Core
Hybrid
Multi Core
core
cache
Hardware acceleration in the IBM PowerEN processor: architecture and performance
Anil Krishna, Timothy Heil, Nicholas Lindberg, Farnaz Toussi, Steven VanderWiel
International conference on Parallel Architectures and Compilation Techniques, PACT 2012
Motivation
o Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study
Problem Statement
o How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform?
o How did the presence of hardware accelerators impact the architecture of the rest of the chip?
Contributions and Findings
o Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detail
o Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets)
o Found that reducing communication overhead and easing programmability requires supporting many new features
o shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions
7
RESEARCH OVERVIEW
In the context of processor chip design trends
Homogeneous
Multi Core
Single Core
Hybrid
Multi Core
Heterogeneous
Multi Core
core
cache
Large design space
Large configuration space
o How many cores/cores-types?
o What cache hierarchy?
o Heterogeneity in caches too?
o How to schedule applications?
o What DVFS settings to use?
o What cores and caches to power-gate?
ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Anil Krishna, Ahmad Samih, Yan Solihin
being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013
8
SUMMARY – Motivation
Design and configuration space explosion with multi-core chips
o As number and types of cores
designs need to be evaluated
o n! static schedules for a single design with n core types
o Very large configuration space with per-core DVFS even in a single design with a single core type
Detailed simulation too slow
o Be it trace or execution driven, be it cycle-by-cycle simulation or discrete-event simulation
Analytical models fast, but existing models lacking
o Too abstract and lacking sufficient fidelity
o Not flexible enough to handle shared caches, heterogeneity across cores, multi-program mixes.
9
SUMMARY – Problem, Contribution
Problem: Need a tool for early design space exploration
o
o
o
Fast: At least 1000x faster than detailed simulation
Accurate: < 20% error in performance projection
Flexible : Able to model shared cache hierarchies, shared memory bandwidth, heterogeneity across
cores and caches on chip and multi-programmed workload mixes
Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator)
o Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver
o Flexible
o Typically runs in under a second (10,000x faster than detailed simulation)
o Accuracy is promising – IPC error < 5% and cache miss rate error <15% (validated up to 4 cores)
10
ReSHAPE – Inputs and Outputs
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
C0
Core 0
Core 0
L1I L1D
∞ L2
App-Core pair profile
Core 1
L1I
L1D
∞ L2
Core 0
L1I L1D
∞ L2
o
o
o
Base IPC
Cache accesses per Inst.
Hit Rate Profiles
L1I
L1I L1D
L1D
L2
L2
L3
L1D
∞ L2
L2
L2
L2 L2
L3
L3
L4
Core 1
L1I
C1 CoreC2
1
L1D
L1D
L1D
L1I
L1I
L1I
Chip Configuration
o
o
o
o
o
o
core counts
core types
Frequencies
Cache hierarchy
memory bandwidth
application schedule
ReSHAPE
Iterative solver of an
underlying analytical
model
Throughput (Instructions per Second)
11
ReSHAPE – The Analytical Component
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
App-Core pair profile
profile
o App-Core
Base
IPC pair
App-Core
pair
profile
Base
IPC pair
o o Cache
accesses
perprofile
Inst.
App-Core
o Base IPC
accesses per Inst.
o o HitCache
Profiles
oRate
Base
IPC
o
Cache
accesses
per Inst.
o Hit Rate Profiles
o
Cache
accesses
o Hit Rate Profiles per Inst.
o Hit Rate Profiles
Chip Configuration
o
o
o
o
o
o
core counts
core types
Frequencies
Cache hierarchy (sizes, latencies)
memory bandwidth
application schedule
12
ReSHAPE – The Analytical Component
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Chip Configuration
App-Core pair profile
o
o
o
Base IPC
Cache accesses per Inst.
Hit Rate Profiles
o
o
o
o
o
o
core counts
core types
Frequencies
Cache hierarchy (sizes, latencies)
memory bandwidth
application schedule
Core 0
L1I
L1D
L2
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 (𝑖)
𝑠𝑒𝑐𝑜𝑛𝑑 (𝑠)
𝑠 𝑠𝑏𝑎𝑠𝑒 𝐿2𝑎𝑐𝑐
=
+
× 𝐿2𝑙𝑎𝑡
𝑖
𝑖
𝑖
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑎𝑠
𝑏𝑎𝑠𝑒𝐼𝑃𝐶
13
ReSHAPE – The Analytical Component
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Chip Configuration
o
o
o
o
o
o
App-Core pair profile
o
o
o
Base IPC
Cache accesses per Inst.
Hit Rate Profiles
core counts
core types
Frequencies
Cache hierarchy (sizes, latencies)
memory bandwidth
application schedule
Core 0
L1I
L1D
L2
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 (𝑖)
𝑠𝑒𝑐𝑜𝑛𝑑 (𝑠)
𝑠 𝑠𝑏𝑎𝑠𝑒 𝐿2𝑎𝑐𝑐
𝐿3𝑎𝑐𝑐
=
+
× 𝐿2𝑙𝑎𝑡 +
× 𝐿3𝑙𝑎𝑡
𝑖
𝑖
𝑖
𝑖
𝑠𝑎𝑚𝑒 𝑎𝑠
L3
𝐿2𝑎𝑐𝑐
× 𝐿2𝑚𝑖𝑠𝑠𝑟𝑎𝑡𝑒
𝑖
14
ReSHAPE – The Analytical Component
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Chip Configuration
App-Core pair profile
o
o
o
Base IPC
Cache accesses per Inst.
Hit Rate Profiles
o
o
o
o
o
o
core counts
core types
Frequencies
Cache hierarchy (sizes, latencies)
memory bandwidth
application schedule
Core 0
L1I
L1D
L2
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 (𝑖)
𝑠𝑒𝑐𝑜𝑛𝑑 (𝑠)
𝑠 𝑠𝑏𝑎𝑠𝑒 𝐿2𝑎𝑐𝑐
𝐿3𝑎𝑐𝑐
𝑀𝑒𝑚𝑎𝑐𝑐
=
+
× 𝐿2𝑙𝑎𝑡 +
× 𝐿3𝑙𝑎𝑡 +
× 𝑀𝑒𝑚
𝑀𝑒𝑚𝑙𝑎𝑡𝑝𝑒𝑛𝑎𝑙𝑡𝑦 + 𝑀𝑒𝑚𝑞𝑢𝑒𝑢𝑒
𝑖
𝑖
𝑖
𝑖
𝑖
L3
15
ReSHAPE – The Analytical Component
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Chip Configuration
App-Core pair profile
o
o
o
Base IPC
Cache accesses per Inst.
Hit Rate Profiles
o
o
o
o
o
o
core counts
core types
Frequencies
Cache hierarchy (sizes, latencies)
memory bandwidth
application schedule
Core 0
L1I
L1D
L2
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 (𝑖)
𝑠𝑒𝑐𝑜𝑛𝑑 (𝑠)
𝑠 𝑠𝑏𝑎𝑠𝑒 𝐿2𝑎𝑐𝑐
𝐿3𝑎𝑐𝑐
𝑀𝑒𝑚𝑎𝑐𝑐
=
+
× 𝐿2𝑙𝑎𝑡 +
× 𝐿3𝑙𝑎𝑡 +
× 𝑀𝑒𝑚𝑝𝑒𝑛𝑎𝑙𝑡𝑦 + 𝑀𝑒𝑚𝑞𝑢𝑒𝑢𝑒
𝑖
𝑖
𝑖
𝑖
𝑖
L3
𝑀𝑒𝑚𝑜𝑟𝑦 𝑖𝑠 𝑚𝑜𝑑𝑒𝑙𝑒𝑑 𝑎𝑠 𝑎𝑛 𝑀 𝐷 1 𝑠𝑦𝑠𝑡𝑒𝑚
λ
𝑀𝑒𝑚𝑞𝑢𝑒𝑢𝑒 =
2μ μ − λ
𝑀𝑒𝑚𝑎𝑐𝑐 𝑖
λ = 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 𝑟𝑎𝑡𝑒 =
×
𝑖
𝑠
𝑏𝑦𝑡𝑒𝑠
1
μ = 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 𝑟𝑎𝑡𝑒 =
×
𝑠
𝑀𝑒𝑚𝐴𝑐𝑐𝑆𝑧𝐼𝑛𝐵𝑦𝑡𝑒𝑠
16
ReSHAPE’s Novelty
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Core 0
Novelty 1
o Separate chip into vertical silos
Core 1
L1I L1D
L1I L1D
L2
L2
L2
L2
L3
L3
L3
ReSHAPE’s partition
optimizer
17
ReSHAPE’s Novelty
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Novelty 1
o Separate chip into vertical silos
Novelty 2
o Use newly computed IPC as baseIPC
o Re-evaluate traffic and partitions
o Iterate until convergence (IPC change <1%)
Core 0
L1I L1D
Core 1
L1I L1D
L2
L2
L3
L3
L3
L3
L3
L3
After convergence
o Use final IPCs to compute throughput
18
ReSHAPE’s Cache partitioning strategy
?
L3
L3
Hits per sec
L3
Hits per sec
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Cache size
Cache size
Greedy Approach
o O(n.k) for n cache slices and k sharers
o May be sub-optimal, but does quite well in practice
19
ReSHAPE’s Cache partitioning strategy
?
L3
L3
Hits per sec
L3
Hits per sec
Resource Sharing and Heterogeneity-aware Analytical Performance Estimator
Cache size
Cache size
Minimize Misses Strategy
o O(log2n. 2k) for n cache slices and k sharers
o May be too slow for large k
o We use this strategy for all evaluations presented here
20
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
Step 1: Analyze benchmark applications
21
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
4 core
2 core
m00
m01
m02
12 mixes
m03
m04
m05
m06
m07
m08
m09
m10
m11
xalan
xalan
omnetpp
povray
mcf
milc
omnetpp
leslie3d
xalan
tonto
milc
lib
namd
xalan
lib
povray
namd
milc
tonto
omnetpp
mcf
namd
tonto
mcf
m00
m01
m02
m03
12 mixes
Step 1: Analyze benchmark applications
Step 2: Construct workload mixes
m04
m05
m06
m07
m08
m09
m10
m11
povray
povray
mcf
omnetpp
omnetpp
omnetpp
mcf
omnetpp
mcf
povray
mcf
mcf
povray
tonto
tonto
xalan
leslie3d
leslie3d
lib
mcf
lib
namd
milc
xalan
tonto
tonto
namd
leslie3d
leslie3d
xalan
milc
milc
lib
leslie3d
tonto
leslie3d
namd
xalan
namd
povray
xalan
lib
povray
lib
milc
xalan
namd
lib
astar
gromacs
omnetpp
omnetpp
mcf
bzip
mcf
leslie3d
lib
hmmer
lib
lbm
lbm
lbm
xalan
milc
soplex
milc
sphinx
sphinx
sphinx
omnetpp
mcf
bzip
mcf
gems
gems
gems
9 core
m00
7 mixes
m01
m02
m03
m04
m05
m06
povray
deal2
perl
povray
perl
leslie3d
hmmer
tonto
games
calculix
tonto
calculix
xalan
soplex
namd
astar
gromacs
namd
gromacs
omnetpp
bzip
deal2
perl
leslie3d
leslie3d
lib
hmmer
lib
games
calculix
xalan
xalan
milc
soplex
milc
22
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
Step 1: Analyze benchmark applications
Step 2: Construct workload mixes
Step 3: Construct configurations to be validated
32K 32K
32K 32K
32K 32K
512KB
2MB
32K 32K
32K 32K
32K 32K
32K 32K
256KB
10Gb/s
32K 32K
256KB
10Gb/s
10Gb/s
1Gbp/s
100Mb/s
10MB/s
1MB
32K 32K
32K 32K
1MB
10Gb/s
1Gbp/s
100Mb/s
10MB/s
32K 32K
10Gb/s
32K 32K
10Gb/s
1Gbp/s
100Mb/s
10MB/s
32K 32K
32K 32K
32K 32K
32K 32K
32K 32K
32K 32K
512KB
512KB
256KB
256KB
512KB
512KB
1MB
1MB
32K 32K
32K 32K
32K 32K
32K 32K
10Gb/s
32K 32K
32K 32K
128KB
128KB
2MB
2MB
32K 32K
32K 32K
10Gb/s
10Gb/s
23
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
Step 1: Analyze benchmark applications
Step 2: Construct workload mixes
Step 3: Construct configurations to be validated
Step 4: Set up identical configurations in SIMICS and ReSHAPE
Each mix is checkpointed (under SIMICS) after running for 100 Billion instructions per application
At least 1 Billion instructions beyond this are used for validation run
Step 5: Compare projections from SIMICS and ReSHAPE
24
0.2
0.2
0.1
0.1
0
0
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.3
IPC (ReSHAPE)
1
astar
games
lbm
omnetpp
tonto
bwaves
gcc
leslie3d
perl
xalan
bzip
gems
lib
povray
zeusmp
cactus
gromacs
mcf
sjeng
calculix
h264
milc
soplex
deal2
hmmer
namd
sphinx
astar
games
lbm
omnet…
tonto
bwaves
gcc
leslie3d
perl
xalan
bzip
gems
lib
povray
zeusmp
cactus
gromacs
mcf
sjeng
calculix
h264
milc
soplex
deal2
hmmer
namd
sphinx
IPC (Simics)
0.6
0.5
0.4
0.3
ReSHAPE
32K 32K
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
256KB
10Gb/s
Average 1-core IPC Error : 1.5% (std. dev. = 1.4%)
1.0
IPC Comparison
0.8
Ideal
Observed
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
Simics
0.8
1.0
25
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
1MB
10Gb/s
Average 2-core IPC Error: 2.7% (std. dev. = 2.1%)
C1
C0
C1
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.2
ReSHAPE
0.6
IPC (ReSHAPE)
0.6
0.3
0.2
m11
m10
m09
m08
m07
m06
m05
m04
m03
m02
m01
m00
m11
m10
m09
m08
m07
m06
0.0
m05
0
m04
0
m03
0.1
m02
Ideal
C0
C1
0.2
0.1
m01
IPC Comparison
0.3
0.1
m00
IPC (Simics)
C0
0.6
0.0
0.1
0.2
0.3 0.4
Simics
0.5
0.6
26
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
1MB
10Gb/s
Average miss rate projection error: 13.4 % (std. dev. = 12.6%)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
C0
C1
1
1
Miss Rate Comparison
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.1
ReSHAPE
Misses Per Access (Simics)
0.9
C1
Misses Per Access (ReSHAPE)
C0
1
0.01
0.001
Ideal
C0
C1
0.2
0.1
0
0.0001
0.0001 0.001
0.01
Simics
0.1
1
27
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
1MB
10Gb/s
Average partition size projection error: 3.7 % (std. dev. = 4.5%)
1
0.9
0.9
C0
0.6
0.5
0.4
0.3
0.2
C1
0.8
0.7
C0
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0
0
Partition Comparison
0.5
ReSHAPE
0.7
0.6
0.4
0.3
0.2
0.1
Ideal
C0
C1
0.0
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
C1
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
Partitions (Simics)
0.8
Partitions (ReSHAPE)
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Simics
28
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
2MB
32K 32K
32K 32K
10Gb/s
C1
C2
C3
C0
C1
C2
C3
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.2
ReSHAPE
0.6
IPC (ReSHAPE)
0.6
0.3
0.2
m11
m10
m09
m08
m07
m06
m05
m04
m03
m02
m01
m00
m11
m10
m09
m08
m07
m06
0.0
m05
0
m04
0
m03
0.1
m02
Ideal
C0
C1
C2
C3
0.2
0.1
m01
IPC Comparison
0.3
0.1
m00
IPC (Simics)
C0
Average 4-core IPC Error: 2.5% (std. dev. = 1.8%)
0.0
0.1
0.2
0.3 0.4
Simics
0.5
0.6
29
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
2MB
32K 32K
10Gb/s
Misses Per Access (Simics)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
C1
C2
C3
C0
C1
C2
C3
1
1
Miss Rate Comparison
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.1
ReSHAPE
C0
Average miss rate projection error: 12.8 % (std. dev. = 13.1%)
Misses Per Access (ReSHAPE)
32K 32K
0.01
Ideal
C0
C1
C2
C3
0.001
0.2
0.1
0
0.0001
0.0001 0.001
0.01
Simics
0.1
1
30
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
2MB
10Gb/s
32K 32K
Average partition size projection error: 20.9% (std. dev. = 12.8%)
1
0.9
0.9
C3
0.7
0.6
C2
0.5
0.4
C1
0.3
0.2
0.1
C0
0.6
0.8
0.6
C2
0.5
0.4
C1
0.3
0.2
0.4
0.3
0.2
0.1
0.1
C0
0.0
0
Partition Comparison
0.5
C3
0.7
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
0
m00
m01
m02
m03
m04
m05
m06
m07
m08
m09
m10
m11
Partitions (Simics)
0.8
Partitions (ReSHAPE)
1
ReSHAPE
32K 32K
Ideal
C0
C1
C2
C3
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Simics
31
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
32K 32K
32K 32K
2MB
32K 32K
32K 32K
10Gb/s
1Gb/s
0.1Gb/s
0.01Gb/s
1.000
Average IPC Error: 17.3% (std. dev. = 5.4%)
IPC Comparison (10 GBps)
ReSHAPE
0.100
IPC Comp. (1 GBps)
IPC Comp. (0.1GBps)
IPC Com. (0.01GBps)
Ideal
C0
C1
C2
C3
0.010
0.001
0.001
0.010
0.100
Simics
1.000 0.001
0.010 0.100
Simics
1.000 0.001
0.010 0.100
Simics
1.000 0.001
0.010 0.100
Simics
1.000
32
32K 32K
128KB
128KB
2MB
2MB
32K 32K
32K 32K
C2
C3
C0
C1
C2
C3
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.2
0.3
0.2
ReSHAPE
0.6
IPC (ReSHAPE)
IPC (Simics)
C1
Comparing ReSHAPE’s projections against SIMICS full system simulator
Private Caches: Average 4-core IPC Error: 3.1% (std. dev. = 1.6%)
10Gb/s
C0
VALIDATION
32K 32K
IPC Comparison
Ideal
C0
C1
C2
C3
0.3
0.2
0.1
0.1
0.1
0
0
0.0
0.0
0.1
0.2
0.3 0.4
Simics
0.5
0.6
33
128KB
128KB
2MB
2MB
32K 32K
32K 32K
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
C2
C3
C0
C1
C2
C3
1
1
Miss Rate Comparison
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.1
ReSHAPE
Misses Per Access (Simics)
1
C1
Comparing ReSHAPE’s projections against SIMICS full system simulator
Average miss rate projection error: 7.5 % (std. dev. = 7.1%)
10Gb/s
C0
VALIDATION
32K 32K
Misses Per Access (ReSHAPE)
32K 32K
0.01
Ideal
C0
C1
C2
C3
0.001
0.2
0.1
0
0.0001
0.0001 0.001
0.01
Simics
0.1
1
34
USE CASES
Putting ReSHAPE to use
Homogeneous
Heterogeneous Caches
Heterogeneous Cores
Heterogeneous Both
Does increasing the sources of heterogeneity buy us performance?
35
USE CASES
Putting ReSHAPE to use
App0
App1
App2
App3
Max
Min
Mean
A
B
C
D
A
B
D
C
A
C
B
D
A A A B B B B B B C C C C C C D D D D
C D D C C D D A A D D A A B B A A B B
D B C D A C A C D A B D B D A B C A C
B C B A D A C D C B A B D A D C B C A
Up to 4! unique schedules for a 4-application workload mix
D
C
A
B
D
C
B
A
What one might expect to see
o Small improvement with heterogeneous caches. Some loss for bad schedules
o Larger improvement with heterogeneous cores
o Even larger improvement with heterogeneous cores + heterogeneous caches
Homogeneous
Het. Both
Het. Core
1
Het. Cache
Weighted speedup
normalized to
Homogeneous design
C0
C1
C2
C3
Heterogeneous Caches
Heterogeneous Cores
Heterogeneous Both
Does increasing the sources of heterogeneity buy us performance?
36
USE CASES
Putting ReSHAPE to use
o Smaller cores hurting more than the larger cores helping
o Heterogeneous caches better than heterogeneous cores in this case
Homogeneous
Heterogeneous Caches
Heterogeneous Cores
Heterogeneous Both
Does increasing the sources of heterogeneity buy us performance?
37
USE CASES
Putting ReSHAPE to use
o As core count scales (4->9) benefit of heterogeneity increases significantly
o Heterogeneous
cores
better>than
heterogeneous
caches
> 350,000
chart
represents
ReSHAPE
10sims
million
ReSHAPE
simsin this case; but schedule still crucial
Heterogeneous Caches
Heterogeneous Cores
Heterogeneous Both
9-core designs
Homogeneous
38
USE CASES
Putting ReSHAPE to use
with 3 core/cache types
9-core designs
o Howomuch
3-core
andtypes
whatand
form
3-cache
of heterogeneity
sizes does not
needs
buycareful
any more
analysis
performance
depending on the design being evaluated
Homogeneous
Heterogeneous Caches
Heterogeneous Cores
Heterogeneous Both
39
USE CASES
Putting ReSHAPE to use
o Different settings for different workload mixes; and not always the fastest setting!
o Not always the slowest setting when optimizing performance/watt
o Somewhere in between when optimizing Energy x Delay product
Weighted Speedup
Perf/Watt
c0 c1 c2 c3
32K 32K
32K 32K
2MB
32K 32K
10Gb/s
32K 32K
32K 32K
250MHz
0.5W
1/(Energy*Delay)
c0 c1 c2 c3
c0 c1 c2 c3
32K 32K
32K 32K
m00
3
3
1
1
m00
1
1
1
1
m00
3
3
1
1
1GHz
2W
4GHz
16W
m01
3
3
3
1
m01
1
1
1
1
m01
3
3
3
1
m02
1
3
3
3
m02
1
1
1
1
m02
1
3
3
3
m03
1
1
1
3
m03
1
1
1
2
m03
1
1
1
3
m04
1
3
3
3
m04
1
1
1
1
m04
1
1
1
1
m05
1
3
3
1
m05
1
1
1
1
m05
1
1
1
1
m06
1
1
1
3
m06
1
1
1
2
m06
1
1
1
3
m07
1
3
1
1
m07
1
1
1
1
m07
1
1
1
1
m08
3
1
1
1
m08
1
1
1
1
m08
1
1
1
1
m09
3
3
1
1
m09
2
1
1
1
m09
3
3
1
1
m10
1
1
3
3
m10
1
1
2
1
m10
1
1
3
3
m11
1
3
3
1
m11
1
1
1
1
m11
1
1
1
1
Legend
1 250MHz, 0.5W
2 1GHz, 2W
3 4GHz, 16W
40
CONCLUSIONS + FUTURE DIRECTION
Rich design/configuration space for multi-core chips
Analytical modeling can be a promising approach to tackling these large search spaces
ReSHAPE extends this classical analytical performance model in novel ways
Accuracy + speed make ReSHAPE a useful tool for early exploration
Future direction – extend ReSHAPE
Validate across unique microarchitectures
Extend key parameters and model - memory level parallelism, writeback traffic, prefetching
Explore the rich constrained optimization problem of cache partitioning
Evaluate more use cases
o
o
best power-gating strategy based on workload mix
dynamic schedules based on per-phase application statistics
41
Thank you!
42
RELATED WORK
Analytical Modeling of multi-core chips
Wentzlaff et al. (MIT Tech Report 2010), Li et al. (ISPASS 2005), Yavits et al. (CAL 2013) all tackle
different aspects of multicore chip design, but only consider homogeneous cores.
Wu et. Al (ISCA 2013) use locality profiles to identify how the application’s cache locality degrades
as the application is spread across more threads – they consider multi-threaded applications.
Several works related to heterogeneous design/scheduling
Navada et al. (PACT 2010, PACT 2013) consider simulation based, criticality driven, design space
exploration and mechanisms for selecting the best way to schedule a single application across
multiple cores.
Kumar et al. (Micro 2003, PACT 2006, ISCA 2004) did most of the seminal work in the area of
heterogeneous multi-core. However, they have typically relied on detailed simulations, private
cache hierarchies and single application scheduling.
43
Misses Per Access (Simics)
Misses Per Access (ReSHAPE)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
astar
games
lbm
omnet…
tonto
bwaves
gcc
leslie3d
perl
xalan
bzip
gems
lib
povray
zeusmp
cactus
gromacs
mcf
sjeng
calculix
h264
milc
soplex
deal2
hmmer
namd
sphinx
ReSHAPE
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
astar
games
lbm
omnet…
tonto
bwaves
gcc
leslie3d
perl
xalan
bzip
gems
lib
povray
zeusmp
cactus
gromacs
mcf
sjeng
calculix
h264
milc
soplex
deal2
hmmer
namd
sphinx
32K 32K
VALIDATION
Comparing ReSHAPE’s projections against SIMICS full system simulator
256KB
10Gb/s
Average miss rate projection error: 7.6% (std. dev. = 12.4%)
1.0
Miss Rate Error
Ideal
0.0
0.2
Observed
0.8
0.6
0.4
0.2
0.0
0.4
0.6
Simics
0.8
1.0
44