talk - CompArch

Download Report

Transcript talk - CompArch

Hyesoon Kim
2/25
| Introduction
| Background
| TAP (TLP-Aware Cache Management Policy)



Core sampling
Cache block lifetime normalization
TAP-UCP and TAP-RRIP
| Evaluation Methodology
| Evaluation Results
| Conclusion
TLP-Aware Cache Management Policy (HPCA-18)
3/25
| Combining GPU cores with conventional CMPs is a trend.
Intel’s Sandy Bridge
AMD’s Fusion
Denver Project
| Various resources are shared between CPU and GPU cores.

LLC, on-chip interconnection, memory controller, and DRAM
| Shared cache is one of most important resources.
TLP-Aware Cache Management Policy (HPCA-18)
4/25
| Many researchers have proposed various cache mechanisms.

Dynamic cache partitioning


Dynamic cache insertion policies


Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06]
Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11]
Many other mechanisms
| All mechanisms target CMPs.
| These may not be directly applicable to CPU-GPU
heterogeneous architectures because CPU and GPU cores
have different characteristics.
TLP-Aware Cache Management Policy (HPCA-18)
5/25
| SIMD, massive threading, lack of speculative execution, …
| GPU cores have an order-of-magnitude more threads.


CPU: 1-4 way SMT
GPU: 10s of active threads in a core
| GPU cores have higher TLP (Thread-Level Parallelism)
than CPU cores.
| TLP has a significant impact on how caching affects
performance of applications.
TLP-Aware Cache Management Policy (HPCA-18)
6/25
Compute intensive
Cache
friendly
or Thrashing
TLP Dominant
MPKI
CPI
MPKI
MPKI
CPI
CPI
Cache Size
| With low TLP
TLP-Aware Cache Management Policy (HPCA-18)
Cache Size
| With High TLP
| This type is hardly found in
CPU applications
7/25
Cache friendly
TLP Dominant
Identical
MPKI
MPKI
Different
CPI
CPI
Cache Size
Cache Size
| Cache-oriented metrics cannot differentiate two types.

Unable to recognize the effect of TLP
| We need to directly monitor performance effect by caching.
TLP-Aware Cache Management Policy (HPCA-18)
8/25
| Samples GPU cores with different cache policies
Bypassing LLC
(No L3)
Core
Core
POL1
L1
CPUs
Followers
Core
Core
Core
Core
Core
Core
Core
Core
Follow Follow Follow Follow
L1
L1
L1
L1
Shared Last-Level Cache
DRAM
TLP-Aware Cache Management Policy (HPCA-18)
MRU insertion
policy in LLC
Core
Core
POL2
L1
GPUs
9/25
| Measures performance differences
Bypassing LLC
(No L3)
Core
POL1
Collect Performance Samples
IPC1
Core Sampling
Controller
Core
POL2
IPC2
Calculate
∆ (IPC1, IPC2)
Yes
Cache-friendly
 Caching improves perf.
TLP-Aware Cache Management Policy (HPCA-18)
MRU insertion
policy in LLC
Core
Core
Core
Core
Follow Follow Follow Follow
Calculate Performance Delta
Make a decision
Followers
∆ > Threshold
No
Not Cache-friendly
 Caching does not affect perf.
10/25
Cache friendly
TLP Dominant
MPKI
MPKI
CPI
CPI
Cache Size
Core
POL1
Core
POL2
Bypassing LLC
MRU insertion
∆ > Threshold:
Cache-friendly
TLP-Aware Cache Management Policy (HPCA-18)
Cache Size
Core
POL1
Bypassing LLC
Core
POL2
MRU insertion
∆ < Threshold:
Not cache-friendly
11/25
| Having different LLC policies for cores to identify the effect
of last-level cache
| Main goal - finding cache-friendly GPGPU applications
| How core sampling is viable

SPMD (Single Program, Multiple Data) model



Each GPU core is running same program.
GPGPUs usually have symmetric behavior on their running GPU cores.
Performance variance between GPU cores is very small.
TLP-Aware Cache Management Policy (HPCA-18)
12/25
| GPU cores have higher TLP (Thread-Level
Parallelism) than CPU cores.
| GPU cores have an order-of-magnitude more cache
accesses
| GPUs have higher tolerance for cache misses due to TLP

Generate cache accesses from different threads without stalls
| SIMD execution – one SIMD instruction can generate
multiple memory requests
TLP-Aware Cache Management Policy (HPCA-18)
13/25
CPU Thread
GPU Threads
Cache miss
Cache miss
Processor stalled
100
No stalls, more^2 cache accesses
< 100
Requests per 1000 cycles
Requests per 1000 cycles
Stalled, fewer cache accesses
80
60
40
20
0
CPU, 1-core
TLP-Aware Cache Management Policy (HPCA-18)
3000
2500
2000
1500
1000
500
0
> 500
GPU, 6-core
14/25
| Why are much more frequent accesses from GPGPU
applications problematic?

Severe interference by GPGPU applications


e.g.) base LRU replacement policy
Performance impact of cache hits is different in applications.

=? Perf. PenaltyGPU(cache miss)
Perf. PenaltyCPU(cache miss) >
| We have to consider the different degree of cache accesses.
| We propose Cache Block Lifetime Normalization.
TLP-Aware Cache Management Policy (HPCA-18)
15/25
| Simple monitoring mechanism

Monitor cache access rate differences between CPU and GPGPU
applications and periodically calculate the ratio
GPU $ Access Counter
CPU $ Access Counter
Calculate Ratio
𝐺𝑃𝑈𝑐𝑜𝑢𝑛𝑡𝑒𝑟
𝑟=
𝐶𝑃𝑈𝑐𝑜𝑢𝑛𝑡𝑒𝑟
if 𝑟 > threshold
XSRATIO = 𝑟
XSRATIO
if 𝑟 < threshold
XSRATIO = 1
| Hints for proposed TAP mechanisms regarding access rate
differences
TLP-Aware Cache Management Policy (HPCA-18)
16/25
TAP
Core Sampling
Lifetime Normalization
To find cache-friendly applications
To consider different degree of cache accesses
UCP
Utility-based
TAP-UCP
Cache Partitioning
RRIP
Re-Reference
TAP-RRIP
Interval Prediction
TLP-Aware Cache Management Policy (HPCA-18)
In this talk
In the paper
17/25
UCP [Qureshi and Patt, MICRO-2006]
Divide hit counter
by XSRATIO register value
to balance cache space
Per application, ATD and hit counters
LLC
ATD
Counters
Hit Counters
ATD Way Hit Way
(LRU Stack)
Way
Hit Counters
ATD
(LRU Stack)
Way Hit Counters
(LRU Stack)
UCP UCP
/
Partitioning Algorithm
Partitioning Algorithm
UCP-Mask
CPU
n4 n5 1n6
n7 GPU
n8 GPGPU
Optimal Partition n1 n2 n3Assign
way
to
If UCP-Mask == 1
| UCP-Mask Register
| Core Sampling
| Cache block lifetime normalization
UCP-Mask = 1
if not cache friendly
XSRATIO
Core Sampling
Controller
TAP
Cache block lifetime
normalization
TAP (TLP-Aware)
TLP-Aware Cache Management Policy (HPCA-18)
18/25
CPU Hit Counters
16 3
8 20 5
8
GPU Hit Counters
3
2
MRU
LRU
3 5.5 10.3 9 8.8 7.8
3 +3 8+
3+8 20
1way
2way
3way
Performance
32 6 16 40 10 16 6
C
MRU
Utility
UCP
Marginal Utility
How many more hits
are expected
if N ways are given
to an application
4
LRU
6 11 20.7 10
18 17.615.7
13 10.1
TAP-UCP
Case 1: Non Cache-Friendly
TAP-UCP
G
Not Cache-friendly
∆ < Threshold
C
C
UCP
G
G
G
G
G
G
G
G
1 CPU: 7 GPU
G
……
Final Partition
More GPU ways
C
G
G
Caching has little effect on Perf.
 Assign only 1 way
to GPGPU
4 CPU: 4 GPU
7 CPU:
1 GPU
G
G
G
G
1 CPU
7 GPU
G
G
TLP-Aware Cache Management Policy (HPCA-18)
7 CPU
Final Partition
GPU
More CPU1ways
C
C
C
C
C
C
C
G
19/25
CPU Hit Counters
16 3
8 20 5
8
GPU Hit Counters
3
2
MRU
LRU
3 5.5 10.3 5
9 6.5
8.8 5.3
7.8
Performance
32 3
16
6 16
8 20
40 10
5 16
8 3
6
MRU
Utility
UCP
C
G
G
G
G
G
Divide hit counters
by XSRATIO
3 5.5 10.3 9 8.8 7.8
TAP-UCP
Case 2: Cache-Friendly
4 CPU
Final Partition
4 GPU
1 CPU
7 GPU
G
XSRATIO = 2
LRU
UCP
Final Partition
4
2
∆ > Threshold
C G
G
C
C
C Cache-friendly
G G G
TAP-UCP
1 CPU: 7 GPU
4 CPU: 4 GPU
More GPU ways
TLP-Aware Cache Management Policy (HPCA-18)
C
G
C
G
7 CPU: 1 GPU
C
C
C
More CPU ways
C
G
C
C
C
G
G
G
20/25
| Introduction
| Background
| TAP (TLP-Aware Cache Management Policy)


|
|
|
|
Core sampling
Cache block lifetime normalization
TAP-UCP
Evaluation Methodology
Evaluation Results
Conclusion
TLP-Aware Cache Management Policy (HPCA-18)
21/25
| MacSim simulator (http://code.google.com/p/macsim) [GT]

Trace-driven, timing simulator, x86+ptx instructions
CPU
(1-4 cores)
OOO
4-wide
Private
L1/L2
GPU
(6 cores)
16 SIMD
width
Private
L1
LLC
DRAM
32-way
8MB
Shared LLC
(Base: LRU)
DDR3-1333,
41.6GB/s BW
FR-FCFS
| Workload


CPU: SPEC 2006
GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench
1-CPU
2-CPU
4-CPU
Stream-CPU
(1 CPU + 1 GPU)
(2 CPUs + 1 GPU)
(4 CPUs + 1 GPU)
(Stream CPU + 1 GPU)
152
150
75
25
TLP-Aware Cache Management Policy (HPCA-18)
22/25
UCP
TAP-UCP
RRIP
1.15
1.2
11%
1.1
1.05
1
0.95
0.9
| UCP is effective with thrashing.
| Less effective with cache-sensitive
GPGPU applications.
TLP-Aware Cache Management Policy (HPCA-18)
Speedup over LRU
Speedup over LRU
1.2
TAP-RRIP
1.15
12%
1.1
1.05
1
0.95
0.9
| RRIP is generally less effective on
heterogeneous workloads.
23/25
Normalized MPKI
| Sphinx3 + Stencil
1.5
Previous
TAP

TLP dominant
1
| MPKI
0.5

0
MPKI CPU
Speedup over LRU
| Stencil
2.5
2
1.5
1
0.5
0
MPKI GPU
Previous
MPKI Overall


TAP
CPU: significant decrease
GPGPU: considerable
increase
Overall MPKI: increased
| Performance


CPU Speedup GPU Speedup
TLP-Aware Cache Management Policy (HPCA-18)
Overall
Speedup

CPU: huge improvement
GPU: no change
Overall: huge improvement
24/25
UCP
TAP-UCP
RRIP
TAP-RRIP
1.3
24%
Speedup over LRU
1.25
1.2
1.15
17.5%
11%
12%
12.5%
14%
1.1
1.05
1
1 CPU App + 1 GPGPU App
2 CPU Apps + 1 GPGPU App
4 CPU Apps + 1 GPGPU App
| TAP mechanisms show higher benefits with more CPU
applications.
TLP-Aware Cache Management Policy (HPCA-18)
25/25
| CPU-GPU Heterogeneous architecture is a popular trend.

Resource sharing problem is more significant.
| We propose TAP for CPU-GPU heterogeneous architecture

First proposal to consider the resource sharing problem
| We introduce a core sampling technique that samples GPU
cores with different policies to identify cache-friendliness.
| Two TAP mechanisms improve the performance of the system
significantly.


TAP-UCP: 11% over LRU and 5% over UCP
TAP-RRIP: 12% over LRU and 9% over RRIP
TLP-Aware Cache Management Policy (HPCA-18)