Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Download Report

Transcript Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Mascar: Speeding up GPU Warps by
Reducing Memory Pitstops
Ankit Sethia*
D. Anoushe
Jamshidi
Scott Mahlke
University of Michigan
GPU usage is expanding
Graphics
Linear Algebra
Data Analytics
Machine Learning
All kinds of applications, compute and
memory intensive are targeting GPUs
Simulation
Computer Vision
2
Performance variation of kernels
% of peak IPC
100%
80%
Compute
Intensive
Memory
Intensive
60%
40%
20%
0%
Memory intensive kernels saturate
bandwidth and get lower performance
3
Impact of memory saturation - I
FPUs
FPUs
FPUs
....
LSU
LSU
L1
LSU
L1
L1
Memory System
• Memory intensive kernels serialize memory requests
• Critical to prioritize order of requests from SMs
4
Impact of memory saturation
100%
80%
fraction of peak IPC
fraction of cycles LSU stalled
Compute Memory
Intensive Intensive
60%
40%
20%
0%
Significant stalls in LSU correspond to low
performance in memory intensive kernels
5
Impact of memory saturation - II
FPUs
FPUs
FPUs
FPUs
....
LSU
LSU
L1
LSU
L1
LSU
W1
L1
L1
Cache-blocks
W0 Memory System
W1 Data
• Data present in the cache, but LSU can’t access
• Unable to feed enough data for processing
6
Speedup
Increasing memory resources
2
1,75
1,5
1,25
1
0,75
0,5
0,25
0
Large MSHRs + Queues
Full Associativity
Freq +20%
All
Memory Intensive Kernels
Large # of MSHRs + Full Associativity + 20% bandwidth boost
UNBUILDABLE
7
During memory saturation:
• Serialization of memory requests cause less overlap
between memory accesses and compute:
Memory aware scheduling (Mas)
• Data present in the cache cannot be reused as data
cache cannot accept any request:
Cache access re-execution (Car)
Mas + car
Mascar
8
Memory Aware Scheduling
Memory
Saturation
Warp 0
Memory
requests
Warp 1
Memory
requests
Warp 2
Memory
requests
Serving one request and switching to another warp (RR)
• No warp is ready to make forward progress
9
Memory Aware Scheduling
Memory
Saturation
Warp 0
Warp 1
Warp 2
GTO issues instructions from another warp whenever:
• No instruction in the i-buffer for that warp
• Dependency between instructions
Similar toMemory
RR as multiple warps
may issue memory
requests
Memory
Memory
requests
requests
requests
Serving one request and switching to another warp (RR)
• No warp is ready to make forward progress
Serving all requests from one warp and then switching to another
• One warp is ready to begin computation early (MAS)
10
MAS operation
Check if memory intensive
(MSHRs or miss queue almost full)
N
Y
Schedule
in Equal
Priority
mode
Assign new owner warp
(only owner’s request can go beyond L1)
Execute memory inst. only from owner, other
Schedule
in execute
Memorycompute
Priority mode
warps can
inst.
Is the next instruction of owner dependent on
already issued load
N
Y
MP mode
11
Implementation of MAS
Decode
WST
I-Buffer
Ordered
Warps
Warp Id
WRC
Scoreboard
Stall
bit
..
.....
OPtype
From
RF
Scheduler
.
.
.
Issued
Warp
Mem_Q
Head
Comp_Q
Head
Memory saturation flag
• Divide warps as memory and compute warps in ordered warps
• Warp Readiness Checker (WRC): Tests if a warp should be allowed
to issue memory instructions
• Warp Status Table: Decide if scheduler should schedule from a warp
12
During memory saturation:
• Serialization of memory requests cause less overlap
between memory accesses and compute:
Memory aware scheduling (Mas)
• Data present in the cache cannot be reused as data
cache cannot accept any request:
Cache access re-execution (Car)
Mas + car
Mascar
13
Cache access re-execution
Load Store Unit
L1 Cache
W1 W0
W2
Cache-blocks
HIT
W1 Data
Re-execution Queue
Better than adding more MSHRs:
• More MSHR cause faster saturation of memory
• Cause faster thrashing of data cache
14
Experimental methodology
•
•
•
•
•
•
GPGPUSim 3.2.2 – GTX 480 architecture
SMs – 15, 32 PEs/SM
Schedulers – LRR, GTO, OWL and CCWS
L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs
L2 cache – 768kB, 8 way, 6 partitions, 200 core cycles
DRAM – 32 requests/partition, 440 core cycles
15
Performance of compute intensive kernels
Speedup w.r.t RR
GTO
OWL
CCWS
Mascar
1,25
1
0,75
0,5
0,25
0
Performance of compute intensive kernels
is insensitive to scheduling policies
16
Performance of memory intensive kernels
GTO
Speedup w.r.t RR
2
OWL
CCWS
Bandwidth intensive
MAS
CAR
3.0
4.8 4.24
Overall
Cache Sensitive
1,5
1
0,5
0
Scheduler
GTO
OWL
CCWS
Mascar
Bandwidth
Intensive
4%
4%
4%
17%
Cache
Sensitive
24%
4%
55%
56%
Overall
13%
4%
24%
34%
17
Conclusion
During memory saturation:
• Serialization of memory requests cause less overlap
between memory accesses and compute:
Memory aware scheduling (Mas) allows one warp to
issue all its requests and begin early computation
• Data present in the cache cannot be reused as data
cache cannot accept any request:
Cache access re-execution (Car) exploits more hitunder-miss opportunities through re-execution queue
34% speedup
12% energy savings
18
Mascar: Speeding up GPU Warps by
Reducing Memory Pitstops
Questions??