Adaptive Input-aware Compilation for Graphics Engines

Transcript Adaptive Input-aware Compilation for Graphics Engines

VAST: The Illusion of a Large
Memory Space for GPUs
Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke
University of Michigan
August 2014
University of Michigan
Electrical Engineering and Computer Science
Compilers Creating Custom Processors
Spectrum of GPUs
• Wide range of systems
Supercomputers
Servers
Desktops
Laptops
Cell Phones
• External GPUs
– Work on large data sets
– Have their own memory
– Transfer the data to the memory
2
GPUs Memory Size
• Most products have <= 2,048 MB of memory
NVIDIA GTX Series
4.5
Memory Size (GB)
4
Problem
Size
3.5
3
2.5
2
1.5
1
0.5
0
2008
2009
2010
2012
2013
2014
Launch Date
3
Portability
• Computations w/ 1.8 GB
2 GB
Works!
– 12K x 12K Matrix Mul
• What if more than 2GB of data?
1 GB
4
GPU Lacks Virtual Addressing
• GPU executions are non-preemptive
– Page fault handling is not available
• High PCIe latency
– > 2000 ns / transaction
• ~ 20x slower than GDDR memory latency
– On-demand access produces significant overhead
• e.g.) NVIDIA Unified Virtual Addressing
5
Motivated Idea
Host Memory (4GB)
Host Memory
GPU Memory
(1 GB)
GPU
PCIe
PCIe
Page
Table
Thread
6
Motivated Idea
Host Memory (4GB)
Page
Table
Challenges
Host Memory
GPU Memory
(1 GB total)
- Control the execution order of threads
GPU set of partial threads
- Gather the exact working
PCIe
PCIe
- Address translation during the execution
Thread
7
OpenCL Execution
Application
OpenCL
Kernel
GPU
Work-item
(Thread)
Work-group
Schedule
- work-groups
- arbitrary order
8
OpenCL Execution
Application
OpenCL
Kernel
Enables kernel-slicing by software
without knowing algorithms
Work-item
(Thread)
Work-group
GPU
Schedule
- work-groups
- arbitrary order
9
VAST System
Application
OpenCL
Kernel
VAST System
GPU
Inspects Footprints
Memory
GPU Memory
Look-ahead Page Table (LPT) Generation
Transforms Kernel
OpenCL (Address Translation)
Library
10
VAST System
Application
OpenCL
Kernel
Page
Accessed
Set
VAST System
Page Accessed Set
(PAS)
Inspects Footprints
GPU
Inspector
Kernel
Look-ahead Page Table (LPT) Generation
Transforms Kernel
(Address Translation)
Paged Access
Kernel
11
VAST Execution Timeline
For 1st round
GPU
Host
For 2nd round
depends on
A
PAS gen
B
LPT gen w/
Frame Buffer
Host to Device
Transfer
GPU
Paged Access
Kernel Execution
Device to Host
Transfer
Array Recovery
Exposed
Cost
Exposed
Cost
Time
12
Key Components
VAST System
• Page Accessed Set
Page Accessed Set
(PAS)
Inspector
Kernel
Look-ahead Page Table
(LPT) Generation
Paged Access
Kernel
• Inspector Kernel
• Look-ahead Page Table
• Paged Access Kernel
13
Page Accessed Set
__kernel void
VectorAdd (__global float *C, __global float *A, __global float *B)
{
int tid = get_global_id(0);
get_global_id(0);
Work-item
256 KB x 1M work-items = 256 GB
}
B [ tid ]
C[tid] = A[tid] + B[tid];
Page
1 GB
IndexIndex
Page
Accessed
0x00000
4K Page
0x00001
4K Page
0x00002
4K Page
0x00000
256 KB
0
0x00001
1
0x00002
0
…
…
0x3FFFF
0x3FFFF
4K Page
Per
Work-item
…
0
14
Sharing Page Accessed Set
__kernel void
VectorAdd (__global float *C, __global float *A, __global float *B)
{
int tid
= get_global_id(0); 𝟐𝟓𝟔
1000
1M work-items
work-groups
= 256
GB
MB
256 KB x 1000
work-groups
== 256
MB
𝑁
𝑵
C[tid] = A[tid] + B[tid];
}
Page Index Accessed
256 KB
0x00000
1
0x00001
1
0x00002
1
…
…
0x3FFFF
0
Per
Work-item
Per
Work-group
Per
N Work-groups
15
Inspector Kernel
__kernel void
VectorAdd (
__global float *C, __global float *A, __global float *BB
, C_PAS,
__global float *B_PAS
B_PAS
A_PAS
)
Removed by DCE
{
C[tid] = A[tid] + BB[tid];
[tid]
int page_idx = compute_Page_Idx( tid
tid ) ;
B_PAS = PAS_for_this_workgroup ( B_PAS );
}
B_PAS[page_idx] = 1;
Mark as “LIVE”
16
Look-ahead Page Table
• Generated from the host using PASes
PASes for
2N
work-groups
N work-groups
SUM
Reduce
A_PAS B_PAS C_PAS
0
01 0
10 0
01 0
00…
……1
00
(1024
OR
Reduce
+
512
+
OR
Reduce
1024)
11
00
00
00
……
00
00
10
00
00
……
11
* 4 K = 10 MB
OR
Reduce
17
Look-ahead Page Table
OR-Reduced
PAS
Frame
Buffer
Actual
Data
Link
Copy
1
Look-ahead
Page Table
-
0
...
1
…
0-
256 K
256 K
4 KB
0
SUM-Reduce
= 1024
1-
…
-
4 KB
1024 Frames
4 Bytes
18
Paged Access Kernel
__kernel void
VectorAdd (
__global float *C, __global float *A, __global float *B
, int gid_from
gid_from
, int gid_to
gid_to
)
{
int flat_gid
flat_id
= get_flat_group_id();
if (flat_gid < gid_from || flat_gid > gid_to)
return;
C[tid] = A[tid]
+
B [tid];
}
19
Paged Access Kernel
__kernel void
VectorAdd (
__global float *C, __global float *A, __global float *BB
... , __global int *B_LPT
B_LPT , __global float *B_BUF
B_BUF
, int flat_gid_from
, int flat_gid_to
)
{
[ ... CHECKING CODE ... ]
}
int page_idx = compute_Page_Idx( tid ) ;
int offset = compute_Page_Offset( tid ) ;
int new_idx = B_LPT[page_idx]
B_LPT[page_idx] * PGSIZE + offset;
C[tid] = A[tid] + B_BUF[new_idx]
B[tid]
[tid];
20
Experiment Setup
Device
Memory
(B/W)
Peak Perf.
PCIe 3.0 x16
Intel Core i7 - 3770
NVIDIA GTX 760
32 GB DDR3
2 GB GDDR5
(12.8 GB/s)
(192 GB/s)
435.2 GFlops
2,258 GFlops
15.76 GB/s
• Benchmarks
Application
BlackScholes
FDTD3d
MatrixMul
MedianFilter
MersenneTwister
BoxMuller
Nbody
Reduction
SobelFilter
SpMV
Execution Parameters
256 million options
3D dimsize=1024, Radius=2
20,480×20,480 matrices
30,720×17,280 PPM image
1.15 billion numbers
1.15 billion numbers
41 million particles
940 million numbers
30,720×17,280 PPM image
8 M x 64 matrix, 8 M vector
In/Out Buffer Size
5.0 GB
8.0 GB
4.7 GB
4.0 GB
4.3 GB
8.6 GB
2.6 GB
3.5 GB
4.0 GB
4.0 GB
21
SpeedUp over CPU Exe.
Performance Results
11
10
9
8
7
6
5
4
3
2
1
0
VAST-2K Page
VAST-4K Page
VAST-8K Page
CPU
2.6x
/w 32GB
22
Execution Time Breakdown
MatrixMul
Additional
Cost
PAS gen
PAS reduction
LPT/Frame gen
HostToDev
Kernel Exe.
DevToHost
Recovery
0
20
40
60
80
100
4
5
120 s
SpMV
PAS gen
PAS reduction
LPT/Frame gen
HostToDev
Kernel Exe.
DevToHost
Recovery
0
1
2
3
6 s 23
Paged Access Overhead
Normalized Execution Time
2K Page
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
4K Page
8K Page
Normal GPU
Execution w/ 1GB
(GPU Execution only)
14%
24
Summary
• GPUs have limited size of memory
• GPUs does not support virtual addressing
• Virtual Address Space for Throughput Processor
(VAST)
– Provide the illusion of a large memory space
– Fast page accessed set (PAS) generation through inspector
kernel
– Look-ahead page table generation
– Paged accessed kernel /w software address translation
– Speedup of 2.6x over CPU device execution
25
VAST: The Illusion of a Large
Memory Space for GPUs
Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke
University of Michigan
August 2014
University of Michigan
Electrical Engineering and Computer Science
Compilers Creating Custom Processors
Performance Results
SpeedUp over CPU Exe.
VAST-2K Page
VAST-4K Page
VAST-8K Page
GPU w/
Infinite Physical Mem.
11
10
9
8
7
6
5
4
3
2
1
0
27

Adaptive Input-aware Compilation for Graphics Engines

Transcript Adaptive Input-aware Compilation for Graphics Engines

Directory