Performance Analysis
Download
Report
Transcript Performance Analysis
Latency considerations of depth-first
GPU ray tracing
Michael Guthe
University Bayreuth
Visual Computing
1
Depth-first GPU ray tracing
Based on bounding box
or spatial hierarchy
Recursive traversal
Usually using a stack
Threads inside a warp may
access different data
May also diverge
2
Performance Analysis
What limits performance of the trace kernel?
Device memory bandwidth? Obviously not!
12%
10%
8%
primary
AO
diffuse
6%
4%
2%
0%
GTX 480
GTX 680
GTX Titan
3
Performance Analysis
What limits performance of the trace kernel?
Maximum (warp) instructions per clock? Not really!
80%
70%
60%
50%
primary
AO
diffuse
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
4
Performance Analysis
Why doesn’t the kernel fully utilize the cores?
Three possible reasons:
Instruction fetch
e.g. due to branches
Memory latency
a.k.a. data request
mainly due to random access
Read after write latency
a.k.a. execution dependency
It takes 22 clock cycles (Kepler) until the result is written to a register
5
Performance Analysis
Why doesn’t the kernel fully utilize the cores?
Profiling shows: Memory & RAW latency limit performance!
70%
fetch prim.
fetch AO
fetch diff.
mem. prim.
mem. AO
mem. diff.
dep. prim.
dep. AO
dep. diff.
60%
50%
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
6
Reducing Latency
Standard solution for latency:
Increase occupancy
No option due to register pressure
Relocate memory access
Automatically performed by compiler
But not between iterations of a while loop
Loop unrolling for triangle test
7
Reducing Latency
Instruction level parallelism
Not directly supported by GPU
Increases number of eligible warps
Same effect as higher occupancy
We might even spend some more registers
Wider trees
4-ary tree means 4 independent instructions paths
Almost doubles the number of eligible warps during node tests
Higher width increase number of node tests, 4 is optimum
8
Reducing Latency
Tree construction
Start from root
Recursively pull largest
child up
Special rules for leaves to
reduce memory consumption
Goal: 4 child nodes whenever
possible
9
Reducing Latency
Overhead: sorting intersected nodes
Can have two independent paths with parallel merge sort
0.7
0.3
0.2
0.3
0.7
0.2
0.3
0.2
0.7
0.2
0.3
0.7
We don‘t need sorting for occlusion rays
10
Results
Improved instructions per clock
Doesn’t directly translate to speedup
80%
70%
60%
50%
primary
AO
diffuse
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
11
Results
Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
600
500
400
300
200
primary
AO
diffuse
100
0
GTX 480
GTX 680
Sibenik, 80k tris.
GTX Titan
12
Results
Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
450
400
350
300
250
200
150
primary
AO
diffuse
100
50
0
GTX 480
GTX 680
Fairy forest, 174k tris.
GTX Titan
13
Results
Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
700
600
500
400
300
200
primary
AO
diffuse
100
0
GTX 480
GTX 680
Conference, 283k tris.
GTX Titan
14
Results
Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
300
250
200
150
100
primary
AO
diffuse
50
0
GTX 480
GTX 680
San Miguel, 11M tris.
GTX Titan
15
Results
Latency is still performance limiter
Mostly improved memory latency
70%
fetch prim.
fetch AO
fetch diff.
mem. prim.
mem. AO
mem. diff.
dep. prim.
dep. AO
dep. diff.
60%
50%
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
16