Performance Analysis

Transcript Performance Analysis

Latency considerations of depth-first
GPU ray tracing
Michael Guthe
University Bayreuth
Visual Computing
1
Depth-first GPU ray tracing


Based on bounding box
or spatial hierarchy
Recursive traversal

Usually using a stack
Threads inside a warp may
access different data

May also diverge

2
Performance Analysis

What limits performance of the trace kernel?

Device memory bandwidth? Obviously not!
12%
10%
8%
primary
AO
diffuse
6%
4%
2%
0%
GTX 480
GTX 680
GTX Titan
3
Performance Analysis

What limits performance of the trace kernel?

Maximum (warp) instructions per clock? Not really!
80%
70%
60%
50%
primary
AO
diffuse
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
4
Performance Analysis

Why doesn’t the kernel fully utilize the cores?

Three possible reasons:

Instruction fetch

e.g. due to branches

Memory latency


a.k.a. data request
mainly due to random access

Read after write latency

a.k.a. execution dependency
It takes 22 clock cycles (Kepler) until the result is written to a register

5
Performance Analysis

Why doesn’t the kernel fully utilize the cores?

Profiling shows: Memory & RAW latency limit performance!
70%
fetch prim.
fetch AO
fetch diff.
mem. prim.
mem. AO
mem. diff.
dep. prim.
dep. AO
dep. diff.
60%
50%
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
6
Reducing Latency

Standard solution for latency:

Increase occupancy

No option due to register pressure

Relocate memory access

Automatically performed by compiler

But not between iterations of a while loop
Loop unrolling for triangle test

7
Reducing Latency

Instruction level parallelism

Not directly supported by GPU
Increases number of eligible warps


Same effect as higher occupancy
We might even spend some more registers

Wider trees

4-ary tree means 4 independent instructions paths

Almost doubles the number of eligible warps during node tests

Higher width increase number of node tests, 4 is optimum

8
Reducing Latency

Tree construction

Start from root
Recursively pull largest
child up


Special rules for leaves to
reduce memory consumption
Goal: 4 child nodes whenever
possible
9
Reducing Latency

Overhead: sorting intersected nodes

Can have two independent paths with parallel merge sort

0.7
0.3

0.2
0.3
0.7
0.2

0.3
0.2
0.7

0.2
0.3
0.7

We don‘t need sorting for occlusion rays
10
Results

Improved instructions per clock

Doesn’t directly translate to speedup
80%
70%
60%
50%
primary
AO
diffuse
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
11
Results

Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
600
500
400
300
200
primary
AO
diffuse
100
0
GTX 480
GTX 680
Sibenik, 80k tris.
GTX Titan
12
Results

Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
450
400
350
300
250
200
150
primary
AO
diffuse
100
50
0
GTX 480
GTX 680
Fairy forest, 174k tris.
GTX Titan
13
Results

Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
700
600
500
400
300
200
primary
AO
diffuse
100
0
GTX 480
GTX 680
Conference, 283k tris.
GTX Titan
14
Results

Up to 20.1% speedup over Aila et. al: “Understanding the Efficiency of Ray Traversal on GPUs”, 2012
million rays per second
300
250
200
150
100
primary
AO
diffuse
50
0
GTX 480
GTX 680
San Miguel, 11M tris.
GTX Titan
15
Results

Latency is still performance limiter

Mostly improved memory latency
70%
fetch prim.
fetch AO
fetch diff.
mem. prim.
mem. AO
mem. diff.
dep. prim.
dep. AO
dep. diff.
60%
50%
40%
30%
20%
10%
0%
GTX 480
GTX 680
GTX Titan
16

Performance Analysis

Transcript Performance Analysis

Directory