Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Transcript Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Slide 1

Grass, Fur and all things hairy
Nicolas Thibieroz

Gaming Engineering Manager, AMD

Karl Hillesland

Senior Research Engineer, AMD

Slide 2

Next-gen Grass, Fur and Hair
●
●

The time for next-gen quality is now
Tomb Raider pioneered next-gen hair
●

Even on PS4/XB1

Users expect this level of quality for nextgen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance
●

Slide 3

TressFX applied to Grass, Fur and Hair
Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
●

●
●
●
●

●

Compute simulations
Anti-aliasing
Transparency
Volumetric self-shadowing
A good lighting model

Slide 4

Forward Rendering Pipeline – a refresher
●

Consists of three steps:
●
●
●

Hair simulation
Shade and store fragments into buffers
Fetch shaded fragments, sort and render

Slide 5

Per-Pixel Linked Lists
●

Head UAV
●

●

Each pixel location has a “head pointer” to a linked list in
the PPLL UAV

PPLL UAV
●
●
●

As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
A link is created to the fragment pointed to by the head
pointer
Head pointer then points to the new fragment

// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;

Head UAV

PPLL UAV

Slide 6

Forward Rendering Pipeline – a refresher
Hair Simulation

Simulation
parameters

Model
space

Input Geometry

CS
CS
CS

World
space

Post-simulation
geometry (UAV)

Slide 7

Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers

World
space

Homogeneous
clip space

VS

PS

Head
UAV

Coverage
Null RT

Stencil

Extrusion from
line segments
to non-indexed
triangles

depth
color
coverage

Lighting

Shadows

next

PPLL
UAV

Slide 8

Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
Head
UAV

Full Screen Quad

PPLL
UAV

VS

Render target
Fragment sorting and
manual blending

PS

Stencil

Slide 9

Forward Rendering Performance
Main cost in forward rendering mode is in the
shading part
●

●
●

●

Don’t need maximum quality on all fragments
●

●

All fragments are lit and shadowed before being stored
PPLL storing is typically not the bottleneck!
“tail” fragments need only “good enough” quality

Solution: Use shader LOD

Slide 10

Forward vs Deferred Rendering Pipeline
Forward rendering pipeline
Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render
●

Deferred rendering pipeline
Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
●

●
●

Full shading on K-frontmost
fragments
“Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm

Slide 11

Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters

Model
space

Input Geometry

CS
CS
CS

World
space

Post-simulation
geometry (UAV)

Slide 12

Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers

World
space

Homogeneous
clip space

VS
Index
Buffer

PS

Head
UAV

Coverage
Null RT

Stencil

depth
tangent

Indexed triangle list

coverage
next

PPLL
UAV

Slide 13

Deferred Rendering Pipeline

Fetch fragments, sort, shade and render
Tail fragments:
cheap chading,
no sorting and
manual blending

Head
UAV

Full Screen Quad

PPLL
UAV

VS

K frontmost fragment:
full shading, sorting
and manual blending

PS

Stencil

Lighting

Shadows

Render target

Slide 14

Deferred Rendering Shading LOD Optimization
●

Deferred approach allows a reduction in shading cost “Shader LOD”
●

●
●

●

Only sort and shade K frontmost fragments at high quality
“Simple” shading and out-of-order rendering on tail fragments
Single-tap shadowing on tail fragments

Very little quality difference compared to full shading
●

But much better performance!
Technique

Cost

Out of order, no shading

1.31 ms

Out of order, shading

2.80 ms

Forward PPLL, shading

3.38 ms

Deferred PPLL, shading

2.13 ms

Shading cost
is ~ 1.5 ms

PPLL cost
is ~ 0.58 ms

Fast!

Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p

Slide 15

Full quality shading forced on
for all fragments

Shading LOD

Slide 16

Geometry Optimizations
●

A great portion of time was spent in the GPU front-end
●

●

Expansion from line segments to triangles was done in GS and then VS with Draw()
●

●

920,000 line segments for fur model
Each segment would create a quad (two triangles) with 6 vertices

Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
Draw() method

2
1
0
Line segments

11
7,10

DrawIndexed() method

8,9

2

5
3

3,5

6
2,3

1

1,4

0

0

Expanded quads

Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };

Line segments

1

4
2
0

Expanded quads

Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };

Slide 17

Distance-based LOD system Optimization
●
●
●
●

Input line segments have a random order
Just render fewer (but thicker) fragments when far away!
Needs shading adjustments to ensure smooth quality transitions
Increase alpha threshold for fragment inclusion when far away

Slide 18

Other Optimizations
●

PPLL Head UAV uses a RWTexture2D instead of a Buffer
●

●

Avoid GPR indexing for sorting
●
●

●

Results in more efficient caching for UAV accesses

Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
Used an ALU-based indexing approach to improve performance

TO DO: compute shader simulation optimizations
●
●

Currently a set of multiple compute shaders
Looking at combining some of these, optimizing shaders and output formats

Slide 19

Per-Pixel Linked Lists UAV Memory Considerations
●

How much memory is needed?
●
●

●

What happens when I run out?
●

●

Guesstimate for a given usage model
Max (hair pixels x average overdraw) fragments
Missing fragments

What can be done about it?

Slide 20

k-Buffer in Memory

Slide 21

PP Linked-List (PPLL)

Node Pool

k-Buffer

fixed size array

k k k k k k k k
k k k k k k k k

k k k k k k k k
All fragments
How big?

k k k k k k k k

k k k k k k k k
k k k k k k k k

Simple Memory Bound

Slide 22

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest

k is 4, 8, 16

20 frags/pixel (ave)
Red = over 100

Slide 23

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest
Can’t know front k
until all fragments processed

k-Buffer

Tail

Slide 24

For Each Fragment in Each Pixel
k-Buffer

Blend
Tail
Fragment

New
Fragment
Index of
furthest

Tail Color

Slide 25

If New Fragment in k
k-Buffer
Blend
Tail
Fragment

New
Fragment
Index of
furthest

If in k

Tail Color

1. Swap with furthest
2. Find new furthest
3. Blend with tail

Slide 26

If not in k
k-Buffer
Blend
Tail
Fragment

New
Fragment
Index of
furthest

If not in k

1. Blend with tail

Tail Color

Slide 27

From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
(mem)
blend tail fragment (reg)
(mem)
Read k-buffer from mem
Sort and blend k-buffer (reg)

Slide 28

Screen Height

k-Buffer
k
K=4, 8, 16
Screen Width

8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)

Slide 29

PPLL: 2nd Pass
k-Buffer

New
Fragment

Blend
Index of
furthest

Tail
Fragment

Registers

Tail Color

Slide 30

k-Buffer in Memory: 1st Pass
k-Buffer

New
Fragment
Index
of
Mutex,
index,
furthest
…

Tail
Fragment

Memory

Blend
Unit

Tail Color

Slide 31

Screen Height

Mutex/Count/Index Buffer
High bit

32 bits
Max Index
(4 bits)

Mutex Bit
Initialized Bit

Screen Width

Count
(remainder)

Slide 32

Spinlock Mutex

Paranoia

[allow_uav_condition]
for(; i{
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work
]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop

Try

Do Work
Release

Slide 33

Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t{
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}

Generally more
memory traffic
than PPLL

Slide 34

Initialization: The first k
Options
●
●
●
●

Clear k-buffer fullscreen (0,1)
Clear k-buffer stenciled, 3rd pass
Clear on first fragment
Count

High bit

Max Index
(4 bits)

Mutex Bit
Initialized Bit

Count
(remainder)

Slide 35

The first k

High bit

InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}

Max Index
(4 bits)

Mutex Bit
Initialized Bit

Count
(remainder)

Slide 36

Models
Stats

2k polygons

2-3.5 M fragments
200-300k pixels

Shading

One point light & shadow
2 shifted specular lobes
~130k hairs

~20k hairs

Slide 37

Depth Complexity

Grey
Blue
Green
Red

1
8
50
100+

Slide 38

Contention

Max attempts per pixel, k=4
Dark Blue
Aqua
Bright Aqua

1
<=4
<=8

Slide 39

Performance
Time ratio to out-of-order blending
●
●
●
●

Forward PPLL:
Forward k-Buffer:
Deferred PPLL:
Deferred k-Buffer:

1.02 to 1.4
1.2 to 1.4
0.7 to 0.9
0.9 to 1.6

Slide 40

K-Buffer in Memory
●
●
●

Simple memory bound
Can be less memory
Usually slower
●

Increased memory traffic

Slide 41

Simulation

Slide 42

Hair Simulation
●
●
●
●
●
●

Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)

Slide 43

Fur Simulation
●
●
●
●
●
●

Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)

Slide 44

Grass Simulation
●
●
●
●
●
●

Length Constraint
Local Constraint (1D)
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)

Slide 45

Constraint Method (iterative)
p2

p0
C0

C1

Pn-2

Pn-1
Cn-2

●
●

Used for length, local and global constraints
Length is most difficult to converge
●

particularly under large movement

Slide 46

Tridiagonal Matrix Formulation
●

Direct solve for length constraint
●
●

●

Almost zero stretch
Limited to smaller time steps (stability)

Still cheap
●

●

Leverages matrix structure of strands
Two sweeps of strand

Slide 47

Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013

Slide 48

Demos

Slide 49

Summary
Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
●

Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphicsdevelopment/amd-radeon-sdk/

Slide 50

Questions?

Slide 51

Extras

Slide 52

Isoline Tessellation for hair/fur? 1/2
●

Isoline tessellation has two tess factors
●
●

●

First is line density (lines per invocation)
Second is line detail (segments per line)

In theory provides easy LOD system
●

Variable line density and detail by increasing both tessellation factors
based on distance

Tess = (1,1)

Tess = (2,1)

Tess = (2,2)

Tess = (2,3)

Tess = (3,3)

Slide 53

Isoline Tessellation for hair/fur? 2/2
●
●

In practice isoline tessellation is not cost effective for this scenario
Lines are always 1-pixel thick
●

Need GS to extrude them into triangles for smooth edges
●

●

Alternative is to enable MSAA
●

●

●

Major impact on performance!
Most engines are deferred so this causes a large performance impact

No extrusion for smoothing edges and no MSAA = poor quality!

Bottom line: a pure Vertex Shader solution is faster
●
●

LOD benefit is easily done in VS (more on this later)
Curvature is rarely a problem (dependant on vertices/strands at authoring time)

Slide 54

AA, Self-shadowing and Transparency

Basic
Rendering

Antialiasing

Antialiasing

+ Self
Shadowing

Antialiasing
+ Self
Shadowing
+ Transparency

Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Transcript Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Directory