Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Download Report

Transcript Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.

Slide 1

Grass, Fur and all things hairy
Nicolas Thibieroz

Gaming Engineering Manager, AMD

Karl Hillesland

Senior Research Engineer, AMD


Slide 2

Next-gen Grass, Fur and Hair



The time for next-gen quality is now
Tomb Raider pioneered next-gen hair


Even on PS4/XB1

Users expect this level of quality for nextgen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance



Slide 3

TressFX applied to Grass, Fur and Hair
Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:









Compute simulations
Anti-aliasing
Transparency
Volumetric self-shadowing
A good lighting model


Slide 4

Forward Rendering Pipeline – a refresher


Consists of three steps:




Hair simulation
Shade and store fragments into buffers
Fetch shaded fragments, sort and render


Slide 5

Per-Pixel Linked Lists


Head UAV




Each pixel location has a “head pointer” to a linked list in
the PPLL UAV

PPLL UAV




As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
A link is created to the fragment pointed to by the head
pointer
Head pointer then points to the new fragment

// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;

Head UAV

PPLL UAV


Slide 6

Forward Rendering Pipeline – a refresher
Hair Simulation

Simulation
parameters

Model
space

Input Geometry

CS
CS
CS

World
space

Post-simulation
geometry (UAV)


Slide 7

Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers

World
space

Homogeneous
clip space

VS

PS

Head
UAV

Coverage
Null RT

Stencil

Extrusion from
line segments
to non-indexed
triangles

depth
color
coverage

Lighting

Shadows

next

PPLL
UAV


Slide 8

Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
Head
UAV

Full Screen Quad

PPLL
UAV

VS

Render target
Fragment sorting and
manual blending

PS

Stencil


Slide 9

Forward Rendering Performance
Main cost in forward rendering mode is in the
shading part







Don’t need maximum quality on all fragments




All fragments are lit and shadowed before being stored
PPLL storing is typically not the bottleneck!
“tail” fragments need only “good enough” quality

Solution: Use shader LOD


Slide 10

Forward vs Deferred Rendering Pipeline
Forward rendering pipeline
Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render


Deferred rendering pipeline
Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render





Full shading on K-frontmost
fragments
“Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm


Slide 11

Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters

Model
space

Input Geometry

CS
CS
CS

World
space

Post-simulation
geometry (UAV)


Slide 12

Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers

World
space

Homogeneous
clip space

VS
Index
Buffer

PS

Head
UAV

Coverage
Null RT

Stencil

depth
tangent

Indexed triangle list

coverage
next

PPLL
UAV


Slide 13

Deferred Rendering Pipeline

Fetch fragments, sort, shade and render
Tail fragments:
cheap chading,
no sorting and
manual blending

Head
UAV

Full Screen Quad

PPLL
UAV

VS

K frontmost fragment:
full shading, sorting
and manual blending

PS

Stencil

Lighting

Shadows

Render target


Slide 14

Deferred Rendering Shading LOD Optimization


Deferred approach allows a reduction in shading cost “Shader LOD”







Only sort and shade K frontmost fragments at high quality
“Simple” shading and out-of-order rendering on tail fragments
Single-tap shadowing on tail fragments

Very little quality difference compared to full shading


But much better performance!
Technique

Cost

Out of order, no shading

1.31 ms

Out of order, shading

2.80 ms

Forward PPLL, shading

3.38 ms

Deferred PPLL, shading

2.13 ms

Shading cost
is ~ 1.5 ms

PPLL cost
is ~ 0.58 ms

Fast!

Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p


Slide 15

Full quality shading forced on
for all fragments

Shading LOD


Slide 16

Geometry Optimizations


A great portion of time was spent in the GPU front-end




Expansion from line segments to triangles was done in GS and then VS with Draw()




920,000 line segments for fur model
Each segment would create a quad (two triangles) with 6 vertices

Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
Draw() method

2
1
0
Line segments

11
7,10

DrawIndexed() method

8,9

2

5
3

3,5

6
2,3

1

1,4

0

0

Expanded quads

Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };

Line segments

1

4
2
0

Expanded quads

Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };


Slide 17

Distance-based LOD system Optimization





Input line segments have a random order
Just render fewer (but thicker) fragments when far away!
Needs shading adjustments to ensure smooth quality transitions
Increase alpha threshold for fragment inclusion when far away


Slide 18

Other Optimizations


PPLL Head UAV uses a RWTexture2D instead of a Buffer




Avoid GPR indexing for sorting





Results in more efficient caching for UAV accesses

Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
Used an ALU-based indexing approach to improve performance

TO DO: compute shader simulation optimizations



Currently a set of multiple compute shaders
Looking at combining some of these, optimizing shaders and output formats


Slide 19

Per-Pixel Linked Lists UAV Memory Considerations


How much memory is needed?





What happens when I run out?




Guesstimate for a given usage model
Max (hair pixels x average overdraw) fragments
Missing fragments

What can be done about it?


Slide 20

k-Buffer in Memory


Slide 21

PP Linked-List (PPLL)

Node Pool

k-Buffer

fixed size array

k k k k k k k k
k k k k k k k k

k k k k k k k k
All fragments
How big?

k k k k k k k k

k k k k k k k k
k k k k k k k k

Simple Memory Bound


Slide 22

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also

Full quality shade on front k

Cheap shade on rest

k is 4, 8, 16

20 frags/pixel (ave)
Red = over 100


Slide 23

The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also

Full quality shade on front k

Cheap shade on rest
Can’t know front k
until all fragments processed

k-Buffer

Tail


Slide 24

For Each Fragment in Each Pixel
k-Buffer

Blend
Tail
Fragment

New
Fragment
Index of
furthest

Tail Color


Slide 25

If New Fragment in k
k-Buffer
Blend
Tail
Fragment

New
Fragment
Index of
furthest

If in k

Tail Color

1. Swap with furthest
2. Find new furthest
3. Blend with tail


Slide 26

If not in k
k-Buffer
Blend
Tail
Fragment

New
Fragment
Index of
furthest

If not in k

1. Blend with tail

Tail Color


Slide 27

From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
(mem)
blend tail fragment (reg)
(mem)
Read k-buffer from mem
Sort and blend k-buffer (reg)


Slide 28

Screen Height

k-Buffer
k
K=4, 8, 16
Screen Width

8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)


Slide 29

PPLL: 2nd Pass
k-Buffer

New
Fragment

Blend
Index of
furthest

Tail
Fragment

Registers

Tail Color


Slide 30

k-Buffer in Memory: 1st Pass
k-Buffer

New
Fragment
Index
of
Mutex,
index,
furthest


Tail
Fragment

Memory

Blend
Unit

Tail Color


Slide 31

Screen Height

Mutex/Count/Index Buffer
High bit

32 bits
Max Index
(4 bits)

Mutex Bit
Initialized Bit

Screen Width

Count
(remainder)


Slide 32

Spinlock Mutex

Paranoia

[allow_uav_condition]
for(; i{
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work
]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop

Try

Do Work
Release


Slide 33

Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t{
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}

Generally more
memory traffic
than PPLL


Slide 34

Initialization: The first k
Options





Clear k-buffer fullscreen (0,1)
Clear k-buffer stenciled, 3rd pass
Clear on first fragment
Count

High bit

Max Index
(4 bits)

Mutex Bit
Initialized Bit

Count
(remainder)


Slide 35

The first k

High bit

InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}

Max Index
(4 bits)

Mutex Bit
Initialized Bit

Count
(remainder)


Slide 36

Models
Stats

2k polygons

2-3.5 M fragments
200-300k pixels

Shading

One point light & shadow
2 shifted specular lobes
~130k hairs

~20k hairs


Slide 37

Depth Complexity

Grey
Blue
Green
Red

1
8
50
100+


Slide 38

Contention

Max attempts per pixel, k=4
Dark Blue
Aqua
Bright Aqua

1
<=4
<=8


Slide 39

Performance
Time ratio to out-of-order blending





Forward PPLL:
Forward k-Buffer:
Deferred PPLL:
Deferred k-Buffer:

1.02 to 1.4
1.2 to 1.4
0.7 to 0.9
0.9 to 1.6


Slide 40

K-Buffer in Memory




Simple memory bound
Can be less memory
Usually slower


Increased memory traffic


Slide 41

Simulation


Slide 42

Hair Simulation







Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)


Slide 43

Fur Simulation







Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)


Slide 44

Grass Simulation







Length Constraint
Local Constraint (1D)
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)


Slide 45

Constraint Method (iterative)
p2

p0
C0

C1

Pn-2

Pn-1
Cn-2




Used for length, local and global constraints
Length is most difficult to converge


particularly under large movement


Slide 46

Tridiagonal Matrix Formulation


Direct solve for length constraint





Almost zero stretch
Limited to smaller time steps (stability)

Still cheap




Leverages matrix structure of strands
Two sweeps of strand


Slide 47

Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013


Slide 48

Demos


Slide 49

Summary
Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute


Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphicsdevelopment/amd-radeon-sdk/


Slide 50

Questions?


Slide 51

Extras


Slide 52

Isoline Tessellation for hair/fur? 1/2


Isoline tessellation has two tess factors





First is line density (lines per invocation)
Second is line detail (segments per line)

In theory provides easy LOD system


Variable line density and detail by increasing both tessellation factors
based on distance

Tess = (1,1)

Tess = (2,1)

Tess = (2,2)

Tess = (2,3)

Tess = (3,3)


Slide 53

Isoline Tessellation for hair/fur? 2/2



In practice isoline tessellation is not cost effective for this scenario
Lines are always 1-pixel thick


Need GS to extrude them into triangles for smooth edges




Alternative is to enable MSAA






Major impact on performance!
Most engines are deferred so this causes a large performance impact

No extrusion for smoothing edges and no MSAA = poor quality!

Bottom line: a pure Vertex Shader solution is faster



LOD benefit is easily done in VS (more on this later)
Curvature is rarely a problem (dependant on vertices/strands at authoring time)


Slide 54

AA, Self-shadowing and Transparency

Basic
Rendering

Antialiasing

Antialiasing

+ Self
Shadowing

Antialiasing
+ Self
Shadowing
+ Transparency