Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.
Download ReportTranscript Grass, Fur and all things hairy Nicolas Thibieroz Gaming Engineering Manager, AMD Karl Hillesland Senior Research Engineer, AMD.
Slide 1
Grass, Fur and all things hairy
Nicolas Thibieroz
Gaming Engineering Manager, AMD
Karl Hillesland
Senior Research Engineer, AMD
Slide 2
Next-gen Grass, Fur and Hair
●
●
The time for next-gen quality is now
Tomb Raider pioneered next-gen hair
●
Even on PS4/XB1
Users expect this level of quality for nextgen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance
●
Slide 3
TressFX applied to Grass, Fur and Hair
Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
●
●
●
●
●
●
Compute simulations
Anti-aliasing
Transparency
Volumetric self-shadowing
A good lighting model
Slide 4
Forward Rendering Pipeline – a refresher
●
Consists of three steps:
●
●
●
Hair simulation
Shade and store fragments into buffers
Fetch shaded fragments, sort and render
Slide 5
Per-Pixel Linked Lists
●
Head UAV
●
●
Each pixel location has a “head pointer” to a linked list in
the PPLL UAV
PPLL UAV
●
●
●
As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
A link is created to the fragment pointed to by the head
pointer
Head pointer then points to the new fragment
// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;
Head UAV
PPLL UAV
Slide 6
Forward Rendering Pipeline – a refresher
Hair Simulation
Simulation
parameters
Model
space
Input Geometry
CS
CS
CS
World
space
Post-simulation
geometry (UAV)
Slide 7
Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers
World
space
Homogeneous
clip space
VS
PS
Head
UAV
Coverage
Null RT
Stencil
Extrusion from
line segments
to non-indexed
triangles
depth
color
coverage
Lighting
Shadows
next
PPLL
UAV
Slide 8
Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
Head
UAV
Full Screen Quad
PPLL
UAV
VS
Render target
Fragment sorting and
manual blending
PS
Stencil
Slide 9
Forward Rendering Performance
Main cost in forward rendering mode is in the
shading part
●
●
●
●
Don’t need maximum quality on all fragments
●
●
All fragments are lit and shadowed before being stored
PPLL storing is typically not the bottleneck!
“tail” fragments need only “good enough” quality
Solution: Use shader LOD
Slide 10
Forward vs Deferred Rendering Pipeline
Forward rendering pipeline
Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render
●
Deferred rendering pipeline
Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
●
●
●
Full shading on K-frontmost
fragments
“Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm
Slide 11
Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters
Model
space
Input Geometry
CS
CS
CS
World
space
Post-simulation
geometry (UAV)
Slide 12
Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers
World
space
Homogeneous
clip space
VS
Index
Buffer
PS
Head
UAV
Coverage
Null RT
Stencil
depth
tangent
Indexed triangle list
coverage
next
PPLL
UAV
Slide 13
Deferred Rendering Pipeline
Fetch fragments, sort, shade and render
Tail fragments:
cheap chading,
no sorting and
manual blending
Head
UAV
Full Screen Quad
PPLL
UAV
VS
K frontmost fragment:
full shading, sorting
and manual blending
PS
Stencil
Lighting
Shadows
Render target
Slide 14
Deferred Rendering Shading LOD Optimization
●
Deferred approach allows a reduction in shading cost “Shader LOD”
●
●
●
●
Only sort and shade K frontmost fragments at high quality
“Simple” shading and out-of-order rendering on tail fragments
Single-tap shadowing on tail fragments
Very little quality difference compared to full shading
●
But much better performance!
Technique
Cost
Out of order, no shading
1.31 ms
Out of order, shading
2.80 ms
Forward PPLL, shading
3.38 ms
Deferred PPLL, shading
2.13 ms
Shading cost
is ~ 1.5 ms
PPLL cost
is ~ 0.58 ms
Fast!
Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p
Slide 15
Full quality shading forced on
for all fragments
Shading LOD
Slide 16
Geometry Optimizations
●
A great portion of time was spent in the GPU front-end
●
●
Expansion from line segments to triangles was done in GS and then VS with Draw()
●
●
920,000 line segments for fur model
Each segment would create a quad (two triangles) with 6 vertices
Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
Draw() method
2
1
0
Line segments
11
7,10
DrawIndexed() method
8,9
2
5
3
3,5
6
2,3
1
1,4
0
0
Expanded quads
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
Line segments
1
4
2
0
Expanded quads
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
Slide 17
Distance-based LOD system Optimization
●
●
●
●
Input line segments have a random order
Just render fewer (but thicker) fragments when far away!
Needs shading adjustments to ensure smooth quality transitions
Increase alpha threshold for fragment inclusion when far away
Slide 18
Other Optimizations
●
PPLL Head UAV uses a RWTexture2D instead of a Buffer
●
●
Avoid GPR indexing for sorting
●
●
●
Results in more efficient caching for UAV accesses
Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
Used an ALU-based indexing approach to improve performance
TO DO: compute shader simulation optimizations
●
●
Currently a set of multiple compute shaders
Looking at combining some of these, optimizing shaders and output formats
Slide 19
Per-Pixel Linked Lists UAV Memory Considerations
●
How much memory is needed?
●
●
●
What happens when I run out?
●
●
Guesstimate for a given usage model
Max (hair pixels x average overdraw) fragments
Missing fragments
What can be done about it?
Slide 20
k-Buffer in Memory
Slide 21
PP Linked-List (PPLL)
Node Pool
k-Buffer
fixed size array
k k k k k k k k
k k k k k k k k
k k k k k k k k
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound
Slide 22
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest
k is 4, 8, 16
20 frags/pixel (ave)
Red = over 100
Slide 23
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest
Can’t know front k
until all fragments processed
k-Buffer
Tail
Slide 24
For Each Fragment in Each Pixel
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
Tail Color
Slide 25
If New Fragment in k
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
If in k
Tail Color
1. Swap with furthest
2. Find new furthest
3. Blend with tail
Slide 26
If not in k
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
If not in k
1. Blend with tail
Tail Color
Slide 27
From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
(mem)
blend tail fragment (reg)
(mem)
Read k-buffer from mem
Sort and blend k-buffer (reg)
Slide 28
Screen Height
k-Buffer
k
K=4, 8, 16
Screen Width
8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)
Slide 29
PPLL: 2nd Pass
k-Buffer
New
Fragment
Blend
Index of
furthest
Tail
Fragment
Registers
Tail Color
Slide 30
k-Buffer in Memory: 1st Pass
k-Buffer
New
Fragment
Index
of
Mutex,
index,
furthest
…
Tail
Fragment
Memory
Blend
Unit
Tail Color
Slide 31
Screen Height
Mutex/Count/Index Buffer
High bit
32 bits
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Screen Width
Count
(remainder)
Slide 32
Spinlock Mutex
Paranoia
[allow_uav_condition]
for(; i {
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work
]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop
Try
Do Work
Release
Slide 33
Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t {
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}
Generally more
memory traffic
than PPLL
Slide 34
Initialization: The first k
Options
●
●
●
●
Clear k-buffer fullscreen (0,1)
Clear k-buffer stenciled, 3rd pass
Clear on first fragment
Count
High bit
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Count
(remainder)
Slide 35
The first k
High bit
InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Count
(remainder)
Slide 36
Models
Stats
2k polygons
2-3.5 M fragments
200-300k pixels
Shading
One point light & shadow
2 shifted specular lobes
~130k hairs
~20k hairs
Slide 37
Depth Complexity
Grey
Blue
Green
Red
1
8
50
100+
Slide 38
Contention
Max attempts per pixel, k=4
Dark Blue
Aqua
Bright Aqua
1
<=4
<=8
Slide 39
Performance
Time ratio to out-of-order blending
●
●
●
●
Forward PPLL:
Forward k-Buffer:
Deferred PPLL:
Deferred k-Buffer:
1.02 to 1.4
1.2 to 1.4
0.7 to 0.9
0.9 to 1.6
Slide 40
K-Buffer in Memory
●
●
●
Simple memory bound
Can be less memory
Usually slower
●
Increased memory traffic
Slide 41
Simulation
Slide 42
Hair Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 43
Fur Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 44
Grass Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint (1D)
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 45
Constraint Method (iterative)
p2
p0
C0
C1
Pn-2
Pn-1
Cn-2
●
●
Used for length, local and global constraints
Length is most difficult to converge
●
particularly under large movement
Slide 46
Tridiagonal Matrix Formulation
●
Direct solve for length constraint
●
●
●
Almost zero stretch
Limited to smaller time steps (stability)
Still cheap
●
●
Leverages matrix structure of strands
Two sweeps of strand
Slide 47
Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013
Slide 48
Demos
Slide 49
Summary
Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
●
Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphicsdevelopment/amd-radeon-sdk/
Slide 50
Questions?
Slide 51
Extras
Slide 52
Isoline Tessellation for hair/fur? 1/2
●
Isoline tessellation has two tess factors
●
●
●
First is line density (lines per invocation)
Second is line detail (segments per line)
In theory provides easy LOD system
●
Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1)
Tess = (2,1)
Tess = (2,2)
Tess = (2,3)
Tess = (3,3)
Slide 53
Isoline Tessellation for hair/fur? 2/2
●
●
In practice isoline tessellation is not cost effective for this scenario
Lines are always 1-pixel thick
●
Need GS to extrude them into triangles for smooth edges
●
●
Alternative is to enable MSAA
●
●
●
Major impact on performance!
Most engines are deferred so this causes a large performance impact
No extrusion for smoothing edges and no MSAA = poor quality!
Bottom line: a pure Vertex Shader solution is faster
●
●
LOD benefit is easily done in VS (more on this later)
Curvature is rarely a problem (dependant on vertices/strands at authoring time)
Slide 54
AA, Self-shadowing and Transparency
Basic
Rendering
Antialiasing
Antialiasing
+ Self
Shadowing
Antialiasing
+ Self
Shadowing
+ Transparency
Grass, Fur and all things hairy
Nicolas Thibieroz
Gaming Engineering Manager, AMD
Karl Hillesland
Senior Research Engineer, AMD
Slide 2
Next-gen Grass, Fur and Hair
●
●
The time for next-gen quality is now
Tomb Raider pioneered next-gen hair
●
Even on PS4/XB1
Users expect this level of quality for nextgen titles
● You need to start thinking about this
● This talk is about making high-quality fur,
grass and hair run at real-time performance
●
Slide 3
TressFX applied to Grass, Fur and Hair
Variations of the same technique can be used for all those
applications
● In all cases the core principles of next-gen quality are still
needed:
●
●
●
●
●
●
Compute simulations
Anti-aliasing
Transparency
Volumetric self-shadowing
A good lighting model
Slide 4
Forward Rendering Pipeline – a refresher
●
Consists of three steps:
●
●
●
Hair simulation
Shade and store fragments into buffers
Fetch shaded fragments, sort and render
Slide 5
Per-Pixel Linked Lists
●
Head UAV
●
●
Each pixel location has a “head pointer” to a linked list in
the PPLL UAV
PPLL UAV
●
●
●
As new fragments are rendered, they are added to the
next open location in the PPLL (using UAV counter)
A link is created to the fragment pointed to by the head
pointer
Head pointer then points to the new fragment
// Retrieve current pixel count and increase counter
uint uPixelCount = LinkedListUAV.IncrementCounter();
uint uOldStartOffset;
// Exchange indices in LinkedListHead texture corresponding to pixel location
InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset);
// Append new element at the end of the Fragment and Link Buffer
Element.uNext = uOldStartOffset;
LinkedListUAV[uPixelCount] = Element;
Head UAV
PPLL UAV
Slide 6
Forward Rendering Pipeline – a refresher
Hair Simulation
Simulation
parameters
Model
space
Input Geometry
CS
CS
CS
World
space
Post-simulation
geometry (UAV)
Slide 7
Forward Rendering Pipeline – a refresher
Shade and Store fragments into Buffers
World
space
Homogeneous
clip space
VS
PS
Head
UAV
Coverage
Null RT
Stencil
Extrusion from
line segments
to non-indexed
triangles
depth
color
coverage
Lighting
Shadows
next
PPLL
UAV
Slide 8
Forward Rendering Pipeline – a refresher
Fetch shaded fragments, sort and render
Head
UAV
Full Screen Quad
PPLL
UAV
VS
Render target
Fragment sorting and
manual blending
PS
Stencil
Slide 9
Forward Rendering Performance
Main cost in forward rendering mode is in the
shading part
●
●
●
●
Don’t need maximum quality on all fragments
●
●
All fragments are lit and shadowed before being stored
PPLL storing is typically not the bottleneck!
“tail” fragments need only “good enough” quality
Solution: Use shader LOD
Slide 10
Forward vs Deferred Rendering Pipeline
Forward rendering pipeline
Hair simulation
● Full shading and store
fragments into buffers
● Fetch shaded fragments, sort
and render
●
Deferred rendering pipeline
Hair simulation
● Store fragment properties into
buffers
● Fetch fragment properties, sort,
shade and render
●
●
●
Full shading on K-frontmost
fragments
“Tail” fragments are shaded with a
simpler light equation and
shadowing algorithm
Slide 11
Deferred Rendering Pipeline
Hair Simulation – unchanged!
Simulation
parameters
Model
space
Input Geometry
CS
CS
CS
World
space
Post-simulation
geometry (UAV)
Slide 12
Deferred Rendering Pipeline – a refresher
Store Fragment Properties into Buffers
World
space
Homogeneous
clip space
VS
Index
Buffer
PS
Head
UAV
Coverage
Null RT
Stencil
depth
tangent
Indexed triangle list
coverage
next
PPLL
UAV
Slide 13
Deferred Rendering Pipeline
Fetch fragments, sort, shade and render
Tail fragments:
cheap chading,
no sorting and
manual blending
Head
UAV
Full Screen Quad
PPLL
UAV
VS
K frontmost fragment:
full shading, sorting
and manual blending
PS
Stencil
Lighting
Shadows
Render target
Slide 14
Deferred Rendering Shading LOD Optimization
●
Deferred approach allows a reduction in shading cost “Shader LOD”
●
●
●
●
Only sort and shade K frontmost fragments at high quality
“Simple” shading and out-of-order rendering on tail fragments
Single-tap shadowing on tail fragments
Very little quality difference compared to full shading
●
But much better performance!
Technique
Cost
Out of order, no shading
1.31 ms
Out of order, shading
2.80 ms
Forward PPLL, shading
3.38 ms
Deferred PPLL, shading
2.13 ms
Shading cost
is ~ 1.5 ms
PPLL cost
is ~ 0.58 ms
Fast!
Fur model with ~130,000 fur strands
Running on AMD Radeon 7970 @ 1080p
Slide 15
Full quality shading forced on
for all fragments
Shading LOD
Slide 16
Geometry Optimizations
●
A great portion of time was spent in the GPU front-end
●
●
Expansion from line segments to triangles was done in GS and then VS with Draw()
●
●
920,000 line segments for fur model
Each segment would create a quad (two triangles) with 6 vertices
Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
Draw() method
2
1
0
Line segments
11
7,10
DrawIndexed() method
8,9
2
5
3
3,5
6
2,3
1
1,4
0
0
Expanded quads
Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) };
Line segments
1
4
2
0
Expanded quads
Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) };
Slide 17
Distance-based LOD system Optimization
●
●
●
●
Input line segments have a random order
Just render fewer (but thicker) fragments when far away!
Needs shading adjustments to ensure smooth quality transitions
Increase alpha threshold for fragment inclusion when far away
Slide 18
Other Optimizations
●
PPLL Head UAV uses a RWTexture2D instead of a Buffer
●
●
Avoid GPR indexing for sorting
●
●
●
Results in more efficient caching for UAV accesses
Sorting K frontmost fragments required array of Generic Purpose Registers with
random indexing into it
Used an ALU-based indexing approach to improve performance
TO DO: compute shader simulation optimizations
●
●
Currently a set of multiple compute shaders
Looking at combining some of these, optimizing shaders and output formats
Slide 19
Per-Pixel Linked Lists UAV Memory Considerations
●
How much memory is needed?
●
●
●
What happens when I run out?
●
●
Guesstimate for a given usage model
Max (hair pixels x average overdraw) fragments
Missing fragments
What can be done about it?
Slide 20
k-Buffer in Memory
Slide 21
PP Linked-List (PPLL)
Node Pool
k-Buffer
fixed size array
k k k k k k k k
k k k k k k k k
k k k k k k k k
All fragments
How big?
k k k k k k k k
k k k k k k k k
k k k k k k k k
Simple Memory Bound
Slide 22
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest
k is 4, 8, 16
20 frags/pixel (ave)
Red = over 100
Slide 23
The Front k
Approximation to avoid massive sorting
● Only sort the front k fragments per-pixel
● Blend the rest out-of-order
If deferring for shader LOD … also
●
Full quality shade on front k
●
Cheap shade on rest
Can’t know front k
until all fragments processed
k-Buffer
Tail
Slide 24
For Each Fragment in Each Pixel
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
Tail Color
Slide 25
If New Fragment in k
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
If in k
Tail Color
1. Swap with furthest
2. Find new furthest
3. Blend with tail
Slide 26
If not in k
k-Buffer
Blend
Tail
Fragment
New
Fragment
Index of
furthest
If not in k
1. Blend with tail
Tail Color
Slide 27
From PPLL to k-Buffer
For each pixel:
Write frags to mem
For each fragment in each pixel
read fragment from mem
update k-buffer (reg)
(mem)
blend tail fragment (reg)
(mem)
Read k-buffer from mem
Sort and blend k-buffer (reg)
Slide 28
Screen Height
k-Buffer
k
K=4, 8, 16
Screen Width
8 bytes each
(depth and data)
PPLL nodes were 12 bytes
(depth, data, next)
Slide 29
PPLL: 2nd Pass
k-Buffer
New
Fragment
Blend
Index of
furthest
Tail
Fragment
Registers
Tail Color
Slide 30
k-Buffer in Memory: 1st Pass
k-Buffer
New
Fragment
Index
of
Mutex,
index,
furthest
…
Tail
Fragment
Memory
Blend
Unit
Tail Color
Slide 31
Screen Height
Mutex/Count/Index Buffer
High bit
32 bits
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Screen Width
Count
(remainder)
Slide 32
Spinlock Mutex
Paranoia
[allow_uav_condition]
for(; i
uint oldID;
InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID);
if( (oldID&RESERVED) != RESERVED) )
{
[[ … Do work
]]
DeviceMemoryBarrier();
tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED;
bStop = true;
} // end mutex check
}// end spinlock loop
Try
Do Work
Release
Slide 33
Find New Max Depth
uint new_max_depth = u_inDepth;
[unroll] for(int t=0; t
uint element_depth = DEPTH( vScreenAddress, t );
if(element_depth > new_max_depth )
{
new_max_depth = element_depth;
new_max_id = t;
}
}
Generally more
memory traffic
than PPLL
Slide 34
Initialization: The first k
Options
●
●
●
●
Clear k-buffer fullscreen (0,1)
Clear k-buffer stenciled, 3rd pass
Clear on first fragment
Count
High bit
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Count
(remainder)
Slide 35
The first k
High bit
InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount);
[allow_uav_condition]
if(oldCount < KBUFFER_SIZE)
{
DATA(vScreenAddress,oldCount) = u_inData;
DEPTH(vScreenAddress,oldCount) = u_inDepth;
return uint2(u_outDepth,u_outData);
}
Max Index
(4 bits)
Mutex Bit
Initialized Bit
Count
(remainder)
Slide 36
Models
Stats
2k polygons
2-3.5 M fragments
200-300k pixels
Shading
One point light & shadow
2 shifted specular lobes
~130k hairs
~20k hairs
Slide 37
Depth Complexity
Grey
Blue
Green
Red
1
8
50
100+
Slide 38
Contention
Max attempts per pixel, k=4
Dark Blue
Aqua
Bright Aqua
1
<=4
<=8
Slide 39
Performance
Time ratio to out-of-order blending
●
●
●
●
Forward PPLL:
Forward k-Buffer:
Deferred PPLL:
Deferred k-Buffer:
1.02 to 1.4
1.2 to 1.4
0.7 to 0.9
0.9 to 1.6
Slide 40
K-Buffer in Memory
●
●
●
Simple memory bound
Can be less memory
Usually slower
●
Increased memory traffic
Slide 41
Simulation
Slide 42
Hair Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 43
Fur Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 44
Grass Simulation
●
●
●
●
●
●
Length Constraint
Local Constraint (1D)
Global Constraint
Model Transform
Collision Shapes
External Forces (wind, gravity, etc.)
Slide 45
Constraint Method (iterative)
p2
p0
C0
C1
Pn-2
Pn-1
Cn-2
●
●
Used for length, local and global constraints
Length is most difficult to converge
●
particularly under large movement
Slide 46
Tridiagonal Matrix Formulation
●
Direct solve for length constraint
●
●
●
Almost zero stretch
Limited to smaller time steps (stability)
Still cheap
●
●
Leverages matrix structure of strands
Two sweeps of strand
Slide 47
Tridiagonal Matrix Formulation
“Tridiagonal Matrix Formulation for
Inextensible Hair Strand Simulation”,
VRIPHYS, 2013
Slide 48
Demos
Slide 49
Summary
Next-gen look is possible now!
● Deferred Rendering for shading LOD is fastest
● k-buffer in memory is an option for memory-constrained
situations
● High-quality grass and fur simulation with compute
●
Upcoming TressFX 2 SDK sample update with fur scenario at
http://developer.amd.com/tools-and-sdks/graphicsdevelopment/amd-radeon-sdk/
Slide 50
Questions?
Slide 51
Extras
Slide 52
Isoline Tessellation for hair/fur? 1/2
●
Isoline tessellation has two tess factors
●
●
●
First is line density (lines per invocation)
Second is line detail (segments per line)
In theory provides easy LOD system
●
Variable line density and detail by increasing both tessellation factors
based on distance
Tess = (1,1)
Tess = (2,1)
Tess = (2,2)
Tess = (2,3)
Tess = (3,3)
Slide 53
Isoline Tessellation for hair/fur? 2/2
●
●
In practice isoline tessellation is not cost effective for this scenario
Lines are always 1-pixel thick
●
Need GS to extrude them into triangles for smooth edges
●
●
Alternative is to enable MSAA
●
●
●
Major impact on performance!
Most engines are deferred so this causes a large performance impact
No extrusion for smoothing edges and no MSAA = poor quality!
Bottom line: a pure Vertex Shader solution is faster
●
●
LOD benefit is easily done in VS (more on this later)
Curvature is rarely a problem (dependant on vertices/strands at authoring time)
Slide 54
AA, Self-shadowing and Transparency
Basic
Rendering
Antialiasing
Antialiasing
+ Self
Shadowing
Antialiasing
+ Self
Shadowing
+ Transparency