Transcript GDC 2005

Technology Behind AMD’s
“Leo Demo”
Jay McKee
MTS Engineer, AMD
Why Forward Rendering?
●
●
●
●
●
●
Complex materials
Multiple light types
Supports hardware anti-aliasing
Efficient memory usage
Supports transparency
BUT, previously could not support a
large number of lights
Forward+ Rendering
●
●
Modified forward renderer. Add
computer shader for light culling.
Modify main light loop.
Lighting and shading done in the same
place, all information is preserved.
Forward+ Rendering (continued)
●
●
No limits on parameters for lights and
materials
● Omni
● Spot
● Cinematic (arbitrary falloffs, barndoor)
● BRDF per material instance
Simple design, concentrate on rendering, not
engine maintenance.
Important DX11 features
●
Compute Shaders
●
UAV support.
Compute Shaders
●
In Leo demo we use two compute shaders:
●
●
●
One for culling lights.
Another for spawning Virtual Point Lights (VPLs)
for indirect lighting.
Culling 3,072 lights takes 1.7 ms on high end GPU.
UAVs
●
●
●
Array(s) of scene light information.
Array of u32 light indices for storing
start/end lights per-tile.
Array of material instance data
Algorithm summary
●
●
Depth Pre-Pass
Light Culling
●
●
●
●
●
Screen divided into tiles. Launch compute shader per tile.
Light info such as position, radius, direction, length
passed to light culling compute shader.
Light culling shader projects lights bounds to screenspace tiles. Uses scene depth from z pre-pass for z
testing against light volumes.
Outputs to UAV describing per tile light list start/end
along with a large UAV of u32 array of light indices.
Output UAVs are passed to main light shaders for looping
through lights per-pixel.
Algorithm summary continued
●
Render scene materials
●
Base light accumulation function
● Use screen x, y location to determine tileID
● From tileID, get light start and end indices
● From start index to end index, loop
● Entry is index into light array.
● Accumulate light hitting pixel
● Returns total direct and indirect light hitting
pixel.
Algorithm summary continued
●
Material shader
●
●
●
Decides what to do with total incoming light
Passed into material’s BRDF for example
Uses light accumulation building blocks
●
Env. lighting, base light accumulation, BRDF, etc. are
put together for final pixel color.
Light Culling Shader Details (1/3)
//
1. prepare
float4 frustum[4];
float minZ, maxZ;
{
ConstructFrustum( frustum );
minZ = thread_REDUCE(MIN, depth );
maxZ = thread_REDUCE(MAX, depth );
ldsMinZ = SIMD_REDUCE(MIN, minZ );
ldsMaxZ = SIMD_REDUCE(MAX, maxZ );
minZ = ldsMinZ;
maxZ = ldsMaxZ;
}
Light Culling Shader Details (2/3)
__local u32 ldsNLights = 0;
__local u32 ldsLightBuffer[MAX];
//
2. overlap check, accumulate in LDS
for(int i=threadIdx; i<nLights; i+=WG_SIZE)
{
Light light = fetchAndTransform( lightBuffer[ i ] );
if( overlaps( light, frustum ) && overlaps ( light, minZ, maxZ ) )
{
AtomicAppend( ldsLightBuffer, i );
}
}
Light Culling Shader Details (3/3)
//
3. export to global
__local u32 ldsOffset;
if( threadIdx == 0 )
{
ldsOffset
= AtomAdd( ldsNLights );
globalLightStart[tileIdx] = ldsOffset;
globalLightEnd[tileIdx] = ldsOffset + ldsNLights;
}
for(int i=threadIdx; i< ldsNLights; i+=WG_SIZE)
{
int dstIdx = ldsOffset + i;
globalLightIndexBuffer[dstIdx] = ldsLightBuffer[i];
Light Accumulation Pseudo-code
// BaseLighting.inc
// THIS INC FILE IS ALL THE COMMON LIGHTING CODE
StructuredBuffer<float4>
StructuredBuffer<uint>
StructuredBuffer<uint>
StructuredBuffer<int2>
LightParams
LowerBoundLights
UpperBoundLights
LightIndexBuffer
: register(u0);
: register(u1);
: register(u2);
: register(u3);
uint GetTileIndex(float2 screenPos)
{
float tileRes = (float)m_tileRes;
uint numCellsX = (m_width + m_tileRes - 1)/m_tileRes;
uint tileIdx = floor(screenPos.x/tileRes)+floor(screenPos.y/tileRes)*numCellsX;
}
return tileIdx;
}
Light Accumulation (2):
StartHLSL BaseLightLoopBegin // THIS IS A MACRO, INCLUDED IN MATERIAL SHADERS
uint tileIdx = GetTileIndex( pixelScreenPos );
uint startIdx = LowerBoundLights[tileIdx];
uint endIdx = UppweBoundLights[tileIdx];
[loop]
for ( uint lightListIdx = startIdx; lightListIdx < endIdx; lightListIdx++ )
{
int lightIdx = LightIndexBuffer[lightListIdx];
// Set common light parameters
float ndotl = max(0, dot(normal, lightVec));
float3 directLight = 0;
float3 indirectLight = 0;
Light Accumulation (3):
if( lightIdx >= numDirectLightsThisFrame ) {
CalculateIndirectLight(lightIdx , indirectLight);
} else
{
if( IsConeLight( lightIdx ) )
{
// <<== Can add more light types here
CalculateDirectSpotlight(lightIdx , directLight);
}
else
{
CalculateDirectSpherelight(lightIdx , directLight);
}
}
float3 incomingLight = (directLight + indirectLight)*ndotl;
float shadowTerm = CalcShadow();
EndHLSL
StartHLSL BaseLightLoopEnd
}
EndHLSL
Material Shader Template:
#include "BaseLighting.inc"
float4 PS ( PSInput i ) : SV_TARGET
{
float3 totalDiffuse = 0;
float3 totalSpec = GetEnvLighting();;
$include BaseLightLoopBegin
// unique material code goes here!! Light accumulation on the pixel for a given light
// we have total incoming light and direct/indirect light components as well as material params and shadow term
// use these building blocks to integrate lighting terms
totalDiffuse += GetDiffuse(incomingLight);
totalSpec += CalcPhong(incomingLight);
$include BaseLightLoopEnd
float3 finalColor = totalDiffuse + totalSpec;
return float4( finalColor, 1 );
}
Debug Mode Demo
Benchmark
3k dynamic lights
Compute-based Deferred v.s. Forward+
Deferred(H)
Deferred(L)
Prepass
Light processing
Forward+(H)
Final shading
Forward+(L)
Time (ms)
Takahiro Harada, Jay McKee, Jason C.Yang, Forward+: Bringing Deferred Lighting to the Next
Level, Eurographics Short Paper (2012)
Depth Pre-Pass Critical
●
Pixel overdraw cripples this technique so depth prepass is required.
●
Depth pre-pass is good opportunity to use MRT to
generate other full-screen data needed for post-fx
and other render fx (optional).
Other important points
●
●
●
●
XBOX 360 has good bandwidth so given limitations on forward
rendering, deferred makes a lot of sense.
However, ALU computation growing at faster rate than bandwidth.
more and more feasible to just do the calculations than to
read/write so much data.
Dynamic branching penalties not nearly as bad as before. As an
optimization, compute shader can sort by light-type for example to
minimize penalties.
All that "light management" CPU side code to decide which lights
hit each object for setting constant registers can be ditched!
Summary
●
●
●
●
Modified forward renderer that handles scenes
with 1000s of lights.
Hardware anti-aliasing (MSAA) “automatic”
Bandwidth friendly.
Makes the most of the GPU's ALU power (which is
growing faster than bandwidth)
Thanks!
Contact:
[email protected]
[email protected]
[email protected]
Leo Demo website:
http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900
SeriesGraphicsReal-TimeDemos.aspx
Eurographics 2012: 'Forward+: Bringing Deferred Lighting to the Next
Level'