Firaxis LORE

Download Report

Transcript Firaxis LORE

Firaxis LORE
And other uses of D3D11
Low Overhead Rendering Engine
• Or, how I learned to
Render 15,000+
batches at 60 FPS
Overview
• Civ 5 is a big game, covers 6000 years of
history
• The entire map can be populated/ polluted
with all sorts of things the user creates
• Need to be able to render a huge amount of
possibly disparate types
Early Goals
• Build brand new Engine for Civilization V
• Like the game, we wanted graphics engine to
be able to ‘stand the test of time’
• Decided while D3D11 was in Alpha to build
the engine natively for D3D11 architecture,
and map backwards to DX9
Step 1: Cutting the overhead down
• Shaders start in Firaxis Shading
Language (FSL) superset of HLSL
•Compiles into CPP and Header file – all
shader constants are mapped to structs,
grouped into packages where all
packages have same bindings
•Model Code is templated – FSL
generated header is then bound with
template code
•Result is tiny amount of code that fills
out required shading, barely shows up
on profiling
FSL Files
CPP / H
Template Code
Compile Time Glue Code
Step 2: Abstracting the Rendering
• Still have to Support DX9, might have to
support consoles in future
• Might have to write a ‘driver’
• Our solution: Make DX9 ‘look like’ DX11
• Started with as a restricted design as possible,
and expanded as we needed to
Packetized Rendering
• Stateless rendering, much simpler then D3D
• Command based – all rendering is performed by self contained command
• A command set may contain a list of surfaces to render, each with shader
constant payload
• A surface is an immutable bundle of an IB, VB, textures, shader def, etc
• All state is bundled into a packages Alpha State, Z State, etc. Commands
reference one of these state packages
• Entire Frame is queued up
• Minimal per frame allocation
Only 5 Types of commands
• COMMAND_RENDER_BATCHES
– A List of surfaces to render into 1 or more rendertargets, with
alpha and Zstate bundles
– Surfaces have IB, VB, sampler and texture bundles. All required
state is specified
•
•
•
•
COMMAND_GENERATE_MIPS
COMMAND_RESOLVE_RENDERTEXTURE
COMMAND_COPY_RENDERTEXTURE
COMMAND_COPY_RESOURCE
Packetized Rendering
Rendering System
Command Stream
Command
Stream
Command
Stream
Command
Stream
Command
Stream
Rendering Engine
D3D/Driver
Step 3: Threading
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Manager
Job
Command Stream
Job
Command Stream
Job
Command Stream
Job
Command Stream
Rendering System
Why do we queue up entire Frame?
• Would seem like additional overhead, but perf analysis
shows it is a net win
–
–
–
–
–
–
Internal command setup is super-cheap, just some mem copies
Engine cache coherency is vastly better
D3D driver cache coherency is much better with one giant dump
Very low % of total CPU time spent in submission
Allows us to filter redundant D3D calls. Call overhead adds up
Fast even in DX9
Implementation advantages
• Once ‘stateless’ concept grasped, code maintaince easy
• Next to no state-leaking (flickering alpha, textures etc)
• Because rendering is packetized, individual jobs need
little or no communication between each other
• NO THREADING BUGS
Threaded D3D11 submission
• Top issues:
– Generally High driver overhead for batch submission
– But: D3D11 has multithreaded submission
– Command Streams not necessarily map 1:1 to
CommandLists
– Civilization V can change how it submits via settings the
config files
Step 4: Gloating over results
• Wildly surpassed commonly held beliefs on # of
batches possible, especially with threading
Test
Driver with native CL support
Driver without CL support
Units
1686*
931
Landmarks
1152*
673
Lategame
3616*
2052
*Believed to be GPU limited
Conclusions
• High throughput rendering is possible: IF:
– care taken to reduce application overhead
– Job based, pay-load based rendering
– Redundant state and calls filtered
– Use D3D11 command lists
– Engine can peg 12 threads at 97% (sans driver)
D3D11 Features: Tessellation
• Major addition to
D3D11 API
[Screenshot]
Terrain
• Civ5 contains one of the most complex terrain
systems ever made
• Complete procedural process
• Use GPU to raytrace and anti-alias shadows
• Caching system to deal with cases where
terrain is too big
Tessellation
• Terrain very high detail, roughly 64x64
heightmap data per hex
• Triangle count, when zoomed out, can be in
the millions
• Used Tessellation as a ‘drop-in’
Tessellation Cont
• Simple Bicupic Beta Spline patches
• Adjusted global tessellation as camera moved in and
out
• A strict performance increase : 10%-40% faster, on
both AMD and Nvidia hardware.
• More Adapative techinques would work even better,
but didn’t have time to implement them
Leaders
Leader Rendering
• Largely done with DX10.1 rendering tech
• New Variable bit rate compression technology
implemented for D3D11.
• 2.5 GBs of texture data reduced to 150mbs, can be
decompressed on the GPU
• Details forthcoming, research is in publication
submission process – extensive use of UAVs
Future Stuff, NO AO
Future Stuff (CS), AO
Q&A