Hpg09_multiGPU_paper_slides.ppt

Download Report

Transcript Hpg09_multiGPU_paper_slides.ppt

Scaling of 3D Game Engine Workloads on
Modern Multi-GPU Systems
Jordi Roca Monfort (Universitat Politècnica Catalunya)
Mark Grossman (Microsoft)
0
Outline
• Introduction on Multi-GPU rendering
• RTT surface synchronization alternatives
• Multi-GPU performance models
• Scaling results
• Conclusion
1
Multi-GPU rendering
• Main purpose: Several GPUs collaborate to render frames of
the same 3D scene.
•
Rendering task is massive parallel.
•
High resolution scenes require a lot of pixel fillrate and memory BW.
•
GPUs partial renders are composed to the final frame sequence.
• Other different usages:
•
Multi-display rendering: each GPU renders a different viewport/screen.
•
GPUs take different tasks: graphics rendering, physics and AI processing.
2
A choice for enthusiast gaming
•
Performance/scaling:
•
Usually high for few GPUs and high resolutions (1600x1200 minimum)
•
Greatly depends on driver maturity and the game workload: Crysis Warhead hits
1.6x and Lost Planet hits 1.9x with 2-GPU systems.
•
Power:
•
Two graphics cards spend more than the equivalent single GPU solution
targeting the same performance.
•
Upgrade cost:
•
Double performance for next game generation by acquiring a second graphics
card (same GPU family and similar range counterpart).
•
Extra cost of acquiring a high end motherboard with multiple PCIe ports.
3
Rendering workload balance
Split Frame Rendering
Dynamic split
line
Alternate Frame
Rendering
SuperAA
Fixed tiles
(32x32)
Geometry scaling problem
Decreased
Play at High AA
interactivity problem
modes (x16)
4
What do GPUs communicate to each
other?
• Today’s 3D engines don’t just render the main
backbuffer:
•
Draw commands can render to special surfaces used later as
textures: reflections, shadow maps, lens flare, post-filtering ops,..
•
New draw dependency chain with render-to-texture surfaces as
synchronization points.
• GPUs must exchange updated surface contents at this
points to ensure data integrity → Inter-GPU syncs.
5
Render-to-Use sync analysis
RTT copies diverge
Start sync
operation
RTT copies are
synchronized
Sync cycles
RTT copies diverge
Swap
frame
Frame 0
Frame 1
GPU0
SFR
GPU1
Render[0]
Use[0]
Use[1]
Repeat Use[0]
RtU intraframe
RtU interframe dep
RTT copies are
synchronized
RTT copies diverge
RTT copies diverge
Render[0]
GPU1
Swap
frame
Frame 0
GPU0
AFR
Render[1]
RtU intraframe dep
Use[0]
Sync
cycles
RtU interframe dep
Use[1]
Repeat Use[1]
Frame 1
Render[1]
Command Buffer events →
6
Render-to-Use sync analysis
Game/Timedemo
Engine
Release
Screen
date
resolution
Frames
RTT
% RTT
surfaces
time
3DMark06/Canyon Flight
Proprietary
2006/01 16 x 12 (1AA)
3344
23
69.11%
FEAR/PerformanceTest
LithTech
2005/10 16 x 12 (1AA)
4130
5
4.48%
Call Of Duty 2/carentan
Proprietary
2005/10 16 x 12 (1AA)
1355
8
23.47%
Call Of Duty 2/demo5
Proprietary
2005/10 16 x 12 (1AA)
1210
8
18.87%
Company Of Heroes/Intro
Essence
2006/09 16 x 12 (1AA)
12195
5
85.74%
Half Life 2 Lost Coast/VST
Source
2005/10 25 x 16 (8AA)
9712
5
15.88%
BattleField 2142/suez canal Refractor2
2006/10 25 x 16 (1AA)
7400
8
80.05%
BattleField 2/abl-chini
2005/06 16 x 12 (1AA)
8217
3
4.15%
Refractor2
7
Render-to-Use sync analysis
RTT Surface
Id
Game/Timedemo
306 Engine
307
3DMark06/Canyon Flight 308Proprietary
FEAR/PerformanceTest 309LithTech
Call Of Duty 2/carentan
30aProprietary
30bProprietary
Intraframe (SFR)
RtU deps
syncs
Interframe (AFR)
RtU deps
syncs 2-GPU
335442 Screen
7405
Release
35722resolution
28322
date
Frames 0 RTT
% RTT0
0
surfaces
time 0
14805
0
2006/01
16 x 1214805
(1AA)
3344 0
23 69.11%
14811
0
2005/10
16 x 1214811
(1AA)
4130 0
5
4.48%
2 x 12 (1AA)
1
0
2005/10 16
1355 0
8 23.47%
1 x 12 (1AA)
1
2005/10 16
14516
2006/09
16 x 12 4713
(1AA)
1210 0
4007
12195
8
5
0
18.87%
9
85.74%
252642
2005/10
25 x 16 7305
(8AA)
9712 0
5
0
15.88%
BattleField 2142/suez canal Refractor2
2006/10 25 x 16 (1AA)
7400
8
80.05%
BattleField 2/abl-chini
2005/06 16 x 12 (1AA)
8217
3
4.15%
Call Of Duty 2/demo5
Company Of Heroes/Intro 30dEssence
Half Life 2 Lost Coast/VST30eSource
Refractor2
•
Track per-surface RtU dependencies and required syncs.
•
The more intraframe syncs, the worse SFR performs.
•
The more interframe syncs, the worse AFR performs.
8
The contributions of this work
• Which characteristics make 3D Games suitable for
multi-GPU systems?
•
Render-to-texture sync analysis.
•
Do they enable any optimization? (See next section)
• Can we measure multi-GPU scaling based on 3D
game workload characteristics?
•
Using a simplified model.
•
Using real 3D workload data.
•
Evaluate SFR, AFR and combined modes (4+ GPUs).
9
RTT surface synchronization
alternatives
Leverage RtU gap: Early Copy
GPU0
Game/
GPU1
Timedemo
3DM06
Render
Draw
D
52.85%
2426
83.68%
65.08%
83780
1.70%
96.28%
48175
1.65%
94.63%
36626
0.01%
96.80%
19788
36.45%
21.95%
BF2142
77363
3.13%
87.16%
BF2
22553
30.40%
44.44%
Early Start COD2c
of sync
operation COD2d
…
COH
D
D
HL2
Use
Swap
frame
Determine last render:
•
Use Prediction table,
•
Look ahead command buffer.
Gap
0.42%
…
Draw
D
syncs
% Pixel
shading
bound
123691
FEAR
Render
RtU sync
RtU gap wrt frame
duration
% of time the RTT draw
is pixel shading bound
Penalization cost =
Sync cycles (delayed copy) – RtU gap cycles
11
Pixel Shading bounds: Concurrent Update
Pixel
Shader
R600 Shader:
Clock: 750MHz
ALUs: 4 x 16 SIMD
Pixel
Shader
Pixel
Shader
Shader output
PCIe 2.0 x16
6GB/sec
Shader output
Shader output
ROP
ROP
ROP
Local Mem
Local Mem
Local Mem
Local GPU
Remote GPUs
•
Shading a thousand fragments (35 instructions) in the R600 SPUs takes 1000 x 35
/ 64 = 547 cycles.
•
Sending the corresponding color outputs through the PCIe 2.0 bus takes 1000 x 4
bytes x 750 MHz / 6 GB/s = 500 cycles
Penalization cost =
Remote sent cycles – Pixel shading cycles
12
Multi-GPU Performance Models
EMPATHY analysis tool
Vtx HW
counters
API state
changes
(N/I) GPUs in
SFR mode
/(N/I)
Game
execution
in real Hw
Downscaled
Pixel counters
`
Real app fps
(FRAPS)
Collect
data
Correlation
(for a GPU
arch. File
describing
the same real
hardware)
Propietary
Analytical tool
Min SFR scaled
exec time
execution
trace
API state
changes
GPU arch.
description
file:
R600
Estimated
exec time
(Single GPU)
Single GPU
arch.
description
file:
R600
Vtx HW
counters
GPU
Interconnection
BW
Add Inter-GPU
sync cost (SFR)
Pixel HW
counters
I GPU clusters
in AFR mode
/I
Propietary
Analytical tool
Add Inter-GPU
sync cost (AFR)
EMPATHY
N: Total GPUs
I: AFR interleaving
Multi-GPU
exec time
14
Multi-GPU interconnection network
CPU
CPU
PCIe x16 Link
PCIe x16 Link
FSB
FSB
NorthBridge
SouthBridge
NorthBridge
SouthBridge
GPU 0
GDDR
GDDR
MemBus
GPU 1
GPU 0
GDDR
FB
FB
Board 1
PCIe
bridge
GDDR
GDDR
FB
GPU 1
GPU 2
GDDR
GDDR
GDDR
FB
FB
Display Link
Board 0
MemBus
DDR
DDR
GPU 3
GDDR
FB
FB
PCIe
bridge
FB
Display Link
Board 0
Board 1
•
SFR sync: all-to-all GPU simulatenous transfers.
•
The NorthBridge PCIe ports become the bottleneck.
•
Each GPU “sees” reduced peak BW as function of the number of GPUs:
15
Performance scaling models
SFR
Penalization cycles
GPU0
GPU1
Frame 0
Frame 1
Frame 2
…
Total estimated multiGPU cycles
AFR
GPU0
GPU1
Frame 0
Frame 1 R
½ frame initialization
interval
GPU0
S
sync cycles
U Frame 2
Penalization cycles frame 2
…
Total estimated multiGPU cycles
16
Early copy + concurrent cost (SFR)
Penalization cycles
Penalization cycles
comparison BF2 - Surface 31d
160000
Cycles
120000
delayed copy
early copy
concurrent update
80000
40000
4307
4264
4221
4178
4135
4092
4049
4006
3963
3920
3877
3834
3791
3748
3705
3662
3619
3576
3533
3490
3447
3404
3361
3318
3275
3232
0
Frames
•
Choose per surface the best sync alternative each frame that incurs in the
minimum penalization cycles.
17
Scaling Results
SFR scaling (early copy + concurrent update)
RtU
sync
SFR performance gain for early + concurrent (2 GPUs)
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
delayed copy
early copy
early + concurrent
Game
50%
Gap
123691
0.42%
52.85%
2426
83.68%
65.08%
COD2c
83780
1.70%
96.28%
COD2d
48175
1.65%
94.63%
COH
36626
0.01%
96.80%
HL2
19788
36.45%
21.95%
BF2142
77363
3.13%
87.16%
BF2
22553
30.40%
44.44%
3DM06
FEAR
COD2c
COD2d
COH
HL2
BF2142
BF2
speed-up gain normalized to delayed copy
Avg Avg
performance
60%
syncs
FEAR
3DM06
% Pixel
shading
bound
early copy
early + concurrent
40%
30%
20%
10%
0%
3DM06
FEAR
COD2c
COD2d
COH
HL2
BF2142
BF2
19
Combined SFR/AFR modes scaling results
Avg multi-GPU efficiency wrt Perfect Scaling (100%)
% perfect scaling
100%
80%
60%
40%
20%
0%
2-GPUs
Pure SFR
•
•
3-GPUs
4-GPUs
SFR>AFR
SFR scaling was tested using
Early Copy + Concurrent Update
optimization.
Low interframe syncs benefits
mostly AFR configurations.
6-GPUs
8-GPUs
SFR = AFR
12-GPUs
SFR < AFR
Interframe
Game
Syncs
i2
16-GPUs
Pure AFR
Interframe
Syncs
i4
Game
3DM06
0
0
COH
FEAR
0
0
HL2
COD2c
0
0
COD2d
0
0
Syncs
i2
Syncs
i4
41
83
1442
1945
BF2142
9
26
BF2
1
3
20
Conclusion
Conclusion
• Inter-GPU synchronization requirements of render-to-texture
surfaces in 3D games impact multi-GPU performance/scaling.
• This work has evaluated the potential benefits of two
proposed sync alternatives based on RTT update anticipation
for a set of popular DX9 games.
•
Leverage of the RtU gap and the pixel shading cost increases SFR
scaling.
• This work has shown a simple multi-GPU performance
analytic model based on real 3D game execution data, that
allows to evaluate SFR, AFR and combined rendering modes.
•
Observed low interframe syncs benefit mostly AFR configurations.
22
Thank you!