GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD PACT 2008

Transcript GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD PACT 2008

GPU evolution

Will Graphics Morph Into Compute?

Norm Rubin Fellow GPG graphics products group AMD PACT 2008

Performance is King

Cinematic world: On average studios typically use 100,000 min of compute per min of image Blinn's Law (the flip side of Moore's Law): – Time per frame is ~constant – Audience expectation of quality per frame is higher every year!

Cars. Courtesy of Pixar Animation Studios

– Expectation of quality increases with compute increases GPUs are real time – 1 min of compute per min of image so users want 100,000 times faster machines

‹#›

Pact 2008 | Oct, 2008

GPU Chip Design Focus Point

CPU Lots of instructions little data – Out of order exec – Branch prediction Reuse and locality Task parallel Needs OS Complex sync Latency machines Instructions stay the same GPU Few instructions lots of data – SIMD – Hardware threading Little reuse Data parallel No OS Simple sync Throughput machines Instructions keep changing

‹#›

Pact 2008 | Oct, 2008

Effects of Hardware Changes

Call of Juarez

‹#›

Pact 2008 | Oct, 2008

Why do ISA instructions change rapidly?

1) A new Game becomes popular, 2) API designers (Ms) adds new functions to make the game simpler to write 3) Hardware vendors look at the game – line by line and add new hardware to speed up the game 4) New Hardware 5) Game developers look at the new hardware and think of interesting new effects (more realism) by pushing past what anyone thought the instruction could do 6) New Games repeat!

No backward compatibility at the hardware level

‹#›

Pact 2008 | Oct, 2008

Co-evolution in action

‹#›

Pact 2008 | Oct, 2008 Photograph courtesy of Peter Chew (www.brisbaneInsects.com)

Is the GPU just a lot of CPU devices?

GPU / CPU have been co-evolving

Not a process of radical change

Current GPUs have rendering-specific hardware Transitional features Ex: fixed function systems Memory systems with filtering (==texture) Depth buffer Rasterizer … These are crucial for graphics performance!

‹#›

Pact 2008 | Oct, 2008

Latency or throughput

CPU and GPU are fundamentally different CPU strength is single thread performance (latency machine) GPU strength is massive thread performance (throughput machine) What classes of problems can be solved by massive parallel processing?

What exactly does latency or throughput mean?

‹#›

Pact 2008 | Oct, 2008

GPU vs. CPU performance

thread: // load r1 = load (index) // series of adds r1 = r1 + r1 r1 = r1 + r1 … Run lots of threads Can you get peak performance/multi-core/cluster? Peak performance = do alu ops every cycle

‹#›

Pact 2008 | Oct, 2008

Typical CPU Operation

One iteration at a time Single CPU unit Cannot reach 100% Hard to prefetch data Multi-core does not help Cluster does not help Limited number of outstanding fetches Fetch Alu Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency

‹#›

Pact 2008 | Oct, 2008

GPU THREADS Throughput (Lower Clock – Different Scale)

100% ALU utilization Overlapped fetch and alu Many outstanding fetches ALU units reach 100% utilization Hardware sync for final Output Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish

‹#›

Pact 2008 | Oct, 2008

PS VS Filter Resources wavefronts Wf status Select waves to execute 2 wf SIMD 2 wf SIMD 2 wf SIMD 2 wf SIMD Vertex Fetch Seq Texture Fetch Seq Output One wavefront is 64 threads Two Wavefronts/simd (running) 16 processing elements/simd 10 simd engines 20 program counters Once enough resources are available a thread goes into the run queue 16 instructions finish per simd Each instruction is 5 way vliw

‹#›

Pact 2008 | Oct, 2008

Threads in Run Queue

Each simd has 256 sets of registers 64 registers in a set (each holds 128 bits) If each thread needs 5 (128 bit) registers, then 256/5 = 51 wavefronts can get into run queue 51 wavefronts = 3264 threads per SIMD or 32,640 running or waiting threads 256 * 64 * 10 vector registers 256 * 64 * 10 *4 (32 bit registers) = 665,360 registers

‹#›

Pact 2008 | Oct, 2008

Implications

CPU: Loads determine performance  Compiler works hard to – Try to use prefetch and other magic to reduce the amount of time waiting for memory GPU: Threads determine performance  – Minimize ALU code – Reduce memory overhead Compiler works hard to – Minimize ALU code – Maximize threads – Try to reorder instructions to reduce synchronization and other magic to reduce the amount of time waiting for threads

‹#›

Pact 2008 | Oct, 2008

Graphics programming model

Input assembler Vertex shader Geometry shader Rasterizer Pixel shader Output merger Parallel Loop over all input points Parallel Loop over all shapes Combine points (vertices) into shape Parallel Loop over all pixels One round of nested parallelism Generate one thread Per pixel in the shape Combine the outputs Multiple passes

‹#›

Pact 2008 | Oct, 2008

Parallelism Model

All parallel operations are hidden via domain specific API calls Developers write sequential code + kernels Kernel operate on one vertex or pixel Developers never deal with parallelism directly No need for auto parallel compilers

‹#›

Pact 2008 | Oct, 2008

Observations

Developers only write a small part of the program, rest of the code comes from libraries 200-300 small kernels each < 100 lines No race conditions are possible No error reporting (just keep going) Can view the program as serial (per vertex/shape/pixel) No developer knows the number of processors Not like pthreads Result: Lots of success, simple enough to program

‹#›

Pact 2008 | Oct, 2008

Power and cost have appeared

Vendors always release a family of cards (change the number of simd engines per chip) -Programs need to scale -Avoid huge monolithic cores Goal: Best performance comes from two gpu chips on a card Mid range performance from one gpu, Low range – just remove simd engines

‹#›

Pact 2008 | Oct, 2008

Change in the design metric (GPU Efficiency)

Processor Efficiency

GFLOPS/watt GFLOPS/$ 2003 Graph based on historical performance Of ATI Radeon tm GPUs 2004

‹#›

Pact 2008 | Oct, 2008 2005 2006 Release Date 2007 2008

Changes in the last generation ATI Radeon™ HD 38XX vs ATI Radeon™ HD 48XX

‹#›

Pact 2008 | Oct, 2008

GPU ATI Radeon ™ HD 4870

• • • • • 800 stream processors • = 160 x 5 way vliw • = 10 simd cores New SIMD core layout New memory architecture Optimized render back-ends for faster anti-aliasing performance Enhanced geometry shader & tessellator performance • • 1.2 tera flops performance 2.4 tera flops for ATI Radeon ™ HD 4870 X2

‹#›

Pact 2008 | Oct, 2008

ATI Radeon™ HD 4800 Series Architecture

GDDR5 Memory Interface SIMD cores: Changed 4 to 10 Memory Bandwidth: changed 72 GB/s to 115 GB/s GDDR3 to GDDR5 Texture Units SIMD Cores PCI Express Bus Interface UVD & Display Controllers

‹#›

Pact 2008 | Oct, 2008

What is new for gpgpu?

Double precision – 5 way vliw allows pairs to do double add 4 to do double multiply 5 way vliw 4 normal functional units + 1 fat unit Local memory on simd (communication) Global memory on chip (more communication) Better thread scheduling General scatter/gather operations

SIMD Cores

‹#›

Pact 2008 | Oct, 2008

Percent programmable area

9700 pixel/combined X1800 vertex X1900 HD3850 gpgpu

‹#›

Pact 2008 | Oct, 2008 HD4850 fixed All columns refer to ATI Radeon TM (Data from ati engineering)

Transitional applications

Written by graphics programmers Real connection with graphics is that the result is rendered and looks cute Really programming physics and AI Evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs.

Massively parallel algorithm formulations Combined with responsive gameplay and rendering

‹#›

Pact 2008 | Oct, 2008

How is software evolving?

Graphics API’s have added compute shaders Dx11 compute shaders Another stage in pipeline Some programming languages which are “sort of C “ OpenCL CUDA CT Streaming languages Brook+

‹#›

Pact 2008 | Oct, 2008

Two transitional apps: Toyshop

‹#›

Pact 2008 | Oct, 2008

Two transitional apps: Froblins

‹#›

Pact 2008 | Oct, 2008

Toyshop demo

ToyShop Demo

‹#›

Pact 2008 | Oct, 2008

Dynamic realistic wave motion of interacting ripples over the water surface Treat water surface as a thin elastic membrane Simulate response due to surface tension Numeric solution to a PDE on the GPU

‹#›

Pact 2008 | Oct, 2008

Rain on window

‹#›

Pact 2008 | Oct, 2008

Rain on window

Physics-based movement of drops on window surface The droplet shape and motion is influenced by the forces of gravity and the interfacial tension forces, as well as air resistance The surface of the glass is represented by a lattice of cells

‹#›

Pact 2008 | Oct, 2008

Rain

Looks great, but do not try to predict rain using this random number technique.

Random number generator: done by a load from a small table based on screen xy coordinate + an offset that changes each frame. GPU generation allowed only 16 persistent outputs per thread – Could not save seed Reference: Tatarchuk, N., Isidoro, J. R. 2006. Artist Directable Real-Time Rain Rendering in City Environments. In Proceedings of Eurographics Workshop on Natural Phenomena, Vienna, Austria.

‹#›

Pact 2008 | Oct, 2008

Programming

256 meg of memory on card, aggressive compression, to fit the texture data into 250 meg (originally 478 meg) ~ 500 small shader programs ½ for the rain Misty objects in rain Halos around light sources Water surface simulation for ripples/splashes Streaming water Warped reflections ….

‹#›

Pact 2008 | Oct, 2008

The ToyShop Team

Lead Artist Lead Programmer

Dan Roeger Natalya Tatarchuk David Gosselin

Artists

Daniel Szecket, Eli Turner, and Abe Wiley

Engine / Shader Programming

John Isidoro, Dan Ginsburg, Thorsten Scheuermann and Chris Oat

Producer

Lisa Close

‹#›

Pact 2008 | Oct, 2008

Manager

Callan McInally

Froblin demo

Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU

‹#›

Pact 2008 | Oct, 2008

A Smörgåsbord of Features

Dynamic pathfinding AI computations on GPU Massive crowd rendering with LOD management Tessellation for high quality close-ups and stable performance HDR lighting and post-processing effects with gamma-correct rendering Terrain system Cascade shadows for large-range environments Advanced global illumination system Actual .9 TeraFlops performance

‹#›

Pact 2008 | Oct, 2008

Run froblin demo

Tessilation allowed by hardware support for limited nested parallelism

‹#›

Pact 2008 | Oct, 2008

Pathfinding on GPU

Numerically solve a 2 nd order PDE on GPU with a computational iterative approach (eikonal solver) • Represent environment as a cost field • Through discretization of the eikonal equation Applicable to many general algorithms and areas

‹#›

Pact 2008 | Oct, 2008

Froblin Land: Terrain Rendering

Smooth, crack-free LOD without degenerates Tessellation and instancing Leverages Direct3D® 10.1

functionality to help minimize memory footprint Complex material system

‹#›

Pact 2008 | Oct, 2008

Froblin demo

512 MB of memory on card, aggressive compression to fit the data.

One giant shader program for AI and animation logic (executed per-agent): 3200 instructions ~6-8 Million triangles rendered each frame Atmospheric scattering to convey sense of scene depth Cascaded shadow soft shadow edges GPU scene management using stream-out High dynamic range imaging with post-processing for light blooms and tone mapping Characters are tessellated and displaced dynamically giving higher detail near the viewer

‹#›

Pact 2008 | Oct, 2008

More information about the Froblins demo

Reference: Shopf, J., Barczak, J., Oat, C., and Tatarchuk, N. 2008. March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU. In ACM SIGGRAPH 2008 Classes (Los Angeles, California, August 11 - 15, 2008). SIGGRAPH '08.

‹#›

Pact 2008 | Oct, 2008

Acknowledgements: Froblins

Game Computing Applications Group

Natalya Tatarchuk Jeremy Shopf Christopher Oat Josh Barczak Abe Wiley Chaingun Studios ‹#›

Pact 2008 | Oct, 2008

Exigent Allegorithmic

What did it take to program these demos?

ToyShop ~120 engineer and artist man-months 256 MB of video memory Over 500 individual shaders – explicit permutations Floblins ~56 engineer and artist man-months 512 MB of video memory More flexible shader programming model – quicker development Many shaders – more dynamic permutations

‹#›

Froblin character control shader has around 3200 line of code alone!

Pact 2008 | Oct, 2008

Architecture/software issue: Improve the Video Interface

To lower power the GPU has a dedicated processor that does video (decode), this offloads work from the cpu Programmable cores are used for de-interlace, scale, color space convert, composition Today this is hand coded; challenge is design a programming model for heterogeneous compute

‹#›

Pact 2008 | Oct, 2008

Challenge

Most of the successful parallel applications seem to have dedicated languages (DirectX ® 11/map-reduce/sawzsall) for limited domains Small programs can build interesting applications, Programmers are not super experts in the hardware Programs survive machine generations Can you replicate this success in other domains?

‹#›

Pact 2008 | Oct, 2008

Disclaimer and Attribution

DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION © 2008 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, CrossFireX, PowerPlay and Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be

‹#›

Pact 2008 | Oct, 2008

Questions?

‹#›

Pact 2008 | Oct, 2008

GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD PACT 2008

Transcript GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD PACT 2008

GPU evolution

Performance is King

GPU Chip Design Focus Point

Effects of Hardware Changes

Why do ISA instructions change rapidly?

Co-evolution in action

Is the GPU just a lot of CPU devices?

Latency or throughput

GPU vs. CPU performance

Typical CPU Operation

Threads in Run Queue

Implications

Graphics programming model

Parallelism Model

Observations

Power and cost have appeared

Change in the design metric (GPU Efficiency)

Changes in the last generation ATI Radeon™ HD 38XX vs ATI Radeon™ HD 48XX

GPU ATI Radeon ™ HD 4870

ATI Radeon™ HD 4800 Series Architecture

SIMD Cores

Transitional applications

How is software evolving?

Two transitional apps: Toyshop

Two transitional apps: Froblins

Toyshop demo

Rain on window

Rain on window

Rain

Programming

The ToyShop Team

Froblin demo

A Smörgåsbord of Features

Run froblin demo

Pathfinding on GPU

Froblin Land: Terrain Rendering

Froblin demo

More information about the Froblins demo

Acknowledgements: Froblins

Game Computing Applications Group

What did it take to program these demos?

Architecture/software issue: Improve the Video Interface

Challenge

Disclaimer and Attribution

Directory