Don’t Dread Threads Orion Granatir Omar Rodriguez GDC 3/12/10 Agenda • Threading is worthwhile • Data decomposition is a good place to start • Think tasks!! • Intel tools.

Download Report

Transcript Don’t Dread Threads Orion Granatir Omar Rodriguez GDC 3/12/10 Agenda • Threading is worthwhile • Data decomposition is a good place to start • Think tasks!! • Intel tools.

Don’t
Dread
Threads
Orion Granatir
Omar Rodriguez
GDC 3/12/10
Agenda
• Threading is worthwhile
• Data decomposition is a good place to start
• Think tasks!!
• Intel tools help make things easy
2
Threading is important!!
3
Threading is required to maximize performance
PERFORMANCE
104 FPS in our demo
Multi-core Needs
Parallel Applications
GHz Era
Multi-core Era
TIME
33 FPS in our demo
4
Follow these steps to add threading…
1.Use data decomposition
2.Use tasks
5
Functional decomposition is limited
Core
Core
Core
Core
6
Functional decomposition is limited
Core
Core
Core
Core
7
Functional decomposition is limited
• Potential latency with pipelining
• Poor load balancing
• Doesn’t scale on varying
core counts
Core
Core
Core
Core
8
Data decomposition can scale to n-cores
Core
Core
Core
Core
9
Big loops are ideal cases for data decomposition
// Loop through each AI
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
10
Minimize interactions
// Loop through each AI
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
AI 0
AI 1
Set m_HP to 10
11
Minimize interactions
// Loop through each AI
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
AI 0
AI 1
Set m_HP to 10
12
Avoid locking
// Loop through each AI
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
AI 0
AI 1
Set m_HP to 10
13
Read global data, don’t write
// Loop through each AI
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
14
OpenMP is a great way to get started
// Loop through each AI
#pragma omp parallel for
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
Serial
6 Core
1.00x
Algorithm
2.31x
~12.0x
15
The next step is to use tasks
Core
Core
Core
Core
16
The next step is to use tasks
Core
Core
Core
Core
17
The next step is to use tasks
Core
Core
Core
Core
18
The next step is to use tasks
• Needed for load balancing
(avoid oversubscription)
• Support large chucks of work
• Better utilization of cache
Core
Core
Core
Core
19
Task can be used to parallelize complex problems
Setup
Texture Lookup
Processing
Data Parallelism
20
Tasks can be arranged in a dependency graph
Setup
Texture Lookup
Processing
Data Parallelism
21
Dependency graph can be mapped to a thread pool
22
Dependency graph can be mapped to a thread pool
Core
Core
Core
Core
23
Think of a task as a unit of work
A task is a unit of work
• It’s run on a thread pool
• It runs to completion
• It has heavy penalties for blocking
• It’s an efficient way to avoid oversubscription
• They adapt to any number of threads/cores
… regardless of CPU topology
24
Data decomposition makes defining tasks easy
// Update all AI
void UpdateAI( float DeltaTime )
{
}
for( int Index = 0; Index < g_NumAI; Index++ )
{
// Update each AI for this frame
g_AI[ Index ].Update();
}
25
Data decomposition makes defining tasks easy
// Update all AI
void UpdateAI( float DeltaTime )
{
// Determine the number of AI tasks we want to create
unsigned int AIGroups = g_NumAI / MAX_AI_PER_GROUP;
for( unsigned int Index = 0; Index < AIGroups; Index++ )
{
// Build the task specific data
AITaskData* pData = new AITaskData();
pData->m_Start = Index * MAX_AI_PER_GROUP;
pData->m_DeltaTime = DeltaTime;
}
}
// Submit task
SubmitTask( Task_UpdateAI, (void*)pData );
26
Individual task are run by the thread pool
void Task_UpdateAI( void* pTaskData )
{
// Read data
AITaskData* pData = (AITaskData*)pTaskData;
unsigned int Start = pData->m_Start;
unsigned int End = pData->m_Start + MAX_AI_PER_GROUP;
// Gap End with max number of AI
End = ( End > g_NumAI ) ? g_NumAI : End;
// Loop through all of our AI and update
for( unsigned int Index = Start; Index < End; Index++ )
{
g_AI[ Index ].Update();
}
}
// Cleanup
delete pData;
27
Intel Threading Building Blocks is a good for tasks
Intel® Threading Building Blocks (Intel® TBB) has a
low-level API to create and process trees of work –
each node is a task.
Continuations go up
Wait
Root
Spawn
Spawn & Wait
Callback
Blocking calls go
down
Root
Task
Task
More
More
28
Learn more about tasking…
Task-based Multithreading – How to Program for 100 Cores
Presented by Ron Fosner
Friday, March 12 @ 4:30PM
South 300
… or get Game Engine Gems 1* and read Brad
Werth’s article.
29
* Other names and brands may be claimed as the property of others.
Time to look at our example…
30
Hotspots are good candidates for threading
• Use tools like Intel® Vtune™ and Intel®Parallel Studio to
locate hotspots.
31
Hotspots are good candidates for threading
• Use tools like Intel® Vtune™ and Intel®Parallel Studio to
locate hotspots.
• Intel® Parallel Studio inspector shows that Flock() is the
main bottleneck. This is a good place to investigate
threading.
32
Validate threading results with Parallel Amplifier
1.
2.
33
Use Parallel Amplifier to validate concurrency
34
Use Parallel Amplifier to validate concurrency
• We have “ideal” CPU utilization for Flocking.
• Now we can start looking for other hotspots to optimize.
35
Use Parallel Amplifier to validate concurrency
• We have “ideal” CPU utilization for Flocking.
• Now we can start looking for other hotspots to optimize.
• There is still a lot of serial code…
36
Use Parallel Inspector to find threading errors
37
Use Parallel Inspector to find threading errors
38
Use Parallel Inspector to find threading errors
• Have a lot of system memory
• Use a reduced data set
• Workload should be repeatable
39
Use other tools as needed… I like Intel® GPA
• Intel® Graphics Performance
Analyzer is designed for games.
• System Analyzer gives a
complete view of system
resources (CPU, GPU, Bus)
• Frame Analyzer allows you to
dive into a DX frame
• Platform View allow you to
instrument code to analyze
workload balance and
execution time.
40
Conclusion
• Threading is required to maximize your game
• Use data decomposition to scale to n-cores
• Use tasks for load balancing and to be
platform independent
• Use Intel tools to make your life easier
• Attend: “Task-based Multithreading – How to
Program for 100 Cores” this Friday.
41
Contact Information
Email: [email protected]
[email protected]
http://www.intel.com/software/gdc
See Intel at GDC:
Intel Booth at Expo, North Hall
Intel Interactive Lounge
42
Other Sessions
A Visual Guide to Game and Task Performance
on Mass-market PC Game Platforms
Thursday, March 11 @ 4:30PM
North 122
Building Games for Netbooks
Friday, March 12 @ 9AM
South 310
Simpler Better Faster Vector
Friday, March 12 @ 1:30PM
North 122
43
Other Sessions
Tuning Your Game for Next Generation Intel
Graphics
Friday, March 12 @ 1:30PM
South 302
Task-based Multithreading – How to
Program for 100 Cores
Friday, March 12 @ 4:30PM
South 300
44
Please fill out an evaluation form
… it’ll help us win a bet
Thank you
Legal Disclaimer









INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN
MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to
change without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which
may cause the product to deviate from published specifications. Current characterized errata are available on
request.
Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system
hardware or software design or configuration may affect actual performance.
Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other
countries.
Any software source code reprinted in this document is furnished under a software license and may only be
used or copied in accordance with the terms of that license
*Other names and brands may be claimed as the property of others.
Copyright © 2010 Intel Corporation.