Opportunistic Computing Technology
Download
Report
Transcript Opportunistic Computing Technology
Multiple C[h]ores on Multicores : Beyond Parallel
Computing
Santosh Pande
CERCS
College of Computing
Georgia Institute of Technology
Industry Trends
Very hard to make uni-processor systems faster
Industry has moved to Multi-core chips & platforms
Desktop: Intel Core 2 Duo (80+ cores by 2011)
Gaming: IBM/Sony/Toshiba Cell Processor (Sony PS3)
Potential for tremendously rich applications for users
Immersive worlds: visually stunning, realistic interactions
Soft real-time applications experiencing explosive growth
Multimedia: creation, editing, consumption
YouTube phenomenon: Everyone wants to create content
Fight for digital home entertainment hub: TV/gaming/music
Microsoft, Apple, Intel, Dell
Killer App: Interactive Multi-media
"You can expect games starting to take advantage of
multi-core in late 2006 as games and engines also
targeting next-generation consoles start making their
way onto the PC.“
-- Tim Sweeney, Co-Founder of Epic Games,
Primary driver behind the widely licensed
Unreal game engine
-- From an interview with AnandTech in March 2005
The Catch
Multi-cores need parallel applications
Programmers need to find rigid structure to usefully parallelize
their applications
Programmer must explicitly manage all thread-interactions
Combinatorial growth in complexity
of thread-interactions
Current tools and programming semantics are
severely limiting the programmer
Management of multiple programming c[h]ores on
multi cores is full of caveats
Leads to poor programmer productivity
Low performance
Developers happily trade 20 % performance for 5-10
% productivity gains
Multi-threading Blues
"Writing multithreaded software is very hard; it's
about as unnatural to support multithreading in
C++ as it was to write object-oriented software
in assembly language. …
… pretty clear that a new programming model is
needed if we're going to scale to ever more
parallel architectures. … “
-- Tim Sweeney, from interview with AnandTech in 2005
Parallel Programming Revisited
Automatic Parallelization – Good but very limited
OpenMP : Does not really promote “threaded way of thinking”
Parallelism Discovery : Conservative Analysis
Mapping : Complex memory hierarchies/sharing
Not good at a whole application level
Streaming computations from multiple media data sources
unsuitable
Mostly designed as alternative for advanced scientific
programming community – iterative spaces, localized
parallelism
PThreads
Good for threaded thinking, popular
Threads are a library
Can lead to lots of bugs – data racing/deadlocks/no way to
manage underlying resources
So what should be done?
A better threaded programming model
Should capture richer properties of shared
data than current transactional memory
Current research proposals focus on “atomicity” part
or on synchronization (barrier) based on clocks – a
step forward from the lock based solutions [PLDI
2006 papers, X10 HPC languages]
Should allow capturing richer “semantic”
interactions between threads than atomicity
and synchronization
Thread contexts and states may help to reason
better about thread interactions – not exported
But this is expensive…
A lot of run-time information to be shared
between threads
contexts and states
A lot of run-time activity to be orchestrated
between threads
memory transactions
This can get very expensive and quickly out
of hand esp. if techiques have to scale
Why not use hardware assists?
Why not design mini-cores that compiler could
use to pack and ship information across?
Compiler-Architecture Synergy
Fundamental key to scalable solution
A Layered solution
Layers manage and scale well with respect to
amount of information
Layered solutions have worked e.g.
networking
Programmatic needs expressed/managed
through the compiler
Underlying mini-cores support dissemination
of information and its update
Thread interactions
-
-
-
Questions Raised:
- What level of interactions should be exported in the
programming model?
- What hardware support should be leveraged to share thread
state/contexts?
Critical to be analyzable
- How can I know what program point a thread is at?
Critical for real performance – shared resources
- How do I orchestrate between different performance needs
of different threads for shared resources? [Zhuang and
Pande, ACM PLDI 2004]
Critical for dynamic monitoring/fine tuning
- Can I throttle a thread more/less to meet a soft-real-time
goal? [Zhuang and Pande, ACM LCTES 2006]
Multi-threading/Context Switch
(Example – ixp network processor)
A=1
X=6
B=4
read Y
read C
D=A+B
ctx
W=X-Y
Z=2*W
A=C+D
write Z
write A
…
…
ctx only when long latency
instruction is executed
non-preemptive, all thread code
known analyzable
very frequently
about 20
cycles
lightweight context switch
only PC is saved, takes 1 cycle.
Our Approach
Register File
Thread 1
Thread 2
Thread 3
Thread 4
What to Put in Shared Registers ?
Live ranges in shared registers must not be used
across context switches. Upon context switch,
they are already dead, other threads can use.
Categorize live ranges into two types i.e. those
live across context switches, and those are not;
Allocate them separately.
Runtime Constraints on Approaches
Category
Constraint
(Weighted) Round Robin—(W)RR
Priority Sharing—PS
Real-time Scheduling Rate Monotonic—RM
Earliest Deadline First—EDF
Packet Scheduling
Priority Class—PC
First Come First Serve—FCFS
(Weighted)Fair Queueing—(W)FQ
CPU Scheduling
Two Approaches:
Fixed Context Switch (FCS)
Dynamic Context Switch (DCS)
Approach
FCS
FCS
FCS
DCS
FCS
DCS
FCS
Self Scheduled Code
C1
P1
C2
P2
P4
C3
P3
C4
P5
C5
Conclusion
As number of cores increases the threading problem on
multi-cores will become worse and worse.
Scalable solutions need to be devised which go well
beyond current research approaches that deal with
atomicity and synchronization in STMs (mostly Java).
Solutions need to come from threading model extending
C++ and Pthreads if one were to benefit large user
community.
Exposing thread interactions gives control to compiler for
resource sharing as well as scheduling.
Self managed code is expected to scale better.