Opportunistic Computing Technology

Download Report

Transcript Opportunistic Computing Technology

Multiple C[h]ores on Multicores : Beyond Parallel
Computing
Santosh Pande
CERCS
College of Computing
Georgia Institute of Technology
Industry Trends
 Very hard to make uni-processor systems faster
 Industry has moved to Multi-core chips & platforms


Desktop: Intel Core 2 Duo (80+ cores by 2011)
Gaming: IBM/Sony/Toshiba Cell Processor (Sony PS3)
 Potential for tremendously rich applications for users


Immersive worlds: visually stunning, realistic interactions
 Soft real-time applications experiencing explosive growth
Multimedia: creation, editing, consumption


YouTube phenomenon: Everyone wants to create content
Fight for digital home entertainment hub: TV/gaming/music

Microsoft, Apple, Intel, Dell
Killer App: Interactive Multi-media
"You can expect games starting to take advantage of
multi-core in late 2006 as games and engines also
targeting next-generation consoles start making their
way onto the PC.“
-- Tim Sweeney, Co-Founder of Epic Games,
Primary driver behind the widely licensed
Unreal game engine
-- From an interview with AnandTech in March 2005
The Catch
 Multi-cores need parallel applications

Programmers need to find rigid structure to usefully parallelize
their applications

Programmer must explicitly manage all thread-interactions
 Combinatorial growth in complexity
of thread-interactions
 Current tools and programming semantics are
severely limiting the programmer




Management of multiple programming c[h]ores on
multi cores is full of caveats
Leads to poor programmer productivity
Low performance
Developers happily trade 20 % performance for 5-10
% productivity gains
Multi-threading Blues
"Writing multithreaded software is very hard; it's
about as unnatural to support multithreading in
C++ as it was to write object-oriented software
in assembly language. …
… pretty clear that a new programming model is
needed if we're going to scale to ever more
parallel architectures. … “
-- Tim Sweeney, from interview with AnandTech in 2005
Parallel Programming Revisited

Automatic Parallelization – Good but very limited



OpenMP : Does not really promote “threaded way of thinking”




Parallelism Discovery : Conservative Analysis
Mapping : Complex memory hierarchies/sharing
Not good at a whole application level
Streaming computations from multiple media data sources
unsuitable
Mostly designed as alternative for advanced scientific
programming community – iterative spaces, localized
parallelism
PThreads



Good for threaded thinking, popular
Threads are a library
Can lead to lots of bugs – data racing/deadlocks/no way to
manage underlying resources
So what should be done?
 A better threaded programming model
 Should capture richer properties of shared
data than current transactional memory
 Current research proposals focus on “atomicity” part
or on synchronization (barrier) based on clocks – a
step forward from the lock based solutions [PLDI
2006 papers, X10 HPC languages]

Should allow capturing richer “semantic”
interactions between threads than atomicity
and synchronization
 Thread contexts and states may help to reason
better about thread interactions – not exported
But this is expensive…
 A lot of run-time information to be shared
between threads

contexts and states
 A lot of run-time activity to be orchestrated
between threads

memory transactions
 This can get very expensive and quickly out
of hand esp. if techiques have to scale


Why not use hardware assists?
Why not design mini-cores that compiler could
use to pack and ship information across?
Compiler-Architecture Synergy
 Fundamental key to scalable solution
 A Layered solution
 Layers manage and scale well with respect to
amount of information
 Layered solutions have worked e.g.
networking
 Programmatic needs expressed/managed
through the compiler
 Underlying mini-cores support dissemination
of information and its update
Thread interactions
-
-
-
Questions Raised:
- What level of interactions should be exported in the
programming model?
- What hardware support should be leveraged to share thread
state/contexts?
Critical to be analyzable
- How can I know what program point a thread is at?
Critical for real performance – shared resources
- How do I orchestrate between different performance needs
of different threads for shared resources? [Zhuang and
Pande, ACM PLDI 2004]
Critical for dynamic monitoring/fine tuning
- Can I throttle a thread more/less to meet a soft-real-time
goal? [Zhuang and Pande, ACM LCTES 2006]
Multi-threading/Context Switch
(Example – ixp network processor)
A=1
X=6
B=4
read Y
read C
D=A+B
ctx
W=X-Y
Z=2*W
A=C+D
write Z
write A
…
…
ctx only when long latency
instruction is executed
non-preemptive, all thread code
known analyzable
very frequently
about 20
cycles
lightweight context switch
only PC is saved, takes 1 cycle.
Our Approach
Register File
Thread 1
Thread 2
Thread 3
Thread 4
What to Put in Shared Registers ?
Live ranges in shared registers must not be used
across context switches. Upon context switch,
they are already dead, other threads can use.
Categorize live ranges into two types i.e. those
live across context switches, and those are not;
Allocate them separately.
Runtime Constraints on Approaches
Category
Constraint
(Weighted) Round Robin—(W)RR
Priority Sharing—PS
Real-time Scheduling Rate Monotonic—RM
Earliest Deadline First—EDF
Packet Scheduling
Priority Class—PC
First Come First Serve—FCFS
(Weighted)Fair Queueing—(W)FQ
CPU Scheduling
Two Approaches:
Fixed Context Switch (FCS)
Dynamic Context Switch (DCS)
Approach
FCS
FCS
FCS
DCS
FCS
DCS
FCS
Self Scheduled Code
C1
P1
C2
P2
P4
C3
P3
C4
P5
C5
Conclusion
As number of cores increases the threading problem on
multi-cores will become worse and worse.
Scalable solutions need to be devised which go well
beyond current research approaches that deal with
atomicity and synchronization in STMs (mostly Java).
Solutions need to come from threading model extending
C++ and Pthreads if one were to benefit large user
community.
Exposing thread interactions gives control to compiler for
resource sharing as well as scheduling.
Self managed code is expected to scale better.