No Slide Title
Download
Report
Transcript No Slide Title
Many-Core Software
Burton Smith
Microsoft
1
Computing is at a Crossroads
Continual performance improvement is our field’s lifeblood
Single-thread performance is nearing the end of the line
But Moore’s Law will continue for some time to come
What can we do with all those transistors?
Computation needs to become as parallel as possible
It encourages people to buy new hardware
It opens up new software possibilities
Henceforth, serial means slow
Systems must support general purpose parallel computing
The alternative is commoditization
New many-core chips will need new software
Our programming models will have to change
The von Neumann premise is broken
2
The von Neumann Premise
Simply put, “instruction instances are totally ordered”
This notion has created artifacts:
And caused major problems:
Variables
Interrupts
Demand paging
The ILP wall
The power wall
The memory wall
What software changes will we need for many-core?
New languages?
New approaches for compilers, runtimes, tools?
New (or perhaps old) operating system ideas?
3
Do We Really Need New Languages?
Mainstream languages schedule values into variables
Introducing parallelism exposes weaknesses in:
Passing values between unordered instructions
Updating state consistently
Our “adhesive bandage” attempts have proven insufficient
To orchestrate the flow of values in the program
To incrementally but consistently update state
Not general enough
Not productive enough
So my answer is “Absolutely!”
4
Parallel Programming Languages
There are (at least) two promising approaches:
Neither is completely satisfactory by itself
SQL is a “mostly functional” language
Transactions allow Consistency via Atomicity and Isolation
Many people think functional languages must be inefficient
Functional programs don’t allow mutable state
Transactional programs implement data flows awkwardly
Data base applications show synergy of these two ideas
Functional programming
Atomic memory transactions
Sisal and NESL are excellent counterexamples
Both competed strongly with Fortran on Cray systems
Others think memory transactions must be inefficient also
This remains to be seen; we have only just begun to optimize
5
Transactions and Invariants
Invariants are a program’s conservation laws
Relationships among values in iteration and recursion
Rules of data structure (state) integrity
If statements p and q preserve the invariant I and they do not
“interfere”, their parallel composition { p || q } also preserves I †
If p and q are performed atomically, i.e. as transactions, then they
will not interfere ‡
Although operations seldom commute with respect to state,
transactions give us commutativity with respect to the invariant
It would help if the invariants were available to the compiler
Can we ask programmers to supply them?
† Susan Owicki and David Gries. Verifying properties of parallel programs:
An axiomatic approach. CACM 19(5):279−285, May 1976.
‡ Leslie Lamport and Fred Schneider. The “Hoare Logic” of CSP, And All That.
ACM TOPLAS 6(2):281−296, Apr. 1984.
6
Styles of Parallelism
We probably need to support multiple programming styles
We may need several languages to accomplish this
After all, we do use multiple languages today
Language interoperability (e.g. .NET) will help greatly
It is essential that parallelism be exposed to the compiler
Both functional and transactional
Both data parallel and task parallel
Both message passing and shared memory
Both declarative and imperative
Both implicit and explicit
So that the compiler can adapt it to the target system
It is also essential that locality be exposed to the compiler
For the same reason
7
Compiler Optimization for Parallelism
Some say automatic parallelization is a demonstrated failure
What failed is parallelism discovery, especially in-the-large
Dependence analysis is chiefly a local success
Locality discovery in-the-large has also been a non-starter
Vectorizing and parallelizing compilers (especially for the
right architecture) have been a tremendous success
They have enabled machine-independent languages
What they do can be termed parallelism packaging
Even manifestly parallel programs need it
Locality analysis is another word for dependence analysis
The jury is still out on in-the-large locality packaging
Local locality packaging works pretty well
8
Fine-grain Parallelism
Exploitable parallelism grows as task granularity shrinks
Inter-task dependence enforcement demands scheduling
No privilege change to stop or restart a task
Locality (e.g. cache content) can be better preserved
Todays OSes and hardware don’t encourage waiting
A task needing a value from elsewhere must wait for it
User-level work scheduling is needed
But dependences among tasks become more numerous
OS thread preemption makes blocking dangerous
Instruction sets encourage non-blocking approaches
Busy-waiting wastes instruction issue opportunities
We need better support for blocking synchronization
In both instruction set and operating system
9
Resource Management Consequences
Since the user runtime is scheduling work on processors,
the OS should not attempt to do the same
An asynchronous OS API is a necessary corollary
The user-exposed API should be synchronous
Scheduling memory via demand paging is also problematic
Instead, the application and OS should negotiate
The application tells the OS its resource needs & desires
The OS makes decisions based on the big picture:
The OS can preempt resources to reclaim them
Requirements for quality of service
Availability of resources
Appropriateness of power level
But with notification, so the application can rearrange work
Resources should be time- and space-shared in chunks
10
Bin Packing
The more resources allocated, the more swapping overhead
It would be nice to amortize it
The more resources you get, the longer you may keep them
Quantity of resource
Roughly, this means scheduling = packing squarish blocks
QOS applications might need long rectangles instead
Time
When the blocks don’t fit, the OS can morph them a little
Or cut corners when absolutely necessary
11
Parallel Debugging and Tuning
Today, debugging relies on single-stepping and printf()
Conditional program and data breakpoints are helpful
The answer is usually found by sampling the PC
In contrast, parallel program tuning tries to discover where
there is insufficient parallelism
Debugging is data mining
Serial program tuning tries to discover where the program
counter spends its time
To stop when an invariant fails to be true
Support for ad-hoc data perusal is also very important
Single-stepping a parallel program is a bit less effective
A good way is to log perf counters and a timestamp at events
Visualization is a big deal for both debugging and tuning
12
Conclusions
It is time to rethink some of the basics
There is lots of work for everyone to do
I’ve left out lots of things, e.g. applications
We need basic research as well as industrial development
Research in computer systems is deprecated these days
In the USA, NSF and DOD need to take the initiative
13