No Slide Title

Transcript No Slide Title

Many-Core Software
Burton Smith
Microsoft
1
Computing is at a Crossroads

Continual performance improvement is our field’s lifeblood



Single-thread performance is nearing the end of the line



But Moore’s Law will continue for some time to come
What can we do with all those transistors?
Computation needs to become as parallel as possible




It encourages people to buy new hardware
It opens up new software possibilities
Henceforth, serial means slow
Systems must support general purpose parallel computing
The alternative is commoditization
New many-core chips will need new software


Our programming models will have to change
The von Neumann premise is broken
2
The von Neumann Premise


Simply put, “instruction instances are totally ordered”
This notion has created artifacts:




And caused major problems:




Variables
Interrupts
Demand paging
The ILP wall
The power wall
The memory wall
What software changes will we need for many-core?



New languages?
New approaches for compilers, runtimes, tools?
New (or perhaps old) operating system ideas?
3
Do We Really Need New Languages?

Mainstream languages schedule values into variables



Introducing parallelism exposes weaknesses in:



Passing values between unordered instructions
Updating state consistently
Our “adhesive bandage” attempts have proven insufficient



To orchestrate the flow of values in the program
To incrementally but consistently update state
Not general enough
Not productive enough
So my answer is “Absolutely!”
4
Parallel Programming Languages

There are (at least) two promising approaches:



Neither is completely satisfactory by itself




SQL is a “mostly functional” language
Transactions allow Consistency via Atomicity and Isolation
Many people think functional languages must be inefficient



Functional programs don’t allow mutable state
Transactional programs implement data flows awkwardly
Data base applications show synergy of these two ideas


Functional programming
Atomic memory transactions
Sisal and NESL are excellent counterexamples
Both competed strongly with Fortran on Cray systems
Others think memory transactions must be inefficient also

This remains to be seen; we have only just begun to optimize
5
Transactions and Invariants

Invariants are a program’s conservation laws






Relationships among values in iteration and recursion
Rules of data structure (state) integrity
If statements p and q preserve the invariant I and they do not
“interfere”, their parallel composition { p || q } also preserves I †
If p and q are performed atomically, i.e. as transactions, then they
will not interfere ‡
Although operations seldom commute with respect to state,
transactions give us commutativity with respect to the invariant
It would help if the invariants were available to the compiler

Can we ask programmers to supply them?
† Susan Owicki and David Gries. Verifying properties of parallel programs:
An axiomatic approach. CACM 19(5):279−285, May 1976.
‡ Leslie Lamport and Fred Schneider. The “Hoare Logic” of CSP, And All That.
ACM TOPLAS 6(2):281−296, Apr. 1984.
6
Styles of Parallelism

We probably need to support multiple programming styles






We may need several languages to accomplish this



After all, we do use multiple languages today
Language interoperability (e.g. .NET) will help greatly
It is essential that parallelism be exposed to the compiler


Both functional and transactional
Both data parallel and task parallel
Both message passing and shared memory
Both declarative and imperative
Both implicit and explicit
So that the compiler can adapt it to the target system
It is also essential that locality be exposed to the compiler

For the same reason
7
Compiler Optimization for Parallelism

Some say automatic parallelization is a demonstrated failure





What failed is parallelism discovery, especially in-the-large


Dependence analysis is chiefly a local success
Locality discovery in-the-large has also been a non-starter


Vectorizing and parallelizing compilers (especially for the
right architecture) have been a tremendous success
They have enabled machine-independent languages
What they do can be termed parallelism packaging
Even manifestly parallel programs need it
Locality analysis is another word for dependence analysis
The jury is still out on in-the-large locality packaging

Local locality packaging works pretty well
8
Fine-grain Parallelism

Exploitable parallelism grows as task granularity shrinks


Inter-task dependence enforcement demands scheduling



No privilege change to stop or restart a task
Locality (e.g. cache content) can be better preserved
Todays OSes and hardware don’t encourage waiting




A task needing a value from elsewhere must wait for it
User-level work scheduling is needed


But dependences among tasks become more numerous
OS thread preemption makes blocking dangerous
Instruction sets encourage non-blocking approaches
Busy-waiting wastes instruction issue opportunities
We need better support for blocking synchronization

In both instruction set and operating system
9
Resource Management Consequences

Since the user runtime is scheduling work on processors,
the OS should not attempt to do the same




An asynchronous OS API is a necessary corollary
The user-exposed API should be synchronous
Scheduling memory via demand paging is also problematic
Instead, the application and OS should negotiate


The application tells the OS its resource needs & desires
The OS makes decisions based on the big picture:




The OS can preempt resources to reclaim them


Requirements for quality of service
Availability of resources
Appropriateness of power level
But with notification, so the application can rearrange work
Resources should be time- and space-shared in chunks
10
Bin Packing

The more resources allocated, the more swapping overhead


It would be nice to amortize it
The more resources you get, the longer you may keep them

Quantity of resource

Roughly, this means scheduling = packing squarish blocks
QOS applications might need long rectangles instead
Time

When the blocks don’t fit, the OS can morph them a little

Or cut corners when absolutely necessary
11
Parallel Debugging and Tuning

Today, debugging relies on single-stepping and printf()


Conditional program and data breakpoints are helpful


The answer is usually found by sampling the PC
In contrast, parallel program tuning tries to discover where
there is insufficient parallelism


Debugging is data mining
Serial program tuning tries to discover where the program
counter spends its time


To stop when an invariant fails to be true
Support for ad-hoc data perusal is also very important


Single-stepping a parallel program is a bit less effective
A good way is to log perf counters and a timestamp at events
Visualization is a big deal for both debugging and tuning
12
Conclusions


It is time to rethink some of the basics
There is lots of work for everyone to do


I’ve left out lots of things, e.g. applications
We need basic research as well as industrial development


Research in computer systems is deprecated these days
In the USA, NSF and DOD need to take the initiative
13

No Slide Title

Transcript No Slide Title

Directory