Programmability and Portability Problems? Time for Hardware Upgrades ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked.

Download Report

Transcript Programmability and Portability Problems? Time for Hardware Upgrades ~2003 Wall Street traded companies gave up the safety of the only paradigm that worked.

Programmability and Portability Problems?
Time for Hardware Upgrades
~2003 Wall Street traded companies gave up the safety of the
only paradigm that worked for them for parallel computing
Yet to see: Easy-to-program, fast general-purpose many-core
computer for single task completion time
Uzi Vishkin
2009
Develop in 2009 application-SW for 2010s many-cores, or wait?
Portability/investment questions:
Will 2009 code be supported in 2010s?
Development-hours in 2009 vs 2010s? Maintenance in 2010s?
Performance in 2010s?
Good News Vendors open up to ~40 years of parallel computing.
Also SW to match vendors’ HW (2009 acquisitions). Also: new starts
However They picked the wrong part: parallel architectures are a
disaster area for programmability. In any case: their programming
is too constrained. Contrast with general-purpose serial computing
that “set the serial programmer free”. Current direction drags
general-purpose computing to an unsuccessful paradigm.
My main point Need to reproduce serial success for many-core
computing.
The business food chain SW developers serve customers NOT
machines. If HW developers will not get used to idea of serving
SW developers, guess what will happen to customers of their HW.
Technical points
Will overview/note:
-What does it mean to “set free” parallel algorithmic
thinking?
-Architecture functions/abilities that achieve that
-HW features supporting them
Vendors must provide such functions.
Simple way: just add these features
Example of HW feature Prefix-Sum
• 1500 cars enter a gas station with 1000 pumps.
• Direct in unit time a car to a EVERY pump.
• Direct in unit time a car to EVERY pump
becoming available.
Proposed HW solution: prefix-sum functional unit.
(HW enhancement of Fetch&Add)
SPAA’97 + US Patent
Objective for programmer’s model
• Emerging: not sure, but analysis should be work-depth. Why
not design for your analysis? (like serial)
What could I do in parallel
at each step assuming
unlimited hardware
Serial
Paradigm
#
ops

..
time
Time = Work
..
Natural (Parallel)
Paradigm
# .
.
ops
..
..
..
..
time
Work = total #ops
Time << Work
• [SV82] conjectured that the rest (full PRAM algorithm) just a
matter of skill.
• Lots of evidence that this “work-depth methodology” works.
Used as framework in PRAM algorithms textbooks: JaJa-92,
Keller-Kessler-Traeff-01.
• Only really successful parallel algorithmic theory. Latent,
though not widespread, knowledgebase
Hardware prototypes of PRAM-On-Chip
64-core, 75MHz FPGA prototype
[SPAA’07, Computing Frontiers’08]
Original explicit multi-threaded
(XMT) architecture [SPAA98] (Cray
started to use “XMT” 7+ years later)
Interconnection Network for 128-core. 9mmX5mm, IBM90nm
process. 400 MHz prototype [HotInterconnects’07]
Same design as 64-core FPGA. 10mmX10mm,
IBM90nm process. 150 MHz prototype
The design scales to 1000+ cores on-chip
XMT big idea in a nutshell Design for work-depth
1) 1 operation now; Any #ops next time unit. 2) No need to
program for locality beyond use of local thread variables, post
work-depth. 3) Enough interconnection network bandwidth
XMT: A PRAM-On-Chip Vision
• IF you could program a current manycore  great
speedups. XMT: Fix the IF
• XMT: Designed from the ground up to address that
for on-chip parallelism
• Unlike matching current HW
• Today’s position Replicate functions
• Tested HW & SW prototypes
• Software release of full XMT environment
• SPAA’09: ~10X relative to Intel Core 2 Duo
• For more info: Google “XMT”
Programmer’s Model: Workflow Function
• Arbitrary CRCW Work-depth algorithm.
- Reason about correctness & complexity in synchronous model
• SPMD reduced synchrony
– Main construct: spawn-join block. Can start any number of processes at
once. Threads advance at own speed, not lockstep
– Prefix-sum (ps). Independence of order semantics (IOS)
– Establish correctness & complexity by relating to WD analyses
– Circumvents “The problem with threads”, e.g., [Lee]
spawn
join
spawn
join
• Tune (compiler or expert programmer): (i) Length of sequence
of round trips to memory, (ii) QRQW, (iii) WD. [VCL07]
• Trial&error contrast: similar startwhile insufficient interthread bandwidth do{rethink algorithm to take better advantage
of cache}
Ease of Programming
• Benchmark Can any CS major program your manycore?
- cannot really avoid it.
Teachability demonstrated so far:
- To freshman class with 11 non-CS students. Some prog.
assignments: merge-sort, integer-sort & samples-sort.
Other teachers:
- Magnet HS teacher. Downloaded simulator, assignments,
class notes, from XMT page. Self-taught. Recommends:
Teach XMT first. Easiest to set up (simulator), program,
analyze: ability to anticipate performance (as in serial). Can do
not just for embarrassingly parallel. Teaches also OpenMP,
MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview
with teacher.
- High school & Middle School (some 10 year olds) students
from underrepresented groups by HS Math teacher.
Conclusion
• XMT provides viable answer to biggest challenges for
the field
– Ease of programming
– Scalability (up&down)
Facilitates code portability
• Preliminary evaluation shows good result of XMT
architecture versus state-of-the art Intel Core 2
platform
• ICPP’08 paper compares with GPUs.
• Easy to build. 1 student in 2+ yrs: hardware design +
FPGA-based XMT computer in slightly more than two
years  time to market; implementation cost.
Replicate functions, perhaps by replicating solutions
Software release
Allows to use your own computer for programming on
an XMT
environment and experimenting with it, including:
a) Cycle-accurate simulator of the XMT machine
b) Compiler from XMTC to that machine
Also provided, extensive material for teaching or selfstudying parallelism, including
(i)Tutorial + manual for XMTC (150 pages)
(ii)Classnotes on parallel algorithms (100 pages)
(iii)Video recording of 9/15/07 HS tutorial (300 minutes)
(iv) Video recording of grad Parallel Algorithms lectures
(30+hours)
www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html,
Or just Google “XMT”
Q&A
Question: Why PRAM-type parallel algorithms matter, when we
can get by with existing serial algorithms, and parallel
programming methods like OpenMP on top of it?
Answer: With the latter you need a strong-willed Comp. Sci. PhD
in order to come up with an efficient parallel program at the
end. With the former (study of parallel algorithmic thinking and
PRAM algorithms) high school kids can write efficient (more
efficient if fine-grained & irregular!) parallel programs.