Document 7557120

Transcript Document 7557120

Implementing Tomorrow's
Programming Languages
Rudi Eigenmann
Purdue University
School of ECE
Computing Research institute
Indiana, USA
1
How to find Purdue University
2
Computing Research Institute
(CRI)
CRI is the high-performance
computing branch of
Discovery Park’s
Other DP Centers:
Bioscience
Nanotechnology
E-Enterprise
Entrepreneurship
Learning
Advanced Manufacturing
Environment
Oncology
3
Compilers are the Center of
the Universe
The compiler translates
Today:
the programmer’s
view
into
the machine’s
view
DO I=1,n
a(I)=b(I)
ENDDO
Subr doit
Loop:
Load 1,R1
. . .
Move R2, x
. . .
BNE loop
Tomorrow:
Do Weather
forecast
Compute on
machine x
Remote
call doit
4
Why is Writing Compilers Hard?
… a high-level view
• Translation passes are complex algorithms
• Not enough information at compile time
– Input data not available
– Insufficient knowledge of architecture
– Not all source code available
• Even with sufficient information, modeling
performance is difficult
• Architectures are moving targets
5
Why is Writing Compilers Hard?
… from an implementation angle
•
•
•
•
•
Interprocedural analysis
Alias/dependence analysis
Pointer analysis
Information gathering and propagation
Link-time, load-time, run-time optimization
– Dynamic compilation/optimization
– Just-in-time compilation
– Autotuning
• Parallel/distributed code generation
6
It’s Even Harder Tomorrow
Because we want:
• All our programs to work on multicore processors
• Very High-level languages
– Do weather forecast …
• Composition: Combine weather forecast with
energy-reservation and cooling manager
• Reuse: warn me if I’m writing a module that exists
“out there”.
7
How Do We Get There?
Paths towards tomorrow’s programming language
Addressing the (new) multicore challenge:
• Automatic Parallelization
• Speculative Parallel Architectures
• SMP languages for distributed systems
Addressing the (old) general software
engineering challenge:
• High-level languages
• Composition
• Symbolic analysis
• Autotuning
8
The Multicore Challenge
• We have finally reached the long-expected
“speed wall” for the processor clock.
– (this should not be news to you!)
• “… one of the biggest disruptions in the
evolution of information technology.”
• “Software engineers who do not know parallel
programming will be obsolete in no time.”
9
Automatic Parallelization
Can we implement standard languages on multicore?
Polaris – A Parallelizing Compiler
Standard Fortran
… more specifically: a source-to-source
restructuring compiler
Research issues in such a compiler:
–
–
–
–
Detecting parallelism
Mapping parallelism onto the machine
Performing compiler techniques at runtime
Compiler infrastructure
Polaris
Fortran+directives
(OpenMP)
OpenMP
backend
compiler
10
State of the Art in
Automatic parallelization
Speedup
• Advanced optimizing compilers perform well
in 50% of all science/engineering
applications.
• Caveats: this is true
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
– in research compilers
– for regular applications, written in Fortran
or C without pointers
ARC2D
FLO52Q
HYDRO2D
MDG
SWIM
5
TOMCATV 1234
TRFD
• Wanted: heroic, black-belt programmers
who know the “assembly language of HPC”
11
Can Speculative Parallel
Architectures Help?
Basic idea:
• Compiler splits program into sections
(without considering data dependences)
• The sections are executed in parallel
• The architecture tracks data
dependence violations and takes
corrective action.
12
MGRID
TRFD
ARC2D
FLO52
SWIM
HYDRO2D
TOMCATV
SU2COR
APPLU
TURB3D
APSI
FPPPP
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
WAVE5
Performance of Speculative
Multithreading
Implicit-Only
Multiplex-Na•ve
Multiplex-Selective
SPEC
CPU2000
FP programs
executed onMultiplex-Profile
a 4-core
speculative architecture.
13
We may need
Explicit Parallel Programming
Shared-memory architectures:
OpenMP: proven model for
Science/Engineering programs
Suitability for non-numerical programs ?
Distributed computers:
MPI: the assembly language of
parallel/distributed systems. Can we do
better ?
14
Beyond Science&Engineering
Applications
7+ Dwarfs:
1.
Structured Grids (including locally structured grids, e.g.
Adaptive Mesh Refinement)
2. Unstructured Grids
3. Fast Fourier Transform
4. Dense Linear Algebra
5. Sparse Linear Algebra
6. Particles
7. Monte Carlo
8. Search/Sort
9. Filter
10. Combinational logic
11. Finite State Machine
15
Shared-Memory Programming
for Distributed Applications?
• Idea 1:
– Use an underlying software distributedshared-memory system (e.g., Treadmarks).
• Idea 2:
– Direct translation into message-passing
code
16
OpenMP for Software DSM
Challenges
• In S-DSMs, such as
TreadMarks, the stacks
are not in shared
address space
Shared
memory
stack stack stack stack
• S-DSM maintains
coherency at a page
P1 tells P2
level
Processor 1
A[50] =
barrier
Shared address space
= A[50]
Distributed memories
stack
stack
stack
stack
• Compiler must identify
shared stack variable 
interprocedural analysis
Processor 2
t
“I have
written
page x”
P2
requests
page “diff”
from P1
• Optimizations that
reduce false sharing and
increase page affinity are
very important
17
Optimized Performance of SPEC
OMPM2001 Benchmarks on a
OMP2001M Performance
TreadmarksSPEC
S-DSM
System
6
5
4
3
2
1
0
1
2
4
8
wupwise
1
2
4
swim
8
1
2
4
mgrid
Baseline Performance
8
1
2
4
applu
8
1
2
4
8
equake
1
2
4
8
art
Optimized Performance
18
Direct Translation of OpenMP
into Message Passing
A question often asked: How is this different from HPF?
• HPF: emphasis is on data distribution
OpenMP: the starting point is explicit parallel regions.
• HPF: implementations apply strict data distribution and ownercomputes schemes
Our approach: partial replication of shared data.
Partial replication leads to
– Synchronization-free serial code
– Communication-free data reads
– Communication for data writes amenable to collective message
passing.
– Irregular accesses (in our benchmarks) amenable to compile-time
analysis
19
Note: partial replication is not necessarily “data scalable”
Performance of OpenMP-to-MPI
Translation
OpenMP-to-MPI
NEW
EXISTS
Hand-coded MPI
Higher is better
Performance comparison of our OpenMP-to-MPI translated versions versus
(hand-coded) MPI versions of the same programs.
Hand-coded MPI represents a practical “upper bound”
“speedup” is relative to the serial version
20
How does the performance
compare to the same programs
optimized for Software DSM?
OpenMP-to-MPI
NEW
OpenMP for SDSM
EXISTS (Project 2)
Higher is better
21
How Do We Get There?
Paths towards tomorrow’s programming language
The (new) multicore challenge:
• Automatic Parallelization
• Speculative Parallel Architectures
• SMP languages for distributed systems
The (old) general software engineering challenge:
• High-level languages
• Composition
• Symbolic analysis
• Autotuning
22
(Very) High-Level Languages
Observation: “The number of programming
errors is roughly proportional to the
number of programming lines”
Fortran
Assembl
y
Objectoriented
languages
Scripting,
Matlab
?
• Probably Domain-specific
• How efficient?
–Very efficient because there is much
flexibility in translating VHLLs
–Inefficient by experience
23
Composition
Can we compose software from existing modules?
• Idea:
Add an “abstract algorithm” (AA) construct
to the programming language
– the programmer definines is the AA’s goal
– called like a procedure
Compiler replaces each AA call with a
sequence of library calls
– How does the compiler do this?
It uses a domain-independent planner that accepts
24
procedure specifications as operators
Motivation: Programmers often
Write Sequences of Library Calls
Example: A Common BioPerl Call Sequence
“Query a remote database and save the result to local storage:”
Query q = bio_db_query_genbank_new(“nucleotide”,
“Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”);
DB db = bio_db_genbank_new( );
Stream stream = get_stream_by_query(db, q);
SeqIO seqio = bio_seqio_new(“>sequence.fasta”, “fasta”);
Seq seq = next_seq(stream);
write_seq(seqio, seq);
Type Procedure
5 data types, 6 procedure calls
Example adapted from
http://www.bioperl.org/wiki/HOWTO:Beginners
25
Defining and Calling an AA
• AA (goal) defined using the glossary...
algorithm
save_query_result_locally(db_name, query_string, filename, format)
=> { query_result(result, db_name, query_string),
contains(filename, result),
in_format(filename, format) }

...and called like a procedure
Seq seq = save_query_result_locally(“nucleotide”,
“Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”,
“>sequence.fasta”, “fasta”);
1 data type, 1 AA call
26
“Ontological Engineering”
• Library author provides a domain glossary
– query_result(result, db, query) – result is the outcome
of sending query to the database db
– contains(filename, data) – file named filename
contains data
– in_format(filename, format) – file named filename is in
format format
27
Implementing the Composition Idea
(Library Specs.) Operators
(Call Context) Initial State
(AA Definition) Goal State
Planner
(Compiler)
(Executable)
Plan User
World
Plan
Actions
A Domain-independent Planner
Borrowing AI technology: planners
-> for details, see PLDI 2006
28
Symbolic Program Analysis
• Today: many compiler techniques work assume
numerical constants
• Needed: Techniques that can reason about the
program in symbolic terms.
– differentiate ax2 -> 2ax
– analyze ranges y=exp; if {c} y+=5; -> y=[exp:exp+5]
– c=0
DO j=1,n
Recognize algorithm:
if (t(j)<v) c+=1 -> c= COUNT(t[1:n]<v)
ENDDO
29
Autotuning
(dynamic compilation/adaptation)
• Moving compile-time decisions to runtime
• A key observation:
Compiler writers “solve” difficult decisions by
creating a command-line option
-> finding the best combination of options
means making the difficult compiler decisions.
30
Tuning Time
Whole
105.76
102.97
100.00
89.28
80.00
69.23
62.22
60.00
87.32
68.28
63.14
50.99
50.59
36.96
1.79
2.33
3.38
4.22
2.59
1.61
swim
wupwise
apsi
applu
ammp
0.00
3.36
GeoMean
4.03
sixtrack
11.21
mgrid
2.33
7.06
mesa
20.00
equake
40.00
art
Normalized tuning time
120.00
PEAK
PEAK is 20 times as fast as the whole-program tuning.
On average, PEAK reduces tuning time from 2.19 hours to 5.85 minutes.
31
The performance is the same.
GeoMean
wupwise
Whole_Ref
swim
sixtrack
PEAK_Train
mgrid
mesa
Whole_Train
equake
art
apsi
applu
ammp
Relative performance improvement percentage (%)
Program Performance
PEAK_Ref
70
60
50
40
30
20
10
0
32
Conclusions
Advanced compiler capabilities are crucial for
implementing tomorrow’s programming
languages:
• The multicore challenge -> parallel programs
– Automatic parallelization
– Support for speculative multithreading
– Shared-memory programming support
• High-level constructs
– Composition pursues this goal
• Techniques to reason about programs in symbolic
terms
• Dynamic tuning
33

Document 7557120

Transcript Document 7557120

Directory