Transcript Document

INSPIRE
The Insieme
Parallel
Intermediate
Representation
Herbert Jordan,
Peter Thoman, Simone
Pellegrini, Klaus Kofler, and
Thomas Fahringer
University of Innsbruck
PACT’13 - 9 September
Programming Models
User:
HW:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
C
Memory
Programming Models
User:
Compiler:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
PL  Assembly
• instruction selection
• register allocation
• optimization
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
R2 JMP M1
HW:
C
Memory
Programming Models
User:
Compiler:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
PL  Assembly
• instruction selection
• register allocation
• optimization
• loops & latency
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
R2 JMP M1
HW:
C
$
Memory
Programming Models
User:
Compiler:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
PL  Assembly
• instruction selection
• register allocation
• optimization
• loops & latency
HW:
C
$
Memory
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
R2 JMP M1
10 year old
architecture
Parallel Architectures
Multicore:
C
C
C
C
Accelerators:
C
$
$
Memory
M
OpenMP/Cilk
G
Clusters:
C
C
C
$
$ M
$ M
M
OpenCL/CUDA
M
MPI/PGAS
Compiler Support
C / C++
lib
IR
void main(…) {
int sum = 0;
#omp pfor
for(i = 1..10)
sum += i;
}
pfor:
mov eax,-2
bin cmp eax, 2
xor eax, eax
Start:
...
mov eax,2
mov ebx,1
call “pfor”
Label 1:
lea esi, Str
push esi
.START ST ST:
MOV R1,#2
MOV R2,#1
_GOMP_PFOR
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
Frontend
Backend
sequential
Situation

Compilers



Libraries



unaware of thread-level parallelism
magic happens in libraries
limited perspective / scope
no static analysis, no transformations
User


has to manage and coordinate parallelism
no performance portability
Compiler Support?
HW:
C
C
C
C
C
G
C C
C
G
C C
C C $ G
C C $ G
$ M
M
$ M
M
M
M
M
M
C
C
Compiler:
PL  Assembly
• instruction selection
• register allocation
• optimization
• loops & latency
• vectorization
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
R2 JMP M1
User:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
Our approach:
HW:
C
C
C
C
C
G
C C
C
G
C C
C C $ G
C C $ G
$ M
M
$ M
M
M
M
M
M
C
C
Compiler:
PL  Assembly
• instruction selection
• register allocation
• optimization
• loops & latency
• vectorization
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT M2
MUL R1,R2 INI
R2 JMP M1
User:
C / C++
void main(…) {
int sum = 0;
for(i = 1..10)
sum += i;
print(sum);
}
Our approach: Insieme
HW:
C
C
C
C
C
G
C C
C
G
C C
C C $ G
C C $ G
M
$
M
M
$
M
M
M
M
M
C
C
Compiler:
PL  Assembly
• instruction selection
• register allocation
• optimization
• loops & latency
• vectorization
IR
.START ST ST:
MOV R1,#2
MOV R2,#1
M1: CMP
R2,#20 BGT
M2 MUL R1,R2
INI R2 JMP M1
Insieme:
User:
PL  PL + extras
• coordinate parallelism
• high-level optimization
• auto tuning
• instrumentation
INSPIRE
unit main(...) {
ref<int> v1 =0;
pfor(..., (){
...
});
}
C / C++
void main(…) {
int sum = 0;
#omp pfor
for(i = 1..10)
sum += i;
}
The Insieme Project
 Goal:
to establish a research platform for
hybrid, thread level parallelism
Dyn. Optimizer
IR Toolbox
INSPIRE
Compiler
Backend
C/C++
OpenMP
Cilk
OpenCL
MPI
and extensions
Frontend
Static Optimizer
IRSM
Scheduler
Monitoring
Exec. Engine
Runtime
Parallel Programming

OpenMP


Cilk


Keywords
MPI


Pragmas (+ API)
library
OpenCL

library + JIT
Objective:
combine those using a
unified formalism and to
provide an infrastructure for
analysis and manipulations
INSPIRE Requirements
OpenMP / Cilk / OpenCL / MPI / others
INSPIRE
•
•
•
•
•
•
•
•
•
•
OpenCL / MPI / Insieme Runtime / others
complete
unified
explicit
analyzable
transformable
compact
high level
whole program
open system
extensible
INSPIRE
 Functional



first-class functions and closures
generic (function) types
program = 1 expression
 Imperative

Constructs
loops, conditions, mutable state
 Explicit

Basis
Parallel Constructs
to model parallel control flow
Parallel Model
 Parallel


Control Flow
defined by jobs: 𝑗𝑜𝑏 𝑒𝑙 , 𝑒𝑢 … 𝑓
processed cooperatively by thread groups
Parallel Model (2)
 one
work-sharing construct
 one
data-sharing construct
 point-to-point

communication
abstract channels type: 𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝛼, 𝑠
Evaluation
 What
inherent impact does the INSPIRE
detour impose?
C
Input
Code
FE
INSPIRE
BE
Insieme Compiler
No Optimization!
C
Target
Code
(IRT)
GCC
4.6.3
(-O3)
Binary A
(GCC)
Binary B
(Insieme)
Performance Impact
Relative execution time (𝑡𝑖𝑛𝑠𝑖𝑒𝑚𝑒 /𝑡𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 )
Derived Work (subset)

Adaptive Task Granularity Control
P. Thoman, H. Jordan, T. Fahringer, Adaptive Granularity Control in Task
Parallel Programs using Multiversioning, EuroPar 2013

Multiobjective Auto-Tuning
H. Jordan, P. Thoman, J. J. Durillo et al., A Multi-Objective Auto-Tuning
Framework for Parallel Codes, SC 2012

Compiler aided Loop Scheduling
P. Thoman, H. Jordan, S. Pellegrini et al., Automatic OpenMP Loop
Scheduling: A Combined Compiler and Runtime Approach, IWOMP 2012

OpenCL Kernel Partitioning
K. Kofler, I. Grasso, B. Cosenza, T. Fahringer, An Automatic Input-Sensitive
Approach for Heterogeneous Task Partitioning, ICS 2013

Improved usage of MPI Primitives
S. Pellegrini, T. Hoefler, T. Fahringer, On the Effects of CPU Caches on MPI
Point-to-Point Communications, Cluster 2012
Conclusion

INSPIRE is designed to




based on comprehensive parallel model


represent and unify parallel applications
to analyze and manipulate parallel codes
provide the foundation for researching parallel
language extensions
sufficient to cover leading standards for parallel
programming
Practicality has been demonstrated by a
variety of derived work
Thank You!
Visit: http://insieme-compiler.org
Contact: [email protected]
Types
7
type constructors:
Expressions
8
kind of expressions:
Statements
9
types of statements: