Early Progress on SEJITS in Many Core and Cloud Environments

Download Report

Transcript Early Progress on SEJITS in Many Core and Cloud Environments

BERKELEY PAR LAB
Efficiency
Programming for the
(Productive) Masses
Armando Fox, Bryan Catanzaro, Shoaib Kamil,
Yunsup Lee, Ben Carpenter, Erin Carson,
Krste Asanovic, Dave Patterson, Kurt Keutzer
UC Berkeley Parallel Computing Lab/UPCRC
Make productivity programmers efficient,
and efficiency programmers productive?
BERKELEY PAR LAB
 Productivity level language (PLL): Python, Ruby
 high-level abstractions well-matched to application
domain => 5x faster development and 3-10x fewer
lines of code
 >90% of programmers
 Efficiency level language (ELL): C/C++, CUDA, OpenCL
 >5x longer development time
 potential 10x-100x performance by exposing HW
model
 <10% programmers, yet their work is poorly reused
 5x development time
 10x-100x performance!
Raise level of abstraction and get performance?
Capture patterns instead of
“domains”?
BERKELEY PAR LAB
 Efficiency programmers know how to
target computation patterns to hardware
 stencil/SIMD codes => GPUs
 sparse matrix => communication-avoiding algo’s on
multicore
 “Big finance” Monte Carlo sim => MapReduce
 Libraries? Useful, but don’t raise
abstraction level
 How to make ELL work accessible to more
PLL programmers?
“Stovepipes”:
Connect Pattern to Platform
Traditional
Layers
Virt. worlds
Data viz.
Robotics
Music
Rendering
Probabilistic
Physics
Lin. Alg.
Common language substrate
Runtime & OS
OOO
GPU
SIMD
FPGA
Humans must
produce these
Robotics
Data viz.
Virt.
worlds
Dense Matrix Sparse Matrix
Cloud
Music
Stencil
“Stovepipes”
Runtime & OS
OOO
GPU
SIMD
FPGA
Cloud
BERKELEY PAR LAB
App domains
Computation
domains
Language
Thick Runtime
Hardware
Applications
Motifs/Pattern
s
Thin Runtime
Hardware
SEJITS: Selective, Embedded
Just-in-Time Specialization
BERKELEY PAR LAB
 Productivity programmers write in general
purpose, modern, high level PLL
 SEJITS infrastructure specializes
computation patterns selectively at runtime
 Specialization uses runtime info to
generate and JIT-compile ELL code
targeted to hardware
 Embedded because PLL’s own machinery
enables (vs. extending PLL interpreter)
Specifically...
BERKELEY PAR LAB
 When “specializable” function is called:
 determine if specializer available for current platform
 if no: continue executing normally in PLL
 If a specializer is found, it can:
 manipulate/traverse AST of the function
 emit & JIT-compile ELL source code
 dynamically link compiled code to PLL interpreter
 Specializers written in PLL
 Necessary features present in modern PLL’s,
but absent from older widely-used PLL’s
SEJITS makes tuning decisions
per-function (not per-app)
Productivity app
.py
@g(
)
f()
.c
@h()
cc/ld
PLL Interp
$
SEJITS
Specializer
OS/HW
.so
BERKELEY PAR LAB
SEJITS makes tuning decisions
per-function (not per-app)
Selective
Productivity app
.py
JIT
@g(
)
f()
.c
@h()
cc/ld
PLL Interp
Embedded
$
SEJITS
Specialization
Specializer
OS/HW
.so
BERKELEY PAR LAB
Example: Stencil Computation
in Ruby
class LaplacianKernel < Kernel
def kernel(in_grid, out_grid)
in_grid.each_interior do |point|
point.neighbors(1).each do |x|
out_grid[point] += 0.2*x.val
end
end
end
BERKELEY PAR LAB
Use introspection to grab
parameters, inspect AST of
computation
VALUE kern_par(int argc, VALUE* argv, VALUE self) {
unpack_arrays into in_grid and out_grid;
#pragma omp parallel for default(shared) private (t_6,t_7,t_8)
for (t_8=1; t_8<256-1; t_8++) {
for (t_7=1; t_7<256-1; t_7++) {
for (t_6=1; t_6<256-1; t_6++) {
int center = INDEX(t_6,t_7,t_8);
out_grid[center] = (out_grid[center]
+(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));
...
out_grid[center] = (out_grid[center]
+(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));
;}}}
return Qtrue;}
•Specializer emits
OpenMP
•1000x-2000x faster
than Ruby
9
Example: Sparse Matrix-Vector
Multiply in Python
BERKELEY PAR LAB
# “Gather nonzero entries,
# multiply them by vector,
# do for each column”
 Specializer outputs CUDA for nvcc:
 SEJITS leverages downstream toolchains
B. Catanzaro et al., joint work with NVIDIA Research
10
SEJITS in the Cloud
BERKELEY PAR LAB
Productivity app
.py
f()
@g(
)
.scala
@h()
scalac
PLL Interp
$
SEJITS
Specializer
Nexus on Eucalyptus or EC2
Spark
worker
Spark & Nexus
• Spark enables clouddistributed, persistent,
fault-tolerant shared
parallel data structures
• Relies on Scala
runtime and dataparallel abstractions
• Relies on Nexus
(cloud resource
management) layer
Example: Logistic regression
using Spark/Scala (in progress)
BERKELEY PAR LAB
M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09
B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘09
12
SEJITS in the Cloud
Productivity app
.py
f()
@g(
)
.java
@h()
javac
PLL Interp
$
SEJITS
Specializer
Nexus on Cloud
Hadoop
master
BERKELEY PAR LAB
SEJITS for Cloud Computing
BERKELEY PAR LAB
Idea: same Python app runs on desktop, on
manycore, and in cloud
 Cloud/multicore synergy: specialize intra-node
as well as generate cloud code
 Cloud: Emit JIT-able code for Spark (Scala),
Hadoop (Java), MPI (C), ...
 Single node: Emit JIT-able code for OpenCL,
CUDA, OpenMP, ...
 Combine abstractions in one app
 Remember...can always fall back to PLL
Questions
BERKELEY PAR LAB
 Won’t we need lots & lots of specializers?
 if ParLab “motifs” bet is correct, ~10s of specializers
will go a long way
 What about libraries, frameworks, etc.?
 SEJITS is complementary to frameworks
 Most libraries for ELL, and ELLs lack features that
promote code reuse, don’t raise abstraction level
 Why isn’t this just as hard as “magic compiler”?
 Specializers written by human experts
 SEJITS allows “crowdsourcing” them
 Will programmers accustomed to Matlab/Fortran
learn functional style, list comprehensions, etc.?
Conclusion
BERKELEY PAR LAB
 SEJITS enables code-generation strategy per-
function, not per-app
 Uniform approach to productive programming
 same app on cloud, multicore, autotuned libraries
 Combine multiple frameworks/abstractions in
same app
 Research enabler
 Incrementally develop specializers for different motifs
or prototype HW
 Don’t need full compiler & toolchain just to get started
BERKELEY PAR LAB
Questions
17