Early Progress on SEJITS in Many Core and Cloud Environments
Download
Report
Transcript Early Progress on SEJITS in Many Core and Cloud Environments
BERKELEY PAR LAB
Efficiency
Programming for the
(Productive) Masses
Armando Fox, Bryan Catanzaro, Shoaib Kamil,
Yunsup Lee, Ben Carpenter, Erin Carson,
Krste Asanovic, Dave Patterson, Kurt Keutzer
UC Berkeley Parallel Computing Lab/UPCRC
Make productivity programmers efficient,
and efficiency programmers productive?
BERKELEY PAR LAB
Productivity level language (PLL): Python, Ruby
high-level abstractions well-matched to application
domain => 5x faster development and 3-10x fewer
lines of code
>90% of programmers
Efficiency level language (ELL): C/C++, CUDA, OpenCL
>5x longer development time
potential 10x-100x performance by exposing HW
model
<10% programmers, yet their work is poorly reused
5x development time
10x-100x performance!
Raise level of abstraction and get performance?
Capture patterns instead of
“domains”?
BERKELEY PAR LAB
Efficiency programmers know how to
target computation patterns to hardware
stencil/SIMD codes => GPUs
sparse matrix => communication-avoiding algo’s on
multicore
“Big finance” Monte Carlo sim => MapReduce
Libraries? Useful, but don’t raise
abstraction level
How to make ELL work accessible to more
PLL programmers?
“Stovepipes”:
Connect Pattern to Platform
Traditional
Layers
Virt. worlds
Data viz.
Robotics
Music
Rendering
Probabilistic
Physics
Lin. Alg.
Common language substrate
Runtime & OS
OOO
GPU
SIMD
FPGA
Humans must
produce these
Robotics
Data viz.
Virt.
worlds
Dense Matrix Sparse Matrix
Cloud
Music
Stencil
“Stovepipes”
Runtime & OS
OOO
GPU
SIMD
FPGA
Cloud
BERKELEY PAR LAB
App domains
Computation
domains
Language
Thick Runtime
Hardware
Applications
Motifs/Pattern
s
Thin Runtime
Hardware
SEJITS: Selective, Embedded
Just-in-Time Specialization
BERKELEY PAR LAB
Productivity programmers write in general
purpose, modern, high level PLL
SEJITS infrastructure specializes
computation patterns selectively at runtime
Specialization uses runtime info to
generate and JIT-compile ELL code
targeted to hardware
Embedded because PLL’s own machinery
enables (vs. extending PLL interpreter)
Specifically...
BERKELEY PAR LAB
When “specializable” function is called:
determine if specializer available for current platform
if no: continue executing normally in PLL
If a specializer is found, it can:
manipulate/traverse AST of the function
emit & JIT-compile ELL source code
dynamically link compiled code to PLL interpreter
Specializers written in PLL
Necessary features present in modern PLL’s,
but absent from older widely-used PLL’s
SEJITS makes tuning decisions
per-function (not per-app)
Productivity app
.py
@g(
)
f()
.c
@h()
cc/ld
PLL Interp
$
SEJITS
Specializer
OS/HW
.so
BERKELEY PAR LAB
SEJITS makes tuning decisions
per-function (not per-app)
Selective
Productivity app
.py
JIT
@g(
)
f()
.c
@h()
cc/ld
PLL Interp
Embedded
$
SEJITS
Specialization
Specializer
OS/HW
.so
BERKELEY PAR LAB
Example: Stencil Computation
in Ruby
class LaplacianKernel < Kernel
def kernel(in_grid, out_grid)
in_grid.each_interior do |point|
point.neighbors(1).each do |x|
out_grid[point] += 0.2*x.val
end
end
end
BERKELEY PAR LAB
Use introspection to grab
parameters, inspect AST of
computation
VALUE kern_par(int argc, VALUE* argv, VALUE self) {
unpack_arrays into in_grid and out_grid;
#pragma omp parallel for default(shared) private (t_6,t_7,t_8)
for (t_8=1; t_8<256-1; t_8++) {
for (t_7=1; t_7<256-1; t_7++) {
for (t_6=1; t_6<256-1; t_6++) {
int center = INDEX(t_6,t_7,t_8);
out_grid[center] = (out_grid[center]
+(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));
...
out_grid[center] = (out_grid[center]
+(0.2*in_grid[INDEX(t_6,t_7,t_8+1)]));
;}}}
return Qtrue;}
•Specializer emits
OpenMP
•1000x-2000x faster
than Ruby
9
Example: Sparse Matrix-Vector
Multiply in Python
BERKELEY PAR LAB
# “Gather nonzero entries,
# multiply them by vector,
# do for each column”
Specializer outputs CUDA for nvcc:
SEJITS leverages downstream toolchains
B. Catanzaro et al., joint work with NVIDIA Research
10
SEJITS in the Cloud
BERKELEY PAR LAB
Productivity app
.py
f()
@g(
)
.scala
@h()
scalac
PLL Interp
$
SEJITS
Specializer
Nexus on Eucalyptus or EC2
Spark
worker
Spark & Nexus
• Spark enables clouddistributed, persistent,
fault-tolerant shared
parallel data structures
• Relies on Scala
runtime and dataparallel abstractions
• Relies on Nexus
(cloud resource
management) layer
Example: Logistic regression
using Spark/Scala (in progress)
BERKELEY PAR LAB
M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09
B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘09
12
SEJITS in the Cloud
Productivity app
.py
f()
@g(
)
.java
@h()
javac
PLL Interp
$
SEJITS
Specializer
Nexus on Cloud
Hadoop
master
BERKELEY PAR LAB
SEJITS for Cloud Computing
BERKELEY PAR LAB
Idea: same Python app runs on desktop, on
manycore, and in cloud
Cloud/multicore synergy: specialize intra-node
as well as generate cloud code
Cloud: Emit JIT-able code for Spark (Scala),
Hadoop (Java), MPI (C), ...
Single node: Emit JIT-able code for OpenCL,
CUDA, OpenMP, ...
Combine abstractions in one app
Remember...can always fall back to PLL
Questions
BERKELEY PAR LAB
Won’t we need lots & lots of specializers?
if ParLab “motifs” bet is correct, ~10s of specializers
will go a long way
What about libraries, frameworks, etc.?
SEJITS is complementary to frameworks
Most libraries for ELL, and ELLs lack features that
promote code reuse, don’t raise abstraction level
Why isn’t this just as hard as “magic compiler”?
Specializers written by human experts
SEJITS allows “crowdsourcing” them
Will programmers accustomed to Matlab/Fortran
learn functional style, list comprehensions, etc.?
Conclusion
BERKELEY PAR LAB
SEJITS enables code-generation strategy per-
function, not per-app
Uniform approach to productive programming
same app on cloud, multicore, autotuned libraries
Combine multiple frameworks/abstractions in
same app
Research enabler
Incrementally develop specializers for different motifs
or prototype HW
Don’t need full compiler & toolchain just to get started
BERKELEY PAR LAB
Questions
17