Distributed Programming in Scala with APGAS

Download Report

Transcript Distributed Programming in Scala with APGAS

Distributed Programming
in Scala with APGAS
Philippe Suter, Olivier Tardieu, Josh Milthorpe
IBM Research
Picture by Simon Greig
APGAS - Context
Asynchronous Partitioned Global Address Space
• Model for concurrency + distribution in X10.
• X10, general purpose language
– Developed at IBM Research for 10+ years.
– Focus/bias towards distributed HPC tasks.
– JVM + native back-ends (through Java & C++).
– Some X10 apps ran on >50K cores.
http://x10-lang.org and X10’15 @ PLDI (tomorrow)
APGAS in Scala
• Goal: expose the concurrent/distributed core
of X10 as a library.
– In Java 8 and as a Scala DSL.
• This contribution:
– Introduction to programming w/ APGAS in Scala.
– Illustrated through two benchmarks:
• K-means clustering
• Unbalanced Tree Search (see paper)
– Contrasting model with Akka (see paper).
– Preliminary experimental scaling results.
APGAS Primer
• Concurrent tasks run at distributed places.
• The environment exposes the available places.
def places : Seq[Place]
def here : Place
• Tasks can be remote or local.
• Tasks are asynchronous by default.
def asyncAt(p : Place)(body: =>Unit) : Unit
def async(body: =>Unit) : Unit
APGAS Primer
• The termination of tasks is controlled by the
finish construct.
def finish(body: =>Unit) : Unit
• Blocks until enclosed tasks have completed,
including all nested tasks, local or remote.
• Distributed termination is challenging, finish is
a powerful contribution of APGAS.
Hello World
Completes when
all places have
completed their
task.
asyncAt returns
immediately.
finish {
for(p <- places) {
asyncAt(p) {
println(s“Hello from $here.”)
}
}
}
$> …
Hello
Hello
Hello
Hello
from
from
from
from
place(0).
place(3).
place(1).
place(2).
“Academic” Fibonacci
finish guards a
single asyncAt…
finish completes
exactly when the
computation of all
dependencies is
complete.
def fibonacci(i: Int) : Long = {
if(i <= 1 ) i else {
var a,b = 0L
finish {
async {
a = fibonacci(i – 2)
}
b = fibonacci(i – 1)
}
…but recursive
a + b
invocations
}
enclose many
more.
}
Messages and Memory
• Default mechanism for transferring memory
between places is to capture it in the closure
of the body of asyncAt.
• APGAS lets the programmer define global
symbols for memory local to places.
class Worker(…) extends PlaceLocal
Place-local Objects
• All instances of PlaceLocal resolve to objects
that are place-specific.
class Worker(…) extends PlaceLocal
val w : Worker = PlaceLocal.forPlaces(places) {
new Worker(…)
One distinct instance is
}
created at each place.
for(p <- places) {
asyncAt(p) { w.work() }
}
Here, w resolves to
the worker at place p.
Global and Shared References
• For objects that cannot extend PlaceLocal,
APGAS provides a wrapper (“pointer”)
trait GlobalRef[T] { def apply(): T }
• Shared references refer to an object at a
particular place and can only be dereferenced
there.
– Useful to “call back” from an asynchronous task.
trait SharedRef[T] { def apply(): T }
Global and Shared References
// at place p1
val largeArray : Array[Double] = …
val ref = SharedRef.make(largeArray)
asyncAt(p2) {
Dereferencing ref() here would be an error.
…
asyncAt(p1) {
val array = ref()
array(…) = …
}
Dereference at p1 resolves to largeArray.
…
}
largeArray is never captured, therefore never serialized.
Distributed K-means Clustering
• Goal: iteratively divide a set of points into K
disjoint clusters.
• Distribute the points among workers.
• In each iteration:
– workers:
• computes the new centroids for their own points.
• communicate their view of the centroid to the master
– the master:
• aggregates all workers’ data and checks convergence
Distributed K-Means: Memory
• Each worker needs to hold:
– Its set of points.
– Its local view of centroids.
• In addition, the master holds:
– The aggregated centroids.
GlobalRef[WorkerData]
SharedRef[MasterData]
• In our implementation, the workers write their
results directly at the master’s.
– Requires synchronized data structure.
Distributed K-Means: Structure
while(!converged) {
finish {
for(p <- places) {
asyncAt(p) {
// compute new local centroids
asyncAt(masterRef.home()) {
// merge local centroids in master
}
}
}
}
}
Unbalanced Tree Search
• Counts nodes in a dynamically generated tree.
• Each node:
– Has an associated SHA1 hash.
– Has a number of children determined by a
probabilistic law.
• Trees are unbalanced in an unpredictable but
deterministic way.
Unbalanced Tree Search
• Algorithm combines work-stealing and workdealing among workers.
• Workers are modeled as state machines.
• Termination:
– in APGAS: a single, top-level finish.
– in Akka: requires a counting protocol.
APGAS Implementation
• APGAS implementation:
– ~2000 lines Java 8
– ~200 lines Scala (definitions, helpers, serialization)
• Tasks are scheduled using fork/join.
• Distribution built on top of Hazelcast.
• Benchmarks are ~1200 Scala lines
– 1/3 APGAS, 1/3 Akka, 1/3 common.
Performance Evaluation
• For both benchmarks, we ran a fixed problem
using 1, 2, 4, 8, 16, and 32 workers.
• Measured “unit of work” per second per
worker.
• All experiments ran on single 48 core machine.
– Akka benchmarks use akka-remote.
Performance Evaluation
• Experiments are meant to:
– be a sanity check,
– provide evidence of scalability potential.
• Please do not interpret as claim that X is
better than Y.
“Comparable performance and scalability
for comparable complexity.”
K-Means
Iterations/second/worker
0.48
0.46
0.44
0.42
APGAS
Akka
0.4
0.38
0.36
0.34
0
5
10
15
20
Number of workers
25
30
35
Unbalanced Tree Search
Million of nodes/second/worker
9.6
9.4
9.2
APGAS
9
Akka
8.8
8.6
8.4
0
5
10
15
20
Number of workers
25
30
35
Conclusion
• Made APGAS programming problem
accessible to Scala programmers.
• Programming style is different, but a good fit
for some problems.
• In particular, finish concisely solves hard
distributed termination problems.
• Complexity is similar to equivalent Akka impls.
• Promising preliminary scaling results.
Thank you!