The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios Prountzos, Zifei Zhong.

Download Report

Transcript The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios Prountzos, Zifei Zhong.

The Galois Project
Keshav Pingali
University of Texas, Austin
Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault,
Donald Nguyen, Dimitrios Prountzos, Zifei Zhong
Overview of Galois Project
•
Focus of Galois project:
–
parallel execution of irregular programs
•
–
raise abstraction level for “Joe programmers”
•
•
•
explicit parallelism is too difficult for most programmers
performance penalty for abstraction should be small
Research approach:
a)
b)
c)
•
pointer-based data structures like graphs and trees
study algorithms to find common patterns of parallelism and locality
- Amorphous Data-Parallelism (ADP)
design programming constructs for expressing these patterns
implement these constructs efficiently
- Abstraction-Based Speculation (ABS)
For more information
–
–
papers in PLDI 2007, ASPLOS 2008, SPAA 2008
website: http://iss.ices.utexas.edu
Organization
• Case study of amorphous data-parallelism
– Delaunay mesh refinement
• Galois system (PLDI 2007)
– Programming model
– Baseline implementation
• Galois system optimizations
– Scheduling (SPAA 2008)
– Data and computation partitioning (ASPLOS 2008)
• Experimental results
• Ongoing work
Delaunay Mesh Refinement
• Iterative refinement to remove badly
Mesh m = /* read in mesh */
shaped triangles:
WorkList wl;
while there are bad triangles do {
wl.add(mesh.badTriangles());
Pick a bad triangle;
while (true) {
Find its cavity;
if ( wl.empty() ) Retriangulate
break;
cavity;
Triangle e = wl.get();
// may create new bad triangles
if (e no longer in} mesh) continue;
= new
Cavity(e);
• Cavity
Orderc in
which
bad triangles should
c.expand();
be refined://determine cavity
c.retriangulate();//re-triangulate
cavity
– final mesh depends on order
in which
m.update(c);//update
mesh
bad triangles are processed
wl.add(c.badTriangles());
– but all bad triangles will be eliminated
}
ultimately regardless of order
Delaunay Mesh Refinement
• Parallelism:
– triangles with non-overlapping
cavities can be processed in
parallel
– if cavities of two triangles overlap,
they must be done serially
– in practice, lots of parallelism
• Exploiting this parallelism
– compile-time parallelization
techniques like points-to and
shape analysis cannot expose this
parallelism (property of algorithm,
not program)
– runtime dependence checking is
needed
• Galois approach: optimistic
parallelization
Take-away lessons
•
Amorphous data-parallelism
–
–
–
–
data structures: graphs, trees, etc.
iterative algorithm over unordered or ordered work-list
elements can be added to work-list during computation
complex patterns of dependences between computations on different work-list
elements (possibly input-sensitive)
– but many of these computations can be done in parallel
•
Contrast: crystalline (regular) data-parallelism
– data structures: dense matrices
– iterative algorithm over fixed integer interval
– simple dependence patterns: affine subscripts in array accesses
(mostly input-insensitive)
for i = 1, N
for j = 1, N
for k = 1, N
C[i,j] = C[i,j] + A[i,k]*B[k,j]
Take-away lessons (contd.)
• Amorphous data-parallelism is ubiquitous
–
–
–
–
–
–
–
–
Delaunay mesh generation: points to be inserted into mesh
Delaunay mesh refinement: list of bad triangles
Agglomerative clustering: priority queue of points from data-set
Boykov-Kolmogorov algorithm for image segmentation
Reduction-based interpreters for l-calculus: list of redexes
Iterative dataflow analysis algorithms in compilers
Approximate SAT solvers: survey propagation, WalkSAT
……
Take-away lessons (contd.)
• Amorphous data-parallelism is obscured within while
loops, exit conditions, etc. in conventional languages
– Need transparent syntax similar to FOR loops for crystalline
data-parallelism
• Optimistic parallelization is necessary in general
• Compile-time approaches using points-to analysis or shape
analysis may be adequate for some cases
• In general, runtime dependence checking is needed
• Property of algorithms, not programs
• Handling of dependence conflicts depends on the
application
• Delaunay mesh generation: roll back any conflicting computation
• Agglomerative clustering: must respect priority queue order
Organization
• Case study of amorphous data-parallelism
– Delaunay mesh refinement
• Galois system (PLDI 2007)
– Programming model
– Baseline implementation
• Galois system optimizations
– Scheduling (SPAA 2008)
– Data and computation partitioning (ASPLOS 2008)
• Experimental results
• Ongoing work
Galois Design Philosophy
•
Do not worry about “dusty decks” (for now)
– Restructuring existing code to expose amorphous data-parallelism: not our focus
(cf. Google map/reduce)
•
Evolution, not revolution
– Modification of existing programming paradigms: OK
– Radical solutions like functional programming: not OK
•
No reliance on parallelizing compiler technology
– will not work for many of our applications anyway
– parallelizing compilers are very complex software artifacts
•
Support two classes of programmers:
– domain experts (Joe):
• should be shielded from complexities of parallel programming
• most programmers will be Joes
– parallel programming experts (Steve)::
• small number of highly trained people
– analogs:
• industry model even for sequential programming
• norm in domains like numerical linear algebra
– Steves implement BLAS libraries
– Joes express their algorithms in terms of BLAS routines
Galois system
• Application program
– Has well-defined sequential semantics
• current implementation: sequential Java
– Uses optimistic iterators to highlight for the
runtime system opportunities for exploiting
parallelism
• Class libraries
– Like Java collections library but with additional
information for concurrency control
• Runtime system
– Managing optimistic parallelism
Optimistic set iterators
• for each e in Set S do B(e)
– evaluate block B(e) for each element in set S
– sequential semantics
• set elements are unordered, so no a priori order on iterations
• there may be dependences between iterations
– set S may get new elements during execution
• for each e in OrderedSet S do B(e)
– evaluate block B(e) for each element in set S
– sequential semantics
• perform iterations in order specified by OrderedSet
• there may be dependences between iterations
– set S may get new elements during execution
Galois version of mesh refinement
Mesh m = /* read in mesh */
Set wl;
wl.add(mesh.badTriangles()); // initialize the Set wl
for each e in Set wl do {
//unordered Set iterator
if (e no longer in mesh) continue;
Cavity c = new Cavity(e);
c.expand();
c.retriangulate();
m.update(c);
wl.add(c.badTriangles());
//add new bad triangles to Set
}
- Scheduling policy for iterator:
• controlled by implementation of Set class
• good choice for temporal locality: stack
Parallel execution model
•
•
•
Object-based shared-memory
model
Master thread and some
number of worker threads
– master thread begins
execution of program and
executes code between
iterators
– when it encounters iterator,
worker threads help by
executing iterations
concurrently with master
– threads synchronize by
barrier synchronization at
end of iterator
Threads invoke methods to
access internal state of objects
– how do we ensure sequential
semantics of program are
respected?
Master
main()
….
for each …..{
…….
…….
}
.....
.....
for each ….{
….....
……..
}
……
Program
Objects
Threads
Shared
Memory
Baseline solution: PLDI 2007
• Iteration must lock object to invoke
method
• Two types of objects:
Objects
– catch and keep policy
• lock is held even after method
invocation completes
• all locks released at end of iteration
• poor performance for programs with
collections and accumulators
– catch and release policy
• like Java locking policy
• permits method invocations from
different concurrent iterations to be
interleaved
• how do we make sure this is safe?
time
1
2
3 i
j
Catch and keep: iteration rollback
• What does iteration j do if object is
already locked by some other
iteration i ?
Shared Memory
Objects
– one possibility: wait and try to acquire
lock again
• but this might lead to deadlock
– our implementation: runtime system
rolls back one of the iterations by
undoing its updates to shared objects
– Undoing updates: any “copy-on-write”
solution works
• Make a copy of entire object when
you acquire the lock (wasteful for
large objects)
• Runtime systems maintains an “undo
log” that holds information for undoing
side-effects to objects as they are
made (cf. software transactional
memory)
time
1
2
3
i
j
Problem with catch and keep
•
Shared Memory
Poor performance for programs that deal
with mutable collections and accumulators
Objects
– work-sets are mutable collections
– accumulators are ubiquitous
•
Example: Delaunay refinement
– Work-set is a (mutable) collection of bad
triangles
– Some thread grabs lock on work-set object,
gets a bad triangle and removes it from the
work-set
– That thread must retain the lock on the
work-set till iteration completes, which
shuts out all other threads (same problem
arises with transactional memory)
•
2
1
Lesson:
– For some objects, we need to interleave
method invocations from different iterations
– But must not lose serializability
3
4
i
j
Galois solution: selective catch and release
•
Example: accumulator
– two methods:
Shared Memory
• add(int)
• read() //return value
Accumulator
– adds commute with other adds and
reads commute with other reads
•
•
•
Interleaving of commuting method
invocations from different iterations
 OK
Interleaving of non-commuting method
invocations from different iterations
 trigger abort
Rolling back side-effects: programmer
must provide “inverse” methods for
forward methods
– Inverse method for add(n) is
subtract(n)
– Semantic inverse, not representational
inverse
•
058  3
a.add(5)
a.add(3)
a.read()
a.add(8)
a.add(-4)
a.add(5)
a.add(3)
a.read()
a.add(8)
a.add(-4)
This solution works for sets as well.
Abstraction-based Speculation
• Library writer:
– specifies commutativity and inverse information for some classes
• Runtime system:
–
–
–
–
catch and release locking for these classes
keeps track of forward method invocations
checks commutativity of forward method invocations
invokes appropriate inverse methods on abort
• More details: PLDI 2007
• Related work:
– logical locks in database systems
– Herlihy et al: PPoPP 2008
Organization
• Case study of amorphous data-parallelism
– Delaunay mesh refinement
• Galois system (PLDI 2007)
– Programming model
– Baseline implementation
• Galois system optimizations
– Scheduling (SPAA 2008)
– Data and computation partitioning (ASPLOS 2008)
• Experimental results
• Ongoing work
Scheduling iterators
• Control scheduling by changing implementation of workset class
– stack/queue/etc.
• Scheduling can have a profound effect on performance
• Example: Delaunay mesh refinement
– 10,156 triangles of which 4,837 were bad
– sequential code, work-set is stack:
• 21,918 completed iterations+0 aborted
– 4-processor Itanium-2, work-set implementations:
• stack: 21,736 iterations completed+28,290 aborted
• array+random choice: 21,908 iterations completed+49 aborted
Scheduling iterators (SPAA 2008)
• Crystalline data-parallelism: DO-ALL loops
– main scheduling concerns are locality and load-balancing
– Open-MP: static, dynamic, guided, etc.
• Amorphous data-parallelism: many more issues
– Conflicts
– Dynamically created work
– Algorithmic issues: efficiency of data structures
• SPAA 2008 paper:
– Scheduling framework for exploiting amorphous data-parallelism
– Generalizes Open-MP DO-ALL loop scheduling constructs
Data Partitioning (ASPLOS 2008)
Cores
• Partition the graph between cores
• Data-centric assignment of work:
– core gets bad triangles from its own partitions
– improves locality
– can dramatically reduce conflicts
• Lock coarsening:
– associate locks with partitions
• Over-decomposition
– improves core utilization
Organization
• Case study of amorphous data-parallelism
– Delaunay mesh refinement
• Galois system (PLDI 2007)
– Programming model
– Baseline implementation
• Galois system optimizations
– Scheduling (SPAA 2008)
– Data and computation partitioning (ASPLOS 2008)
• Experimental results
• Ongoing work
Small-scale multiprocessor results
• 4-processor Itanium 2
– 16 KB L1, 256 KB L2, 3MB L3 cache
• Versions:
–
–
–
–
GAL: using stack as worklist
PAR: partitioned mesh + data-centric work assignment
LCO: locks on partitions
OVD: over-decomposed version (factor of 4)
Large-scale multiprocessor results
• Maverick@TACC
– 128-core Sun Fire E25K 1 GHz
– 64 dual-core processors
– Sun Solaris
• First “out-of-the-box” results
• Speed-up of 20 on 32 cores for
refinement
• Mesh partitioning is still
sequential
– time for mesh partitioning starts
to dominate after 8 processors
(32 partitions)
• Need parallel mesh partitioning
– Par-Metis (Karypis et al)
Galois version of mesh refinement
Mesh m = /* read in mesh */
Set wl;
wl.add(mesh.badTriangles()); // initialize the Set wl
for each e in Set wl do {
//unordered Set iterator
if (e no longer in mesh) continue;
Cavity c = new Cavity(e);
c.expand();
c.retriangulate();
m.update(c);
wl.add(c.badTriangles()); //add new bad triangles to Set
}
Partitioned Work-set
Partitioned Graph
Galois runtime system
Results for BK Image Segmentation
• Versions:
–
–
–
–
GAL: standard Galois version
PAR: partitioned graph
LCO: locks on partitions
OVD: over-decomposed version
Related work
• Transactions
– programming model is explicitly parallel
– assumes someone else is responsible for parallelism, locality, loadbalancing, and scheduling, and focuses only on synchronization
– Galois: main concerns are parallelism, locality, load-balancing, and
scheduling
• “catch and keep” classes can use TM for roll-back but this is probably
overkill
• Thread level speculation
– not clear where to speculate in C programs
• wastes power in useless speculation
–
–
–
–
many schemes require extensive hardware support
no notion of abstraction-based speculation
no analogs of data partitioning or scheduling
overall results are disappointing
Ongoing work
• Case studies of irregular programs
– understand “parallelism and locality patterns” in irregular programs
• Lonestar benchmark suite for irregular programs
– joint work with Calin Cascaval’s group at IBM Yorktown Heights
• Optimizing the Galois runtime system
– improve performance for iterators in which work/iteration is relatively low
• Compiler analysis to reduce overheads of optimistic parallel
execution
• Scalability studies
– larger number of cores
• Distributed-memory implementation
– billion element meshes?
• Program analysis to verify assertions about class methods
– Need a semantic specification of class
Summary
• Irregular applications have amorphous data-parallelism
– Work-list based iterative algorithms over unordered and ordered sets
• Amorphous data-parallelism may be inherently data-dependent
– Pointer/shape analysis cannot work for these apps
• Optimistic parallelization is essential for such apps
– Analysis might be useful to optimize parallel program execution
• Exploiting abstractions and high-level semantics is critical
– Galois knows about sets, ordered sets, accumulators…
• Galois approach provides unified view of data-parallelism in regular
and irregular programs
– Baseline is optimistic parallelism
– Use compiler analysis to make decisions at compile-time whenever
possible