Transcript Slide 1

Challenges in
Combinatorial Scientific Computing
John R. Gilbert
University of California, Santa Barbara
Grand Challenges in Data-Intensive Discovery
October 28, 2010
1
Support: NSF, DOE, Intel, Microsoft
Combinatorial Scientific Computing
“I observed that most of the
coefficients in our matrices were
zero; i.e., the nonzeros were ‘sparse’
in the matrix, and that typically the
triangular matrices associated with
the forward and back solution
provided by Gaussian elimination
would remain sparse if pivot
elements were chosen with care”
- Harry Markowitz, describing the 1950s
work on portfolio theory that won
the 1990 Nobel Prize for Economics
2
Graphs and Sparse Matrices: Cholesky factorization
Fill: new nonzeros in factor
3
1
7
3
1
7
Symmetric Gaussian elimination:
6
8
4
9
G(A)
3
4
10
5
2
6
8
9
10
5
G+(A)
[chordal]
2
for j = 1 to n
add edges between j’s
higher-numbered neighbors
Large graphs are everywhere…
• Internet structure
• Social interactions
WWW snapshot, courtesy Y. Hyun
4
• Scientific datasets: biological,
chemical, cosmological, ecological, …
Yeast protein interaction network, courtesy H. Jeong
The Challenge
of the Middle
5
An analogy?
Continuous
physical modeling
As the “middleware”
of scientific computing,
linear algebra has supplied
or enabled:
• Mathematical tools
Linear algebra
• “Impedance match” to
computer operations
• High-level primitives
• High-quality software libraries
Computers
• Ways to extract performance
from computer architecture
• Interactive environments
6
An analogy?
Continuous
physical modeling
Discrete
structure analysis
Linear algebra
Graph theory
Computers
7
Computers
An analogy? Well, we’re not there yet ….
 Mathematical tools
? “Impedance match” to
computer operations
Discrete
structure analysis
? High-level primitives
? High-quality software libs
Graph theory
? Ways to extract performance
from computer architecture
? Interactive environments
8
Computers
The Case for
Primitives
9
All-Pairs Shortest Paths on a GPU
[Buluc et al.]
Based on R-Kleene algorithm
Well suited for GPU architecture:
10
•
In-place computation =>
low memory bandwidth
•
Few, large MatMul calls =>
low GPU dispatch overhead
C
A
A
B
C
D
D
B
+ is “min”,
A = A*;
× is “add”
% recursive call
•
Recursion stack on host CPU, B = AB; C = CA;
not on multicore GPU
D = D + CB;
•
Careful tuning of GPU code
•
Fast matrix-multiply kernel
D = D*;
% recursive call
B = BD; C = DC;
A = A + BC;
APSP:
Experiments
and observations
The Case
for Primitives
Lifting Floyd-Warshall
to GPU
480x
Unorthodox RKleene algorithm
The right primitive!
Runtime vs. Matrix Dimension, log-log
• High performance is achievable but not simple
• Carefully chosen and optimized primitives are key
• Matching the architecture and the algorithm is key
11
The
sparse
matrices
Thecase
Casefor
for
Sparse
Matrices
Many irregular applications contain
coarse-grained parallelism that can be exploited
by abstractions at the proper level.
12
Traditional graph
computations
Graphs in the language of
linear algebra
Data driven,
unpredictable communication.
Fixed communication patterns
Irregular and unstructured,
poor locality of reference
Operations on matrix blocks
exploit memory hierarchy
Fine grained data accesses,
dominated by latency
Coarse grained parallelism,
bandwidth limited
Sparse array-based primitives
Identification of Primitives
Sparse matrix-matrix
multiplication (SpGEMM)
x
Element-wise operations
Sparse matrix-dense
vector multiplication
x
Sparse matrix indexing
.*
Matrices on various semirings:
13
(x, +) , (and, or) , (+, min) , …
Multiple-source breadth-first search
1
2
4
7
3
AT
14
X
6
5
Multiple-source breadth-first search

1
2
4
7
3
AT
15
X
ATX
6
5
Multiple-source breadth-first search

1
2
4
7
3
AT
16
X
6
ATX
•
Sparse array representation => space efficient
•
Sparse matrix-matrix multiplication => work efficient
•
Three levels of available parallelism: searches, vertices, edges
5
A Few Examples
17
Combinatorial BLAS
[Buluc, G]
A parallel graph library based on
distributed-memory sparse arrays
and algebraic graph primitives
Typical software stack
Betweenness Centrality (BC)
What fraction of shortest paths
pass through this node?
Brandes’ algorithm
18
BC performance in distributed memory
RMAT powerlaw graph,
2Scale vertices,
avg degree 8
TEPS score
Millions
BC performance
250
200
150
Scale 17
Scale 18
100
Scale 19
50
Scale 20
484
441
400
361
324
289
256
225
196
169
144
121
100
81
64
49
36
25
0
Number of Cores
• TEPS = Traversed Edges Per Second
• One page of code using C-BLAS
19
KDT: A toolbox for graph analysis and pattern discovery
[G, Reinhardt, Shah]
Layer 1: Graph Theoretic Tools
20
•
Graph operations
•
Global structure of graphs
•
Graph partitioning and clustering
•
Graph generators
•
Visualization and graphics
•
Scan and combining operations
•
Utilities
Star-P architecture
Star-P
client manager
package manager
processor #1
dense/sparse
sort
processor #2
ScaLAPACK
processor #3
FFTW
Ordinary Matlab variables
processor #0
FPGA interface
MPI user code
UPC user code
...
MATLAB®
processor #n-1
server manager
matrix manager
21
Distributed matrices
Landscape connectivity modeling
•
Habitat quality, gene flow,
corridor identification,
conservation planning
•
Pumas in southern California:
12 million nodes, < 1 hour
•
Targeting larger problems:
Yellowstone-to-Yukon corridor
Figures courtesy of Brad McRae
22
From semirings to computational patterns
Sparse matrix times vector as a semiring operation:
–
Given vertex data xi and edge data ai,j
–
For each vertex j of interest, compute
yj = ai1,jxi1  ai2,jxi2  · · ·  aik,j xik
–
23
User specifies: definition of operations  and 
From semirings to computational patterns
Sparse matrix times vector as a computational pattern:
24
–
Given vertex data and edge data
–
For each vertex of interest, combine data from
neighboring vertices and edges
–
User specifies: desired computation on data from
neighbors
SpGEMM as a computational pattern
• Explore length-two paths that use specified
vertices
• Possibly do some filtering, accumulation,
or other computation with vertex and edge
attributes
• E.g. “friends of friends” (think Facebook)
• May or may not want to form the product
graph explicitly
• Formulation as semiring matrix multiplication
is often possible but sometimes clumsy
• Same data flow and communication patterns
as in SpGEMM
25
Graph BLAS: A pattern-based library
26
•
User-specified operations and attributes give the
performance benefits of algebraic primitives with a
more intuitive and flexible interface.
•
Common framework integrates algebraic (edge-based),
visitor (traversal-based), and map-reduce patterns.
•
2D compressed sparse block structure supports userdefined edge/vertex/attribute types and operations.
•
“Hypersparse” kernels tuned to reduce data movement.
•
Initial target: manycore and multisocket shared memory.
The Challenge
of Architecture
and Algorithms
27
The Architecture & Algorithms Challenge
Two Nvidia
8800 GPUs
> 1 TFLOPS
Oak Ridge / Cray Jaguar
> 1.75 PFLOPS
 Parallelism is no longer optional…
 … in every part of a computation.
28
Intel 80core chip
> 1 TFLOPS
High-performance architecture

Most high-performance
computer designs allocate
resources to optimize
Gaussian elimination on
large, dense matrices.

Originally, because linear
algebra is the middleware
of scientific computing.

Nowadays, largely for
bragging rights.
29
P A
=
L
x
U
Strongly connected components
1
2
4
7
5
1
3
6
1
2
4
7
2
4
7
5
5
3
PAPT
30
6
3
6
G(A)
•
Symmetric permutation to block triangular form
•
Diagonal blocks are strong Hall (irreducible / strongly connected)
•
Sequential: linear time by depth-first search [Tarjan]
•
Parallel: divide & conquer, work and span depend on input
[Fleischer, Hendrickson, Pinar]
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)
31
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or parallelism.
32
The memory wall blues

Most of memory is hundreds or thousands of cycles away
from the processor that wants it.

You can buy more bandwidth, but you can’t buy less latency.
(Speed of light, for one thing.)

You can hide latency with either locality or parallelism.

Most interesting graph problems have lousy locality.

Thus the algorithms need even more parallelism!
33
Architectural impact on algorithms
Full matrix multiplication: C = A * B
C = 0;
for i = 1 : n
for j = 1 : n
for k = 1 : n
C(i,j) = C(i,j) + A(i,k) * B(k,j);
O(n3) operations
34
Architectural impact on algorithms
12000 would take
1095 years
Naïve 3-loop matrix multiply [Alpern et al., 1992]:
6
T = N4.7
log cycles/flop
5
4
3
Size 2000 took 5 days
2
1
0
-1 0
1
2
3
4
5
log Problem Size
Naïve algorithm is O(N5) time under UMH model.
BLAS-3 DGEMM and recursive blocked algorithms are O(N3).
35
Diagram from Larry Carter
The architecture & algorithms challenge
36

A big opportunity exists for computer architecture to
influence combinatorial algorithms.

(Maybe even vice versa.)
A novel architectural approach: Cray MTA / XMT
37
•
Hide latency by massive
multithreading
•
Per-tick context switching
•
Uniform (sort of) memory
access time
•
But the economic case is
still not completely clear.
A Few Other Challenges
38
The Productivity Challenge
Raw performance isn’t always the only criterion.
Other factors include:
39
•
Seamless scaling from desktop to HPC
•
Interactive response for data exploration and viz
•
Rapid prototyping
•
Just plain programmability
The Education Challenge

How do you teach this stuff?

Where do you go to take courses in
 Graph algorithms …
 … on massive data sets …
 … in the presence of uncertainty …
 … analyzed on parallel computers …
 … applied to a domain science?
40
Final thoughts
41
•
Combinatorial algorithms are pervasive in scientific
computing and will become more so.
•
Linear algebra and combinatorics can support each
other in computation as well as in theory.
•
A big opportunity exists for computer architecture to
influence combinatorial algorithms.
•
This is a great time to be doing research in
combinatorial scientific computing!