Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –
Download ReportTranscript Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components –
Parallel Graph Algorithms Kamesh Madduri [email protected] Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness Centrality • Performance on current systems – Software – Architectures – Performance trends 2 5/4/2009 Parallel Graph Algorithms Routing in transportation networks Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microseconds H. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007. 3 5/4/2009 Parallel Graph Algorithms Internet and the WWW • The world-wide web can be represented as a directed graph – Web search and crawl: traversal – Link analysis, ranking: Page rank and HITS – Document classification and clustering • Internet topologies (router networks) are naturally modeled as graphs 4 5/4/2009 Parallel Graph Algorithms Scientific Computing • Reorderings for sparse solvers – Fill reducing orderings • partitioning, traversals, eigenvectors – Heavy diagonal to reduce pivoting (matching) • Data structures for efficient exploitation of sparsity • Derivative computations for optimization – Matroids, graph colorings, spanning trees • Preconditioning – Incomplete Factorizations – Partitioning for domain decomposition – Graph techniques in algebraic multigrid • Independent sets, matchings, etc. – Support Theory • Spanning trees & graph embedding techniques B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”, http://www.er.doe.gov/ascr/ascac/Meetings/Oct08/Hendrickson%20ASCAC.pdf 5 5/4/2009 Parallel Graph Algorithms Large-scale data analysis • Graph abstractions are very useful to analyze complex data sets. • Sources of data: petascale simulations, experimental devices, the Internet, sensor networks • Challenges: data size, heterogeneity, uncertainty, data quality Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com 6 5/4/2009 Parallel Graph Algorithms Social Informatics: new analytics challenges, data uncertainty Data Analysis and Graph Algorithms in Systems Biology • Study of the interactions between various components in a biological system • Graph-theoretic formulations are pervasive: – Predicting new interactions: modeling – Functional annotation of novel proteins: matching, clustering – Identifying metabolic pathways: paths, clustering – Identifying new protein complexes: clustering, centrality Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003. 7 5/4/2009 Parallel Graph Algorithms Graph –theoretic problems in social networks – Community identification: clustering – Targeted advertising: centrality – Information spreading: modeling Image Source: Nexus (Facebook application) 8 5/4/2009 Parallel Graph Algorithms Network Analysis for Intelligence and Survelliance • [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality Image Source: http://www.orgnet.com/hijackers.html • A global view of entities is often more insightful • Detect anomalous activities by exact/approximate graph matching Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47 9 5/4/2009 Parallel Graph Algorithms Characterizing Graph-theoretic computations Input: Graph abstraction Problem: Find *** • paths • clusters • partitions • matchings • patterns • orderings 10 5/4/2009 Graph kernel • traversal • shortest path algorithms • flow algorithms • spanning tree algorithms • topological sort ….. Factors that influence choice of algorithm • graph sparsity (m/n ratio) • static/dynamic nature • weighted/unweighted, weight distribution • vertex degree distribution • directed/undirected • simple/multi/hyper graph • problem size • granularity of computation at nodes/edges • domain-specific characteristics Graph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations Parallel Graph Algorithms Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness centrality • Performance on current systems – Software – Architectures – Performance trends 11 5/4/2009 Parallel Graph Algorithms Parallel Computing Models: A Quick PRAM review • Objectives – Bridge between software and hardware • General purpose HW, scalable HW • Transportable SW – Abstract architecture for algorithm development • Why is it so important? – Uniprocessor: von Neumann model of computation – Parallel processors Multicore • Requirements: inherent tension – Simplicity to make analysis of interesting problems tractable – Detailed to reveal the important bottlenecks • Models, e.g.: – PRAM: rich collection of parallel graph algorithms – BSP: some CGM algorithms (cgmGraph) – LogP 12 5/4/2009 Parallel Graph Algorithms PRAM • Ideal model of a parallel computer for analyzing the efficiency of parallel algorithms. • PRAM composed of – P unmodifiable programs, each composed of optionally labeled instructions. – a single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. – P accumulators, one associated with each program – a read-only input tape – a write-only output tape • No local memory in each RAM. • Synchronization, communication, parallel overhead is zero. 13 5/4/2009 Parallel Graph Algorithms PRAM Data Access Forms • EREW (Exclusive Read, Exclusive Write) – A memory cell can be read or written by at most one processor per cycle. – Ensures no read or write conflicts. • CREW (Concurrent Read, Exclusive Write) – Ensures there are no write conflicts. • CRCW (Concurrent Read, Concurrent Write) – Requires use of some conflict resolution scheme. 14 5/4/2009 Parallel Graph Algorithms PRAM Pros and Cons • Pros – Simple and clean semantics. – The majority of theoretical parallel algorithms are specified with the PRAM model. – Independent of the communication network topology. • Cons – Not realistic, too powerful communication model. – Algorithm designer is misled to use IPC without hesitation. – Synchronized processors. – No local memory. – Big-O notation is often misleading. 15 5/4/2009 Parallel Graph Algorithms Analyzing Parallel Graph Algorithms • Problem parameters: n, m, D (graph diameter) • Worst-case running time: T • Total number of operations (work): W • Nick’s Class (NC): complexity class for problems that can be solved in polylogarithmic time using a polynomial number of processors • P-complete: inherently sequential 16 5/4/2009 Parallel Graph Algorithms The Helman-JaJa model • Extension to the PRAM model for shared memory algorithm design and analysis. • T(n, p) is measured by the triplet –TM(n, p), TC(n, p), B(n, p) – TM(n, p): maximum number of non-contiguous main memory accesses required by any processor – TC(n, p): upper bound on the maximum local computational complexity of any of the processors – B(n, p): number of barrier synchronizations. 17 5/4/2009 Parallel Graph Algorithms Building blocks of classical PRAM graph algorithms • Prefix sums • List ranking – Euler tours, Pointer jumping, Symmetry breaking • Sorting • Tree contraction 18 5/4/2009 Parallel Graph Algorithms Prefix Sums • Input: A, an array of n elements; associative binary operation i • Output: A( j ) for1 i n j 1 O(n) work, O(log n) time, n processors B(3,1) C(3,2) B(2,2) C(2,2) B(2,1) C(2,1) B(1,1) C(1,1) B(1,2) C(1,2) B(1,3) C(1,3) B(1,4) C(1,4) B(0,1) B(0,2) B(0,3) B(0,4) B(0,5) B(0,6) B(0,7) B(0,8) C(0,1) C(0,2) C(0,3) C(0,4) C(0,5) C(0,6) C(0,7) C(0,8) 19 5/4/2009 Parallel Graph Algorithms Parallel Prefix • X: array of n elements stored in arbitrary order. • For each element i, let X(i).value be its value and X(i).next be the index of its successor. • For binary associative operator Θ, compute X(i).prefix such that – X(head).prefix = X (head).value, and – X(i).prefix = X(i).value Θ X(predecessor).prefix where – head is the first element – i is not equal to head, and – predecessor is the node preceding i. • List ranking: special case of parallel prefix, values initially set to 1, and addition is the associative operator. 20 5/4/2009 Parallel Graph Algorithms List ranking Illustration • Ordered list (X.next values) 2 3 4 5 6 7 8 9 • Random list (X.next values) 4 21 5/4/2009 6 5 7 8 Parallel Graph Algorithms 3 2 9 List Ranking key idea 1. Chop X randomly into s pieces 2. Traverse each piece using a serial algorithm. 3. Compute the global rank of each element using the result computed from the second step. • Locality (list ordering) determines performance • In the Helman-JaJa model, TM(n,p) = O(n/p). 22 5/4/2009 Parallel Graph Algorithms An example higher-level algorithm Tarjan-Vishkin’s biconnected components algorithm: O(log n) time, O(m+n) time. 1. Compute spanning tree T for the input graph G. 2. Compute Eulerian circuit for T. 3. Root the tree at an arbitrary vertex. 4. Preorder numbering of all the vertices. 5. Label edges using vertex numbering 6. Connected components using the ShiloachVishkin algorithm 23 5/4/2009 Parallel Graph Algorithms Data structures: graph representation • Dense graphs (m = O(n2)): adjacency matrix commonly used. • Sparse graphs: adjacency lists, similar to the CSR matrix format. • Dynamic sparse graphs: we need to support edge and vertex membership queries, insertions, and deletions. – should be space-efficient, with low synchronization overhead • Several different representations possible – Resizable adjacency arrays – Adjacency arrays, sorted by vertex identifiers – Adjacency arrays for low-degree vertices, heap-based structures for high-degree vertices (for sparse graphs with skewed degree distributions) 24 5/4/2009 Parallel Graph Algorithms Data structures in (Parallel) Graph Algorithms • A wide range of ADTs in graph algorithms: array, list, queue, stack, set, multiset, tree • ADT implementations are typically arraybased for performance considerations. • Key data structure considerations in parallel graph algorithm design – Practical parallel priority queues – Space-efficiency – Parallel set/multiset operations, e.g., union, intersection, etc. 25 5/4/2009 Parallel Graph Algorithms Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness centrality • Performance on current systems – Software – Architectures – Performance trends 26 5/4/2009 Parallel Graph Algorithms Connected Components • Building blocks for many graph algorithms – Minimum spanning tree, spanning tree, planarity testing, etc. • Representative of the “graft-and-shortcut” approach • CRCW PRAM algorithms – [Shiloach & Vishkin ’82]: O(log n) time, O((m+n) logn) work – [Gazit ’91]: randomized, optimal, O(log n) time. • CREW algorithms – [Han & Wagner ’90]: O(log2n) time, O((m+nlog n) logn) work. 27 5/4/2009 Parallel Graph Algorithms Shiloach-Vishkin algorithm • Input: n isolated vertices and m PRAM processors. • Each processor Pi grafts a tree rooted at vertex vi to the tree that contains one of its neighbors u under the constraints u< vi • Grafting creates k ≥ 1 connected subgraphs, and each subgraph is then shortcut so that the depth of the trees reduce at least by half. • Repeat graft and shortcut until no more grafting is possible. • Runs on arbitrary CRCW PRAM in O(logn) time with O(m) processors. • Helman-JaJa model: TM = (3m/p + 2)log n, TB = 4log n. 28 5/4/2009 Parallel Graph Algorithms SV pseudo-code • Input: (1) A set of m edges (i, j) given in arbitrary order. (2) Array D[1..n] with D[i] = i • Output: Array D[1..n] with D[i] being the component to which vertex i belongs. begin while true do 1. for (i, j) ∈ E in parallel do if D[i]=D[D[i]] and D[j]<D[i] then D[D[i]] = D[j]; 2. for (i, j) ∈ E in parallel do if i belongs to a star and D[j]=D[i] then D[D[i]] = D[j]; 3. if all vertices are in rooted stars then exit; for all i in parallel do D[i] = D[D[i]] end 29 5/4/2009 Parallel Graph Algorithms SV Illustration 4 2 4 2 1 3 1 3 1st iter Input graph 2,3 1,4 shortcut graft 2nd iter 1 30 5/4/2009 2 Parallel Graph Algorithms 1 2 1 Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness centrality • Performance on current systems – Software – Architectures – Performance trends 31 5/4/2009 Parallel Graph Algorithms Parallel Single-source Shortest Paths (SSSP) algorithms • No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) work • Parallel priority queues: relaxed heaps [DGST88], [BTZ98] • Ullman-Yannakakis randomized approach [UY90] • Meyer et al. ∆ - stepping algorithm [MS03] • Distributed memory implementations based on graph partitioning • Heuristics for load balancing and termination detection K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007. 32 5/4/2009 Parallel Graph Algorithms ∆ - stepping algorithm [MS03] • Label-correcting algorithm: Can relax edges from unsettled vertices also • ∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm” • ∆: bucket width • Vertices are ordered using buckets representing priority range of size ∆ • Each bucket may be processed in parallel 33 5/4/2009 Parallel Graph Algorithms 34 5/4/2009 Parallel Graph Algorithms Classify edges as “heavy” and “light” 35 5/4/2009 Parallel Graph Algorithms Relax light edges (phase) Repeat until B[i] Is empty 36 5/4/2009 Parallel Graph Algorithms Relax heavy edges. No reinsertions in this step. 37 5/4/2009 Parallel Graph Algorithms ∆ - stepping algorithm: illustration ∆ = 0.1 (say) 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 ∞ ∞ ∞ ∞ ∞ ∞ ∞ Buckets 38 5/4/2009 Parallel Graph Algorithms 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 0 0 39 5/4/2009 Parallel Graph Algorithms 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket Initialization: Insert s into bucket, d(s) = 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket 6 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 0 0 R .01 S 40 5/4/2009 Parallel Graph Algorithms 2 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 0 ∞ ∞ ∞ ∞ ∞ ∞ Buckets 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R .01 0 S 41 5/4/2009 2 Parallel Graph Algorithms 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 0 ∞ 2 3 4 5 .01 ∞ ∞ ∞ ∞ Buckets 0 2 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 42 5/4/2009 Parallel Graph Algorithms 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 0 ∞ 2 3 4 5 .01 ∞ ∞ ∞ ∞ Buckets 0 2 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R .03 .06 S 43 5/4/2009 Parallel Graph Algorithms 1 3 0 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 0 ∞ 2 3 4 5 .01 ∞ ∞ ∞ ∞ Buckets 6 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R .03 .06 0 S 44 5/4/2009 1 3 Parallel Graph Algorithms 0 2 ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 0 0.02 0.13 0.15 2 0.23 4 0.18 5 1 d array 0 1 2 3 0 .03 .01 .06 4 5 6 ∞ ∞ ∞ Buckets 0 1 3 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 0 2 45 5/4/2009 Parallel Graph Algorithms ∆ - stepping algorithm: illustration 0.05 3 0.56 0.07 0.01 2 0 0.02 0.13 0.15 0.23 4 0.18 5 1 d array 0 1 2 3 4 5 6 0 .03 .01 .06 .16 .29 .62 Buckets 46 1 4 2 6 5 6 One parallel phase while (bucket is non-empty) i) Inspect light edges ii) Construct a set of “requests” (R) iii) Clear the current bucket iv) Remember deleted vertices (S) v) Relax request pairs in R Relax heavy request pairs (from S) Go on to the next bucket R S 6 5/4/2009 Parallel Graph Algorithms 0 2 1 3 No. of phases (machine-independent performance count) 1000000 high diameter No. of phases 100000 10000 1000 low diameter 100 10 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU Graph Family 47 5/4/2009 Parallel Graph Algorithms SqGrid USAd NE USAt NE Average shortest path weight for various graph families ~ 220 vertices, 222 edges, directed graph, edge weights normalized to [0,1] 100000 Average shortest path weight 10000 1000 100 10 1 0.1 0.01 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU Graph Family 48 5/4/2009 Parallel Graph Algorithms SqGrid USAd NE USAt NE Last non-empty bucket (machine-independent performance count) 1000000 Last non-empty bucket 100000 10000 1000 Fewer buckets, more parallelism 100 10 1 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU Graph Family 49 5/4/2009 Parallel Graph Algorithms SqGrid USAd NE USAt NE Number of bucket insertions (machine-independent performance count) 12000000 No. of Bucket insertions 10000000 8000000 6000000 4000000 2000000 0 Rnd-rnd Rnd-logU Scale-free LGrid-rnd LGrid-logU Graph Family 50 5/4/2009 Parallel Graph Algorithms SqGrid USAd NE USAt NE Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness centrality • Performance on current systems – Software – Architectures – Performance trends 51 5/4/2009 Parallel Graph Algorithms Betweenness Centrality • Centrality: Quantitative measure to capture the importance of a vertex/edge in a graph – degree, closeness, eigenvalue, betweenness • Betweenness Centrality BC(v) st (v) s v t st ( st : No. of shortest paths between s and t) • Applied to several real-world networks – – – – 52 Social interactions WWW Epidemiology Systems biology 5/4/2009 Parallel Graph Algorithms Algorithms for Computing Betweenness • All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space). • Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space. 53 5/4/2009 Parallel Graph Algorithms Our New Parallel Algorithms • Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centrality – low-diameter sparse graphs (diameter D = O(log n), m = O(nlog n)) – Exact algorithm: O(mn) work, O(m+n) space, O(nD) time (PRAM model) or O(nD+nm/p) time. • Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory references – In practice, 2-3X faster than previous algorithm – Lock-free => better scalability on large parallel systems 54 5/4/2009 Parallel Graph Algorithms Parallel BC Algorithm • Consider an undirected, unweighted graph • High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scores • Two steps – traversal and path counting – dependency accumulation (v ) 1 (w) (v ) vP ( w) ( w) 55 5/4/2009 Parallel Graph Algorithms Parallel BC Algorithm Illustration Data structures G (size m+n): read-only, adjacency array representation BC (size n): centrality score of each vertex 5 0 7 8 3 2 1 4 6 • S (n): stack of visited vertices 9 • Visited (n): Mark to check if vertex has been visited • D (n): Distance of vertex from source s • Sigma (n): No. of shortest paths through a vertex • Delta (n): Partial dependence score for each vertex • P (m+n): Multiset of predecessors of a vertex along shortest paths Space requirement: 8(m+6n) Bytes 56 5/4/2009 Parallel Graph Algorithms Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. 0 5 8 7 3 source vertex 2 57 5/4/2009 Parallel Graph Algorithms 1 4 6 9 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 8 3 source vertex 2 58 5/4/2009 Parallel Graph Algorithms 6 P 0 1 4 D 9 2 7 5 0 1 0 1 0 1 0 0 Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distance and path counts. S 5 0 7 8 3 source vertex 2 6 P 0 1 4 D 9 8 3 2 7 5 0 1 2 0 2 1 0 1 2 0 5 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 59 5/4/2009 Parallel Graph Algorithms 7 0 7 Parallel BC Algorithm Illustration 1. Traversal step: at the end, we have all reachable vertices, their corresponding predecessor multisets, and D values. 5 0 7 8 3 source vertex 2 1 4 6 9 S D 2 1 6 4 8 3 2 7 5 0 0 P 1 2 1 1 2 Level-synchronous approach: The adjacencies of all vertices in the current frontier can be visited in parallel 60 5/4/2009 Parallel Graph Algorithms 6 0 2 3 0 8 0 5 6 7 8 0 7 Step 1 (traversal) pseudo-code for all vertices u at level d in parallel do for all adjacencies v of u in parallel do v1 u dv = D[v]; if (dv < 0) // v is visited for the first time v2 vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1; pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Pcount[v], 1); // Add u to predecessor list of v u1 e1 v u2 61 5/4/2009 Parallel Graph Algorithms e2 Graph traversal step analysis • Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n) • Upper bound on size of each predecessor multiset: Indegree • Potential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path counts • New algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths. – simplifies the accumulation step – reduces an atomic operation in traversal step – cache-friendly! 62 5/4/2009 Parallel Graph Algorithms Modified Step 1 pseudo-code for all vertices u at level d in parallel do for all adjacencies v of u in parallel do dv = D[v]; if (dv < 0) // v is visited for the first time vis = fetch_and_add(&Visited[v], 1); if (vis == 0) // v is added to a stack only once D[v] = d+1; pS[count++] = v; // Add v to local thread stack fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of u if (dv == d + 1) fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); // Add v to successor list of u 63 5/4/2009 Parallel Graph Algorithms Graph Traversal Step locality analysis for all vertices u at level d in parallel do All the vertices are in a contiguous block (stack) for all adjacencies v of u in parallel do All the adjacencies of a vertex are stored compactly (graph rep.) Non-contiguous dv = D[v]; memory access if (dv < 0) Non-contiguous vis = fetch_and_add(&Visited[v], 1); memory access if (vis == 0) D[v] = d+1; pS[count++] = v; fetch_and_add(&sigma[v], sigma[u]); Non-contiguous fetch_and_add(&Scount[u], 1); memory access if (dv == d + 1) Store to S[u] fetch_and_add(&sigma[v], sigma[u]); fetch_and_add(&Scount[u], 1); Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously 64 5/4/2009 Parallel Graph Algorithms Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack, update dependence scores. S 5 0 7 8 3 source vertex 2 65 5/4/2009 Parallel Graph Algorithms 1 4 6 9 2 1 6 4 8 3 2 7 5 0 (v ) (v ) 1 (w) ( w ) vP ( w) Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7 Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in a level-synchronous manner. S 5 0 7 8 3 source vertex 2 66 5/4/2009 Parallel Graph Algorithms 1 4 6 9 2 1 6 4 8 3 2 7 5 0 (v ) (v ) 1 (w) ( w ) vP ( w) Delta P 6 0 2 3 0 8 0 5 6 7 8 0 7 Step 2 (accumulation) pseudo-code for level d = GraphDiameter to 2 do for all vertices w at level d in parallel do for all v in P[w] do acquire_lock(v); delta[v] = delta[v] + (1 + delta[w]) * sigma(v)/sigma(w); release_lock(v); (v ) 1 (w) ( v ) BC[v] = delta[v] ( w ) vP ( w) 67 5/4/2009 Parallel Graph Algorithms Modified Step 2 pseudo-code (w/ successor lists) for level d = GraphDiameter-2 to 1 do for all vertices v at level d in parallel do for all w in S[v] in parallel do reduction(delta) delta[v] = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] 68 5/4/2009 Parallel Graph Algorithms (v ) (v ) 1 (w) ( w ) wS ( v ) Accumulation step locality analysis for level d = GraphDiameter-2 to 1 do for all vertices v at level d in parallel do All the vertices are in a contiguous block (stack) Each S[v] is a contiguous block for all w in S[v] in parallel do reduction(delta) delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w]; BC[v] = delta[v] = delta_sum_v; Only floating point operation in code 69 5/4/2009 Parallel Graph Algorithms Digression: Centrality Analysis applied to Protein Interaction Networks Human Genome core protein interactions Degree vs. Betweenness Centrality 43 interactions Protein Ensembl ID ENSG00000145332.2 Kelch-like protein 8 1e+0 Betweenness Centrality 1e-1 1e-2 1e-3 1e-4 1e-5 1e-6 1e-7 1 10 Degree 70 5/4/2009 Parallel Graph Algorithms 100 Talk Outline • Applications • Parallel algorithm building blocks – Kernels – Data structures • Parallel algorithm case studies – Connected components – BFS/Shortest paths – Betweenness centrality • Performance on current systems – Software – Architectures – Performance trends 71 5/4/2009 Parallel Graph Algorithms Graph topology matters • Information networks are very different from graph topologies and computations that arise in scientific computing. Informatics: dynamic, high-dimensional data Static networks, Euclidean topologies Image Sources: visualcomplexity.com (1,2), MapQuest (3) • Classical graph algorithms typically assume a uniform random graph topology. 72 5/4/2009 Parallel Graph Algorithms “Small-world” complex networks Human Protein Interaction data set Degree distribution (18669 proteins, 43568 interactions) 10000 73 5/4/2009 Parallel Graph Algorithms 1000 Frequency • Information networks are typically dynamic graph abstractions, from diverse data sources • High-dimensional data • Skewed (“power law”) degree distribution of the number of neighbors • Low graph diameter • Massive networks (billions of entities) 100 10 1 0.1 1 10 100 1000 Degree Kevin Bacon and six degrees of separation Image Source: Seokhee Hong Implementation Challenges • Execution time is dominated by latency to main memory – Large memory footprint – Large number of irregular memory accesses • Essentially no computation to hide memory costs • Poor performance on current cache-based architectures (< 5-10% of peak achieved) • Memory access pattern is dependent on the graph topology. • Variable degrees of concurrency in parallel algorithm 74 5/4/2009 Parallel Graph Algorithms Desirable HPC Architectural Features • A global shared memory abstraction – no need to partition the graph – support dynamic updates • A high-bandwidth, low-latency network • Ability to exploit fine-grained parallelism • Support for light-weight synchronization • HPC systems with these characteristics – Massively Multithreaded architectures – Symmetric multiprocessors 75 5/4/2009 Parallel Graph Algorithms Performance Results: Test Platforms Cray XMT • Latency tolerance by massive multithreading – – – – – hardware support for 128 threads on each processor Globally hashed address space No data cache Single cycle context switch Multiple outstanding memory requests • Support for fine-grained, word-level synchronization • 16 x 500 MHz processors, 128 GB RAM 76 5/4/2009 Parallel Graph Algorithms Sun Fire T5120 • • • • Sun Niagara2: Cache-based multicore server with chip multithreading 1 socket x 8 cores x 8 threads per core 8 KB private L1 cache per core, 4 MB shared L2 cache 1167 MHz processor, 32 GB RAM Betweenness Centrality Performance Approximate betweenness computation on a synthetic small-world network of 256 million vertices and 2 billion edges Cray XMT Sun UltraSparcT2 Betweenness TEPS rate (Millions of edges per second) 180 160 BC-new 2.2x faster than previous approach 140 120 100 BC-new 2.4x faster than BC-old 80 60 2.0 GHz quad-core Xeon 40 20 0 1 2 4 8 12 Number of processors/cores TEPS: Traversed edges per second, performance rate for centrality computation. 77 5/4/2009 Parallel Graph Algorithms 16 SNAP: Small-world network analysis and Partitioning • • • • New parallel framework for complex graph analysis 10-100x faster than existing approaches Can process graphs with billions of vertices and edges Open-source snap-graph.sourceforge.net 78 5/4/2009 Parallel Graph Algorithms Image Source: visualcomplexity.com SNAP: Compact graph representations for dynamic network analysis 14 Execution time per update Relative Speedup 1200 12 1000 10 800 8 600 6 400 4 200 2 0 0 1 2 4 8 12 16 24 Number of threads Graph: 25M vertices and 200M edges, System: Sun Fire T2000 79 5/4/2009 Parallel Graph Algorithms 32 Relative Speedup Execution time (nanoseconds per update) 1400 • New graph representations for dynamically evolving small-world networks in SNAP. • We support fast, parallel structural updates to lowdiameter scale-free and small-world graphs. SNAP: Induced Subgraphs Performance 90 12 75 10 60 8 45 6 30 4 15 2 0 0 1 2 4 8 12 16 24 Number of threads Graph: 500M vertices and 2B edges, System: IBM p5 570 SMP 80 5/4/2009 Parallel Graph Algorithms 32 Relative Speedup Execution time (seconds) Execution time Relative Speedup • Key kernel for dynamic graph computations • We reduce execution time of linear-work kernels from minutes to seconds for massive small-world networks (billions of vertices and edges) Large-scale Graph Traversal Problem Graph Multithreaded BFS [BM06] Random graph, 256M vertices, 1B edges 2.3 sec (40p) Processes all low73.9 sec (1p) diameter graph MTA-2 families Random graph, 256M vertices, 1B edges 8.9 hrs (3.2 GHz Xeon) State-of-the-art external memory BFS 11.96 sec (40p) MTA-2 Works well for all lowdiameter graph families 180 sec, 96p 2.0GHz cluster Best known distributed-memory SSSP implementation for large-scale graphs External Memory BFS [ADMO06] Multithreaded SSSP Random graph, [MBBC06] 256M vertices, 1B edges Parallel Dijkstra [EBGL06] 81 5/4/2009 Random graph, 240M vertices, 1.2B edges Parallel Graph Algorithms Result Comments Optimizations for real-world graphs • Preprocessing kernels (connected components, biconnected components, sparsification) significantly reduce computation time. – ex. A high number of isolated and degree-1 vertices • store BFS/Shortest Path trees from high degree vertices and reuse them • Typically 3-5X performance improvement • Exploit small-world network properties (low graph diameter) – Load balancing in the level-synchronous parallel BFS algorithm – SNAP data structures are optimized for unbalanced degree distributions 82 5/4/2009 Parallel Graph Algorithms Faster Community Identification Algorithms in SNAP: Performance Improvement over the Girvan-Newman approach Parallelization Algorithm Engineering Performance Improvement 30 20 10 0 PPI Citations DBLP NDwww Small-world Network Graphs: Real-world networks (order of Millions), System: Sun Fire T2000 83 5/4/2009 Parallel Graph Algorithms Actor • Speedup from Algorithm Engineering (approximate BC) and parallelization (Sun Fire T2000) are multiplicative! • 100-300X overall performance improvement over Girvan-Newman approach Large-scale graph analysis: current research • Synergistic combination of novel graph algorithms, high performance computing, and algorithm engineering. Novel approaches Classical graph Data stream algorithms algorithms Complex Network Analysis & Empirical studies Graph problems on dynamic network abstractions Realistic modeling Many-core Spectral techniques Dynamic graph algorithms Stream Affordable exascale computing data storage Enabling technologies 84 5/4/2009 Parallel Graph Algorithms Review of lecture • Applications: Internet and WWW, Scientific computing, Data analysis, Surveillance • Parallel algorithm building blocks – Kernels: PRAM algorithms, Prefix sums, List ranking – Data structures: graph representation, priority queues • Parallel algorithm case studies – – – – Connected components: graft and shortcut BFS: level-synchronous approach Shortest paths: parallel priority queue Betweenness centrality: parallel algorithm with pseudo-code • Performance on current systems – Software: SNAP, Boost GL, igraph, NetworkX, Network Workbench – Architectures: Cray XMT, cache-based multicore, SMPs – Performance trends 85 5/4/2009 Parallel Graph Algorithms Thank you! • Questions? 86 5/4/2009 Parallel Graph Algorithms