Transcript PowerPoint

Distributed Graph Pattern Matching
Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo
Graphs are everywhere, and quite a few are huge graphs!
File systems
Databases
World Wide Web
Social Networks
Graph searching is a key to social searching engines!
2
Graph Pattern Matching
• Given two graphs G1 (pattern graph) and G2 (data graph),
– decide whether G1 “matches” G2 (Boolean queries);
– identify “subgraphs” of G2 that match G1
• Applications
–
–
–
–
–
Web mirror detection/ Web site classification
Complex object identification
Software plagiarism detection
Social network/biology analyses
…
• Matching Semantics
– Traditional: Subgraph Isomorphism
– Emerging applications: Graph Simulation and its extensions, etc..
A variety of emerging real-life applications!
3
Distributed Graph Pattern Matching
• Real-life graphs are typically way too large:
– Yahoo! web graph: 14 billion nodes
– Facebook: over 0.8 billion users
It is NOT practical to handle large graphs on single machines
• Real-life graphs are naturally distributed:
– Google, Yahoo and Facebook have large-scale data centers
Distributed graph processing is inevitable
It is nature to study “distributed graph pattern matching”!
4
Distributed Graph Pattern Matching
• Given pattern graph Q(Vq, Eq) and fragmented data graph
F = (F1, … , Fk) of G(V, E) distributed over k sites,
• the distributed graph pattern matching problem is to find the
maximum match in G for Q, via graph simulation.
There exists a unique maximum match for graph simulation!
5
Graph Simulation
• Given pattern graph Q(Vq, Eq) and data graph G(V, E), a
binary relation R ⊆ Vq × V is said to be a match if
– (1) for each (u, v) ∈ R, u and v have the same label; and
– (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such
that (u′, v′) ∈ R.
• Graph G matches pattern Q via graph simulation, if there
exists a total match relation M
– for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.
– Intuitively, simulation preserves the labels and the child relationship
of a graph pattern in its match.
– Simulation was initially proposed for the analyses of programs; and
simulation and its extensions were recently introduced for social
networks.
Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!
6
Graph Simulation
Set up a team to develop a new software product
Graph simulation returns F3, F4 and F5;
Subgraph isomorphism returns empty!
Subgraph Isomorphism is too strict for emerging applications!
7
Properties of Graph Simulation
Impacts of connected components (CCs)
• Let pattern Q = {Q1, . . . , Qh} (h CCs). For any data graph G,
– if Mi is the maximum match in G for Qi,
– then M1 ∪ … ∪ Mh is the maximum match in G for Q.
• Let data graph G = {G1, . . . , Gh} (h CCs). For any pattern graph G,
– if Mi is the maximum match in Gi for Q,
– then M1 ∪ … ∪ Mh is the maximum match in G for Q.
Even if data graph G is connected, R(G) might be highly disconnected,
by removing useless nodes and edges from G.
• Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data
graph G(V,E) that contains the maximum match M in G for Q.
– If Mi is the maximum match in R(G)i for Q,
– then M1 ∪ … ∪ Mh is exactly the maximum match in G for Q,
where R(G) consists of h CCs R(G)1, . . . , R(G)h).
8
Properties of Graph Simulation
What can be computed locally?
• The matched subgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5;
• Removing any node or edge from Gs makes Q1 NOT match Gs.
Graph simulation has poor data locality
9
Properties of Graph Simulation
We turn to the data locality of single nodes
• Checking whether data node v in G matches pattern node u in Q can
be determined locally iff subgraph desc(Q, u) is a DAG.
desc(Q1, SA) is the subgraph in
Q1 with nodes SA, SD and ST
What we have learned from the static analysis?
• Treat each connected component in Q and G separately;
• Use the data locality to check whether a node in G can be
determined locally.
10
Complexity Analysis of Distributed Algorithms
Model of Computation:
• A cluster of identical machines (with one acted as coordinator);
• Each machine can directly send arbitrary number of messages to
another one;
• All machines co-work with each other by local computations and
message-passing.
Complexity measures:
1. Visit times: the maximum visiting times of a machine (interactions)
2. Makespan: the evaluation completion time (efficiency)
3. Data shipment: the size of the total messages shipped among distinct
machines (network band consumption)
11
Complexity Analysis of Distributed Algorithms
Specifications for the distributed algorithms:
• For each machine Si (1 ≤
i ≤ k) ,
– Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3) a marked
binary relation Ri ⊆ Vq × V, where
 each match (u; v) 2 Ri is marked as true, false or unknown; and
 Ri can be updated by either messages or local computations.
– Message: only local information is allowed to be exchanged
– Local computations: update Ri by utilizing the semantics of graph simulation.
 local algorithms execute only local computations without involving message-passing during
the computation,
 run in time of a polynomial of |Q| and |Gs,i|.
Complexity bounds:
1. The optimal data shipment is |G| - 1, and it is tight.
2. The optimal visit times are 1, and it is tight.
3. The minimum makespan problem is NP-complete.
Remarks:
1. Data shipment, visit times and makespan are controversial with each other.
2. A well-balanced strategy between makespan and the other two measures.
12
Distributed Evaluation of Graph Simulation
• Stage 1: Coordinator SQ broadcasts Q to all k sites;
• Stage 2: All sites, in parallel, partially evaluate Q on local fragments –
partial match;
• Stage 3: Ship those CCs across different machines to single
machines, while minimizing data shipment and makespan;
• Stage 4: Compute the maximum matches in those CCs originally
across multiple machines in parallel;
• Stage 5: Collect and assemble partial matches in the coordinator.
Performance guarantees:
1. The total computational complexity is the same to the best-known centralized
algorithm, while it invokes 4 rounds of message-passing and local evaluation only;
2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|;
3. Each machine except coordinator SQ is visited with g + 2 times (g is the
maximum number machines at which a CC resides in Stage2, and SQ is visited
2(k -1) times.
Sacrifice data shipment and visit times for makspan!
13
Scheduling Data Shipment - Stage 3
The Scheduling Problem:
Given h connected components, C1, … ,Ch, and an integer k, find an assignment of
the connected component to k identical machines, so that both the makespan and
the total data shipment are minimized.
Approximation Hardness (data shipment, makespan):
The scheduling problem is not approximable within (ε, max(k − 1, 2)) for any ε > 1.
Performance guarantees of algorithm dSchedule:
Algorithm dSchedule produces an assignment of the scheduling problem such that
the makespan is within a factor (2 − 1/k) of the optimal one.
Remarks:
1. A heuristic is used to minimize the data shipment,
2. A greedy approach is adopted to guarantee the performance of the makesapn.
3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not
cause a bottleneck.
14
Optimization Techniques
Using data locality
Determine whether (u, v) belongs to the maximum match M in G for Q:
Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj;
Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph
desc (Q, u) of Q is a DAG
(SA, SA2): Case 1
(BA, BA2): Case 2
Minimization
Minimizing pattern graphs (Q ≡ Qm)
Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such
that for any data graph G, G matches Q iff G matches Qm, via graph simulation.
15
Experimental Study
Real life datasets:
Google Web graph: 875,713 nodes and 5,105,039 edges
Amazon product co-buy network: 548,552 nodes and 1,788,725 edges
Synthetic graph generator: (108 nodes and 3,981,071,706 edges)
Three parameters:
1. The number n of nodes;
2. The number nα of edges; and
3. The number l of node labels
Algorithms:
Algorithm disHHK and its optimized version disHHK+
Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt (visit times)
Machines:
The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620
CPUs and 64GB memory
16
Experimental Study
1. All algorithms scale well except naiveMatchds and naiveMatchvt
2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK
17
Experimental Study
1. All algorithms ship about 1/10000 of the data graphs
2. disHHK+ and disHHK even ship less data than naiveMatchds when
data graphs are large and sparse
18
Experimental Study
disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds,
as expected
19
Conclusion
We have formulated and investigated the distributed graph pattern
matching problem, via graph simulation.
We have given a static analysis of graph simulation
– Utility of connected components
– Study of data locality
We have studied the complexity of a large class of distributed algorithms
for graph simulation.
– A message-passing computation model
– Makespan, data shipment, and visit times (controversial with each other)
We have proposed a distributed algorithm for graph simulation
– The scheduling problem
– Optimization techniques
– Experimental verification
A first step towards the big picture of distributed graph pattern matching
20