download report

Transcript PowerPoint

Distributed Graph Pattern Matching
Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo
Graphs are everywhere, and quite a few are huge graphs!
File systems
World Wide Web
Social Networks
Graph searching is a key to social searching engines!
Graph Pattern Matching
• Given two graphs G1 (pattern graph) and G2 (data graph),
– decide whether G1 “matches” G2 (Boolean queries);
– identify “subgraphs” of G2 that match G1
• Applications
Web mirror detection/ Web site classification
Complex object identification
Software plagiarism detection
Social network/biology analyses
• Matching Semantics
– Traditional: Subgraph Isomorphism
– Emerging applications: Graph Simulation and its extensions, etc..
A variety of emerging real-life applications!
Distributed Graph Pattern Matching
• Real-life graphs are typically way too large:
– Yahoo! web graph: 14 billion nodes
– Facebook: over 0.8 billion users
It is NOT practical to handle large graphs on single machines
• Real-life graphs are naturally distributed:
– Google, Yahoo and Facebook have large-scale data centers
Distributed graph processing is inevitable
It is nature to study “distributed graph pattern matching”!
Distributed Graph Pattern Matching
• Given pattern graph Q(Vq, Eq) and fragmented data graph
F = (F1, … , Fk) of G(V, E) distributed over k sites,
• the distributed graph pattern matching problem is to find the
maximum match in G for Q, via graph simulation.
There exists a unique maximum match for graph simulation!
Graph Simulation
• Given pattern graph Q(Vq, Eq) and data graph G(V, E), a
binary relation R ⊆ Vq × V is said to be a match if
– (1) for each (u, v) ∈ R, u and v have the same label; and
– (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such
that (u′, v′) ∈ R.
• Graph G matches pattern Q via graph simulation, if there
exists a total match relation M
– for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.
– Intuitively, simulation preserves the labels and the child relationship
of a graph pattern in its match.
– Simulation was initially proposed for the analyses of programs; and
simulation and its extensions were recently introduced for social
Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!
Graph Simulation
Set up a team to develop a new software product
Graph simulation returns F3, F4 and F5;
Subgraph isomorphism returns empty!
Subgraph Isomorphism is too strict for emerging applications!
Properties of Graph Simulation
Impacts of connected components (CCs)
• Let pattern Q = {Q1, . . . , Qh} (h CCs). For any data graph G,
– if Mi is the maximum match in G for Qi,
– then M1 ∪ … ∪ Mh is the maximum match in G for Q.
• Let data graph G = {G1, . . . , Gh} (h CCs). For any pattern graph G,
– if Mi is the maximum match in Gi for Q,
– then M1 ∪ … ∪ Mh is the maximum match in G for Q.
Even if data graph G is connected, R(G) might be highly disconnected,
by removing useless nodes and edges from G.
• Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data
graph G(V,E) that contains the maximum match M in G for Q.
– If Mi is the maximum match in R(G)i for Q,
– then M1 ∪ … ∪ Mh is exactly the maximum match in G for Q,
where R(G) consists of h CCs R(G)1, . . . , R(G)h).
Properties of Graph Simulation
What can be computed locally?
• The matched subgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5;
• Removing any node or edge from Gs makes Q1 NOT match Gs.
Graph simulation has poor data locality
Properties of Graph Simulation
We turn to the data locality of single nodes
• Checking whether data node v in G matches pattern node u in Q can
be determined locally iff subgraph desc(Q, u) is a DAG.
desc(Q1, SA) is the subgraph in
Q1 with nodes SA, SD and ST
What we have learned from the static analysis?
• Treat each connected component in Q and G separately;
• Use the data locality to check whether a node in G can be
determined locally.
Complexity Analysis of Distributed Algorithms
Model of Computation:
• A cluster of identical machines (with one acted as coordinator);
• Each machine can directly send arbitrary number of messages to
another one;
• All machines co-work with each other by local computations and
Complexity measures:
1. Visit times: the maximum visiting times of a machine (interactions)
2. Makespan: the evaluation completion time (efficiency)
3. Data shipment: the size of the total messages shipped among distinct
machines (network band consumption)
Complexity Analysis of Distributed Algorithms
Specifications for the distributed algorithms:
• For each machine Si (1 ≤
i ≤ k) ,
– Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3) a marked
binary relation Ri ⊆ Vq × V, where
 each match (u; v) 2 Ri is marked as true, false or unknown; and
 Ri can be updated by either messages or local computations.
– Message: only local information is allowed to be exchanged
– Local computations: update Ri by utilizing the semantics of graph simulation.
 local algorithms execute only local computations without involving message-passing during
the computation,
 run in time of a polynomial of |Q| and |Gs,i|.
Complexity bounds:
1. The optimal data shipment is |G| - 1, and it is tight.
2. The optimal visit times are 1, and it is tight.
3. The minimum makespan problem is NP-complete.
1. Data shipment, visit times and makespan are controversial with each other.
2. A well-balanced strategy between makespan and the other two measures.
Distributed Evaluation of Graph Simulation
• Stage 1: Coordinator SQ broadcasts Q to all k sites;
• Stage 2: All sites, in parallel, partially evaluate Q on local fragments –
partial match;
• Stage 3: Ship those CCs across different machines to single
machines, while minimizing data shipment and makespan;
• Stage 4: Compute the maximum matches in those CCs originally
across multiple machines in parallel;
• Stage 5: Collect and assemble partial matches in the coordinator.
Performance guarantees:
1. The total computational complexity is the same to the best-known centralized
algorithm, while it invokes 4 rounds of message-passing and local evaluation only;
2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|;
3. Each machine except coordinator SQ is visited with g + 2 times (g is the
maximum number machines at which a CC resides in Stage2, and SQ is visited
2(k -1) times.
Sacrifice data shipment and visit times for makspan!
Scheduling Data Shipment - Stage 3
The Scheduling Problem:
Given h connected components, C1, … ,Ch, and an integer k, find an assignment of
the connected component to k identical machines, so that both the makespan and
the total data shipment are minimized.
Approximation Hardness (data shipment, makespan):
The scheduling problem is not approximable within (ε, max(k − 1, 2)) for any ε > 1.
Performance guarantees of algorithm dSchedule:
Algorithm dSchedule produces an assignment of the scheduling problem such that
the makespan is within a factor (2 − 1/k) of the optimal one.
1. A heuristic is used to minimize the data shipment,
2. A greedy approach is adopted to guarantee the performance of the makesapn.
3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not
cause a bottleneck.
Optimization Techniques
Using data locality
Determine whether (u, v) belongs to the maximum match M in G for Q:
Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj;
Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph
desc (Q, u) of Q is a DAG
(SA, SA2): Case 1
(BA, BA2): Case 2
Minimizing pattern graphs (Q ≡ Qm)
Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such
that for any data graph G, G matches Q iff G matches Qm, via graph simulation.
Experimental Study
Real life datasets:
Google Web graph: 875,713 nodes and 5,105,039 edges
Amazon product co-buy network: 548,552 nodes and 1,788,725 edges
Synthetic graph generator: (108 nodes and 3,981,071,706 edges)
Three parameters:
1. The number n of nodes;
2. The number nα of edges; and
3. The number l of node labels
Algorithm disHHK and its optimized version disHHK+
Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt (visit times)
The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620
CPUs and 64GB memory
Experimental Study
1. All algorithms scale well except naiveMatchds and naiveMatchvt
2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK
Experimental Study
1. All algorithms ship about 1/10000 of the data graphs
2. disHHK+ and disHHK even ship less data than naiveMatchds when
data graphs are large and sparse
Experimental Study
disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds,
as expected
We have formulated and investigated the distributed graph pattern
matching problem, via graph simulation.
We have given a static analysis of graph simulation
– Utility of connected components
– Study of data locality
We have studied the complexity of a large class of distributed algorithms
for graph simulation.
– A message-passing computation model
– Makespan, data shipment, and visit times (controversial with each other)
We have proposed a distributed algorithm for graph simulation
– The scheduling problem
– Optimization techniques
– Experimental verification
A first step towards the big picture of distributed graph pattern matching