Wenfei Fan Shuai Ma Nan Tang Yinghui Wu Graph Pattern Matching Find Identify all suspects in all matches the drug ring of a B pattern in a data graph B captured.

Download Report

Transcript Wenfei Fan Shuai Ma Nan Tang Yinghui Wu Graph Pattern Matching Find Identify all suspects in all matches the drug ring of a B pattern in a data graph B captured.

Wenfei Fan
Shuai Ma
Nan Tang
Yinghui Wu
Graph Pattern Matching
Find
Identify all suspects in
all matches
the drug ring of a
B
pattern in a data graph
B
captured by a relation, not a function
not allowed by bijection
A1
AM
S
W
W
3(bounded) 1edge-to-path mapping
3
W
W
FW
W
Am
W
W
W
Drug trafficking: Pattern and Data Graph
Subgraph isomorphism is too strict for emerging applications
Cows Don’t Fly and Other Known Facts
• Subgraph isomorphism
NP-complete
Exponential number of subgraphs
• Graph simulation
PTIME
Bounded by |Vp||V|
What is the touchstone?
Subgraph isomorphism is intractable, and has large query results
Touchstone: Meet Practical Needs
• Capture the semantics in emerging applications
• Real-life graph could be large
– Efficient algorithms are necessary
– Simulation-based solution is promising
– Cubic-time complexity is a milestone
• Real-life graph is dynamic
–
–
–
–
Compute from scratch?
Previous effort in vain
The changes are typically small
Incremental algorithms are expected
The quest for new graph pattern queries and solutions
Pearls of This Work (Outline)
• Revised data graphs and pattern graphs
– bounded simulation
• Graph pattern matching algorithm
• Incremental graph pattern matching
– performance guarantees
• Experimental study
– Three real-life datasets and synthetic data
• Conclusion and future work
New practical model for graph pattern matching in emerging applications
Data Graphs
• A data graph is a directed graph G = (V, E, fA)
– fA(u) is a tuple (A1 = a1, ..., An = an)
– attribute: label, keywords, blogs, comments …
(‘dept’=CS, ‘field’=AI)
AI
Med
DB
Gen
(‘dept’=CS, ‘field’=DB)
Chem
(‘dept’=Bio, ‘field’=Gen)
(‘dept’=Bio, ‘field’=Eco)
Soc
Eco
Enriched models on data graphs
Pattern Graphs
• A pattern graph is defined as P = (Vp, Ep, fv, fe)
– fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
• a search condition
– fe(u,u’): k or ∗
Med
*
Unbounded
fv(): ‘dept’=CS
CS
Bounded
3
*
2
3
Bio
2
Soc
The new graph patterns are more flexible and expressive
Bounded Simulation
• G=(V, E) matches P=(Vp, Ep) via bounded simulation, if there exists
a binary relation S ⊆ Vp × V such that:
– for each u∈ Vp, there exists v∈ V such that (u,v)∈ S
– for each (u,v)∈ S, the attributes fA(v) satisfies the predicate fv(u)
No to
matches
– each (u,u’) in Ep is mapped
a bounded path from v to v’ in G, (u’,v’)∈ S
*
CS
AI
Med
*2
3
(CS, DB)
3
(Bio, Gen)
Bio
DB
Gen
Chem
(Bio, Eco)
In traditional graph
simulation,
(Med, Med)
(v, v’) is an edge
2
Soc
S
Med
Soc
Eco
(Soc, Soc)
A departure from traditional graph simulation
Maximum
Match and Result Graph
But these are graph queries
• Maximum match: if G matches P, then there is a unique
The problem of graph pattern
maximum match
matching is well defined
• Result graph: a graph representation of S
Med
*
CS
(CS, DB)
Med
*2
3
3
3
(Bio, Gen)
Bio
(Bio, Eco)
Soc
Gen
1
3
DB
2
(Med, Med)
2
1
2
3
(Soc, Soc)
Eco
2
1
Soc
Result
graph
S
SQL queries return
relations
A unique result and a clear-cut result presentation
Identify the Maximum Match
Algorithm: Match
• input: a data graph G and a pattern graph P
• output: the maximum match S
Main ideas:
cubic
time
– Initiate the match set of each pattern node, and a distance matrix
decrease
– Recursively remove nodes that cannot make a match
– Return the maximum match S, or an empty set otherwise monotonically
The algorithm is well designed with low complexity - O(|V|3)
Subgraph isomorphism is NP-complete! We are in cubic time!!
How It Works
• Initialization
P
mat()
premv()
CS
{DB, AI}
{DB,AI,Gen,Chem,Eco}
Med
{Med}
{Med,Chem}
Bio
{Gen, Eco} {Med,Gen,Eco,Chem}
Soc
{Soc}
{AI,Med,Chem}
*
CS
Med
*2
3
3
AI
Med
DB
Gen
(Bio, Gen)
Bio Eco)
(Bio,
2
Soc
(CS, DB)
Chem
(Med, Med)
(Soc, Soc)
Soc
Eco
 step1: AI is removed from mat(CS), and premv(Soc)=null
 step2: nothing can be removed from mat(CS), and premv(Bio)=null

Bad news: in practice, data
step3: nothing can be removed graphs
from mat(CS),
mat(Bio), premv(Med)=null
are dynamic
 step4: nothing can be removed from mat(Med), premv(CS)=null
 Return the maximum match
The maximum match can be identified efficiently
Incremental Graph Pattern Matching
 A pattern P, a graph G, the maximum match S, updates δ, find S’
 Affected area |AFF|: the changes in the input and the output
 The problem is unbounded even when pattern graphs are DAGs
 A revised AFF, taking the distance matrix M as an input
No performance
guarantees
 |AFF1|: the set of node pairs in G whose distance is changed
 |AFF2|: the difference between S’ and S
 Unit updates: edge deletions, edge insertions
With performance
guarantees
 Batch updates: a sequence of edge deletions and insertions
Minimizing unnecessary computation
Incremental Algorithm for Unit Update
Algorithm: Match•
•
input: a graph G, a pattern P, the maximum match S, the
distance matrix M and an edge e to be deleted from G
output: the new maximum match S’
Main ideas:
–
–
–
–
Compute affected area AFF1 by incrementally deriving M’
Identify matches in S that are directly affected by AFF1
Recursively find all matches that are affected by AFF1
Return S’ and M’, and constitute AFF2
For unit insertion, w.r.t. DAG patterns and data
graphs, Match+ runs in O(|AFF1||AFF2|2)
For unit deletion, it runs in O(|AFF1||AFF2|2)
How Incremental Algorithm Works
• Given P, G, the maximum match S, and an edge (n3, n5) to be removed
(A, n1)
n1
(SE, n3)
A
2
A
n3
*
SE
1
n2
2
HR
2
DM,’golf’
HR
SE
n4
(SE, n5)
n5
(HR, n2)
HR,SE
n6
DM,’golf’
DM,’golf’
(HR, n5)
(DM, n4)
(DM, n6)
 step1: identify AFF1: (n3,n5), (n3,n6), (n4, n5), (n4, n6)

step2: identify those in S that are affected by AFF1, (n4, n1) and (n3, n1)
 step3: check the parent of n4, and remove n3 from mat(SE)
Unit update can be incrementally computed efficiently
Incremental Algorithms for Batch Updates
• Given P, G, the maximum match S, and updates δ,
find S’
 Straightforward way: process updates in δ one by
one
 IncMatch: incrementally computes AFF1 and
updates M by taking the entire δ as a batch
 It is in O(|AFF1||AFF2|2) time, for DAG patterns
and general data graphs.
Batch updates can be handled efficiently
Incremental Graph Simulation
• The problem is unbounded even for unit
updates and general patterns.
• It is bounded for:
– single-edge deletions and general patterns;
– single-edge insertions and DAG patterns; within
an optimal time O(|AFF|).
• In O(|δ |(|P||AFF| + |AFF|2)) time for batch
updates and general patterns.
Incremental Graph Simulation
• Main idea for single edge (v',v) deletion :
– delete the edge from G; check if v' still serve as a
match for some pattern node u';
– if not, remove v' from sim(u'); push all the edges
(v'',v) into a stack;
– while stack is not empty, repeat the above check
for v''.
• Single edge insertion is similar as above, only
the first step is a little different.
• Datasets
Experimental Study
– Synthetic data: C++ boost graph generator
– Real-life data
Matter Pblog YouTube
|V
|
16726
|E| 47594
1490
14829
19090
58901
• Pattern generator: |Vp|, |Ep|, G, k (k-c ≤ k’ ≤k+c)
• Algorithms implemented
– Match and IncMatch
– BFS and 2-hop
– SubIso and VF2
• Experimental results
– Effectiveness and flexibility
– Efficiency and scalability
– Incremental performance
Effectiveness and Flexibility
C=“Music” and R > 3
n1
n2
1
More matches
U=“FWPB”
4
3
U=“ascro” and A < 500
2
n3
3
2
n4
n5
2
Much faster
n6
2
3 3
n7
2
n8
YouTube
Identify meaningful
results
Efficiently identify more communities than its counterpart
Efficiency and Scalability
(|V|=20K, |E|=20K)
(|V|=20K, |E|=40K)
Match scales well with |E|
(|V|=20K, |E|=60K)
The use of distance matrix is effective
Patterns can be evaluated
efficiently, with good scalability
Incremental Performance
IncMatch is better when
|δ| ≤ 2800,
edge insertion has a
stronger impact than edge
deletion
The updates can be incrementally handled with high efficiency
Conclusion and Future Work
• Graph pattern queries:
– revised data graphs and pattern graphs using bounded simulation;
– a cubic-time algorithm for the proposed graph pattern queries;
– incremental algorithms with performance guarantees.
• Future work
• a bounded incremental algorithm for cyclic patterns;
• the queries supporting edge colors capturing relationships;
• optimization techniques.
A promising graph pattern matching in emerging applications
1. Revisited Graph
homomorphism
edge-to-edge to edge-to-path
(PVLDB 2010)
2. Bounded Simulation
Graph isomorphism to simulation
(PVLDB 2010)
3. Adding (restrictive) Regular
Expression to Bounded Simulation
(ICDE 2011)
4. Incremental Graph
Pattern Matching
(in submission)