Transcript ln3

QSX: Querying Social Graphs
Graph Pattern Matching
 Graph pattern matching via subgraph isomorphism
 Graph pattern matching via graph simulation
 Revisions of graph simulation for social network analysis
1
The need for studying graph pattern matching
 Applications
•
pattern recognition
•
knowledge discovery
•
intelligence analysis
•
transportation network analysis
•
Web site classification,
•
social position and community detection
•
social media marketing
•
knowledge fusion
•
...
Prevalent use in traditional and emerging applications
2
Subgraph isomorphism: complexity and algorithm
3
Social Graphs
Directed graph G = (V, E, fA)
Assume fA(u) has a
unique attribute: label
 attributes fA(u): label
A
I
Med
DB
Gen
Soc
Eco
Eco
Simplification: node labels
Chem
4
Subgraph isomorphism
A function f from the nodes of Q to the nodes of G:


For each node u in Q, u and f(u) have the same label;
There exists an edge (u, u’) in Q if and only if there exists an
edge (f(u), f(u’)) in G
A
A
B
B
D
E
Q
v1
B
v2
E
D
G
A bijection: identical label matching, edge-to-edge relations
55
Matching by subgraph isomorphism

Input: A directed graph G, and a graph pattern Q

Output: all subgraphs of G that are isomorphic to Q


NP-complete
Exponentially many matches
Complexity
• Remains NP-hard even when
• Q is a tree and G is a forest
• Q is acyclic and G is a tree
PTIME if Q is a forest and G is a tree
The lower bounds is rather robust
intractable
6
Algorithms for computing subgraph isomorphism
Input: pattern Q and graph G
Output: all isomorphic mappings P from Q to G
P: partial mappings, initially empty
Match(P)
•
if P covers all nodes in Q then output P;
•
else compute the set S(P) of all candidate pairs for inclusion in P
•
for each pair p = (u, v) in S(P)
nodes that are directly connected
to those already in P, with the
• then P’  P  {p}; call Match(P’);
same labels
• restore data structures
Guarantee correctness
for each pair p = (u, v) in S(P):
• if p passes feasibility check
 enumerate all possible extensions, for refinement
 if the feasibility test is not successful, drop it and try the next
Recursion, refinement
7
VF2
Match(P)
•
if P covers all nodes in Q then output P;
•
else compute the set S(P) of all candidate pairs for inclusion in P
•
for each pair p = (u, v) in S(P)
•
restore data structures
Five k-look-ahead rules, to make
sure that P is a partial isomorphic
• if p passes feasibility check
mapping
• then P’  P  {p}; call Match(P’);
Guarantee correctness
Feasibility rules: for each pair (u, v) in P
and reduce backtracking
 their predecessors
are P.
already
and M.
included
L. P. Cordella,
Foggia,mapped
C. Sansone,
Vento. in P
A (Sub)Graph
Isomorphism
Algorithm for
 their successors
can possibly
be mapped
Matching Large Graphs, IEEE Trans.
 Certain conditions
on Anal.
cardinalities
of predecessors
and
Pattern
Mach. Intell.
26, 2004
successors to ensure correctness and expandability
VF2: a popular algorithm for subgraph isomorphism
8
Ullman’s algorithm
Use adjacency matrices of G and
Q, their transposes, and a form of
permutation matrices
Backtrack(P)
•
if P covers all nodes in Q then output P and return;
•
for each node u in Q that is not yet in P
• find a node v in G; p  (u, v); P’  P  {p};
• if P’ makes a partial mapping (injective function, preserving edges)
• then call Backtrack(P’);
Expanding permutation
matrices representing P
for each candidate pair p = (u, v):
 enumerate all possible extensions, for refinement
 Backtracking: no matter whether the test is successful or not, go
back to the previous level and try another p
J. R. Ullman. An Algorithm for Subgraph
Isomorphism. JACM 1976
An algorithm that is still being used
9
Graph simulation: complexity and algorithm
10
Graph Simulation
A binary relation R on the nodes of Q and the nodes of G:


For each node u in Q, there exists a node v in G such that (u, v)
is in R, and u and v have the same label;
If there exists an edge (u, u’) in Q and each pair (u, v) is in R,
then there exists an edge (v, v’) in G such that (u’, v’) is in R
A
A
B
B
D
E
Q
v1
B
v2
E
D
relations as opposed to functions
G
A relation: identical label matching, edge-to-edge mapping
11
11
Matching by graph simulation



Input: A directed graph G, and a graph pattern Q
Output: the maximum simulation relation R
Maximum simulation relation: always exists and is unique
• If a match relation exists, then there exists a maximum one
• Otherwise, it is the empty set – still maximum
 Complexity: O((| V | + | VQ |) (| E | + | EQ| )
 The output is a unique relation, possibly of size |Q||V|
Use relations instead of functions
Quadratic time
12
Data locality
Given a pattern Q, a graph G and a node v in G, can we decide
whether v matches some node in Q by inspecting only nodes within
d hops of v, where d is determined by Q only?

Graph simulation does not have the data locality
Q
G
d: the diameter of Q
We only need to inspect the d-neighborhood of v

Subgraph isomorphism has the data locality
Graph simulation: a recursive computation
13
Algorithm for computing graph simulation
Input: pattern Q and graph G
Output: for each u in Q, sim(u): the matches w in G
Similarity(P)
•
for all nodes u in Q do
with the same label; moreover, if u
has an outgoing edge, so does w
• sim(u)  the set of candidate matches w in G;
•
while there exist (u, v) in Q and w in sim(u) (in G) that violate the
simulation condition
successor(w)  sim(v) = 
• sim(u)  sim(u)  {w};
•
output sim(u) for all u in Q
refinement
successor(w)  sim(v) = 
• There exists an edge from u to v in Q, but the candidate w of u
has no corresponding edge to a node w’ that matches v
Correct, but not in quadratic time
14
speedup

For each node u in pattern Q, prevsim(u) a superset of sim(u)
• once considered for candidate matches of u
• for each edge (u, v) in Q and each w in sim(u)
successor(w)  prevsim(v)  
• terminate if prevsim(u) = sim(u) for all nodes u in G
prevsim(u)  sim(u): invalid candidates
 If successor(w)  prevsim(v) = 
•
Can’t be refined further
w should be removed from sim(u); u: a predecessor of v
Propagate violations upward
Once w is removed, it is never put back
Each node in prevsim(u) is looked up only once
15
Algorithm
with the same label; moreover, if u
has an outgoing edge, so does w
Similarity(P)
•
for all nodes v in Q do
• sim(v)  the set of candidate matches in G;
• prevsim(v)  the set of all the nodes in G;
•
while there exists a node v in Q and such that sim(v)  prevsim(v)
• remove  predecessor(sim(v))  predecessor(prevsim(v));
• for all u in predecessor(v) do
• sim(u)  sim(u)  remove;
• prevsim(v)  sim(v);
•
output sim(v) for all v in Q
A dynamically
maintained remove
Propagate up
refinement
For each w  prevsim(v)  sim(v),
w is checked only once, hence |VQ| |V| in total
Can be implemented in O((| V | + | VQ |) (| E | + | EQ| ) time
16
Graph simulation revised for social network analysis
17
Graph pattern matching: The conventional
 Input: a query Q and a data graph G,
 Output: all the matches of Q in G.
• subgraph isomorphism
a bijective function f on nodes: (u,u’ )
∈ Q iff (f(u), f(u’)) ∈ G
•
graph simulation
a binary relation S on nodes
for each (u,v)∈ S, each edge (u,u’)
in Q is mapped to an edge (v, v’ )
in G, such that (u’,v’ )∈ S
18
Can we use the conventional notions for social network analysis?
Example query: graph pattern matching
Find all matches of a pattern in a graph
B
Identify suspects
in a drug ring
B
A1
Am
1
AM
3
S
W
W
3
W
W
W
W
FW
pattern graph
W
W
“Understanding the structure of drug trafficking organizations”
19
Pattern matching in social graphs
relation
instead of
function
not allowed by
bijection
B
A1
B
Am
1
AM
3
W
W
S
3
W
W
W
W
FW
edges to paths
W
W
For both scalability and effectiveness
Neither subgraph isomorphism nor graph simulation works
20
Social Graphs
label, keywords, blogs,
Directed graph G = (V, E, fA)
comments, rating …
 attributes fA(u): a tuple (A1 = a1, ..., An = an)
A
I
Med
DB
Gen
(‘dept’=CS, ‘field’=AI)
(‘dept’=CS, ‘field’=DB)
Soc
Chem
(‘dept’=Bio, ‘field’=Gen)
Eco
Eco
(‘dept’=Bio, ‘field’=Eco)
Social graphs: modeling attributes
21
Bounded patterns
Search condition
Pattern graph: Q = (VQ, EQ, fv, fe)
 fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
 fe(u,u’): a constant k or a symbol ∗, bound
within k hops
*
Unbounded
CS
Med
Bounded
3
*
2
Bio
fv(): ‘dept’=CS
3
2
Soc
Incorporating search conditions and bounds on the number of hops22
Bounded Simulation
for each (u,v)∈ S,
 attributes fA(v) satisfies predicate fv(u)
G = (V, E, fA) matches
Q
(VQ, E
fvQ,, fthere
bounded
for=each
u∈
v∈ Vsimulation,
such that if
Q, V
e) viaexists
 each (u,u’ ) in EQ(u,v)
is mapped
path from v
∈SS⊆ VQto×a V
there exists a binary relation
such that S
to v’ of length fe(u,u’ ) in G, (u’,v’ )∈ S
 is a total mapping,
 satisfies search conditions and bounds on edge-to-path mappings
*
CS
Med
AI
Med
DB
Gen
S
3
*
Bio
2
3
Chem
2
Soc
Soc
Eco
There exists a unique
maximum match
Mapping edges to bounded paths
23
Bounded simulation in social graphs
B
relation instead of function
A1
B
Am
1
AM
3
W
W
S
3
W
W
W
W
FW
edges to paths
W
W
The set of all suspects involved in a drug ring
24
Complexity
 Input: Pattern Q and data graph G
Always exist
 Output: Q(G), the unique maximum match relation
cubic time
O(| V | | E | + | EQ| | V |2 + | VQ| | V |)
 Subgraph isomorphism: intractable
comparable: Q is
small in practice
 Graph simulation: O((| V | + | VQ |) (| E | + | EQ| )
Query driven approximation: use bounded simulation instead of
subgraph isomorphism. Criteria:
 Lower complexity
 Effectiveness: the query
answersThe
arereading
sensible
Algorithm?
list
To identify sensible matches and be computable in low PTIME
25
Bounded simulation vs. graph simulation
Graph simulation: a special case of bounded simulation
 The same bound 1 on all pattern edges (edge-to-edge mapping)
 Unique attributes vs. search conditions: label equality
 O((| VG | + | VQ |) (| EG | + | EQ| )
vs.
O(| VG | | EG | + | EQ| | VG |2 + | VQ| | VG|)
Process calculus
Web site classification
Social position detection, …
Capture more sensible matches in social graphs (by 80%)
26
Homeomorphism and monomorphism
Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ)
 an injective function from VQ V
function rather than relation
 edges to pairwise node-disjoint simple paths in G
constraints on paths
Monomorphism revised: G = (V, E) matches Q = (VQ, EQ)
 an injective function from VQ V
 edges to nonempty paths in G
Intractable, even when Q
is a tree and G is a DAG
Strike a balance between expressive power and complexity
27
Graph pattern matching:
• Incorporating edge relationships
28
Edge relationships
S: supervise
pattern
C: co-author
C
C
Ann, CS
C
S+
CS
S
S
S
Mat, DB
John, DB
C C
Bill, Bio
Tom, Bio
C
DB
Bio
C
C
Pat, DB
Don, Gen
Bio
What is this pattern to find?
29
Edge relation
Facebook
Mikhail
Twitter
Alice
Sunita
Jose
(Alice, Facebook)
(Alice, Sunita)
(Jose, Twitter)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Twitter)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
30
Graph encodings: Adding edge types
Facebook
fan-of
Mikhail
fan-of
friend-of
Alice
fan-of
fan-of
friend-of
Sunita
fan-of
Twitter
Jose
(Alice, fan-of, Facebook)
(Alice, friend-of, Sunita)
Adding edge labels
(Jose, fan-of, Twitter)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Twitter)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)
31
Graph encodings: Adding weights
Facebook
fan-of
0.8
fan-of 0.5
0.7 fan-of
friend-of
friend-of
Alice
0.9
0.3
Sunita
fan-of
Mikhail
0.7
fan-of
Twitter
0.5
Jose
(Alice, fan-of, 0.5, Facebook)
(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Twitter)
Even further, you can
add weights and
others
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8, Facebook)
(Mikhail, fan-of, 0.7, Twitter)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
(Sunita, friend-of, 0.3, Jose)
32
Regular patterns
Pattern: Q = (VQ, EQ, fv, fe)
 fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
Unbounded
 fe(u,u’ ): a regular expression of the form
F ::= c | ck | c+ | FF
C
S+
CS
Simple regular expressions:
 fairly common
 optimizing patterns (checking
containment in linear-time)
 low complexity in matching
DB
Bio
C
Bounded
Bio
Mapping edges to paths satisfying associated regular expressions33
Complexity
 Input: Pattern Q and data graph G
 Output: Q(G)
m: the number of
distinct colors in Q
O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)
 bounded simulation: a special case
 single color c (hence m = 1)
 fe(u,u’ ) = c
general regular
expressions?
Adding edge colors does not incur extra complexity
34
Graph pattern matching:
• Capturing graph topology
35
Limitations of graph simulation
pattern
graph
 A disconnected graph matches a connected pattern
 The yellow node in the pattern has 3 “parents”, in contrast to 1
in the data graph
 An undirected cycle matches a tree
Simulation does not preserve the topologic in matching
36
Limitations of graph simulation
pattern
graph
 A cycle with two nodes matches a cycle of unbounded length
 The match relation may be excessively large
When social distances increase, the
closeness of relationships decrease
The need for revising simulation to enforce locality
37
Dual simulation
for each (u,v)∈ S,
 each (u,u’ ) in EQ is mapped to an edge
G = (V, E, fA) matches Q(v,
= v’
(V)Q,inEG,
fv, fv’
Q, (u’,
e) )via
∈ Sdual simulation, if
there exists a binary
S ⊆u)Vin
that toSan edge
Q ×
 relation
each (u’,
EQ Vissuch
mapped
 is a total mapping, (v’, v) in G, (u’, v’ ) ∈ S
 satisfies search conditions, and
 preserves both “child” and “parent” relationships
 Q(G) : a unique maximum match relation
Preserve “parent” relationships and connectivity
38
Locality
 diameter dQ: the maximum shortest distance (undirected paths)
2
1
 dQ-radius subgraph G[v, dQ] : centered at v, within dQ hops
v
Excessive match
Locality: matches contained in G[v, dQ] for some v
39
Strong simulation
 G matches Q via strong simulation, if there exists a node v in G
such that G[v, dQ] matches Q via dual simulation
– duality
– local
 Match: the subgraph GS of G[v, dQ] representing the maximum
match S
for each (u,v) in the maximum match S,
 v is in GS
 for each edge (u,u’ ) in Q, (v, v’ ) is in GS
if (u’,v’ )∈ S
Matching: given Q and G, find the set Q(G) of all matches
40
Preserving the topology of patterns
 Child and parent relationships
 connectivity: if Q is connected (via undirected path), so is GS
 cycles: a directed (resp. undirected) cycle in Q matches a
directed (resp. undirected) cycle in GS
 bounded matches:
– the diameter of GS is at most 2 * dQ
– |M(Q, G)|  |V|
What about graph simulation?
41
Strong simulation vs. graph simulation
hierarchy
G matches Q via subgraph isomorphism
preserve topology, but
not bounded match
G matches Q via strong simulation
does not preserve parents,
connectivity, undirected
G matches Q via dual simulation
cycles, bounded match
Complexity of strong
simulation G matches Q via graph simulation
 Input: Pattern Q and data graph G
cubic time
 Output: Q(G)
O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |))
A balance between the complexity and the ability to preserve topology42
Making strong simulation stronger?
 Bounded cycles
If G matches Q, then the longest
simple cycle in G is no longer
than its counterpart in Q
 Bisimulation instead of simulation: find all subgraphs that are
bisimilar to a pattern
for each (u,v)∈ S,
 each (u,u’ ) in EQ is mapped to an edge
(v, v’ ) in Gs, (u’,v’ )∈ S
 each edge (v, v’ ) in Gs is mapped to an
edge (u,u’ ) in EQ, (u’, v’ )∈ S
Both extensions make matching from PTIME to intractable
43
Summing up
44
Various notions for graph pattern matching
matching
complexity
|M(Q, G)|
subgraph isomorphism
NP-complete
|V| |VQ|
graph simulation
quadratic time
|V| |VQ|
bounded simulation
cubic time
|V| |VQ|
regular matching
cubic time
|V| |VQ|
strong simulation
cubic time
|V|
Query driven approximation: from subgraph isomorphism (intractable)
to strong simulation or bounded simulation (cubic-time)
45
Summary
 Graph pattern matching
– Subgraph isomorphism
– Graph simulation
– Bounded simulation
– Regular matching
– Strong simulation
– ...
A uniform framework for these
 Querying both topology and data content
• What query language should we use for social data analysis?
• Strike a balance between the expressivity and complexity
Reading: W. Fan. Graph Pattern Matching Revised for Social Network
Analysis, ICDT 2012. (survey of graph pattern matching)
The study has raised as many questions as it has answered
46
Summary and review
 What is subgraph isomorphism? Complexity? Algorithm? Name
a few applications
 What is graph simulation? Complexity? Understand its
algorithm. Name a few applications

Why do we need to revise conventional graph pattern matching
for social network analysis? How should we do it? Why?
 Understand bounded simulation. Read its algorithm.
Complexity?
 What is strong simulation? Complexity? Name a few
applications in which strong simulation is useful.
 Find other revisions of conventional graph pattern matching that
are not covered in the lecture.
47
Project (1)
Recall bounded graph simulation




Implement an algorithm that, given a pattern Q and a graph G,
computes the maximum match of Q in G via bounded simulation
Develop optimization strategies
Experimentally evaluate your algorithm, especially its scalability
with the size of G
Write a survey on revisions of conventional graph simulation, as
related work
A development project
48
Project (2)
Recall graph simulation




Develop a MapReduce algorithm that, given a pattern Q and a
graph G, computes the maximum match of Q in G via graph
simulation
Develop optimization strategies
Experimentally evaluate your algorithm, especially its scalability
with the size of G
Write a survey on revisions of conventional graph simulation, as
part of the related work
A research and development project
49
Project (3)
Recall subgraph isomorphism




Develop two algorithms that, given a pattern Q and a graph G,
computes the maximum match of Q in G via subgraph
isomorphism, in
• MapReduce (see Lecture 4)
• BSP (see Lecture 5)
Develop optimization strategies to reduce parallel computational
cost and data shipment cost
Experimentally evaluate your algorithms, especially their scalability
with the size of G
Write a survey on parallel algorithms for subgraph isomorphism
A development project
50
Papers for you to review
•
M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on
finite and infinite graphs. FOCS, 1995.
http://infoscience.epfl.ch/record/99332/files/HenzingerHK95.pdf
•
L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph
Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern
Anal. Mach. Intell. 26, 2004 (search Google scholar)
 A. Fard, M. U. Nisar, J. A. Miller, L. Ramaswamy, Distriuted and
scalable graph pattern matching: models and algorithms. Int. J. Big
Data. http://cobweb.cs.uga.edu/~ar/papers/IJBD_final.pdf
•
W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching:
From intractable to polynomial time, VLDB, 2010.
•
W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions
to Graph Reachability and Pattern Queries, ICDE 2011.
•
S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo: Strong simulation: Capturing
51
topology in graph pattern matching. TODS 39(1): 4, 2014.