LLNL Graph X-Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad 8/13/2007 KDD 2007, San Jose.

Download Report

Transcript LLNL Graph X-Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad 8/13/2007 KDD 2007, San Jose.

LLNL
Graph X-Ray:
Fast Best-Effort Pattern Matching
in Large Attributed Graphs
Hanghang Tong, Brian Gallagher,
Christos Faloutsos, Tina Eliassi-Rad
8/13/2007
KDD 2007, San Jose
Input
Output
Query Graph
CEO
SEC
Matching Subgraph
Accountant
Manager
Attributed Data Graph
2
Terminology: ``Conform’’
Matching Subgraph conforms Query Graph
3
Terminology: ``Interception’’
matching node
Intermediate node
matching node
matching node
matching node
Matching Subgraph
Query Graph
Path 12-13-4 is an Interception
4
Terminology: ``Instantiate’’
Matching Subgraph Ht
Query Graph Hq
Node 11 instantiates SEC node
Ht instantiates Hq
5
Roadmap
• Introduction
– Problem Definition
– Motivations
• How to: Graph X-Ray
• Experimental Results
• Conclusion
6
Motivation: Why Not SQL?
• Case 1: Exact match does not exist
– Q: How to find approximate answer?
• Case 2: Too many exact matches
– Q: How to rank them?
7
Motivation: Why Not SQL? (Cont.)
• Case 3: Exact match might be not the best answer
– ``Find CEO who has heavy contact with Accountant’’
• Q: how to find right?
2
3
4
11
4
12
1
...
99
Exact match
1 direct connection
Inexact match
Many indirect connections
8
Motivation: Efficiency
• Why Not Subgraph Isomorphism?
– Polynomial for fixed # of pattern query
• Q1: How to scale up linearly?
• Q2: … and with a small slope?
9
Wish List
• Effectiveness
– Both exact match & inexact Match
– Ranking among multiple results
– ``Best’’ answer (proximity-based)
• Efficiency
– Scale linearly
– Scale with small scope
G-Ray meets all!
10
Roadmap
• Introduction
– Problem Definition
– Motivations
• How to: Graph X-Ray
• Experimental Results
• Conclusion
11
Preliminary: Center-Piece Subgraph [Tong+]
Q
B
B
A
C
Original Graph
CePS
meta
Black:
queryis
nodes
A
C
=CePS( A , B , C )
opt. in G-Ray!
12
Preliminary: Augmented Graph
• Data nodes
8
4
– 1,…13
7
• Attribute nodes
13
3
11
–a
12
1
5
9
6
Footnote
2
10
Aug. Graph is crucial for computation!
13
G-Ray: quick overview (for loop
)
Step 1: SF
Step 7: BR
Step 2: NE
Step 3: BR
Step 6: NE
Step 4: NE
Step 5: BR
SF: Seed-Finder
NE: Neighborhood
-Expander
BR: Bridge
14
Step 8: BR
Seed-Finder (
)
• Q: How to instantiate SEC node?
• A:
11
=CePS(
)
8
4
7
Footnote
`11’ is close to some un-known
data nodes for `CEO’ `Account.’
and `Manager’
13
3
11
12
1
5
9
6
2
15
10
Neighborhood-Expander (
)
• Q: How to instantiate CEO node?
– Step 1  Step 2?
• A:
12
=CePS(
)
11
8
4
• Footnote:
7
– Step 3  Step 4?
7
=CePS(
11
)
=CePS(
7
11
12
1
5
– Step 5  Step 6?
4
13
3
9
12
)
6
2
16
10
Bridge (
)
Step 6: NE
• Q:
?
Step 7: BR
• A: Prim-like Alg.
– To maximize
– Should block node 11 and 7
• Footnote
– Connection subgraph, or one single path?
17
Roadmap
• Introduction
– Problem Definition
– Motivation
• How to: Graph X-Ray
• Experimental Results
• Conclusion
18
Experimental Results
• Datasets
– DBLP
– Node: author (315k)
– Edge: co-authorship (1,800k)
– Attribute: conference & year (13k)
• KDD-2001, SIGMOD…
19
Effectiveness: star-query
Query
Result
20
Effectiveness: line-query
Query
Result
21
Effectiveness:
loop-query
Query
Result
22
Efficiency
Response Time
80
Average Response Time (Seconds)
70
Fast FSGM
Iterative method
60
•Scale linearly
•Small slope
•3-5 Seconds
50
40
30
20
10
0
# of Edges
0
0.2
0.4
0.6
0.8
1
1.2
# of Edges
1.4
1.6
1.8
2
6
x 10
~2 M edges
23
Roadmap
• Introduction
– Problem Definition
– Motivation
• How to: Graph X-Ray
• Experimental Results
• Conclusion
24
Conclusion
• Graph X-Ray (G-Ray)
– Best effort pattern match
• in large attributed graphs
– Scale linearly
• with small slope
• More details in Poster Session
– Monday (tonight)
– board number 8
25
4
8
7
13
3
12
11
13
12
1
4
5
11
9
7
2
6
10
G-Ray
X-Ray
Thank you!
www.cs.cmu.edu/~htong
26
Backup-slides
27
Proximity on Graph
10
a.k.a relevance, closeness
9
12
2
8
1
11
3
4
6
• Multi-faceted
• Punish long path
• Edge weight
5
7
How to:
---- random walk with restart
28
Random walk with restart
0.04
9
0.10
Node 4
12
2
0.13
1
0.03
10
0.08
3
0.02
8
0.13
11
0.04
4
0.13
6
5
7
0.05
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Node 10
Node 11
Node 12
0.13
0.10
0.13
0.22
0.13
0.05
0.05
0.08
0.04
0.03
0.04
0.02
0.05
Nearby nodes, higher scores
More red, more relevant
Ranking vector
r4
29
How to rank the results
• Our goodness function
– Measure the proximity between any two matching
nodes if they are required to be connected. (two-way)
– Multiply them together
• In G-Ray, we approximately optimize this
goodness functions
• If we have multiple matching subgraphs, we can
rank them according to this goodness functions
30
How to rank the results
matching node
matching node
matching node
matching node
Goodness = Prox (12, 4) x Prox (4, 12) x
Prox (7, 4) x Prox (4, 7) x
Prox (11, 7) x Prox (7, 11) x
Prox (12, 11) x Prox (11, 12) 31