幻灯片 1 - buaa.edu.cn

Download Report

Transcript 幻灯片 1 - buaa.edu.cn

Graph Search: a New Paradigm for Social
Computing
Shuai Ma
Graphs are everywhere, and quite a few are huge graphs!
2
Graph Search - Why Bother?
File systems
Databases
World Wide Web
Social Networks
• File systems - 1960’s: very simple search functionalities
• Databases - mid 1960’s:SQL language
• World Wide Web - 1990’s:keyword search engines
• Social networks - late 1990’s:
1. Graphs have more expressive power, compared with RDB & XML.
2. Relationships become important for search – Google Knowledge Graph
Graph search is a new paradigm for social computing!
3
Interesting Coincidence!
SIGMOD + VLDB + ICDE
40
35
30
25
20
15
10
5
0
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Social computing
&
Web 2.0
DB people started working on graphs at around the same time!
4
Outline
•
•
•
•
•
•
Application scenarios
What is graph search?
Three types of graph search
Problems and challenges
Related techniques
Summary
5
Application Scenarios
6
Application Scenarios
Complex object identification
• Data quality
– Real-life data is often dirty: 1%–5% of business data contains errors
– Dirty costs us businesses 600 billion dollars each year
– Wrong price data in retail databases alone costs US consumers $2.5
billion annually
– Data cleaning tools deliver an overall business value of more than
‘‘600 million GBP’’ each year at BT.
• Data cleaning
– Data repairing
– Record matching (aka. object identification, entity resolution, data
deduplication)
• Complex object identification
– Modeling complex objects as graphs
7
Application Scenarios
Software plagiarism detection [13]
• Traditional plagiarism detection tools may not be
applicable for serious software plagiarism problems.
• A new tool based on graph pattern matching
– Represent the source codes as program dependence graphs [14].
– Use graph pattern matching to detect plagiarism.
8
Application Scenarios
Transport routing [16]
• Graph search is a common practice in transportation networks, due to
the wide application of Location-Based Services.
• Example: Mark, a driver in the U.S. who wants to go from Irvine to
Riverside in California.
– If Mark wants to reach Riverside by his car in the shortest time, the problem
can be expressed as the shortest path problem. Then by using existing
methods, we can get the shortest path from Irvine, CA to Riverside, CA
traveling along State Route 261.
– If Mark drives a truck delivering hazardous
materials may not be allowed to cross over
some bridges or railroad crossings. This
time we can use a pattern graph containing
specific route constraints (such as regular
expressions) to find the optimal transport
routes.
9
Application Scenarios
Recommender systems [13]
• Recommendations have found its usage in many emerging
specific applications, such as social matching systems.
• Graph search is a useful tool for recommendations.
– A headhunter wants to find a biologist
(Bio) to help a group of software
engineers (SEs) analyze genetic data.
– To do this, (s)he uses an expertise
recommendation network G, as
depicted in G, where
 a node denotes a person labeled
with expertise, and
 an edge indicates recommendation,
e.g., HR1 recommends Bio1, and
AI1 recommends DM1
10
Application Scenarios
Biological data analysis [17]
• A large amount of biological data can be represented by
graphs, and it is significant to analyze biological data with
graph search techniques.
– “Protein-interaction network (PIN) analysis provides valuable
insight into an organism’s functional organization and evolutionary
behavior.”
– For example, one can get the topological
properties of a PIN formed by highconfidence human protein interactions
obtained from various public interaction
databases by PIN analysis.
11
What is Graph Search?
12
What is Graph Search?
A unified definition[3] (in the name of graph matching):
• Given a pattern graph Gp and a data graph G:
– check whether Gp ‘‘matches’’ G; and
– identify all ‘‘matched’’ subgraphs.
Remarks:
– Two classes of queries:
– Boolean queries (Yes or No)
– Functional queries, which may use Boolean queries as a
subroutine
– Graphs contain a set of nodes and a set of edges, typically with labels
– Pattern graphs are typically small (e.g., 10), but data graphs are
usually huge (e.g., 108)
13
What is Graph Search?
Different semantics of “match” implies different “types” of
graph search, including, but not limited to, the following:
•
•
•
•
•
•
•
Shortest paths/distances[11]
Subgraph isomorphism[12]
Graph homomorphism and its extensions[9]
Graph simulation and its extensions[7,8]
Graph keyword search[2]
Neighborhood queries[10]
…
Graph search is a very general concept!
14
Three Types of Graph Search
• Cohesive subgraphs
• Keyword search on graphs
• Graph pattern matching
15
Cohesive Subgraphs
• Cohesive subgroups are subsets of actors among whom
there are relatively strong, direct, intense, frequent or
positive ties [1].
– Different cohesive subgroups are formed according to different
cohesive relations, which are further specified by application
needs.
• Social networks can be represented as graphs, such that
we formalize cohesive subgroups as cohesive subgraphs.
– Correspondingly, the problem of finding cohesive subgraphs on
graphs are referred to as Cohesive subgraph search.
16
Cohesive Subgraphs
• Various cohesive subgraphs (clique, n-clan, k-plex, k-core)
Maximal clique: a maximal clique
is a maximal complete subgraph.
• Main issues:
– Cliques can overlap
– Too many or too few cliques
emerge
– The problem is NP-complete
“Padgett's Florentine Families”
17
Cohesive Subgraphs
• Various cohesive subgraphs (clique, n-clan, k-plex, k-core)
N-clique: an n-clique is a
maximal subgraph in which the
largest distance between any two
nodes is no greater than n.
N-clan: an n-clan is an n-clique in
which the diameter is no greater
than n.
K-core: a k-core is a maximal
subgraph in which the nodal
degree of each node is no
smaller than k.
“Padgett's Florentine Families”
The cohesive relations are gradually looser
18
Keyword Search on Graphs
• Given a set of keywords and a data graph, the problem is
to determine a group of densely linked nodes in the graph
such that the nodes together
– contain all the keywords, and
– satisfy some structural constrains [2]
Remarks:
1. Different “structure constraints” implies different types of
keyword search.
2. Keyword search is a very simple but user-friendly information
retrieval mechanism.
19
Keyword Search on Graphs
Given keywords: {A, B}
Minimum spanning tree [2]
𝑝5 : {B, G}
𝑝1 : {B}
𝑝4 : {A}
𝑝2 : {C, E}
𝑝6 : {A, E}
𝑝3 : {D} 𝑝7 : {D, F}
20
Keyword Search on Graphs
r-clique [18]
𝑝5 : {B, G}
𝑝1 : {B}
𝑝4 : {A}
𝑝2 : {C, E}
𝑝6 : {A, E}
𝑝3 : {D} 𝑝7 : {D, F}
Lack of input structure constrains, the results requires ranking
Lack justification of the usage of the structure constrains
21
Graph Pattern Matching
• Given two directed graphs G1 (pattern graph) and
G2 (data graph),
– decide whether G1 “matches” G2 (Boolean queries);
– identify “subgraphs” of G2 that match G1
• Matching Semantics
– Traditional: Subgraph Isomorphism
– Emerging applications: Graph Simulation and its extensions, etc..
22
Subgraph Isomorphism
• Given Pattern graph Q, subgraph Gs of data graph G
– Q matches Gs if there exists a bijective function f: VQ→ VGs such that
 for each node u in Q, u and f(u) have the same label
 An edge (u, u‘) in Q if and only if (f(u), f(u')) is an edge in Gs
• Goodness:
Keep exact structure topology between Q and Gs
• Badness:
Decision problem is NP-complete
May return exponential many matched subgraphs
In certain scenarios, too restrictive to find matches
These hinder the usability in emerging applications, e.g., social networks
23
Graph Simulation
• Given pattern graph Q(Vq, Eq) and data graph G(V, E), a
binary relation R ⊆ Vq × V is said to be a match if
– (1) for each (u, v) ∈ R, u and v have the same label; and
– (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such
that (u′, v′) ∈ R.
• Graph G matches pattern Q via graph simulation, if there
exists a total match relation M
– for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.
– Intuitively, simulation preserves the labels and the child relationship
of a graph pattern in its match.
– Simulation was initially proposed for the analyses of programs; and
simulation and its extensions were recently introduced for social
networks.
Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!
24
Subgraph Isomorphism
Set up a team to develop a new software product
Graph simulation returns F3, F4 and F5;
Subgraph isomorphism returns empty!
Subgraph isomorphism is too strict for emerging applications
25
Terrorist Collaboration Network
“Those who were trained to fly didn’t know the others.
One group of people did not know the other group.”
(Osama Bin Laden, 2001)
26
Strong Simulation[6]
• Subgraph isomorphism
– Goodness
 Keep (strong) structure topology
– Badness
 May return exponential number of matched subgraphs
 Decision problem: NP-complete
 In certain scenarios, too restrictive to find sensible matches
• Graph simulation
– Goodness
 Solvable in quadratic time
– Badness
 Lose structure topology (how much? open question)
 Only return a single matched subgraph
Balance between complexity and the capability to capturing topology!
27
Strong Simulation
Disconnected
• Graph simulation loses graph structures
Tree
Long cycle
28
Strong Simulation
• Duality (dual simulation)
– Both child and parent relationships
– Simulation considers only child relationships
• Locality
– Restricting matches within a ball
– When social distance increases, the closeness of
relationships decreases and the relationships may become
irrelevant
• The semantics of strong simulation is well defined
– The results are unique
Strong simulation: bring duality and locality into graph simulation
29
Strong Simulation
Subgraph
Isomorphism
Strong
Simulation
Dual
Simulation
Graph
Simulation
Topology preservation and bounded matches
30
Strong Simulation
• A new matching model referred to as strong simulation
• A cubic time algorithm
• Three main optimization techniques
– Query minimization
 An O(n2) algorithm
– Dual simulation filtering
 First compute the match graph of dual simulation, then project on each ball of the
data graph
– Connectivity pruning
 Based on the connectivity theorem
• A distributed algorithm
– Data locality property
– Boundary nodes and radius
Towards revising conventional notions of graph matching
31
Problems and Challenges
32
Problems
Analyses:
Graph search
Userfriendliness
Result-accuracy
Cohesive Subgraphs


Keyword Search
Keywords
Graph Pattern Matching Pattern graphs
Result ranking
More accurate (well
structure constrained)
A novel approach to combining the advantages and overcoming the
shortcomings of existing graph search.
33
Challenges
Some facts:
– Facebook: over 0.8 billion users, 7.9 new users increased per second,
more than 600 thousand new users increased every day.
– Twitter: over 0.1 billion users, more than 300 thousand new users
increased every day.
– Data are often dirty due to data missing and data uncertainty [19, 20]
34
Challenges
– The amount of data has reached hundred millions orders of
magnitude.
Graph search with high efficiency, striking a balance between its
performance and accuracy.
– The data are updated all the time, and the updated amount of data
daily reaches hundred thousands orders of magnitude.
Consider the dynamic changes and timing characteristics of data.
– Same with traditional relational data, there exists data quality
problems such as data uncertainty and data missing in the new
applications.
Solve the data quality problems.
35
Related Techniques
36
Distributed Processing
• Real-life graphs are typically way too large:
– Yahoo! web graph: 14 billion nodes
– Facebook: over 0.8 billion users
It is NOT practical to handle large graphs on single machines
• Real-life graphs are naturally distributed:
– Google, Yahoo! and Facebook have large-scale data centers
Distributed graph processing is inevitable
It is nature to study “distributed graph search”!
37
Distributed Processing
Model of Computation [15]:
• A cluster of identical machines (with one acted as coordinator);
• Each machine can directly send arbitrary number of messages to
another one;
• All machines co-work with each other by local computations and
message-passing.
Complexity measures:
1. Visit times: the maximum visiting times of a machine (interactions)
2. Makespan: the evaluation completion time (efficiency)
3. Data shipment: the size of the total messages shipped among distinct
machines (network band consumption)
38
Incremental Techniques
Google Percolator
[21]:
• Converting the indexing system to an incremental system,
• Reduce the average document processing latency by a
factor of 100
• Process the same number of documents per day, while
reducing the average age of documents in Google search
results by 50%.
It is a great waste to compute everything from scratch!
39
Data Preprocessing
• Data sampling
– Instead of dealing with the entire data graphs, it reduces the size of
data graphs by sampling and allows a certain loss of precision.
– In the sampling process, ensure that the sampling data obtained can
reflect the characteristics and information of the original data graphs
as much as possible.
• Data compression
– It generates small graphs from original data graphs that preserve the
information only relevant to queries.
– A specific compression method is applied to a specific query
application, such that data graph compression is not universal for all
query applications.
– Reachability query, Neighbor query
40
Data Preprocessing
• Indexing
• There are mainly three standards for measuring the goodness of an
indexing method.
– The space of a graph index
– Establishing time for a graph index
– Query time with a graph index
• Data partitioning
– Partition a data graph to relatively “small” graphs
– Hash function is a simple approach for random partitioning.
– There are well established tools, e.g. Metis.
41
Summary
42
We have introduced graph search: a new paradigm for social computing
We have discussed the history and applications of graph search
We have introduced and analyzed three types of graph search:
– Cohesive subgraphs
– Keyword search on graphs
– Graph pattern matching
We have pointed out the problems and challenges
We have presented some useful techniques to solve the problems
43
References
[1] S.Wasserman and K. Faust. Social Network Analysis: Methods and Applications.
Cambridge University Press, 1994.
[2] C. C. Aggarwal and H. Wang. Managing and Mining Graph Data. Springer, 2010.
[3] Shuai Ma, Yang Cao, Tianyu Wo, and Jinpeng Huai, Social Networks and Graph
Matching. Communications of CCF, 2012.
[4] Shuai Ma, Jia Li, Xudong Liu, and Jinpeng Huai, Graph Search: A New Searching
Approach to the Social Computing Era . Communications of CCF, 2012..
[5] Wenfei Fan, Graph Pattern Matching Revised for Social Network Analysis. ICDT 2012.
[6] Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, and Tianyu Wo, Capturing Topology in
Graph Pattern Matching. VLDB 2012.
[7] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Adding Regular
Expressions to Graph Reachability and Pattern Queries. ICDE 2011.
[8] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph Pattern
Matching: From Intractable to Polynomial Time. VLDB 2010.
[9] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph
Homomorphism Revisited for Graph Matching. VLDB 2010.
[10] Hossein Maserrat and Jian Pei, Neighbor query friendly compression of social networks.
KDD 2010.
[11] Rice, M. and Tsotras, V.J., Graph indexing of road networks for shortest path queries
with label restrictions. VLDB 2010.
44
References
[12] Brian Gallaghe, Matching structure and semantics: A survey on graph-based pattern
matching. AAAI FS. 2006.
[13] Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, GPLAG: detection of software
plagiarism by program dependence graph analysis. KDD 2006.
[14] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its
use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, 1987.
[15] Shuai Ma, Yang Cao, Jinpeng Huai, and Tianyu Wo, Distributed Graph Pattern
Matching, WWW 2012.
[16] Rice, M. and Tsotras, V.J., Graph indexing of road networks for shortest path queries
with label restrictions,VLDB 2010.
[17] David A. Bader and Kamesh Madduri, A graph-theoretic analysis of the human proteininteraction network using multicore parallel algorithms. Parallel Computing 2008.
[18] Mehdi Kargar, Aijun An: Keyword Search in Graphs: Finding r-cliques. In VLDB
Conference, 2011.
[19] Eytan Adar and Christopher Re, Managing Uncertainty in Social Networks, IEEE Data
Eng. Bull., pp.15-22, 30(2), 2007.
[20] Gueorgi Kossinets, Effects of missing data in social networks. Social Networks 28:247268, 2006.
[21] Daniel Peng, Frank Dabek: Large-scale Incremental Processing Using Distributed
Transactions and Notifications. OSDI 2010.
45
Book Recommendation
46
Databases and Logic
47
Computational Complexity
48
Algorithms
49
Formal Languages
50
Statistics and Social Networks
51
Graph Theory
52
Acknowledgement:
Yang Cao, Wenfei Fan, Kaiyu Feng, Jinpeng Huai, Jia Li, Jianzhong Li,
Xudong Liu, Nan Tang, Tianyu Wo, Yinghui Wu, …
Homepage: http://mashuai.buaa.edu.cn
Email: [email protected]
Address: Room G1122,
New Main Building,
Beihang University
53