Transcript sews4 7356

Mining, Indexing and
Searching Graphs in
Biological Databases
Jiawei Han
Department of Computer Science
&
Institute of Genomic Biology
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
In collaboration with Xifeng Yan (UIUC Ph.D.’06 and
IBM Watson), Philip S. Yu (IBM Watson), et al.
(Core material for tutorials at ICDM’05 & KDD’06)
July 27, 2016
1
References: “Covering” Five Papers

X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proc. 2002
Int. Conf. on Data Mining (ICDM'02) (Google Scholar: ranked #3 out of 83,800 entries
on “Graph Pattern Mining” on July 27, 2016)

X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, Proc.
2003 ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'03) (Google
Scholar: ranked #1 out of 83,800 entries on “Graph Pattern Mining” on July 27, 2016)

X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based
Approach, Proc. 2004 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'04)
(invited to TODS and published 2005, Google Scholar: ranked #1 out of 39,300 entries
on “Graph Indexing” on July 27, 2016)

X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”,
Proc. 2005 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'05) (invited and
published in ACM TODS’06)

H. Hu, X. Yan, H. Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs
across Massive Biological Networks for Functional Discovery”, Proc. 2005 Int.
Conf. Intelligent Systems for Molecular Biology (ISMB'05) (Also in Bioinformatics, 2005)
July 27, 2016
2
July 27, 2016
3
from H. Jeong et al Nature 411, 41 (2001)
Graph, Graph, Everywhere
Aspirin
July 27, 2016
An Internet Web
Yeast protein interaction network
Co-author network
4
Why Graph Mining and Searching?


Graphs are ubiquitous

Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)

Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis
Graph is a general model


Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs

Directed vs. undirected, labeled vs. unlabeled (edges & vertices),
weighted, with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high complexity!
July 27, 2016
5
Outline

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Some recent progress on graph mining
July 27, 2016
6
Graph Pattern Mining

Frequent subgraphs


A (sub)graph is frequent if its support (occurrence
frequency) in a given dataset is no less than a
minimum support threshold
Applications of graph pattern mining

Mining biochemical structures

Program control flow analysis

Mining XML structures or Web communities

July 27, 2016
Building blocks for graph classification, clustering,
comparison, and correlation analysis
7
Example: Frequent Subgraphs
Graph Dataset
(A)
(B)
(C)
Frequent Patterns
(min support is 2)
(1)
July 27, 2016
(2)
8
Frequent Subgraph Mining Approaches


Apriori-based approach

AGM/AcGM: Inokuchi, et al. (PKDD’00)

FSG: Kuramochi and Karypis (ICDM’01)

PATH: Vanetik and Gudes (ICDM’02, ICDM’04)

FFSM: Huan, et al. (ICDM’03)
Pattern growth-based approach

MoFa, Borgelt and Berthold (ICDM’02)

gSpan: Yan and Han (ICDM’02)

Gaston: Nijssen and Kok (KDD’04)
July 27, 2016
9
Properties of Graph Mining Algorithms

Search order


Generation of candidate subgraphs


passive vs. active
Support calculation


apriori vs. pattern growth
Elimination of duplicate subgraphs


breadth vs. depth
embedding store or not
Discover order of patterns

July 27, 2016
path  tree  graph
10
Apriori-Based Approach
k-edge
(k+1)-edge
G1
G
G2
G’
…
G’’
Gn
JOIN
July 27, 2016
11
Pattern Growth-Based Span and Pruning
1-edge
...
2-edge
...
3-edge
...
...
July 27, 2016
...
...
G1
If redundant,
prune it!
...
PRUNED
12
gSpan (Yan and Han ICDM’02)
Right-Most Extension
Theorem: Completeness
The Enumeration of Graphs
using Right-most Extension is
COMPLETE
July 27, 2016
13
DFS Code

Flatten a graph into a sequence using depth first
search
e0: (0,1)
0
e1: (1,2)
1
e2: (2,0)
2
3
4
e3: (2,3)
e4: (3,1)
e5: (2,4)
July 27, 2016
14
Graph Pattern Explosion Problem

If a graph is frequent, all of its subgraphs are
frequent ─ the Apriori property

An n-edge frequent graph may have 2n
subgraphs

Among 422 chemical compounds which are
confirmed to be active in an AIDS antiviral
screen dataset, there are 1,000,000 frequent
graph patterns if the minimum support is 5%
July 27, 2016
17
Closed Frequent Graphs

Motivation: Handling graph pattern explosion problem

Closed frequent graph

A frequent graph G is closed if there exists no
supergraph of G that carries the same support as G

If some of G’s subgraphs have the same support, it is
unnecessary to output these subgraphs (nonclosed
graphs)

Lossless compression: still ensures that the mining result
is complete
July 27, 2016
18
CLOSEGRAPH (Yan & Han, KDD’03)
A Pattern-Growth Approach
(k+1)-edge
G1
k-edge
G
G2
…
Gn
July 27, 2016
At what condition, can we
stop searching their children
i.e., early termination?
If G and G’ are frequent, G is a
subgraph of G’. If in any part
of the graph in the dataset
where G occurs, G’ also
occurs, then we need not grow
G, since none of G’s children will
be closed except those of G’.
19
Experimental Result

The AIDS antiviral screen compound dataset
from NCI/NIH

The dataset contains 43,905 chemical
compounds

Among these 43,905 compounds, 423 of
them belongs to CA, 1081 are of CM, and
the remaining are in class CI
July 27, 2016
21
Discovered Patterns
20%
10%
5%
July 27, 2016
22
Number of Patterns: Frequent vs. Closed
frequent graphs
closed frequent graphs
Number of patterns
CA
July 27, 2016
1.0E+06
1.0E+05
1.0E+04
1.0E+03
1.0E+02
0.05
0.06
0.07
0.08
minimum support
0.1
23
Runtime: Frequent vs. Closed
runtime (sec)
CA
10000
FSG
Gspan
CloseGraph
1000
100
10
1
0.05
0.06
0.07
0.08
minimum support
July 27, 2016
0.1
24
Do the Odds Beat the Curse of Complexity?

Potentially exponential number of frequent patterns

The worst case complexty vs. the expected probability

Ex.: Suppose Walmart has 104 kinds of products

The chance to pick up one product 10-4

The chance to pick up a particular set of 10 products: 10-40

What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?

Have we solved the NP-hard problem of subgraph isomorphism testing?

No. But the real graphs in bio/chemistry is not so bad

A carbon has only 4 bounds and most proteins in a network have
distinct labels
July 27, 2016
25
Outline

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Some recent progress on graph mining
July 27, 2016
26
Graph Search: Querying Graph Databases

Querying graph databases:
 Given a graph database and a query graph,
find all graphs containing this query graph
query graph
July 27, 2016
graph database
27
Scalability Issue

Sequential scan



Disk I/O
Subgraph isomorphism (a)
testing
An indexing mechanism is
needed



July 27, 2016
DayLight: Daylight.com
(commercial)
(b)
(c)
Query graph
Sample database
GraphGrep: Dennis Shasha,
et al. PODS'02
Grace: Srinath Srinivasa, et
al. ICDE'03
28
Indexing Strategy
Query graph (Q)
Graph (G)
If graph G contains query
graph Q, G should contain
any substructure of Q
Substructure
Remarks
 Index substructures of a query graph to
prune graphs that do not contain these
substructures
July 27, 2016
29
Framework

Two steps in processing graph queries
Step 1. Index Construction

Enumerate structures in the graph
database, build an inverted index
between structures and graphs
Step 2. Query Processing



July 27, 2016
Enumerate structures in the query graph
Calculate the candidate graphs containing
these structures
Prune the false positive answers by
performing subgraph isomorphism test
30
Cost Analysis
Query Response Time
Tindex  Cq  Tio  Tisomorphism _ testing 
Disk I/O time
Graph index access time
Isomorphism testing time
Size of candidate answer set
Remark: make |Cq| as small as possible
July 27, 2016
31
Path-Based Approach
Sample database
(a)
(b)
(c)
Paths
0-length: C, O, N, S
1-length: C-C, C-O, C-N, C-S, N-N, S-O
2-length: C-C-C, C-O-C, C-N-C, ...
3-length: ...
Built an inverted index between paths and graphs
July 27, 2016
32
Problems of Path-Based Approach
Sample database
(a)
(b)
(c)
Query graph
Only graph (c) contains this query
graph. However, if we only index
paths: C, C-C, C-C-C, C-C-C-C, we
cannot prune graph (a) and (b).
July 27, 2016
33
gIndex: Indexing Graphs by Data Mining

Our methodology on graph index:

Identify frequent structures in the database, the
frequent structures are subgraphs that appear quite
often in the graph database

Prune redundant frequent structures to maintain a
small set of discriminative structures

Create an inverted index between discriminative
frequent structures and graphs in the database
July 27, 2016
34
IDEAS: Indexing with Two Constraints
discriminative
(~103)
frequent
(~105)
structure
July 27, 2016
(>106)
35
Why Discriminative Subgraphs?
Sample database
(a)


(b)
(c)
All graphs contain structures: C, C-C, C-C-C
Why bother indexing these redundant frequent
structures?
 Only index structures that provide more
information than existing structures
July 27, 2016
36
Discriminative Structures

Pinpoint the most useful frequent structures

Given a set of structures f1, f2, …, fn and a new
structure x , we measure the extra indexing power
provided by x,
Px f1 , f 2 , f n , fi  x.


When P is small enough, x is a discriminative
structure and should be included in the index
Index discriminative frequent structures only

July 27, 2016
Reduce the index size by an order of magnitude
37
Why Frequent Structures?


minimum
support threshold
support

We cannot index (or even search) all of
substructures
Large structures will likely be indexed well by their
substructures
Size-increasing support threshold
size
July 27, 2016
38
Experimental Setting

The AIDS antiviral screen compound dataset from
NCI/NIH, containing 43,905 chemical compounds

Query graphs are randomly extracted from the
dataset.

GraphGrep: maximum length (edges) of paths is
set at 10

gIndex: maximum size (edges) of structures is
set at 10
July 27, 2016
39
Experiments: Index Size
1.4E+05
Path
Frequent Structure
Discriminative Frequent Structure
# OF FEATURES
1.2E+05
1.0E+05
8.0E+04
6.0E+04
4.0E+04
2.0E+04
0.0E+00
1k
2k
4k
8k
16k
DATABASE SIZE
July 27, 2016
40
# OF CANDIDATES
Experiments: Answer Set Size
140
120
100
80
60
40
20
0
GraphGrep
gIndex
Actual Match
4
8
12
16
20
24
QUERY SIZE
July 27, 2016
41
Experiments: Incremental Maintenance
80
70
60
50
40
30
20
2K
4K
From scratch
6k
8k
10k
Incremental
Frequent structures are stable to database updating
Index can be built based on a small portion of a graph
database, but be used for the whole database
July 27, 2016
42
Outline

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Some recent progress on graph mining
July 27, 2016
43
Structure Similarity Search
• CHEMICAL COMPOUNDS
(a) caffeine
(b) diurobromine
(c) viagra
• QUERY GRAPH
July 27, 2016
44
Some “Straightforward” Methods

Method1: Directly compute the similarity between the
graphs in the DB and the query graph


Sequential scan

Subgraph similarity computation
Method 2: Form a set of subgraph queries from the
original query graph and use the exact subgraph
search

Costly: If we allow 3 edges to be missed in a 20edge query graph, it may generate 1,140 subgraphs
July 27, 2016
45
Index: Precise vs. Approximate Search

Precise Search




Use frequent patterns as indexing features
Select features in the database space based on their
selectivity
Build the index
Approximate Search


Hard to build indices covering similar subgraphs—
explosive number of subgraphs in databases
Idea: (1) keep the index structure
(2) select features in the query space
July 27, 2016
46
Substructure Similarity Measure

Query relaxation measure
 The number of edges that can be relabeled or
missed; but the position of these edges are
not fixed
QUERY GRAPH
…
July 27, 2016
47
Substructure Similarity Measure

Feature-based similarity measure



July 27, 2016
Each graph is represented as a feature vector
X = {x1, x2, …, xn}
The similarity is defined by the distance of
their corresponding vectors
Advantages

Easy to index

Fast

Rough measure
48
Intuition: Feature-Based Similarity Search
Graph (G1)
Query (Q)
 If graph G contains
the major part of a query
graph Q, G should share
a number of common
features with Q
Graph (G2)
Substructure
 Given a relaxation ratio,
calculate the maximal
number of features that
can be missed !
At least one of them
should be contained
July 27, 2016
49
Feature-Graph Matrix
features
graphs in database
G1
G2
G3
G4
G5
f1
0
1
0
1
1
f2
0
1
0
0
1
f3
1
0
1
1
1
f4
1
0
0
0
1
f5
0
0
1
1
0
Assume a query graph has 5 features and at most 2 features to
miss due to the relaxation threshold
July 27, 2016
50
Edge Relaxation – Feature Misses



If we allow k edges to be relaxed, J is the
maximum number of features to be hit by k
edges—it becomes the maximum coverage
problem
NP-complete
A greedy algorithm exists
  1 k 
J greedy  1  1     J
  k 


 We design a heuristic to refine the bound of
feature misses
July 27, 2016
51
Query Processing Framework

Three steps in processing approximate graph
queries
Step 1. Index Construction

July 27, 2016
Select small structures as features in a
graph database, and build the featuregraph matrix between the features
and the graphs in the database
52
Framework (cont.)
Step 2. Feature Miss Estimation
 Determine the indexed features belonging
to the query graph
 Calculate the upper bound of the number
of features that can be missed for an
approximate matching, denoted by J
 On the query graph, not the graph
database
July 27, 2016
53
Framework (cont.)
Step 3. Query Processing


July 27, 2016
Use the feature-graph matrix to
calculate the difference in the number
of features between graph G and query
Q, FG – FQ
If FG – FQ > J, discard G. The remaining
graphs constitute a candidate answer
set
54
Performance Study

Database


Chemical compounds of Anti-Aids Drug from
NCI/NIH, randomly select 10,000 compounds
Query


July 27, 2016
Randomly select 30 graphs with 16 and 20
edges as query graphs
Competitive algorithms

Grafil: Graph Filter—our algorithm

Edge: use edges only

All: use all the features
55
Comparison of the Three Algorithms
# of candidates
10000
1000
Grafil
Edge
All
100
10
1
2
3
4
edge relaxation
July 27, 2016
56
Outline

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Some recent progress on graph mining
July 27, 2016
57
Biological Networks






July 27, 2016
Protein-protein interaction network
Metabolic network
Transcriptional regulatory network
Co-expression network
Genetic Interaction network
…
58
Data Mining Across Multiple Networks
f
a
f
j
h
c
a
c
a
e
e
b
b
d
i
g
j
a
c
e
j
e
b
k
g
i
k
d
h
c
e
b
d
j
a
h
c
i
g
f
f
h
k
d
i
g
f
a
b
k
k
d
j
h
c
h
e
July 27, 2016
f
j
g
i
b
k
d
g
i
59
Data Mining Across Multiple Networks
f
a
f
j
h
c
a
c
a
e
e
b
b
d
i
g
j
a
c
e
j
e
b
k
g
i
k
d
h
c
e
b
d
j
a
h
c
i
g
f
f
h
k
d
i
g
f
a
b
k
k
d
j
h
c
h
e
July 27, 2016
f
j
g
i
b
k
d
g
i
60
Identify Frequent Co-expression Clusters
across Multiple Microarray Data Sets
c1 c2… cm
g1 .1 .2… .2
g2 .4 .3… .4
…
c1 c2… cm
g1 .8 .6… .2
g2 .2 .3… .4
…
f
a
c
d g
a
c
b
July 27, 2016
k
i
f
e
j
k
d
e
f
e
b
c
j
h
a c
k
b
f
j
a
d g
i
h
j
h
k
i
k
i
f
e
g
i
e
b
d
k
d g
e
a
j
h
..
.
b
d g
k
i
f
c
..
.
a c
h j
d g
i
g
c
b
a
h
f
c1 c2… cm
g1 .2 .5… .8
g2 .7 .1… .3
…
a
b
..
.
c1 c2… cm
g1 .9 .4… .1
g2 .7 .3… .5
…
e
f
h j
c
h
j
e
b
d g
k
i
61
Our Solution
We develop a novel algorithm, called CODENSE, to mine
frequent coherent dense subgraphs.
The target subgraphs have three characteristics:
(1)
All edges occur in >= k graphs (frequency)
(2)
All edges should exhibit correlated occurrences in
the given graph set (coherency)
(3)
The subgraph is dense, where density d is higher
than a threshold  and d=2m/(n(n-1)) (density)
m: #edges, n: #nodes
July 27, 2016
62
CODENSE: Mine Coherent Dense Subgraphs
(1) Builds a summary graph by eliminating infrequent edges
f
a
a
h
c
e
b
f
f
a
c
e
b
h
c
h
e
b
f
d
d
i
g
G1
d
i
g
G2
i
g
a
G3
h
c
e
b
f
a
b
a
h
c
e
d
b
g
G4
July 27, 2016
f
i
f
a
h
c
e
d
b
g
G5
i
d
h
c
i
summary graph Ĝ
e
d
g
g
i
G6
63
CODENSE: Mine Coherent Dense Subgraphs
(2) Identify dense subgraphs of the summary graph
f
a
f
Step 2
h
c
e
h
c
e
b
d
g
summary graph Ĝ
i
MODES
g
i
Sub(Ĝ)
Observation: If a frequent subgraph is dense, it must be a
dense subgraph in the summary graph. However, the
reverse is not true.
July 27, 2016
64
CODENSE: Mine Coherent Dense Subgraphs
(3) Construct the edge occurrence profiles for each dense
summary subgraph
f
h
c
Step 3
e
g
Sub(Ĝ)
i
E
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
0
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
…
…
…
…
…
…
…
edge occurrence profiles
July 27, 2016
65
CODENSE: Mine Coherent Dense Subgraphs
(4) builds a second-order graph for each dense summary
subgraph
g-h
f-i
E
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
1
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
e-i
h-i
e-g
g-i
Step 4
e-h
c-h
f-h
c-f
…
…
…
…
…
edge occurrence profiles
…
e-f
…
c-e
c-i
second-order graph S
July 27, 2016
66
CODENSE: Mine Coherent Dense Subgraphs
(5) Identify dense subgraphs of the second-order graph
g-h
g-h
f-i
e-i
h-i
e-i
h-i
e-g
g-i
Step 4
g-i
e-g
e-h
e-h
c-h
f-h
c-f
e-f
c-e
c-h
f-h
c-f
e-f
c-i
second-order graph S
c-e
Sub(S)
Observation: If a subgraph is coherent (its edges show
high correlation in their occurrences across a graph set),
then its 2nd-order graph must be dense
July 27, 2016
67
CODENSE: Mine Coherent Dense Subgraphs
(6) Identify the coherent dense subgraphs
g-h
h
e-i
h-i
e
Step 5
i
g
g-i
e-g
e-h
f
h
c
c-h
f-h
e
c-f
e-f
c-e
Sub(G)
Sub(S)
July 27, 2016
68
CODENSE: Mine Coherent Dense Subgraphs
a
f
h
c
f
f
a
c
b
e
h
b
e
d
G1
a
b
e
b
h
c
e
d
i
g
G4
f
i
g
a
Step 1
G3
f
a
h
d
d
G2
f
c
e
i
g
h
c
b
d
i
g
a
g
c
b
e
d
i
G5
Step 2
e
Add/Cut
h
e
d
g
i
MODES
i
g
Sub(Ĝ)
summary graph Ĝ
g
h
c
b
f
a
f
h
c
i
G6
Step 3
g-h
g-h
f-i
h
e-i
e-i
h-i
h-i
e
Step 6
i
g
g-i
e-g
Step 5
g-i
e-g
e-h
e-h
f
h
c
e
Restore
G and
MODES
c-h
f-h
c-f
e-f
c-e
Sub(G)
July 27, 2016
Step 4
Sub(S)
MODES
c-h
f-h
c-f
e-f
c-e
c-i
E
G1
G2
G3
G4
G5
G6
c-e
0
0
1
1
1
1
c-f
0
1
0
1
1
1
c-h
0
0
0
1
1
1
c-i
0
0
1
1
1
0
e-f
0
0
0
1
1
1
…
…
…
…
…
…
…
edge occurrence profiles
second-order graph S
69
Applying CoDense to 39 Yeast Microarray Data Sets
f
c1 c2… cm
g1 .1 .2… .2
g2 .4 .3… .4
…
c1 c2… cm
g1 .8 .6… .2
g2 .2 .3… .4
…
c1 c2… cm
g1 .9 .4… .1
g2 .7 .3… .5
…
a
c
e
July 27, 2016
a
b
d g
a
c
b
k
i
f
e
j
k
d
a c
f
j
h
a
c
j
h
e
b
k
d g
f
a c
k
b
i
j
h
j
a
f
e
k
i
k
i
d g
i
h
g
k
i
e
b
d
e
f
c
e
b
d g
h j
d g
i
g
c
b
a
h
f
c1 c2… cm
g1 .2 .5… .8
g2 .7 .1… .3
…
f
h j
c
h
j
e
b
d g
k
i
70
Discovery of New Genes Based on Similar Genes
YDR115W
MRP49
PHB1
MRPL51
PET100
ATP12
ATP17
MRPL37
MRPL38
ACN9
MRPL32
MRPL39
MRPS18
July 27, 2016
FMC1
71
Network of Known Similar Genes
ATP17
MRP49
MRPL51
PHB1
ATP12
PET100
PET100
YDR115W
MRPL38
ACN9
MRPL32
MRPL39
MRPS18
FMC1
Brown: YDR115W, FMC1, ATP12, MRPL37, MRPS18
GO:0019538 (protein metabolism; pvalue = 0.001122)
July 27, 2016
72
Network Involved in the New Genes
YDR115W
MRP49
PHB1
MRPL51
PET100
ATP12
MRPL37
ATP17
MRPL38
ACN9
MRPL32
MRPL39
MRPS18
FMC1
Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100
GO:0006091 (generation of precursor metabolites and energy; pvalue=0. 001339)
July 27, 2016
73
Outline

Mining frequent graph patterns

Graph indexing methods

Similairty search in graph databases

Biological network analysis

Some recent progress on graph mining
July 27, 2016
74
Recent Developments: Graph Mining



Colossal pattern mining: F. Zhu, X. Yan, J. Han, P. S. Yu,
and H. Cheng, “Mining Colossal Frequent Patterns by Core
Pattern Fusion”, in Proc. 2007 Int. Conf. on Data
Engineering (ICDE'07), April 2007 (Best student paper
award)
Constraint-based mining: F. Zhu, X. Yan, J. Han, and P. S.
Yu, “gPrune: A Constraint Pushing Framework for Graph
Pattern Mining”, in Proc. 2007 Pacific-Asia Conf. on
Knowledge Discovery and Data Mining (PAKDD'07), May
2007 (Best student paper award)
Approximate graph mining: C. Chen, X. Yan, F. Zhu, and J.
Han, “gApprox: Mining Frequent Approximate Patterns
from a Massive Network”, Proc. 2007 Int. Conf. on Data
Mining (ICDM'07), Oct. 2007
July 27, 2016
75
Recent Developments: Graph Mining



Graph-containment indexing: C. Chen, X. Yan, P. S. Yu, J.
Han, D. Zhang, and X. Gu, “Towards Graph Containment
Search and Indexing”, in Proc. 2007 Int. Conf. on Very
Large Data Bases (VLDB'07), Vienna, Austria, Sept. 2007
Pattern-based classification: H. Cheng, X. Yan, J. Han,
and C.-W. Hsu, “Discriminative Frequent Pattern Analysis
for Effective Classification”, in Proc. 2007 Int. Conf. on
Data Engineering (ICDE'07), Istanbul, Turkey, April 2007
DDPMine: H. Cheng, X. Yan, J. Han, and P. S. Yu, "Direct
Discriminative Pattern Mining for Effective Classification",
Proc. 2008 Int. Conf. on Data Engineering (ICDE'08),
Cancun, Mexico, April 2008
July 27, 2016
76
Discriminative Frequent Pattern Analysis
for Effective Classification [ICDE’07]
July 27, 2016
77
Conclusions

Graph mining has wide applications

Frequent and closed subgraph mining methods

gSpan and CloseGraph: pattern-growth depth-first search
approach

Graph indexing techniques:


Similairty search in graph databases


Indexing and approximate matching help similar subgraph search
Biological network analysis


Frequent and discirminative subgraphs as indexing fatures
Mining coherent, dense, multiple biological networks
Many new developments along the line of graph pattern mining
July 27, 2016
78
Thanks and Questions
July 27, 2016
79