Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign November 7,

Download Report

Transcript Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign November 7,

Generating Semantic Annotations
for Frequent Patterns with
Context Analysis
Qiaozhu Mei, Dong Xin, Hong Cheng,
Jiawei Han, ChengXiang Zhai
University of Illinois at Urbana-Champaign
November 7, 2015
1
Frequent Pattern Mining
( [Agrawal & Srikant 94] and many others)
Database
Frequent Patterns
A B E
C
D
E
A
B
F
C
D
E
F
A
B
E
F
A
C
B
AB AE
……
Itemsets:
Sequential
Patterns:
AF
E F
CD
DE
EF
camera
film
CE
BE BF
ABE ABF
diaper
D
milk
CDE
;
;…
... Mining Closed Frequent Graph Patterns …
… Mining Graph and Structured Patterns in ...
Subgraph Patterns:
…
2
Toward Understanding the Patterns
-- Find Canonical Patterns
Database
A
B
E
C
D
E
A
B
F
C
D
E
A
B
……
E
Frequent Patterns
A
F
F
B
C
AB AE
AF
D
CD
E F
CE
BE BF
ABE ABF
CDE
DE
EF
C
D
E
F
1.0
1.0
0.9
0.8
( Yan et al ‘05)
( Xin et al ‘05)
3
Toward Understanding the Patterns
-- How to Interpret Patterns?
diaper
beer
• Do they all make sense?
• What do they mean?
• How are they useful?
female sterile (2) tekele
morphological info. and simple statistics
Semantic Information
Not all frequent patterns are useful, only those with meanings…
Our goal: Annotate patterns with semantic information
4
Challenges
• How can we represent the semantics of a
frequent pattern? (Annotate a pattern with what?)
• How can we infer pattern semantics? (How to
annotate?)
• How can we do it in a general way? (Do it for all
kinds of patterns)
• Once such annotations are generated, what can
we use them for? (Applications)
5
A Dictionary Analogy
Word: “pattern” – from Merriam-Webster
Non-semantic info.
Definitions indicating
semantics
Examples of Usage
Synonyms
Related Words
6
What about a “Pattern Dictionary”?
-- Semantic Pattern Annotation (SPA)
Word:
Pattern: Pattern“latent semantic analysis”
Non-Semantic: function;
Non-Semantic:
pronunciation;
sequential;
date; etc.
close; sup = 0.1%
Definitions:
AContext
form or model proposed
“indexing”,for“semantic”,
…
“S. Dumais”,
Related words: Indicators
original,(CI):
constellation
“singular value
…
decomposition”, …
Examples:
a dressmaker’s
index
pattern
by latent semantic analysis
Representative
Transactions:
a pattern of
probablist
dissent latent semantic analysis
Synonyms
Semantically
design, device,
“latent semantic indexing”,
similar
Patterns motif,
(SSP):motive… “LSA”, “PLSA”
7
How Can We Generate Such an Entry?
Database
A
B
Semantic Annotations
Frequent
Patterns
Pattern
AB
Non
Sup = 60%
P1: AB
CI
AB, E, F, EF …
P2: CD
Trans.
ABE; ABEF
SSPs
CD; …
E
C
D
E
A
B
F
C
D
E
F
A
B
E
F
P3:
?
…
Pn:
…
Pattern
CD
…
…
How to infer the semantics of a frequent pattern?
8
Continue the Analogy…
“You shall know a word by the company it keeps.”
- Firth 1957
Data … association … pattern … MINE … algorithm …
mountain … Africa … diamond … MINE … weight …
You’ll know the meaning of a pattern by its context
Pattern
Context
{A,B}:
{ … Baby, Milk, Diaper, Toy, Soymilk… }
{C,D}:
{ … Printer, Film, Camera, Lens, … }
9
Our Approach: Model the Context
Database
A
B
Frequent
Patterns
E
P1: AB
C
D
E
A
B
F
C
D
E
F
A
B
E
F
P2: CD
Context Units
<E, F, …, EF, … ABE>
<E, F, …, EF, …,CDEF>
Semantic
Annotations
Pattern
AB
Non
Sup = 60%
CI
AB, E, F, EF
Trans.
ABE; ABEF
SSPs
CD; …
…
Pn:
…
Pattern
CD
…
…
Context Units = Objects co-occurring with p
10
Semantic Analysis with Context Models
• Task1: Model the context of a frequent pattern
Based on the Context Model…
• Task2: Extract strongest context indicators
• Task3: Extract representative transactions
• Task4: Extract semantically similar patterns
11
Task1: Context Modeling
- A Vector Space Model
Database
Frequent
Patterns
A B E
C D E
A B F
<E, F, …, EF, … ABE>
<E, F, …, EF, … ABE>
P1: AB
C D E F
A B E F
Context Units
< 2.0, 2.0, …, 1.0, … , 1.0 >
<E, F, …, EF, …,CDEF>
P2: CD
Semantic
Annotations
Pattern
AB
Non
Sup = 60%
CI
AB, E, F, EF
Trans.
ABE; ABEF
SSPs
CD; …
…
< 2.0, 2.0, …, 1.0, … , 1.0 >
…
Pattern
CD
…
…
Pn:
Co-occurrence
Context Unit
Weight:
Mutual
Information
……
Cosine Similarity
Context
Similarity:
Pearson
Coefficient
……
12
Context Unit Selection
t1
diaper
t2
camera
Valid Context
Units:
milk
babywear
memory stick
diaper
,
lotion
printer
milk
milk
lotion
t1
t2
,
,
printer
camera
…
Single
items
itemsets
transactions
In general, Context Units are frequent patterns
13
Context Unit Selection:
Redundancy Removal
• Problem: too many valid context units,
most are redundant
– { Diaper, milk, babywear }: “diaper”, “diaper,
milk”, “milk, babywear”, “milk, lotion”, …
• Solution:
– use close patterns
– micro-clustering: (hierarchical, one-pass)
• Jaccard Distance (γ: threshold to stop clustering):
| D  D |
D( p , p )  1 
| D  D |
14
Task2: Extract Context Indicators
Database
Frequent
Patterns
Context Units
<A, B, AB,<C,AB,
D, CD,CD,
E, F,…
EF,,AE,
ABE, ABF,…,
EF,BF,……ABE,
…> ABEF>
A B E
C D E
Context Unit
Weighting
A B F
C D E F
A B E F
P1: AB
Pn:
Pattern
AB
Non
Sup = 60%
CI
AB, EF, ABE..
Trans.
ABE; ABEF
SSPs
CD; …
< 3.0, 0, … 2.0, … , 1.0, …>
P2: CD
…
Semantic
Annotations
AB
3.0
EF
2.0
ABE 1.0
…
…
Pattern
CD
…
…
15
Task3: Extract Representative
Transactions
Database
A
B
E
C
D
E
A
B
F
C
D
E
F
A
B
E
F
Frequent
Context Units
Patterns < AB, CD, … , EF, … ABE, …>
P1: AB
3.0, 0, …,2.0, … , 1.0
1.0, 0, …,1.0, … , 1.0
T1:
Semantic
Annotations
Pattern
AB
Non
Sup = 60%
CI
AB, E, F, EF
Trans.
ABEF; ABE
SSPs
CD; …
…
T5:
T5
T1
T3
…
0.8
0.6
0.6
Semantic
Similarity
Pattern
CD
…
…
16
Task4: Extract Semantically Similar
Patterns
Database
A
B
E
C
D
E
A
B
F
C
D
E
F
A
B
E
F
Frequent
Context Units
Patterns < AB, CD, … , EF, … ABE, …>
P1: AB
P2: CD
3.0, 0, …,2.0, … , 1.0
0, 3.0, …,2.0, … , 0.5
Pattern
AB
Non
Sup = 60%
CI
AB, E, F, EF
Trans.
ABEF; ABE
SSPs
CD; …
…
Pk: EF
AB:
Semantic
Annotations
CD
BF
EF
…
0.7
0.5
0.3
Semantic
Similarity
Pattern
CD
…
…
17
Experiments
• Three different real world applications
– Annotating DBLP title/authors Patterns
– Motif/Gene-Ontology (GO) matching
– Gene Synonyms extraction
• Study the effectiveness of the proposed
SPA methods
• Explore applications of SPA to different
real world tasks
18
Annotating DBLP Co-authorship
and Title Pattern
Database:
Frequent Patterns
Authors
Title
X.Yan, P. Yu, J. Han
Substructure Similarity Search
in Graph Databases
…
…
…
…
P1: { x_yan, j_han }
Frequent Itemset
P2: “substructure search”
Frequent Sequential Pattern
Semantic Annotations
Pattern
{ x_yan, j_han}
Non
Sup = …
CI
{p_yu}, graph pattern, …
Trans.
gSpan: graph-base……
SSPs
{ j_wang }, {j_han, p_yu}, …
Context Units
< { p_yu, j_han}, { d_xin }, … , “graph pattern”,
… “substructure similarity”, … >
19
DBLP Results: Frequent Itemset
Pattern= {xifeng_yan, jiawei_han}
Annotations:
Context
Indicator
(CI)
graph; {philip_yu}; mine close; graph pattern; index
approach; sequential pattern; …
Representative > gSpan: graph-base substructure pattern mining;
Transactions > mining close relational graph connect
constraint; …
(Trans)
Semantically {jiawei_han, philip_yu}; {jian_pei, jiawei_han};
Similar
{jiong_yang, philip_yu, wei_wang}; …
Patterns (SSP)
20
DBLP Results: Freq. Seq. Pattern
Pattern= “Information … retrieval”
Annotations:
Context
Indicator
(CI)
{w_bruce_croft}; web information; full text;
{monika_rauch_hezinger}; {james_p_callan}; …
Representative > web information retrieval
Transactions > language model information retrieval
(Trans)
Semantically information use; web information; probabilistic
Similar
information; information filter; text information; …
Patterns (SSP)
21
Motif-GO Matching
Sequence 1
motif1
motif2
Sequence 2
motif2
GO term 1
motif3
GO term 2
motif2
GO term 3
?
Sequence 3
motif2
motif4
motif5
GO term 4
GO term 5
Motif: a subsequence pattern in the sequences
Gene Ontology (GO) terms: annotating the functionality of
sequence, motifs
22
Motif-GO Matching (Cont.)
Motif 1
Database:
Protein Sequence
Frequent Patterns
GO terms
GOTerm1; GOTerm2;
GOTerm3
GOTerm3
…
…
P1: Motif1
Sequential Pattern
P2: GOTerm2
Single Item Pattern
Motif-GO
matching
Motif1
GOTerm1
GOTerm2
Semantic Annotations
Pattern
Motif1
Non
CI
Context Units
< Motif1, Motif3, …,
GOTerm1, GOTerm2, … >
GOTerm1, GOTerm3, …
Trans.
SSPs
GOTerm1, GOTerm2, …
23
Motif/GO Matching: Evaluation
• Gold standard generated by human experts
• Measure: Mean reciprocal rank (MRR)
– Reflects ranking accuracy (the higher the better)
– 1/Rank (0.5 means the correct answer is ranked as the 2nd )
• Results:
Weights for Context Units:
Mutual Information Co-occurrence
Random Selection
0.0023
0.0023
Context Indicators
0.5877
0.6064
SSPs
0.4017
0.4681
Ranking Strategy
24
Gene Synonym Extraction
• Gene Synonyms:
– A Sequential Pattern in the textual database
Gene_id
Gene Synonyms
FBgn0001000
female sterile 2 tekele; fs 2 sz 10;
tek; fs 2 tek; tekele; …
– Matching gene synonyms: a challenging and
important new problem in mining biology data
– Analogy: thesaurus or synonyms in dictionary
25
Gene Synonym Extraction (Cont.)
Database:
Frequent Patterns
Biomedical Sentences
… D. melanogaster gene
Female sterile (2) Tekele …
… Female sterile (2) Tekele ,
abbreviated as Fs(2)Tek …
…
Matched
Synonyms
female
sterile (2)
tekele
Sequential Pattern
P2: Fs(2)Tek
Sequential Pattern
Semantic Annotations
Pattern
P1: female sterile (2) tekele
female sterile (2) tekele
Non
Context Units
< gene, female, …,
d. melanogaster gene , … >
CI
Fs(2)Tek
fs 2 sz 10
female
sterile …
Trans.
SSPs
Fs(2)Tek, female
sterile, fs 2 sz 10, …
Context Units: context units
can be single words or
sequential patterns
26
Gene Synonym Extraction: Results
MRR: one-pass
MRR:
hierarchical
• Effective! MRR > 0.5
• frequent pattern >>
single words
• Micro-clustering is
useful
Running time:
hierarchical
Running time:
one-pass
27
Conclusions
•
•
•
•
A novel problem: semantical pattern annotation
A structured annotation for frequent patterns
A general method based on context modeling
A general post-processing procedure of frequent
pattern mining on any types of pattern
• Applicable to and effective for quite different
tasks
• Future work:
– Tune for specific tasks
– Better context unit weights, redundancy removal, etc
28
Thanks and Questions
29