Web Document Clustering using Phrase

Download Report

Transcript Web Document Clustering using Phrase

Web Mining:
Phrase-based Document Indexing
and Document Clustering
Khaled Hammouda, Ph.D. Candidate
Mohamed Kamel, Supervisor, PI
PAMI Research Group
University of Waterloo
Waterloo, Ontario, Canada
Phrase-based Document Indexing
Document Index Graph Structure







A model based on a digraph representation of the
phrases in the document set
Nodes correspond to unique terms
Edges maintain phrase representation
A phrase is a path in the graph
The model is an inverted list (terms  documents)
Nodes carry term weight information for each
document in which they appear
Shared phrases can be matched effeciently
Phrase-based Features






Phrases: more informative feature than individual
words  local context matching
Represent sentences rather than words
Facilitate phrase-matching between documents
Achieves accurate document pair-wise similarity
Avoid high-dimensionality of vector space model
Allow incremental processing
booking
mild
fishing
river
trips
vacation
plan
rafting
wild
adventures
Document 1
Document 2
Document 3
river rafting
mild river rafting
river rafting trips
wild river adventures
river rafting vacation plan
fishing trips
fishing vacation plan
booking fishing trips
river fishing
Document Index Graph
2
Phrase-based Document Indexing
fishing
e1
river
rafting
e0
Edge Tables
e2
Document Table
doc
TF
ET
1 {0,0,3}
2 {0,0,2}
3 {0,0,1}
e0
s1(1),s2(2),s3(1)
e0
e2
s2(1)
s1(2)
e1
s4(1)
adventures
Document Index Graph (internal structure)
Document Index Graph (size scalability)
Document Index Graph (time performance)
3
Document Clustering using
Cluster Similarity Histograms
Similarity Histogram-based Clustering (SHC).
Clusters are represented using concise
statitsical representation called similarity
histograms.
Maximize clusters coherency by maintaining
high similarity distributions in clusters
histograms.
Enhance a cluster any time by re-distributing
documents among clusters.
Both original and receiving clusters benefit
from more tight similarity distributions.
SHC algorithm is incremental.
Web Mining Agent
New Doc.
New Doc.
+ve
Doc. Cluster
+ve
-ve
Similarity
Histogram
-ve
+ve
-ve
+ve documents: contribute to cluster cohesiveness
-ve documents: contribute to cluster looseness
-ve documents in one cluster could be +ve documents in another
Redistribute documents among clusters such that the number of
–ve documents is reduced in each cluster
4
Document Clustering using
Cluster Similarity Histograms (cont’d)
SHC (time performance)
5
Phrases as Document Features
Effect of Phrase Similarity (F-measure)
Effect of Phrase Similarity (Entropy)
Document Clustering using Similarity Histograms
1
0.9
F-Measure
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DS1
DS2
SHC (with Re-assignment)
SHC (no Re-assignment)
HAC
Single Pass
K-NN
SHC Clustering Improvement (F-measure)
SHC Clustering Improvement (Entropy)
6
Current Research
Web Mining Multi-Agent
System

Re
ve
tr i e
ve
Web Mining Agent
Communicate
Exchange
Alg
Comm
Stor
Proc
Re
p re
s en
Stor
Proc
Alg
KM
an
ipu
lat
e
t
e latio
n

r ie
Web Mining Agent
F in d R

Cooperative agents work on mining web
content
Agents can negotiate and exchange data
to achieve better solutions
Implemented distributed clustering
Based on multiple standards including
XML, Web Services. Later will incorporate
XML Topic Maps (XTM), Semantic Web
and Ontologies to represent discovered
clusters.
t
Re
Comm

The Web
DModel
KModel
7
Publications
Journal Publications


K. Hammouda and M. Kamel, “Efficient Phrase-based Document Indexing for Web
Document Clustering”, IEEE Transactions on Knowledge and Data Engineering. Accepted,
September 2003.
K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model”,
Knowledge and Information Systems. Springer. Accepted, May 2003.
Conference Publications


K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity
Histograms”, The 2003 IEEE/WIC International Conference on Web Intelligence (WI 2003),
pp. 597-601, Halifax, Canada, October 2003
K. Hammouda and M. Kamel, “Phrase-based Document Similarity Based on an Index Graph
Model”, The 2002 IEEE International Conference on Data Mining (ICDM'02), pp. 203-210,
Maebashi, Japan, December 2002.
Available at: http://pami.uwaterloo.ca/nav.php?site=pub&action=list&researcher=hammouda
8