Diapositiva 1

Download Report

Transcript Diapositiva 1

Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User Information
Carlos Cobos-Lozada MSc. Ph.D. (c)
[email protected] / [email protected]
Advisor: Elizabeth León Ph.D.
[email protected]
Visiting scholar of Modern Heuristic Research Group
LISI-MIDAS: Universidad Nacional de Colombia Sede Bogotá
GTI : Universidad del Cauca
Idaho Falls, October 5, 2011
Agenda

Preliminaries

Latent Semantic Indexing

Web Clustering Engines

Proposed Model
Preliminaries
Information Retrieval System
Auto complete
Extended
Query
Query
User
Visualization
and
browsing
Indexes
Retrieval
Process
Feedback
Results
Indexing
Process
Documents
Preliminaries
Information Retrieval Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
Retrieval
boolean
vector space
probabilistic
Structured Models
Non-Overlapping Lists
Proximal Nodes
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
Preliminaries
Classic Models – Basic Concepts






Each document is represented by a set of representative
keywords or index terms
An index term is a document word useful for remembering
the document main themes
Usually, index terms are nouns because nouns have
meaning by themselves
However, some search engines assume that all words are
index terms (full text representation)
Not all terms are equally useful for representing the
document contents, e.g. less frequent terms allow
identifying a narrower set of documents
The importance of the index terms is represented by
weights associated to them
Preliminaries
Indexing Process
recognition of
structure
Document
Structure
Full text representation
Tokenization
Filters
Stop words rem.
Noun groups rem.
Vocabulary rest.
Stemming
Key words
Preliminaries
Indexing Process - Sample
WASHINGTON - The House of Representatives on Tuesday passed a bill that puts the
government on stable financial footing for six weeks but does nothing to resolve a battle
over spending that is likely to flare again.
Original
WASHINGTON The House of Representatives on Tuesday passed a bill that puts the
government on stable financial footing for six weeks but does nothing to resolve a battle
over spending that is likely to flare again
Tokens
washington the house of representatives on tuesday passed a bill that puts the
government on stable financial footing for six weeks but does nothing to resolve a battle
over spending that is likely to flare again
Filters
washington house representatives tuesday passed bill puts government stable financial
footing weeks resolve battle spending flare
Stop
washington hous repres tuesdai pass bill put govern stabl financi foot week resolv battl
spend flare
Stem
Preliminaries
Indexing Process - Sample
TRENTON, New Jersey - New Jersey Governor Chris Christie dashed hopes on Tuesday
he might make a late leap into the 2012 Republican presidential race, in a move that sets
up a battle between Mitt Romney and Rick Perry.
Original
TRENTON New Jersey New Jersey Governor Chris Christie dashed hopes on Tuesday
he might make a late leap into the 2012 Republican presidential race in a move that sets
up a battle between Mitt Romney and Rick Perry
Tokens
trenton new jersey new jersey governor chris christie dashed hopes on tuesday he might
make a late leap into the 2012 republican presidential race in a move that sets up a battle
between mitt romney and rick perry
Filters
trenton jersey jersey governor chris christie dashed hopes tuesday make late leap 2012
republican presidential race move sets battle mitt romney rick perry
Stop
trenton jersei jersei governor chri christi dash hope tuesdai make late leap 2012
republican presidenti race move set battl mitt romnei rick perri
Stem
Preliminaries
TF-IDF or Term-Document Matrix
Stored in an
Inverted Index
Observed Frequency
d1
d2
t1
t2
1
3
…
tj
4
…
tF
2
4
2
…
di
0
fi,j
max(fi)
…
dN
Term-Document
Matrix (TDM)
0
2
nj
 N 

wi , j 
 log
1 n 
max( f i )
j 

fi, j
Preliminaries
Cosine Similarity
M
Simd , q  
W
i ,d
i 1
M
W
i 1
 Wi , q
M
2
i ,d
W
i 1
2
i ,q
Preliminaries
t2
Sample 1: Vector Space Model
d7
d6
q
d3
t2
t3
max freql,j
d1
1
0
1
1
d2
1
0
0
d3
0
1
d4
1
d5
d4
t1
d2
t3
t1
d5
d1
t1
t2
t3
|dj|
sim(dj,q)
ranking
d1
0,3364722
0
0,8472979
0,9116618
0,85224481
3
1
d2
0,3364722
0
0
0,33647224
0,31454287
6
1
1
d3
0
0,5596158
0,8472979
1,01542282
0,94924327
2
0
0
1
d4
0,3364722
0
0
0,33647224
0,31454287
7
1
1
1
1
d5
0,3364722
0,5596158
0,8472979
1,06971822
1
1
d6
1
1
0
1
d6
0,3364722
0,5596158
0
0,65298039
0,61042281
4
d7
ni
0
5
1
4
0
3
1
d7
0
0,5596158
0
0,55961579
0,52314318
5
idfi
0,3364722
0,5596158
0,8472979
N
7
q
1
max freql,q
1
1
1
|q|
q
0,5047084
0,8394237
1,2709468
1,60457732
Preliminaries
t2
Sample 2: Vector Space Model
d7
q
d5
d3
t3
t1
t2
t3
max freql,j
d1
1
0
1
1
d2
1
0
0
d3
0
1
d4
1
d5
d6
d4
d2
t1
d1
t1
t2
t3
|dj|
sim(dj,q)
ranking
d1
0,3364722
0
0,8472979
0,9116618
0,88229947
3
1
d2
0,3364722
0
0
0,33647224
0,19256666
6
1
1
d3
0
0,5596158
0,8472979
1,01542282
0,97544391
2
0
0
1
d4
0,3364722
0
0
0,33647224
0,19256666
7
1
1
1
1
d5
0,3364722
0,5596158
0,8472979
1,06971822
0,98650404
1
d6
1
1
0
1
d6
0,3364722
0,5596158
0
0,65298039
0,48349989
4
d7
ni
0
5
1
4
0
3
1
d7
0
0,5596158
0
0,55961579
0,44838373
5
idfi
0,3364722
0,5596158
0,8472979
N
7
q
1
max freql,q
2
3
3
|q|
q
0,2803935
0,6528851
1,2709468
1,45608558
Preliminaries
Vector Space Model
Advantages:
• Simple model based on
linear algebra
• Term weights
• Allows computing a
continuous degree of
similarity between queries
and documents
• Allows ranking documents
according to their possible
relevance
• Allows partial matching
Limitations:
• Long documents are poorly represented
because they have poor similarity values
(a small scalar product and a large
dimensionality)
• Word substrings might result in a "false
positive match"
• Semantic sensitivity; documents with
similar context but different term
vocabulary won't be associated, resulting
in a "false negative match".
• The order in which the terms appear in
the document is lost in the vector space
representation.
• Assumes terms are statistically
independent
Latent Semantic Indexing
It is an indexing and retrieval method that
uses a mathematical technique called
Singular Value Decomposition (SVD) to
identify patterns in the relationships
between the terms and concepts
contained in an unstructured collection of
text
 SVD:

◦ Also, it can be used to reduce noise in the
data (SVD moves data to a reduced
dimension)
Latent Semantic Indexing
SVD

Let A denote an m × n matrix of real-valued data and rank r, where
without loss of generality m ≥ n, and therefore r ≤ n.
Amxn  U mxn * nxn *V

T
nxn
Where:
◦ The columns of U are called the left singular and form an orthonormal
basis for original columns
 U is the eigenvectors of DDT (orthogonal)
◦ The rows of VT contain the elements of the right singular vectors and
form an orthonormal basis for original rows
 V is the eigenvectors of DTD (orthogonal)
◦ Ʃ is square root of eigenvalues of U and V put in the diagonal (so it’s a
sorted diagonal matrix) Ʃi,i > Ʃ j,j where i<j y Ʃi,i=0 where i>=r … r ≤ n
Latent Semantic Indexing
A10 x8  U10 x8
mxn
mxn
Terms
Docs
3
3
3
5
3
4
2
5
5
4
5
4
4
5
4
5
5
5
5
3
5
5
5
3
4
4
4
4
5
4
5
4
3
5
4
5
3
5
5
4
4
3
3
4
5
4
3
3
4
1
3
3
4
5
4
3
3
5
5
2
1
4
3
5
3
5
3
5
4
4
2
3
2
5
4
2
2
4
4
3
 8x8
nxn
V
T
8 x8
nxn
0,29
0,30
0,28
0,37
0,31
0,33
0,26
0,37
0,38
0,26
0,64
0,06
0,24
-0,44
0,15
-0,01
0,28
-0,35
-0,04
-0,33
-0,01
0,28
0,13
-0,44
-0,52
0,27
0,09
-0,02
-0,12
0,58
-0,29
-0,02
-0,12
0,09
-0,03
0,71
0,39
-0,06
-0,3
-0,38
-0,56
0,24
0,59
-0,14
0,08
-0,35
0,33
0,04
0,04
-0,16
-0,19
0,48
-0,24
-0,08
0,61
0,06
-0,18
-0,41
-0,18
0,25
0,12
0,12
-0,42
0,24
-0,09
-0,43
0,66
0,04
-0,27
0,18
0,15
0,47
-0,08
-0,15
-0,01
-0,07
-0,27
0,62
-0,40
-0,32
34,89
0
0
0
0
0
0
0
0
4,63
0
0
0
0
0
0
0
0 3,36
0
0
0
0
0
0
0
0 2,33
0
0
0
0
0
0
0
0 2,21
0
0
0
0
0
0
0
0 1,73
0
0
0
0
0
0
0
0 1,22
0
0
0
0
0
0
0
0 0,35
0,34 0,41
-0,35 0,28
0,10 0,05
-0,28 0,36
-0,3 -0,04
-0,29 -0,43
-0,40 0,60
-0,58 -0,27
0,38
0,48
0,51
-0,38
0,38
0,26
-0,10
-0,04
0,39
0,03
0,13
-0,09
-0,68
0,05
0,02
0,60
0,31
0,38
-0,53
0,36
-0,11
0,39
-0,38
0,20
0,34 0,34 0,29
-0,06 -0,55 -0,35
-0,42 0,33 -0,38
-0,16 0,56 -0,41
0,45 0,29 0,08
-0,50 0,24 0,44
-0,22 -0,10 0,52
0,41 0,12 -0,11
Latent Semantic Indexing

Using SVD to reduce noise
◦ Take r instead of n in matrix Ʃ
◦ What value of r? e.g. 90% of Frobenius norm
A10 x8  U10 x5 * 5 x5 *V

T
5 x8
In this case r=5, where r < n (n=8)
Latent Semantic Indexing
A10 x8  U10 x 5
mxr
mxn
Terms
Docs
3
3
3
5
3
4
2
5
5
4
5
4
4
5
4
5
5
5
5
3
5
5
5
3
4
4
4
4
5
4
5
4
3
5
4
5
3
5
5
4
4
3
3
4
5
4
3
3
4
1
3
3
4
5
4
3
3
5
5
2
1
4
3
5
3
5
3
5
4
4
2
3
2
5
4
2
2
4
4
3
 5x 5
rxr
V
T
5rxnx 8
0,29
0,30
0,28
0,37
0,31
0,33
0,26
0,37
0,38
0,26
0,64
0,06
0,24
-0,44
0,15
-0,01
0,28
-0,35
-0,04
-0,33
-0,01
0,28
0,13
-0,44
-0,52
0,27
0,09
-0,02
-0,12
0,58
-0,29
-0,02
-0,12
0,09
-0,03
0,71
0,39
-0,06
-0,3
-0,38
-0,56
0,24
0,59
-0,14
0,08
-0,35
0,33
0,04
0,04
-0,16
-0,19
0,48
-0,24
-0,08
0,61
0,06
-0,18
-0,41
-0,18
0,25
0,12
0,12
-0,42
0,24
-0,09
-0,43
0,66
0,04
-0,27
0,18
0,15
0,47
-0,08
-0,15
-0,01
-0,07
-0,27
0,62
-0,40
-0,32
34,89
0
0
0
0
0
0
0
0
4,63
0
0
0
0
0
0
0
0 3,36
0
0
0
0
0
0
0
0 2,33
0
0
0
0
0
0
0
0 2,21
0
0
0
0
0
0
0
0 1,73
0
0
0
0
0
0
0
0 1,22
0
0
0
0
0
0
0
0 0,35
0,34 0,41
-0,35 0,28
0,10 0,05
-0,28 0,36
-0,3 -0,04
-0,29 -0,43
-0,40 0,60
-0,58 -0,27
0,38
0,48
0,51
-0,38
0,38
0,26
-0,10
-0,04
0,39
0,03
0,13
-0,09
-0,68
0,05
0,02
0,60
0,31
0,38
-0,53
0,36
-0,11
0,39
-0,38
0,20
0,34 0,34 0,29
-0,06 -0,55 -0,35
-0,42 0,33 -0,38
-0,16 0,56 -0,41
0,45 0,29 0,08
-0,50 0,24 0,44
-0,22 -0,10 0,52
0,41 0,12 -0,11
Latent Semantic Indexing
Sum ← 0
For i ← 0 to n do
Sum ← Sum + Ʃ(i,i)
End for
Percentage← Sum* 0.9 // 90% of Frobenius Norm
r←0
Temp ← 0
For i ← 0 to n do
Temp ← temp + S(i, i)
r←r+1
IF temp ≥ Percentage then
break
end if
End for
Return r
Value of r?
Latent Semantic Indexing

Retrieved documents in latent space
◦ Documents in the latent space:
D'mxr 
1
Amxn *Vnxr *  rxr
◦ Terms in latent space:
T 'nxr 
T
Amxn
1
*U mxr *  rxr
Latent Semantic Indexing

Query in the latent space:
q'1xr 

1
q1xn *Vnxr *  rxr
Cosine similarity
r
Simd , q  
Wi,d  Wi,q
i 1
r
W
i 1
r
2
i,d
W 2 i,q
i 1
Web Clustering Engines
Web Clustering Engines

The search aspects where WCE can be most
useful in complementing the output of plain
search engines are:
◦ Fast subtopic retrieval: documents can be
accessed in logarithmic rather than linear time
◦ Topic exploration.: Clusters provides a high-level
view of the whole query topic including terms for
query reformulation (particularly useful for
informational searches in unknown or dynamic
domains)
◦ Alleviating information overlook: Users may
review hundreds of potentially relevant results
without the need to download and scroll to
subsequent pages
Web Clustering Engines

WDC pose new requirements and
challenges to clustering technology:
◦
◦
◦
◦
◦
◦
Meaningful labels
Computational efficiency (response time)
Short input data description (snippets)
Unknown number of clusters
Work with noise data
Overlapping clusters
General Model
Query
Search results
acquisitions
Snippets
Preprocesing
Features
Cluster construction
and labeling
Clusters
Visualization
Proposed Model
Query
Search results
acquisitions
Query Expansion
Concepts instead
of Terms
Snippets
Preprocesing
Features
Cluster construction
and labeling
Evolutionary
approach: Online
and Offline
Clusters
Feedback
Visualization
Taxonomy, Ontologies and User Information
Query Expansion Process
User
Query by
keywords
Query
Expansion
Process
1. A registered user requests a query (based on
keywords in a common graphics interface like Google).
He/she receives help on-line (auto complete) based on
his/her user profile
Query by
keywords
3. External
service
Auto complete
Dropdown List
1. Pre-processing
and semantic
relationship
Inverted
Index of
Concepts
2. Related
Concepts with
user profile
User
Profile
Extended
Query
1
General
Taxonomy of
Knowledge
0…*
Specific
Ontology
Concepts,
relations (isa, is-part-of)
and instances
Query Expansion Process (B)
User
Query by
keywords
1. GTK and Specific ontologies are multilingual (collaborative
edition process)
2. User profile has:
• Nodes from GTK used for the user
• A relation with the Inverted Index of concepts
(ontologies), to support rating process:
• Manage concepts that have been previously
evaluated for an ontology specific (good/bad)
Query
Expansion
Process
Extended
Query
1
General
Taxonomy of
Knowledge
Independent Threads
Google
API
Yahoo!
API
Bing
API
TDM-OF
Building
Process
Term-Document Matrix - Observed Frequency TDM-OF Building Process
Extended query: Original keyword+ other concepts +
selected nodes from GTK (ontologies)
In parallel, each web search results is processed:
1. Pre-processing
• Tokenization
• Filters (Special characters and lower case)
• Stop words removal
• Define the language
• Stemming (English/ Spanish)
2. For each document, accumulate the observed
frequency of each term
Term-Document
Matrix (Observed
Frequency)
2
3. Mark the document as processed
Concept-Document Matrix - Observed
Frequency - CDM-OF Building Process
CDM-OF
Building
Process
In parallel, for each document marked as processed:
1. Join terms belonging to the same concept in the
selected specific ontologies (from extended query)
Specific
Ontology
Concept-Document
Matrix (Observed
Frequency)
Thread
Synchronization
3
2. Accumulate the observed frequency for terms who
joined in the same concept
3. End this process when all web search results are
processed - thread synchronization -
Concept-Document Matrix (CDM)
Building Process
CDM-OF
Building
Process
1. Calculate weigh (TF-IDF) of concepts in documents
 N 

wi , j 
 log
1 n 
max(freqi , j )
j 

freqi , j
Concept-Document
Matrix (CDM)
4
Clustering Process
Clustering
Process
Clustered
Documents
5
Three own algorithms
1. A Hybridization of the Global-Best Harmony Search,
with the K-means algorithm
2. A Memetic Algorithm with Niching Techniques
(restricted competition replacement and restrictive
mating)
3. A Memetic Algorithm (Roulette wheel, K-means, and
Replace the worst)
All Algorithms:
1. Define the number of clusters automatically (BIC)
2. Can use a standard Term-Document Matrix (TDM),
Frequent Term-Document Matrix (FTDM), ConceptDocument Matrix (CDM) or Frequent ConceptDocument Matrix (FTDM)
3. Test with data sets based on Reuters-21578 and
DMOZ
4. Test by users
Labeling Process
Labeling
Process
Clustered
Documents
and Labeled
6
Statistically Representative Terms:
1. Initialize algorithm parameters
2. Building of the "Others” label and cluster
3. Candidate label induction
4. Eliminate repeated terms
5. Visual improving of labels
Frequent Phrases:
1. Conversion of the representation
2. Document concatenation
3. Complete phrase discovery
4. Final selection
5. Building of the "Others” label and cluster
6. Cluster label induction
Overlapping clusters
Visualization and Rating Process
Clustered
Documents
and Labeled
Visualization
and Rating
Process
User
Profile
On experimentation → for each cluster, the user
answered whether or not:
• (Q1) the cluster label is in general representative of
the cluster (much, little, or nothing)
• (Q2) the cluster is useful, moderately useful or
useless.
Then, for each document in each cluster, the user
answered whether or not:
• (Q3) the document matches with the cluster (very
well matching, moderately matching, or notmatching)
• (Q4) the document relevance (location) in the cluster
was adequate (adequate, moderately suitable, or
inadequate).
Visualization and Rating Process
Clustered
Documents
and Labeled
Visualization
and Rating
Process
On production → the user can answer if each
document is useful (relevant) or not
General
Taxonomy of
Knowledge
User
Profile
0…*
Inverted
Index of
Concepts
User
Profile
Specific
Ontology
Proposed model
Collaborative Editing Process of Ontologies
1. Select node
(ontology associated)
Editor
3. Supported
by general
ontologies
4. Supported
by concepts
used for user
General
Taxonomy of
Knowledge
2. Edit the ontology
Concepts, synonyms
in different
languages, relations,
instances
0…*
WordNet
User
Profile
Can be
automatically
Inverted
Index of
Concepts
Specific
Ontology
5. Update Index
automatically when save
Model of Web Clustering Engine
Enrichment with a Taxonomy,
Ontologies and User Information
Carlos Cobos-Lozada MSc. Ph.D. (c)
[email protected] / [email protected]
Questions?