4bslides/cs466-lecture-x-wsk.ppt

Download Report

Transcript 4bslides/cs466-lecture-x-wsk.ppt

Thesaurus Creation/
Term Clustering
Two major applications:
1. Query expansion –
fleshing out sparse queries with related words
 improves recall
(at possible expense of reduced precision)
2. Termset dimensionality reduction
Similar outcome with smaller model
CS466-10
1
Term Dimensionality Reduction
Query : Water Spaniel Diseases
Spaniel Spaniels disease diseases collie illness
Original V
Q
vector
1
Reduced
VQ’
vector
0
0
DOG
1
1
0
ILL
1
0
ant
0
0
INSECT
0
 Collie illnesses
 Poodle sickness
Problem : Reduced flexibility in partial weighting of synonym set
 Synonyms got as much weight as the original
 Equivalent to query expansion
when i for all synonyms is 1
CS466-10
2
Query Expansion
Query : Water Spaniel Diseases
Original
1
1
1
0
0
0
0
0
2
3
4
0
syn
Expanded
1
1
1
1
Water Spaniel diseases Spaniels diseases illness collie ant
stem
Relate Document set:
stem
D1 : Water Spaniels
D2 : Water Spaniel illnesses
syn
D3 : Collie diseases
syn
CS466-10
  i  semantic
dist(wi ,t)
3
Query Expansion
Query : Water Spaniel diseases
document1 : … water spaniels …..
…
CS466-10
4
Simplest Term Clustering
 Stemming is a clustering method
Original Term Set
computing
flies
computers
houses
computation
flown
compute
flew
house
CS466-10
stemming
Clustered Term Set
comput *
fly *
5
Another simple clustering method:
Pre-existing thesauri(e.g. Rogets’)
different pos
illness  disease, sickness, unwell, sick, ill, …
same part of speech(pos)
PhD  Ph.D, PhD, Phd, Ph.D., ….
Term equivalence classes
Loosely related topic sets
DOG  Spaniel, Collie, Schunauzer, bulldog, Poodle, ….
CS466-10
6
Term Clustering
Non-hierarchical methods : single pass(Salton, ’71)
Given clustering threshold/target size and similarity function sim(i , j )
 Pick random document Dj
 Assign a document di : sim(Dj , dj) <  to cluster Cj and recalc centroid
else create a new cluster Ck with centroid dk
 Exclude di from document list
 Repeat until document list empty
.. . .... ..... . . . . . .. .
... ....... .......... ....... ... ...... .. . . ... . . .
. . . ...... ... .. .. .. .... . ... . .. ... ....
.. .. . .. .. . .D
. .
.. .2
.
.
.
.
.
.
.
.
.
.. . . ... .. .D.
... ... . . . . .
1
.
.
.
.
.

.
.. . ... . .. . .. .. . . . . . ...
.. .. .
.. 3.... ... .... ..
.. . ... . ... . .....D
. . . ... . . .. . . . .. . . . .
. . .
.
. .
.
CS466-10
7
Types of Clustering
Behavior/Criterion
Sim(ti , tj)
Document level – co-occurrence in same document
Verb-object
Syntagmatic similarity sim(drink, wine)
appears together
sim(eat, meat)
in region
sim(drink, water)
Paradigmatic similarity sim(wine, water)
based on
object of drink
appears as objects
sim(wine, drink)
or of all verbs
of the same verb
sim(wine, meat) 
CS466-10
8
N-gram
Syntagmatic similarity sim(Hong, Kong)
occur together
sim(soap, opera)
sim(soap, suds)
Paradigmatic similarity sim(opera, suds)
occur in same
sim(tall, short)
context
sim(long, short)
sim(Hong, Kong) 
soap opera
suds
residue
CS466-10
Ivory soap
Dial
Lye
9