4bslides/cs466-lecture-x-wsk.ppt
Download
Report
Transcript 4bslides/cs466-lecture-x-wsk.ppt
Thesaurus Creation/
Term Clustering
Two major applications:
1. Query expansion –
fleshing out sparse queries with related words
improves recall
(at possible expense of reduced precision)
2. Termset dimensionality reduction
Similar outcome with smaller model
CS466-10
1
Term Dimensionality Reduction
Query : Water Spaniel Diseases
Spaniel Spaniels disease diseases collie illness
Original V
Q
vector
1
Reduced
VQ’
vector
0
0
DOG
1
1
0
ILL
1
0
ant
0
0
INSECT
0
Collie illnesses
Poodle sickness
Problem : Reduced flexibility in partial weighting of synonym set
Synonyms got as much weight as the original
Equivalent to query expansion
when i for all synonyms is 1
CS466-10
2
Query Expansion
Query : Water Spaniel Diseases
Original
1
1
1
0
0
0
0
0
2
3
4
0
syn
Expanded
1
1
1
1
Water Spaniel diseases Spaniels diseases illness collie ant
stem
Relate Document set:
stem
D1 : Water Spaniels
D2 : Water Spaniel illnesses
syn
D3 : Collie diseases
syn
CS466-10
i semantic
dist(wi ,t)
3
Query Expansion
Query : Water Spaniel diseases
document1 : … water spaniels …..
…
CS466-10
4
Simplest Term Clustering
Stemming is a clustering method
Original Term Set
computing
flies
computers
houses
computation
flown
compute
flew
house
CS466-10
stemming
Clustered Term Set
comput *
fly *
5
Another simple clustering method:
Pre-existing thesauri(e.g. Rogets’)
different pos
illness disease, sickness, unwell, sick, ill, …
same part of speech(pos)
PhD Ph.D, PhD, Phd, Ph.D., ….
Term equivalence classes
Loosely related topic sets
DOG Spaniel, Collie, Schunauzer, bulldog, Poodle, ….
CS466-10
6
Term Clustering
Non-hierarchical methods : single pass(Salton, ’71)
Given clustering threshold/target size and similarity function sim(i , j )
Pick random document Dj
Assign a document di : sim(Dj , dj) < to cluster Cj and recalc centroid
else create a new cluster Ck with centroid dk
Exclude di from document list
Repeat until document list empty
.. . .... ..... . . . . . .. .
... ....... .......... ....... ... ...... .. . . ... . . .
. . . ...... ... .. .. .. .... . ... . .. ... ....
.. .. . .. .. . .D
. .
.. .2
.
.
.
.
.
.
.
.
.
.. . . ... .. .D.
... ... . . . . .
1
.
.
.
.
.
.
.. . ... . .. . .. .. . . . . . ...
.. .. .
.. 3.... ... .... ..
.. . ... . ... . .....D
. . . ... . . .. . . . .. . . . .
. . .
.
. .
.
CS466-10
7
Types of Clustering
Behavior/Criterion
Sim(ti , tj)
Document level – co-occurrence in same document
Verb-object
Syntagmatic similarity sim(drink, wine)
appears together
sim(eat, meat)
in region
sim(drink, water)
Paradigmatic similarity sim(wine, water)
based on
object of drink
appears as objects
sim(wine, drink)
or of all verbs
of the same verb
sim(wine, meat)
CS466-10
8
N-gram
Syntagmatic similarity sim(Hong, Kong)
occur together
sim(soap, opera)
sim(soap, suds)
Paradigmatic similarity sim(opera, suds)
occur in same
sim(tall, short)
context
sim(long, short)
sim(Hong, Kong)
soap opera
suds
residue
CS466-10
Ivory soap
Dial
Lye
9