Automatic Discovery and Aggregation of Compound Names for
Download
Report
Transcript Automatic Discovery and Aggregation of Compound Names for
Automatic Acquisition of
Paradigmatic Relations
using
Iterated Co-occurrences
Chris Biemann, Stefan Bordag, Uwe Quasthoff
University of Leipzig, NLP Department
LREC 2004, Learning & Acquisition (II), 27th of May 2004
Sets of Words
• Our goal is the automatic extension of homogenous
word sets, i.e. WordNet synsets or small subtrees of
some hierarchy
• We collect methods and apply them, eventually in
combination
• Mind experiment: the computer as „associator“:
Input: some example concepts
- Detection of the relation
- Output of additional instances
This can be done semi-supervised
• Necessary:
- very large text corpus
- features
- methods
Chris Biemann
2
Statistical Co-occurrences
• occurrence of two or more words within a well-defined
unit of information (sentence, nearest neighbors)
• Significant Co-occurrences reflect relations between
words
• Significance Measure (log-likelihood):
- k is the number of sentences containing a and b together
- ab is (number of sentences with a)*(number of sentences with b)
- n is total number of sentences in corpus
sig ( A, B) x k log x log k !
with n number of sentences,
x
ab
.
n
Chris Biemann
3
Iterating Co-occurrences
• (sentence-based) co-ocurrences of first order:
words that co-occur significantly often together in sentences
• co-occurrences of second order:
words that co-occur significantly often in collocation sets of first order
• co-occurrences of n-th order:
words that co-occur significantly often in collocation sets of (n-1)th order
When calculating a higher order, the significance values of the
preceding order are not relevant. A co-occurrence set
consists of the N highest ranked co-occurrences of a word.
Chris Biemann
4
Constructed Example I
Ord 1
dog
terrier
dog
cat
-
mouse
barking
-
X
x
X
-
-
x
x
X
x
-
x
-
-
x
-
-
-
-
cat
-
-
mouse
-
-
X
barking
X
X
-
-
bite
X
X
x
x
-
yelp
x
x
-
-
-
dog
terrier
dog
3
yelp
-
terrier
Ord 2
bite
cat
mouse
barking
bite
yelp
1
1
-
-
-
1
1
-
-
-
1
-
-
-
-
1
-
2
2
terrier
3
cat
1
1
mouse
1
1
1
barking
-
-
-
-
bite
-
-
-
1
2
yelp
-
-
-
Chris Biemann
2
-
2
2
5
Constructed Example II
Ord 2
dog
terrier
dog
cat
x
mouse
barking
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
x
x
x
cat
-
-
mouse
-
-
-
barking
-
-
-
-
bite
-
-
-
-
x
yelp
-
-
-
-
x
dog
terrier
dog
-
yelp
-
terrier
Ord 3
bite
cat
mouse
barking
x
x
bite
yelp
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
1
1
terrier
-
cat
-
-
mouse
-
-
-
barking
-
-
-
-
bite
-
-
-
-
yelp
-
-
-
Chris Biemann
-
1
1
-
1
1
6
Properties of
Iterated Co-occurrences
•
•
•
•
after some iterations the sets remain more or less stable
the sets are somewhat semantically homogeneous
sometimes, they have to do nothing with the reference word
calculations performed until 10th order.
• Example for TOP 20 NB-collocations of 10th order for
„erklärte“ [explained]:
sagte, schwärmte, lobt, schimpfte, meinte, jubelte, lobte,
resümierte, schwärmt, Reinhard Heß, ärgerte, kommentierte,
urteilte, analysierte, bilanzierte, freute, freute sich,
Bundestrainer, freut ,gefreut
[said, enthused, praises, grumbled, meant, was jubilant, praised,
summarized, dreamt, Reinhard Hess, annoyed, commentated, judged,
analyzed, balanced, made happy, was pleased, coach of the national
team, is pleased, been pleased]
Chris Biemann
7
Mapping co-occurrences to graphs
• For all words having co-occurrences, form nodes in a
graph.
• Connect them all by edges, initialize edge weight with 0
• For every co-occurrence of two words in a sentence,
increase edge weight by significance
Chris Biemann
8
First Iteration Step
• The two black nodes A and B get connected in the
step if there are many nodes C which are connected
to both A and B
• The more Cs, the higher the weight of the new edge
existing connection
new connection
Chris Biemann
9
Second Iteration Step
• The two black nodes A and B get connected in the
step if there are many (dark grey) nodes Ds which
are connected to both A and B.
• The connections between the nodes Ds and the
nodes A and B were constructed because of (light
gray) nodes Es and Fs, respectively
Es
Ds
Fs
former connection
existing connection
A
B
Chris Biemann
new connection
10
Collapsing bridging nodes
• Upper bound for path length in iteration n is 2n.
• However, some of the bridging nodes collapse, giving
rise to self-keeping clusters of arbitrary path length,
which are invariant under iteration.
Upper 5 nodes: invariant cluster
A, B are being absorbed by this cluster
Chris Biemann
11
Examples of
Iterated Co-occurrences
Order
Reference
word
TOP-10 collocations
N2
wine
wines, champagne, beer, water, tea, coffee,
Wine, alcoholic, beers, cider
S10
wine
wines, grape, sauvignon, chardonnay, noir,
pinot, cabernet, spicy, bottle, grapes
S1
ringing
phone, bells, phones, hook, bell, endorsement,
distinctive, ears, alarm, telephone
S2
ringing
rung, Centrex, rang, phone, sounded, bell, ring,
FaxxMaster, sound, tolled
S4
ringing
sounded, rung, rang, tolled, tolling, sound, tone,
toll, ring, doorbell
S10
pressing
Ctrl, Shift, press, keypad, keys, key, keyboard,
you, cursor, menu, PgDn, keyboards, numeric,
Alt, Caps, CapsLock, NUMLOCK, NumLock,
Chris Biemann
12
Scroll
Intersection of Co-occurrence
Sets: resolving ambiguity
HerzBube
bedient - folgenden - gereizt Karo-Buben - Karo-Dame - KaroKönig - Karte - Karten - Kreuz-Ass
- Kreuz-Dame - Kreuz-Hand Kreuz-König - legt - Mittelhand Null ouvert - Pik - Pik-Ass - PikDame - schmiert - Skat - spielt Spielverlauf - sticht - übernimmt zieht -
Becker
Stich
Agassi - Australian Open - Bindewald Boris - Break - Chang - Dickhaut - gewann - Ivanisevic - Kafelnikow - Kiefer
- Komljenovic - Leimen - Matchball Michael Stich - Monte Carlo - Prinosil Sieg - Spiel - spielen - Steeb - Teamchef
Achtelfinale Aufschlag - Boris Becker
- Daviscup - Doppel - DTB –
Edberg - Finale - Graf - Haas Halbfinale - Match - Pilic - Runde Sampras - Satz - Tennis - Turnier Viertelfinale - Weltrangliste - Wimbledon
Alleinspieler - Herz Herz-Dame - HerzKönig - Hinterhand Karo - Karo-As - KaroBube - Kreuz-As Kreuz-Bube - Pik-As Pik-Bube - Pik-König Vorhand -
Becker - Courier - Einzel - Elmshorn - French Open Herz-As - ins - Kafelnikow - Karbacher - Krajicek Kreuz-As - Kreuz-Bube - Michael Stich - Mittelhand
- Pik-As - Pik-Bube - Pik-König
Chris Biemann
Stich
13
Example: NB-collocations of 2nd order
warm, kühl, kalt
warm
kühl
kalt
abgekühlt abgeklärt abgekühlt
abkühlen abgekühlt abkühlen
angestiegenabkühlen angestiegen
anzeigt
ablehnend anzeigt
aufgeheizt abstrakt
aufgeheizt
eingefrorenaggressiv aushalten
erhitzt
ähnlich
eingefroren
erwärmt
altmodisch einstellen
fertig
anders
erhitzt
gebrannt archaisch ernst
gefallen
aufgeheizt erwärmt
gehalten aushalten frei
geklettert bedrohlich gebrannt
gekühlt
bescheidengefallen
gelagert
bitter
gehalten
gemessen blaß
geklettert
gesenkt
blutleer
gekühlt
gestiegen distanziert gelagert
gesunken eingefrorengemessen
gut
empfindlichgenug
Heiß
empört
gesenkt
heruntergekühlt
entrüstet
gestiegen
hoch
entsetzt
hart
höher
entspannt heiß
kalt
erhitzt
heruntergekühlt
kalte
erleichtert hoch
kalten
erschöpft höher
...
...
...
• Disjunction and filtering for
adjectives of collocation sets for
warm, kühl, kalt [warm, cool, cold]
results in:
abgekühlt, aufgeheizt, eingefroren,
erhitzt, erwärmt, gebrannt, gelagert,
heiß, heruntergekühlt, verbrannt,
wärmer
[cooled down, heated, frozen, heated up,
warms up, burned, stored, hot, downcooled, burned, more warmly]
• emotional reading „abweisend“
[repelling] for kühl, kalt is eliminated
Chris Biemann
14
Detection of X-onyms
synonyms, antonyms, (co)-hyponyms...
• Idea: Intersection of co-occurrence sets of two X-onyms as
reference words should contain X-onyms
• lexical ambiguity of one reference word does not deteriorate the
result set
• Method:
- Detect word class for reference words
- calculate co-occurrences for reference words
- filter co-occurrences w.r.t the word class of the reference
words (by means of POS tags)
- perform disjunction of the co-occurrence sets
- output result
• ranking can be realized over significance values of the
co-occurrences
Chris Biemann
15
Mini-Evaluation
• Experiments for different data sources, NB-collocations of 2nd and
3rd order
• fraction of X-onyms in TOP 5 higher than in TOP 10 ranking
method makes sense
• disjunction of 2nd-order and 3rd-order collocations almost always
empty different orders exhibit different relations
• satisfactory quantity, more through larger corpora
• quality: for unsupervised extension
not precise enough
Chris Biemann
16
Word Sets for Thesaurus
Expansion
Application: thesaurus expansion
start set: [warm, kalt] [warm, cold]
result set: [heiß, wärmer, kälter, erwärmt, gut, heißer, hoch,
höher, niedriger, schlecht, frei] [hot, warmer, colder, warmed,
good, hotter, high, higher, lower, bad, free]
start set: [gelb, rot] [yellow, red]
result set: [blau, grün, schwarz, grau, bunt, leuchtend, rötlich,
braun, dunkel, rotbraun, weiß] [blue, green, black, grey, colorful,
bright, reddish, brown, dark, red-brown, white]
start set: [Mörder, Killer] [murderer, killer]
result set: [Täter, Straftäter, Verbrecher, Kriegsverbrecher,
Räuber, Terroristen, Mann, Mitglieder, Männer, Attentäter]
[offender, delinquent, criminal, war criminal, robber, terrorists,
man, members, men, assassin
Chris Biemann
17
More Examples in English
Intersection of N2-Order collocation sets
Chris Biemann
18
Questions?
THANK YOU !
Chris Biemann
19