Negra Korpus: deutsche Zeitungstexte aus der Frankfurter

Download Report

Transcript Negra Korpus: deutsche Zeitungstexte aus der Frankfurter

COOC
Practicum / Software Project, SS 2000
Final Report
Tanja von den Berg, Tilman Jäger, Kerstin Klöckner,
Stephan Lesch, Holger Neis, Norbert Pfleger,
Diana Raileanu, Hubert Schlarb
Supervisors: Jan Alexandersson, Paul Buitelaar
Contents
• Intro
• Theoretical foundations
• At the outset
• Project aspects
–
–
–
–
Preprocessing
Training
Application
Evaluation
• Outlook
30.11.00
COOC
2
Intro
• Word Sense Disambiguation (WSD) as preparation for
semantic analysis of text documents
• Application areas: translation systems, info retrieval
systems, document classification, etc.
• Machine learning approaches:
- supervised (semantically tagged corpora)
- unsupervised (untagged corpora)
• COOC: the first unsupervised, corpus-based approach for
German
30.11.00
COOC: Einleitung
3
Theoretical Foundations
WSD (Word Sense Disambiguation) in context:
E.g.: bank - place to sit vs. financial institution
I‘m going to the bank to get some money.
COOC: cooccurrence of words in a given context
GermaNet: (WordNet for German)
WordNet:
- lexical and semantic data bank
- semantic net, ontology
- lexical and conceptual relations
(antonymy, hyponymy)
30.11.00
COOC: Theoretische Grundlagen
4
Theoretical Foundations (II)
Method:
- knowledge sources (WordNet, Thesaurus)
- the possibility of finding relations between words and
meanings
supervised: - requires already disambiguated data
- requires large amounts of data
unsupervised: - requires even more data
- data need not be desambiguated
30.11.00
COOC: Theoretische Grundlagen
5
Theoretical Foundations (III)
Examples of unsupervised methods:
• Lesk (1986): comparison among dictionary entries
• Yarowski (1992):
- Roget‘s Thesaurus, Groliers Encyclopedia
- collections of contexts for a thesaurus category
- identification of characteristic words
• Resnik (1997): - Penn Treebank Corpus, pos-tagged,
syntactically annotated
- selectional preference (predicate arguments)
30.11.00
COOC: Theoretische Grundlagen
6
At the outset
Approach of Seligman (94):
• Japanese dialogues (direction finding, hotel
reservations in spontaneous speech)
• thesaurus with 4 fixed abstraction levels
• explicit semantic smoothing
COOC project:
• Tiger corpus (Frankfurter Rundschau)
• GermaNet with varying number of abstraction
levels (up to 26)
• implicit semantic smoothing
30.11.00
COOC: Ausgangssituation
7
Flow diagram
30.11.00
COOC: Training
8
Preprocessing
• Conversion of the training corpus (plain text) into
the COOC format
• Statistics on GermaNet categories
30.11.00
COOC: Vorbehandlung
9
Resources
• Tiger corpus (1.051.446 tokens)
- German newspaper text from the Frankfurter Rundschau
• TnT tagger (Brants 2000)
- statistical Part-of-Speech tagger
• Mmorph (Petitpierre & Russell, 1995)
- morphological analysis tool
• GermaNet:
- lexico-semantic network for German (about 25000 nouns,
6000 verbs, 3500 Adjectives)
30.11.00
COOC: Vorbehandlung
10
COOC-Format
Philip Glass wurde auf seinen weltweiten Tourneen mit Kassetten
und Tonbändern überschüttet. (Phillip Glass was showered with audio tape and
cassettes during his wordwide tour.)
...
166 seinen
167 weltweiten
NA
weltweit
PPOSAT
ADJA
[ 113815 113669 111763 111559 ]
...
172 Tonbändern Tonband
NN
173 überschüttet überschütten VVPP
[ 75749 ... 1749365 ] ... [ 75749 ... 144863 ]
[ 353400 ... 226602 ] [ 353400 ... 2266023 ]
...
30.11.00
COOC: Vorbehandlung
11
GermaNet Hierarchy
30.11.00
COOC: Vorbehandlung
12
Statistics on GermaNet Categories
•
•
•
•
Omission of higher-frequency categories
Reduction of computational complexity
Format: Frequency ID(Offset) Synset
Example: 70725
1749365 Objekt_0
43450
...........
2
1
30.11.00
369009
Situation_0
843903
695036
Kofferraum_0
Intellekt_0_Genius_0
COOC: Vorbehandlung
13
Segmentation...
...at sentence boundaries:
Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor
Überall wird gebuddelt und gemauert.
Hamburg erlebt den größten Geschäftsbau-Boom.
Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen.
...or e.g. after every 3 significant words:
Landesbank schlägt Verträge zwischen Stadt und privaten
Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt
den größten Geschäftsbau-Boom. Jährlich hinzukommen rund
300 000 Quadratmeter an Büroräumen.
30.11.00
COOC: Training
14
Windows
Text window: n segments with current segment in
the middle
wider scope than n-grams
S(i)
S(i+1)
S(i+2)
S(i+3)
S(i+4)
W(t)
W(t+1)
W(t+2)
n=3
30.11.00
COOC: Training
15
Training: unsupervised
Compare Peter goes by train with Diana goes by bike:
train and bike should both be VEHICLES; but different ambiguities
30.11.00
COOC: Training
16
Statistics
For a pair of categories:
• conditional probability
Pr(c2 , c1 )
Pr(c2 | c1 ) 
PrS (c1 )
Pr(c2 | c1 )
• mutual information MI (c2 , c1 )  log 2
PrW (c2 )
Effect: correct category combinations emerge
statistically
30.11.00
COOC: Training
17
Training: Parameters
• Segmentation methods
• Window width
limiting calculation time and space requirements:
• exclusion of certain POS combinations
• only categories in certain frequency intervals
• only pairs with frequency > minimum
30.11.00
COOC: Training
18
Application
• Actual disambiguation process
– input: sentences/text in COOC format,
containing ambiguous words
– output: disambiguated sentences/words
– requires training results
30.11.00
COOC: Anwendung
19
To proceed
• Connection to the training data bank
– selection of parameters (window and segment
size) of the training data bank
• Text processing
– construction of the initial windows
– desambiguation of the current segment
– results are written to the Ouput Data
30.11.00
COOC: Anwendung
20
To proceed (II)
• Window handling:
– the middle (current) segment is then
disambiguated word by word
– at the last segment, the window is moved one
segment to the right
S(i)
30.11.00
S(i+1)
S(i+2)
COOC: Anwendung
S(i+3)
S(i+4)
21
To proceed (III)
• Handling the words in the middle (current)
segment
– distinguish significant vs. insignificant words
(with and without GermaNet categories)
– for significant words, the most probable
meaning is computed and output
– insignificant words are written unchanged into
the Output Data
30.11.00
COOC: Anwendung
22
Probability of the Appeareance of a
Category in Context
n
 MI (c , c ) PR(c c )
i
0
0
i
í 1
• where:
•
•
•
•
30.11.00
MI:
PR:
c0 :
c i:
n
mutual information
conditional probability
current category
context category
COOC: Anwendung
23
Calculation of the most probable
meaning
• where:
 n

  PCi 
í 1
max 

 n 


• PC: probability of the appearance of a category
given a context
30.11.00
COOC: Anwendung
24
Example: Disambiguation
Folklore, Rock, Klassik und Jazz zu vermischen reicht ihnen nicht,
sie nutzen die Elektronik und sind sogar dazu übergegangen,
satisfied to merely mix up Folk, Rock, Classical,
Instrumente selbst zu bauen. Not
and Jazz, they make use of Electronic Music as well,
and go so far as to build their own instruments.
3002 Rock Rock NN 2
Rock_0
3002 Rock Rock NN [ 39981 ... 3228 ] [ 39981 ... 3228 ]
3004 Klassik Klassik NN 1
Klassik_0
3004 Klassik Klassik NN [ 221503 ... 221266 ]
3008 vermischen vermischen VVINF 1
vermengen_0_vermischen_0
3008 vermischen vermischen VVINF [ 643704 643048 ]
3009 reicht reichen VVFIN 7
reichen_0
3009 reicht reichen VVFIN [ 21538 ] [ 339847 307402 ] [ 581324 ... 568361 ]
3014 nutzen nutzen VVFIN 2
nutzen_2_nützen_2
[ 581324 ... 862674] [ 581324 ... 912753 ] [ 586102 585849 ] [ 588150 ... 586261 ]
3016 Elektronik Elektronik NN 1
Elektronik_0
3016 Elektronik Elektronik NN [ 405356 ... 383322 ]
3023 Instrumente Instrument
NN 2
Musikinstrument_0_Instrument_2
3023 Instrumente Instrument
NN [ 5357 3228 ] [ 142311 ... 3228 ]
3026 bauen bauen VVINF 4
bauen_3
3026 bauen bauen VVINF [ 650176 647379 ] [ 742021 ... 734399 ]
[ 743571 ... 734399 ] [ 743710 735354 734399 ]
30.11.00
COOC: Anwendung
25
Evaluation:Comparison
Test corpus
1017
Komponisten
2010
Komponist
NN
1
Komponist_0_Komponistin_0
Möglichkeiten Möglichkeit
NN
2
Möglichkeit_2_Eventualität_0
14011
verfügbar
verfügbar
ADJD
0
14014
machen
machen
VVINF
6
betätigen_0_treiben_0_machen_0
24006
wirkt
wirken
VVFIN
6
wirken_2
Evaluation corpus (Negra/Lexsem corpus)
1017
Komponisten
2010
NN
Komponist_0_Komponistin_0
Möglichkeiten Möglichkeit
NN
Möglichkeit_2_Chance_0_Gelegenheit_0
14011
verfügbar
verfügbar
ADJD
unknown
14014
machen
machen
VVINF
unspec
24006
wirkt
wirken
VVFIN
wirken_2
30.11.00
Komponist
COOC: Evaluation
26
1000 980
Meanings in the test corpus
900
800
2346 words annotated with 3.1 meanings per word,
1366 of these ambiguous, with average of 4.6 meanings
700
600
500
485
400
300
200
299
159
113
100
97
34 39 20
14 13 19
8
2
2
16
3
2
11
5
2
0
20
0
0
0
0
0
3
0
1 2 3 4
30.11.00
5 6
7 8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
COOC: Evaluation
27
Results (3 Segments/Window)
Segmentgröße segment size
Sentences
0(Satz)
2
5
count
1882
trivial
773
7
10
15
hitcount
703
586
688
718
724
720
incorrect
347
523
421
349
357
366
59
210
52
42
28
23
Precision (alle) [32,3%]
80,97
81,28
79,84
81,03
80,74
80,31
Precision (amb.) [21,7%]
66,95
52,84
62,04
67,30
66,98
66,30
Recall
96,51
88,51
96,88
97,41
98,15
98,41
nicht desambiguiert
not disambiguated
30.11.00
COOC: Evaluation
28
Summary
COOC:
• is the first unsupervised, corpus-based method of
disambiguating semantically ambiguous words for
German
• goes beyond n-gram statistics
• uses plain text, GermaNet, MMorph and a POS tagger
• is a tool for unsupervised learning, semantic tagging,
and evaluation
• first evaluation gives 67,3% (81) precision and 97,4%
recall
30.11.00
COOC: Zusammenfassung
29
Outlook
• Use of GermaNet 2 (but still need a hand-labeled
evaluation corpus)
• Repeat experiment with WordNet and Penn
Treebank Corpus
• Several experiments to determine optimal
parameters
• Two theses:
• lexical disambiguation
• general predictions
30.11.00
COOC: Ausblick
30