Automatic Assignment of Domain Labels to WordNet GWC 2004 Mauro Castillo V.

Download Report

Transcript Automatic Assignment of Domain Labels to WordNet GWC 2004 Mauro Castillo V.

Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Automatic Assignment of
Domain Labels to WordNet
Mauro Castillo V.
Francis Real V.
German Rigau C.
GWC 2004
Outline
•Introduction
•WordNet
•WN Domains
•Experimentation
•Evaluation and results
•Discussion
•Conclusions
Introduction
• To semantically enrich any WN version with the semantic
domain labels of MultiWordNet Domains
• WN is an standard resource for semantic processing
• Effectiveness of Word Domain Disambiguation
• The work presented explores the automatic and sistematic
assignment of domain labels to glosses
• Proposed Method can be used to correct and verify the
suggested labeling
WordNet
• The version WN1.6 was used because of the
availability of WN Domains
WN Domains
TOP
pure_science
mathematics
geometry
statistics
biology
WordNet Domain hierarchy
botany
developed at IRST
zoology
(Magnini and Cavagliá, 2000)
entomology
anatomy
... ... ...
WN Domains
• The synsets have been annotated semiautomatically
with one or more labels
• Most of synsets it has single a label
#
1
2
3
4
5
nom
56458
8104
1251
210
2
verb
11287
743
88
8
1
adj
16681
1113
113
8
0
adv
3460
109
6
0
0
%
88,2020
10,1050
1,4632
0,2268
0,0030
Distribution of domain labels for synset
Average labels for synset
noun = 1.170
verb = 1.078
adj = 1.076
adv = 1.033
WN Domains
• A domain may include synsets of different syntactic
categories : e.g. MEDICINE
doctor#1
(n)
operar#7
(v)
medical#1
(a)
clinically#1 (r)
• A domain label may also contain senses from different
Wn subhierarchies. e.g. SPORT
athleta#1
game-equipment#1
sport#1
playing-field#1




life-form#1
physical-object#1
act#2
location#1
WN Domains
• Synsets that have more than one label, do not seem to
follow any pattern
• sultana#n#1 (pale yellow seedless grape used for raisins and wine)
Botany Gastronomy
• morocco#n#2 (a soft pebble-grained leather made from goatskin; used for
shoes and book bindings etc.)
Anatomy Zoology
• canicola_fever#n#1(an acute feverish disease in people and in dogs
marked by gastroenteritis and mild jaundice)
Medicine Physiology Zoology
• blue#n#1, blueness#n#1 (the color of the clear sky in the daytime; "he had
eyes of bright blue")
Color Quality
WN Domains
• FACTOTUM : Used to mark the senses of WN that do
not have a specific domain
• STOP Senses: The synsets that appear frequently in
different contexts, for instance: numbers, colours, etc.
Applications of WN Domains
• Word Sense Disambiguation
• Word Domain Disambiguation
• Text Categorization, etc.
Experimentation
• Process to automatically assign domain labels to WN1.6
glosses
• Validation procedures of the consistency of the domains
assignment in WN1.6, and especially, the automatic
assignment of the factotum labels
POS
noun
verb
adj
adv
FAC
no FAC
66025
58252
12127
4425
17915
6910
3575
1039
%FAC
11,77
63,51
61,42
70,93
Distribution of synset with and without the domain
label factotum in WN1.6
Experimentación
Test set was randomly selected (around 1%) and the
other synsets were used as a training set
POS
noun
verb
FAC
no FAC
572
647
43
121
%FAC
11,90
60,33
Corpus test for nouns and verbs
Experimentation
castle#n#4, castling#n#1
CHESS SPORT
castle castling | interchanging the positions of the king and a rook
castle
chess
castle
sport
castling
chess
castling
sport
interchanging chess
interchanging sport
interchanging chess
interchanging sport
interchanging chess
interchanging sport
king
chess
king
sport
rook
chess
rook
sport
Calculation of
frequency
castle
chess
68
castle
sport
27
castle
hystory
18
castle
archictecture
57
castle
law
12
castle
tourism
24
…
Experimentation
Measures
M1: Square root formula
c(w,D) - 1/N*c(w)c(D)
c(w,D)
M2: Association Ratio
Ar(w,D) = Pr(w|D)log2(Pr(w|D) / Pr(w))
M3: Logarithm formula
log2(N*c(w,D) / c(w)c(D))
Experimentation
TRAINING
CALCULATION
orange
orange
orange
orange
orange
orange
orange
orange
orange
orange
botany
gastronomy
color
jewellery
entomology
quality
hunting
geology
chemistry
biology
10.1739451057135
4.98225066954225
3.28232334801756
1.49369255002054
1.23243498322359
1.17822271128967
0.412524764820793
0.293707167933641
0.166183492890361
0.110492358490017
MATRIX
OF WEIGHTS
VALIDATION
Experimentation
06950891 leader#n#1 PERSON
politics
history
religion
person
mythology
commerce
person
19.94
law
8.01
economy
4.74
religion
4.24
anthropology 3.74
sexuality
3.53
politics
3.49
4.30
3.33
2.19
1.78
1.17
1.11
leader
|
a
person
law
factotum
computer_science
mathematics
grammar
play
linguistics
politics
who
variant
VD =  weigth(wi,dj)*percentage
2.70
2.09
2.05
1.83
1.68
1.57
1.54
1.35
rules
gloss
tourism
industry
person
mechanics
factotum
occultism
pedagogy
or
1.64
1.54
1.46
1.26
1.24
0.98
0.93
psychology 0.96
factotum
0.82
guides or inspires others
POSITION 1: person
POSITION 2: politics
POSITION 3: law
...
...
person
= 30.23
= 13.40
= 11.08
Evaluation y Results: nouns
AP: Accuracy first label
AT: Accuracy all labels
P : Precision
MiD : Measures the success of each formula
(M1, M2 or M3) when the first proposed label
is correct (or subsumed as correct one in the
domain hierarchy).
R : Recall
F1 : 2PR/(P+R)
N
M1A
M1D
M2A
M2D
M3A
M3D
AP
70,94
74,50
45,75
52,09
66,77
71,56
AT
79,75
84,85
50,39
57,50
74,50
81,45
MiA : Measures the success of each formula
(M1, M2 or M3) when the first proposed label
is correct
P
64,74
68,88
42,73
48,75
60,86
66,54
R
68,25
72,62
43,12
49,21
63,76
69,71
F1
66,45
70,70
42,92
48,98
62,27
68,09
Results for nouns with factotum CF
N
M1A
M1D
M2A
M2D
M3A
M3D
AP
73,95
78,50
52,45
59,44
74,48
78,85
AT
81,82
87,24
57,52
65,21
82,69
88,64
P
66,81
71,24
49,32
55,94
68,41
73,33
R
68,68
73,24
48,24
54,71
69,41
74,41
F1
67,73
72,23
48,77
55,32
68,91
73,87
Results for nouns without factotum SF
Evaluation y Results: verbs
AP: Accuracy first label
AT: Accuracy all labels
P : Precision
MiD : Measures the success of each formula
(M1, M2 or M3) when the first proposed label
is correct (or subsumed as correct one in the
domain hierarchy).
R : Recall
F1 : 2PR/(P+R)
V
M1A
M1D
M2A
M2D
M3A
M3D
AP
51,24
51,24
13,22
16,53
23,14
24,79
AT
57,02
57,02
14,88
19,83
28,10
29,75
MiA : Measures the success of each formula
(M1, M2 or M3) when the first proposed label
is correct
P
47,26
47,26
12,68
16,90
21,94
23,23
R
50,74
50,74
13,24
17,65
25,00
26,47
F1
48,94
48,94
12,95
17,27
23,37
24,74
Results for verbs with factotum CF
V
M1A
M1D
M2A
M2D
M3A
M3D
AP
69,77
74,72
20,93
41,86
41,86
53,49
AT
76,74
83,72
25,58
51,16
55,81
67,44
P
64,71
69,23
19,64
38,60
39,34
46,77
R
55,93
61,02
18,64
37,29
40,68
49,15
F1
60,00
64,86
19,13
37,93
40,00
47,93
Results for verbs without factotum SF
Evaluation y Results
• On average, the method assigns:
Noun : 1.23 domains labels (1.170)
Verb : 1.20 domains labels (1.078)
• We obtain better results with nouns
• The best average results were obtained with the M1
measure
• The first proposed label (noun): 70% accuracy
• The results of verbs are worse than nouns, one of the
reasons may be the high number of verbal synsets
labels with factotum domain
Discussion
Monosemic words:
credit application#n#1 (an application for a line of credit)
Domains: SCHOOL
Proposal 1. Banking
Proposal 2. Economy
Banking
economy
banking
Discussion
Relation between labels:
Academic_program#n#1 (a program of education in liberal arts and
sciences (usually in preparation for higher education))
Domains: PEDAGOGY
Proposal 1. School
Proposal 2. University
pedagogy
school
university
Discussion
Relation between labels:
shopping#n#1 (searching for or buying goods or services: "went
shopping for a reliable plumber"; "does her shopping at the mall
rather than down town")
Domains: ECONOMY
Proposal 1. Commerce
social_science
commerce
economy
Discussion
Relation between labels:
Fire_control_radar#n#1 (radar that controls the delivery of fire on a
military target)
Domains: MERCHANT_NAVY
Proposal 1. Military
social_science
transport
merchant_navy
military
Discussion
Uncertain cases:
birthmark#n#1 (a blemish on the skin formed before birth)
Domains: QUALITY
Proposal 1. Medicine
bardolatry#n#1 (idolization of William Shakespeare)
Domains: RELIGION
Proposal 1. History
Proposal 1. Literature
Conclusions
• The procedure to assign automatically domain
labels to WN gloss seems to be dificult
• The proposal process is very reliable with the first
proposal labels
• The proposal labels are ordered by priority
• It is posible to add new correct labels or validate
the old ones
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Automatic Assignment of
Domain Labels to WordNet
Mauro Castillo V.
Francis Real V.
German Rigau C.
GWC 2004