Semiautomatic Extension of CoreNet - uni

Transcript Semiautomatic Extension of CoreNet - uni

Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences

Chris Biemann (University of Leipzig) Sa-Im Shin (KORTERM, KAIST) Key-Sun Choi (KORTERM, KAIST)

Friday, 27th of August Coling 2004, Genève

Outline

• The necessity of the extension of lexical-semantic word nets • CoreNet – a WordNet for Korean, Japanese and Chinese • Co-occurrence statistics on large corpora • The Pendulum Algorithm • Results and Evaluation 2

Why extending WordNet?

• Manual Construction is done by experts - time-consuming - expensive • General-purpose WordNet often does not fit specialized domain • Existing ressources have coverage problems 3

Bootstrapping of lexical items

For learning by bootstrapping, two things are needed: A

start set

items with classes and a

rule set

of some known that states, how more information can be obtained using known items. Generic bootstrapping algorithm: Knowledge=0 New=Start_set While New>0 Knowledge+=New New=0 New=find new items using Knowledge and Rule_set known items

# items

Phase of growth Phase of exhaustion new items

iteration

Benefits and Backdraws of Bootstrapping

Pro: • Only small start sets (seeds) are needed, those can be rapidly prepared • Process needs no further supervision (weakly supervized learning) Cons: • Danger of Error Propagation • When to stop is unclear 5

CoreNet – ontology for Korean, Japanese and Chinese

Size of Korean part: 2,954 concepts

Word class

NOUN VERB ADJECTIVE

Lemmas

28,823 1,757 804

Senses

56,523 4,717 1,392 Features • Rather large groups of words per concept as opposed to fine-grained WordNet structure • Same concept hierarchy is used for all word classes 6

KAIST Corpus and Co-occurrences

Size of KAIST corpus (unannotated version): • 38 Million tokens • 2.3 Million sentences • 3.8 Million types Co-occurrence Statistics (sentence based): • occurrence of two or more words within a well-defined unit of information (sentence) • Significant Co-occurrences reflect semantic relations between words • Significance Measure (log-likelihood):

log

 with

 number of sentences,



n k

= number of sentences with

and

Co-occurrence set examples

reference word TOP 25 co-occurrences ordered by significance

연필 (pencil) jurisdiction 지우개 (eraser) (25), 만년필 (Korean) (14), 볼펜 (fountain pen) (22), (ball pen) (14), 쥐는 국어 (grasping) (14), 한 자루도 (a pen) (14), 한쪼가리 (a part of) (14), 문구세트 (stationary set) (13), 문화연필은 (Mun-Hwa pencil) (13), 자 루 (the measure of numbering pencils) (11), 필통 box) (11), 한토막 (a part) (11), 공책 (pencil (notebook) (10), 기념품 을 (souvenir) (9), 노트 (drawing) (8), 사진 (notebook) (9), (picture) (8), 한글을 시간 (time) (9), (Korean) (8), 그린 가방 (bag) (7), 쓰던 (writing) (7), 쓰면 (writing) (7), 아이들은 (children) (7), 종이 (paper) (7), 줄은 (decreasing) (7) [..] over (305), court (188), under (183), courts (145), federal (121), Court (95), case (73), court's (68), state (45), within (43), Appeals (38), ruled (38), Circuit (36), SEC (36), law (36), Commission (34), GSBCA (34), appeals (34), House (33), committees (33), Judge (31), Act (29), CFTC (29), Committee (29), subcommittee (28) [...] Co-occurrence sets alone exhibit too many different relations to the reference word for the use of CoreNet extension 8

Pendulum-Algorithm: Bootstrapping with verification

LastLearned=StartSet; Knowledge=StartSet; NewLearned=0; while (LastLearned>0) { for all i in LastLearned { Candidates=getCooccurrences(i); for all c in Candidates { } Search step VerifySet=getCooccurrences(c); if |VerifySet  Knowledge| >threshhold { NewLearned+=c; Verification step Knowledge+=c; } } } LastLearned=NewLearned; NewLearned=0; 9

Pendulum Example

Seed

: 관자놀이 (temple), 눈 (eye), 뺨 (cheek), 시(poem), 쌍꺼 풀 (double eyelid), 부위마다 (part of face), 아랫입술 (lower lip), 오관 (the five sensory organs), 입 (mouth), 코 (nose), 혀 (tongue)

Search with

관자놀이 (temple): … , 복사뼈 (malleolus bone), …

Verify for

복사뼈 (malleolus bone): 부위마다 (part of face), 안면부 (part of the face), 인당 (ligament), 인중 (philtrum), 경골 (tibial), 관자놀이 (temple), 경혈을 (spots on the body suitable for acupuncture), 손끝으 로 (with fingertip), 용천 (spring), 청명 (serenity), 4차례씩 (per 4 times), 두드릴 (tabbing), 발바닥 (the sole of the foot), 코와 (with nose), 등 (back), 오리 (duck), 영향 (influence), 상 부 (high part), 위쪽 (front part), 신체 (body), 예방하는 (preparing), 중간 (middle), 입 (mouth), 질병을 (disease), 코 (nose), 한가운데 (center), 가볍게 (lightly), 곳 (place), 누르고 (pressing), 지정된 (appointed). 10

Evaluation

• Selection of concepts performed by a non-Korean speaker • Evaluation performed manually, only new words counted • Heuristics for avoiding result set infection - iteratively lower threshold for verification from 8 downto 3 until the result set is too large - take lowest threshold for result set with reasonable size (not exceeding start set) • Typical run needed 3-7 iterations to converge 11

Results

CoreNet ID

50 111 113 114 181 430 471 548 552 553 577 590 672 817

Name of Concept

human good/bad human relation partner / co-worker partner / member human ability store land, area insect, bug part of animal head forehead legs and arms plant (vegetation) cloths

Sum: Size

736 139 72 86 461 246

3439

119 274 123 71 213 128 260 75 Not enough for automatic extension, but a good source for candidates

# new

36 3 23 5 7 12 10 43 10 7 4 7 30 34

231 # ok

2 11 2 6 5 2 8 3 15 18

6 4 2 3

precision

13.89% 66.67% 34.78% 60.00% 28.57% 91.67% 20.00% 13.95% 60.00% 57.14% 50.00% 42.86% 50.00% 52.94%

37.67%

Problems... ...and possible solutions

• „Coverage is low“ - increase corpus size for relevant domains - make use of other features, e.g. patterns • „Precision is not satisfactionary“ - obtain multiple concepts simultaneously - meta-level bootstrapping - make use of other features, e.g. POS tags for word class information This work gives a baseline of what is reachable without employing language-dependent features 13

Summary

Language-independent method for semi-automatic extension of lexical-semantic word nets using • Co-occurrence data on basis of a plain text corpus • Pendulum Algorithm for keeping precision high in Bootstrapping 14

Questions?

THANK YOU!

Local Ontology Engineering

• Bottom-up approach: Given an existing (small) ontology, how can it be (semi)automatically extended?

• Top nodes of ontologies are scarcely lexicalized  focus rather on leaves than on branch and trunk nodes • Local view: extension does not take global structure into account but operates within sub-trees ...

...

Focus on local areas 16

Using word class information

–

Algorithm: Word set W

German Example

As long as new words w are found candidates C= co-occurrences of w of different word class for all c in C: if co-occurrence set of c contains enough words of W with different class of c: add c to W

19.23 Hieb- und Stichwaffe (DORNSEIFF 2003)

Waffe • Stichwaffe · Bajonett · Damaszener · Degen · Dolch · Florett · Lanze · Säbel · Schwert · Sense · Speer · Spieß • Messer · Fahrtenmesser · Jagdmesser · Klinge · Stilett • Hiebwaffe · Baseballschläger · Faustkeil · Keule · Knüppel · Morgenstern · Prügel · Schlagring · Schlagstock · Stock · Totschläger • Bumerang · Hellebarde · Streitaxt · Tomahawk • Armatur · Bewaffnung · Rüstung · Wehr • Arsenal · Rüstkammer · Waffenkammer · Waffenlager · Zeughaus • bewaffnen · rüsten · wappnen • einprügeln · einschlagen · einstechen · erschlagen · erstechen · prügeln · schlagen · stechen · verprügeln · zuschlagen · zustechen

New for 19.23

Abrißbirne · Axt · Hüften · Lüfte Drahtesel · Eisenstange · Fäuste · Golfschläger · · Peitsche · Pendel · Racket · Sattel · Schläger · Skins · Takt · Tanzbein · Unterleib · Zepter · einschlug · ersticht · fechten · ficht · kreuzen · rammt · schwang · schwangen · schwingen · schwingt · traktiert · zückt · zückte