An Integrated Approach to Measuring Semantic Similarity

Transcript An Integrated Approach to Measuring Semantic Similarity

A Common Concept Description
of Natural Language Texts as
a Foundation of Semantic Computing
on the Web
Mitsuru Ishizuka
Dept. of Creative Informatics &
Dept. of Info. and Communication Eng.
School of Information Science and Technology
Semantic Computing Initiative
lay a foundation that allows computers to understand the semantic meaning
of Web contents so that they can perform semantic computing on the Web.
The aims of CDL are
1) to realize machine understandability of Web text contents, and
2) to overcome language barrier on the Web.
2
Major Differences from Semantic Web
Semantic Web



Target of representation:
Meta-data extracted from
Web contents.
Domain-dependent
ontologies (which cause the
difficulty of wide interboundary usage)
RDF / OWL (description
logic is hard for ordinary
people to understand)
Tim Berners-Lee says that:
“the Data Web” is more adequate rather
than “the Semantic Web”. (2007)
Semantic Computing
Initiative



Target of representation:
Semantic concepts expressed in
texts.
Universal vocabulary (+
additional specific vocabulary
in a domain if necessary), and
pre-defined relation set.
CDL.nl (richer than RDF)
Main body:
Institute of Semantic Computing (ISeC)
in Japan
3
Int’l Standardization Activity:
W3C Common Web Language(CWL)-XG
Incubator Group Activity at W3C
from Oct. 2006 to March 2008
4
2nd Incubator Group at W3C
from May 2008
5
CDLs and Semantic Web
6
Tim Berners-Lee(2007): The Semantic Web  The Data Web (more adequate)
Another Broader View of
CDL Development

In 1960s – 1970s
The foundation on the common representation
and manipulation (retrieval) of Data.
Database

In 2000s – 2010s
The foundation of the common representation
and manipulation of Semantic Information.
Common Concept Base
It is preferable that this is language independent; in other words,
Computer Esperanto Language which is understandable by
computers.
7
Functions and Supporting Standards
8
From Machine Translation
English
Japanese
Chinese
Transfer
method
Pivot
method
Pivot
Language
UNL (Universal
CDL (Concept
Networking Language)
Description Language)
could be
Minimal sufficient relations have been
chosen to represent the surface-level
concept meaning of texts.
Computer
Esperanto
Language
9
UNL (Universal Networking Language)




The development started
in1997 at the United Nations
Univ. (Tokyo). The chief
scientist has been Dr. Hiroshi
Uchida. It is now continuously
developed under the UNDL
foundation.
The purpose is to let people in
the world exchange and share
textural info. on the Web
beyond language barrier.
The design is based on the
results of Machine Translation
(especially, Pivot method) and
Electric Dictionaries.
There have been activities wrt
English, Japanese, Chinese,
Spanish, French, Arabic, etc.
10
The defining method of one unique
sense of a word in UW （Patent of UN Univ.）

Defining category
swallow(icl>bird)
swallow(icl>action)
swallow(icl>quantity)

the bird
“One swallow does not make a summer”
the action of swallowing
“at one swallow”
the quantity
“take a swallow of water”
Defining possible case relations
spring(agt>thing,obj>wood)
spring(agt>thing,obj>mine))
spring(agt>thing,obj>person,
src>prison))
spring(agt>thing,gol>place)
spring(agt>thing,gol>thing)
spring(obj>liquid)
bending or dividing something
blasting something
escaping (from) prison
jumping up
“to spring up”
jumping on
“to spring on”
gushing out
“to spring out”
11
UW (Universal Words) in UNL
Universal Word
uw{(equ>Universal Word)}
adjective concept{(icl>uw)}
uw(aoj>thing{,and>uw,ben>thing,cao>thing,cnt>uw,cob>thing,con>uw,coo>uw,dur>period,man>
how,obj>thing,or>uw(aoj>thing),plc>thing,plf>thing,plt>thing,rsn>uw(aoj>thing),rsn>do,icl>adjective concept})
Achaean({icl>uw(}aoj>thing{)})
Afghan({icl>uw(}aoj>thing{)})
African({icl>uw(}aoj>thing{)})
African-American({icl>uw(}aoj>thing{)})
Ainu({icl>uw(}aoj>thing{)})
Alaskan({icl>uw(}aoj>thing{)})
Albanian({icl>uw(}aoj>thing{)})
Aleutian({icl>uw(}aoj>thing{)})
Alexandrian({icl>uw(}aoj>thing{)})
Algerian({icl>uw(}aoj>thing{)})
Altaic({icl>uw(}aoj>thing{)})
American({icl>uw(}aoj>thing{)})
Anglian({icl>uw(}aoj>thing{)})
Anglo-American({icl>uw(}aoj>thing{)})
40,000 lexicons are
Anglo-Catholic({icl>uw(}aoj>thing{)})
Anglo-French({icl>uw(}aoj>thing{)})
open to public.
Anglo-Indian({icl>uw(}aoj>thing{)})
Anglo-Irish({icl>uw(}aoj>thing{)})
The full vocabulary
Anglo-Norman({icl>uw(}aoj>thing{)})
includes 200,000
Arab({icl>uw(}aoj>thing{)})
Arab-Israeli({icl>uw(}aoj>thing{)})
lexicons as of 2007.
Arabian({icl>uw(}aoj>thing{)})
Arabic({icl>uw(}aoj>thing{)})
12
CDLs

CDL.core


defines the basic format.
CDL.nl (or Common Web Language)


describes every possible concept expressed in natural language
text in any languages. It provides a concept description scheme
wrt word, phase, sentence and documents in any languages,
based on the CDL concept model.
A basic vocabulary set including relation vocabulary is given.

CDL.jpn, CDL.eng, CDL.chi, CDL.spa, etc.

Articulate Japanese (明晰日本語)

defines a style of Japanese sentence expressions which suppress
ambiguity referring to CDL.nl
13
CDL.core

Basic Description Elements：
Entity


Elementary Entity
Composite Entity
Relation

Node
Hyper-Node
Link
Attribute-Value pairs can be added to the entity (and
the relation).

These are quasi-relations; if the value takes a complex entity, then we can
treat it as a relation.
14
Representation with CDL.nl

<John reported to Alice that he bought a computer yesterday.>

{#A01 Event tmp=‘past’;
{#B01 Event tmp=‘past’;
{#b01 buy;} {#b02 computer ral=‘def’;} {#b03 yesterday;}
[#b01 agt John] [#b01 obj #b02] [#b01 tim #b03] }
{#John John;} {#Alice Alice;} {#a01 report;}
[#a01 agt #John] [#a01 gol #Alice] [#a01 obj #B01] }
Event#A01
tmp=‘past’
agt
report#a01
John#
Event#B01
tmp=‘past’
agt
buy#b01
gol
obj
tim
Alice#
obj
computer#b02
ral=‘def’
yesterday#b03
15
Semantic Role Labels in PropBank
The focus is on Predicate-Argument Structure.

















Arg0 (prototypical agent)
Arg1 (prototypical patient)
These are defined
Arg2 (indirect object/benefactive/instrument/attribute/end state)
wrt each word sense.
Arg3 (start point/benefactive/instrument/attribute)
Ex) buy::
Arg4 (end point)
Arg0: buyer
Arg5 (
)
Arg1: thing bought
TMP (time)
Arg2: seller (bought-from)
LOC (location)
Arg3: price paid
DIR (direction)
Arg4: benefactive (bought-for)
MNR (manner)
PRP (purpose)
CAU (cause)
This set is not sufficient for representing
MOD (modal verb)
every concept expressed in natural
NEG (negative marker)
language texts.
ADV (general-purpose modifier)
It cannot be used for every language due
DIS (discourse particle and clause)
to its language (English) dependency.
16
PRD (secondary predication)
CDL.nl Relations (1)

RELATION

ElementalReration
要素関係
[PlaceRelation 場所関係]
FunctionalRelation 機能的観点
 plc（place：場所）
[AgentRelation 主体関係]
 plf（initial place：起点）
 agt（agent：動作主）
 plt（final place：終点）
 cag（co-agent：並行動作主）
 vip (intermediate place, via place：場
 aoj（thing with attribute：属性主）
所経由）
 cao（co-thing with attribute：並行属性主）
[StateRelation 状態関係]
 ptn（partner：相手）
 sta (state：状態)
[PatientRelation 被行為体関係]
 obj（affected thing：対象）
 src（source, initial state：始状態）
 cob（affected co-thing：並行対象）
 gol（goal, final state：終状態）
 opl（affected place：場所対象）
 vis（intermediate place or state：経由）
 ben（beneficiary：受益者）
[TimeRelation 時間関係]
[InstrumentRelation 道具関係]
 tim（time：時間）
 ins（instrument：道具）
 tmf（initial time：始時間）
 mat (material：材料)
 tmt（final time：終時間）
17
 met（method or means：方法）
 dur（duration：期間）

CDL.nl Relations (2)
[SceneRelation 場面関係]
 pos（possessor：所有者）
 scn (scene：場面)
 cnt（content, namely：内容）
 vic （via scene：場面経由)
 nam（name：名前）
[CausalRelation 原因関係]
 per（proportion, rate or distribution：単位）
 con（condition：条件）
 fmt (range/from to：範囲）
 pur（purpose or objective：目的）
 frm (origin：起源点）
 rsn（reason：理由）
 to (destination：目的点）
[OrderRelation 順序関係]
[LogicalRelation 論理関係]
 coo（co-occurrence：同起）
 and（conjunction：連言的）
 seq（sequence：先行）
 or (disjunction, alternative：選言的）
[MannerRelation 様態関係]
 not (complement：補集合）
 mal (qualitative manner：質的仕方） [ConceptRelation 概念関係]
 mat (quantitative manner：量的仕方）
 equ (equivalent：同義）
 bas（basis for expressing a
 icl (included / a kind of：上位）
standard：基準）
 tof (type-of：具体化）
[ModificationRelation 限定関係]
 pof（part-of：部分）
 mod (modification：限定）
18
 qua (quantity：量的限定）
CDL.nl Relations (3)

InterThingOrInterEventRelation
間事
物・間事象関係
[ConnectingRelation 接続関係]
 cau (causal：順接）
 adv (adversative：逆接）
 adt (aditive：添加）
 cot (contrastive：対比）
 par (parallel：同列）
 att (attached：補足）
[RefferingRelation 参照関係]
 rfi (reffered by identically：同一参照)
 rfp (reffered by partially：部分参照）
 rfw (reffered by wholly：包含参照）

AttensionRelation 注目関係
 ent (entry：主概念）
 foc (focus：焦点）
 qfo (question focus：質問焦点）
 tpc (topic：トピック）
 com (comment：コメント）
19
Rough Correspondence between Semantic
Relations of PropBand and CDL.nl (1)
















Arg0 (prototypical agent)  agt (agent), cag (co-agent), aoj (thing with attribute), cao (co-thing with attribute)
Arg1 (prototypical patient)  obj (affected thing), cob (affected co-thing)
Arg2 (indirect object/benefactive/instrument/attribute/end state)
 ---, ben (beneficiary), ins (instrument), mat (material), met (method or means),
sta (state), gol (goal, final state)
Arg3 (start point/benefactive/instrument/attribute)  plf (initial place), ben (beneficiary), ins (instrument),
mat (material), met (method or means), sta (state)
Arg4 (end point)  plt (final place), to (destination)
TMP (time)
 tim (time), tmf (initial time), tmf (final time), dur (duration)
LOC (location)  plc (place)
DIR (direction)  to (destination)
MNR (manner)  mal (qualitative manner), mat (quantitative manner)
PRP (purpose)  pur (purpose or objective)
CAU (cause)  rsn (reason)
MOD (modal verb)  an attribute in CDL.nl
NEG (negative marker)  an attribute in CDL.nl
ADV (general-purpose modifier)  mod (modification), qua (quantity), pos (possessor), cnt (content),
nam (name), per (proportion, rate or distribution), fmt (range/from to),
frm (origine)
DIS (discourse particle and clause)  [inter-sentence relation]
PRD (secondary predication)  [unique in English]
20
Rough Correspondence between Semantic
Relations of PropBand and CDL.nl (2)
Other CDL.nl Relations

[AgentRelation]
ptn (partner): an indispensable non-focused initiator of
an action
Ex) He competes with John. Mary collaborates with
him.

[PatientRelation]
CDL.nl’s Relations other than
Predicate-Argument Relations

coo (co-occurrence) seq (sequence)

opl (affected place): a place in focus affected by an event.
Ex) He cut the paper in middle in the room.

[PlaceRelation]
vip (intermediate place, via place)



[StateRelation]
src (source, initial state)
vis (instermediate place or state)

[SceneRelation]
scn(scene): a scene where an event occurs, or state is
true, or a thing exists.
A scene is different from plc in that plc is the real
place something happens, whereas scn is an abstract
or metaphorical world.
Ex) He won a prize in a contest. He played in the
movie.
vic (via scene)
[OrderRelation]


[LogicalRelaion]
and (conjunction) or (disjunction, atternative)
not (complement)
[ConceptRelation]
equ (equivalent) icl (included/a kind of)
tof (type of) pof (part of)
[ConnectingRelaion]
cau (causal) adv (adversative) adt (additive)
cot (contrastive) par (parallel) att (attached)
[ReferringRelaion]
rfi (referred by identically)
rfp (referred by partially) rfw (referred by wholly)
[AttensionRelation]
ent (entry)  main (main element) in Connexor
foc (focus) qfo (question focus) tpc (topic)
com (comment)
21
Rich Attributes in UNL and CDL.nl
Express subjectivity evaluation of the writer/speaker for the sentence.
 Ex.) tense, aspect, mood, etc.
 Writer’s feeling and judgements
Time with respect to speaker


@ability @get-benefit @give-benefit
@conclusion @consequence @sufficient @grant
@grant-not @although @discontented
@expectation @wish
@insistence @intention @want @will @need
@obligation @obligation-not @should
@unavoidable @certain @inevitable @may
@possible @probable @rare @regret @unreal
@admire @blame @contempt @regret
@surprised @troublesome
@past @present @future

Writer’s view on aspect of event
@begin @complete @continue @custom
@end @experience @progress @repeat @state

Writer’s view of reference
@generic @def @indef @not @ordinal

Writer’s view of emphasis, focus
and topic
@emphasis @entry @qfocus @theme
@title @topic

Writer’s View of reference
@generic @def @indef @not @ordinal
Describing logical characters and
properties of concepts
@transitive @symmetric @identifiable
@disjoint
Writer’s attitudes
@affirmative @confirmation @exclamation
@imperative @interrogative @invitation
@politeness @respect @vocative



Modifying attribute on aspect
@just @soon @yet @not

Attribute for convention
@passive @pl @angle_bracket @brace
@double_parenthesis @double_quote
22
@parenthesis @single_quote @square_bracket
Discourse (Inter-sentence) Relations
are missing in current CDL.nl
Discourse Relations at ISO/TC37/SC4/TDG3 (34 types)






derivation
causes
conditional
inference
purpose
trigger











compromise
conflict
contrast
unconditional

comparison
disjunction
dissimilar
manner
otherwise
proportion
similar
strongComparison














detail
element
example
extraction
general-specific
minimum
part
process-step
Restatement
constraint
supplement
background
content
evaluation
23
Concept Description Levels
Surface Level
Concept
Description
Deep Semantic
Level


There are several choices for the deep semantic-level description depending on
applications. On the other hand, a certain consensus has been made wrt
“Concept Description” which is slightly below the surface level, through
decades-long researches on NLP, machine translation and electric dictionaries.
Whereas a complete consensus has not been achieved yet regarding the Concept
Description level and its description scheme, it is meaningful to set up a common
concept description format as an international standard today.
24
Hierarchical Construction of
Concept Representation in CDL.nl
situation (discourse)
temporal and causal relations,
etc., and coreference
composite
concept/event
(complex sentence)
agent-patient relation, phrasal relation, etc.
single event
(single sentence)
consisting of
proposition
and modality
components
composite entity
predicate, case components,
predicate-modification components, etc.
elementary
thing/entity
corresponding to
disambiguated
word sense
25
Current Major Issues in CDL.nl

Semi-automatic Conversion from Text.
(Text generation from CDL.nl is not so difficult.)

Semantic Retrieval of CDL Data
(The design of a CDL Query Language (CDQL)
and its processing mechanism）

Killer Application(s)
-- information exchange and share beyond language
barrier.
-- semantic patent document retrieval.
2015/7/17
26
26
Approaches for Generating CDL Data

Manual Coding & Editing

Even in this case, a graphical input editor is necessary.

Graphical Input & Editing （Hasida’s Semantic Authoring)

Some Manual Tagging to Text, then Conversion into
CDL.

Semi-automatic Conversion from Text (1)


Semi-automatic Conversion from Text (2)


Graphical interface for selecting a right one among possible
candidates.
Manual disambiguation of the word sense (a pull-down menu
selection), then automatic conversion into CDL.
Our current
approach
An approach
taken at UNDL
foundation
Full Automatic Conversion (ultimate goal)
27
Semantic Authoring (by K. Hasida):
A Graphical Input Approach
Coarse Grain
Fine Grain
28
Semantic Parsing

Language processing is going through:
 Syntactic parsing
 Dependency parsing
 Shallow semantic parsing

Semantic Role Labeling

Given a sentence:


The brave soldiers fought with their enemies for their country
in the War
Assign predicate-argument roles to sentence elements.
(Who did What to Whom, When, Where, Why, How, etc.)
 [ARG0The
brave soldiers] [rel fought] with [ARG1-with their enemies]
for [ARG2-for their country] in [ARGM-loc the War] (in PropBank)

Corpora: PropBank, FrameNet, …
29
Dependency Parser as
a lower basis of our Semantic Analysis
Connexor Machines
Text Analyser

Dependency Functions close to the
semantic role
main (main element)
agt (agent) : The agent by-phase in passive
sentences.
Ex) The dog was chased by the boy.
ins (instrument)
tmp (time)
dur (duration) Ex) ...experience in the past 10
years.
man (manner)
loc (location)
sou (source)
goa (goal)
pth (path)
Ex) ... move away from the street.
Ex) ... shift to a full power.
Ex) ... travel from Tokyo to
Beijing.
cnt (contingency (purpose or reason))
Ex) ... unable to say why he was too ....
cod (condition)
qn (quantifier)

Syntactic Functions
pcomp (prepositional complement) Ex) They are in that red car.
phr (verb particle) Ex) She looked up the word in the dictionary.
subj (subject)
obj (object)
comp (subject complement) Ex) John remains a boy.
dat (indirect object)
Ex) John gave her an apple.
oc (object complement)
Ex) John called him a fool.
corpred (copredicative)
Ex) John regards him as foolish.
com (comitative)
Ex) Drinking with you is nice.
voc (vocative)
Ex) John, come here!
frq (frequency)
qua (quantity)
meta (clause adverbial)
Ex) So far, he has been ….
cla (clause initial adverbial) Ex) Under his guidance, they can ....
ha (heuristic prepositional phrase attachment)
Ex) escape trough ..., fight for ...
det (determiner)
neg (negator)
not
attr (attributive nominal)
Ex) industrial editor
mod (other postmodifer)
Ex) … of …
ad (attributive adverbial)
Ex) So much for modern technology,
30
cc (coordination)
Ex) and
Named Entity Recognition

In Connexor Machinese Text Analyser





+ org (organization, company)
+ loc (location)
+ ind (individual)
+ name (name)
+ role (occupation, title)
This info. is useful as a lexical feature for the semantic parsing.
31
Conversion of text into CDL.nl
through Shallow Semantic Parsing

Original text
The records retrieved in answer to queries become information that can be
used to make decisions . #

Separate each word with a ID
The \ w37 records \ w38 retrieved \ w39 in \ w40 answer \ w41 to \ w42 queries \ w43
become \ w44 information \ w45 that \ w46 can \ w47 be \ w48 used \ w49 to \ w50 make
\ w51 decisions \ w52 . \ w53 #

Connexor Analiser’s output
det: ( w38 w37 ) subj: ( w44 w38 ) mod: ( w38 w39 ) loc: ( w39 w40 )
pcomp: ( w40 w41 ) mod: ( w41 w42 ) pcomp: ( w42 w43 ) main: ( w36 w44 )
comp: ( w44 w45 ) subj: ( w47 w46 ) v-ch: ( w48 w47 ) v-ch: ( w49 w48 )
mod: ( w45 w49 ) pm: ( w51 w50 ) cnt: ( w49 w51 ) obj: ( w51 w52 ) obj: ( w51 w53 ) #

Hand-coded partial CDL.nl relations
obj(w39 w38) gol(w39 w41) pur(w39 w43) aoj(w44 w38) obj(w44 w45)
obj(w49 w45) pur(w49 w51) obj(w51 w52) #
32
Relation/Role Set Comparison

Propbank


describes how a verb relates to its arguments.
FrameNet
describes how to describe words with its arguments in a related common scenario.
Common disadvantages of FrameNet & Propbank role set:
 The set covers only predicate-argument roles; they don’t consider any other types
of relationships between entities.



CDL.nl
CDL relation set describes how words are correlated and what the meanings of
their relationships are.
Advantages of CDL.nl relation set
 Each relation in the set is pre-defined along with distinctive information from
other similar relations.
 It describes not only predicate-argument relations, but also those between each
pair of entities there exists a meaningful relationship. Thus it has better coverage.
 The set has been chosen so that every concept expressed in texts can be
sufficiently encoded.
 It is universal, i.e., language independent, and can be applied to any language.


33
CDL Relation Set

Used to describes shallow semantic structure of text.

Relations have been chosen to be able to sufficiently represent the
semantic concepts of texts, and are predefined.

The set of relations contains all relation types which are organized
roughly into three groups:

intra-event relations (22)


inter-entity relations (13)


agt(agent), aoj(thing with attribute), cag(co-agent), cao(co-thing with
attribute), ptn(partner), …..
and(conjunction), con(condition), seq(sequence), …..
qualification relations (9)

mod(modification), pos(possessor), qua(quantity), …..
34
Frequencies of CDL Relations

Data sparseness :

The whole number of relation:13487
Relation type: 44

Average num per relation: 306.5

nam
Mod
Obj
Aoj
And
Agt
Man
Plc
Gol
Tim
Pur
Qua
#rel
3128
2697
2069
1122
1046
788
446
395
321
289
269
nam
Pos
Scn
Rsn
Src
Cnt
Dur
Bas
Met
Equ
Nam Con
#rel
86
71
65
63
61
58
49
47
46
41
41
nam
Ben
Tmt
Pof
Frm
Or
Fmt
Tmf
Seq
To
Iof
Cag
#rel
27
25
24
23
21
20
19
17
12
11
10
nam
Icl
Via
Coo
Per
Ins
Plt
Ptn
Plf
Cao
Opl
Cob
#rel
10
9
8
8
8
7
6
4
2
1
0
35
Feature spaces
Combine information from different language processes:



syntactic analysis
 tells the details of the word forms used in the text and the syntactic
structures among words.
dependency analysis
 A dependency relation specifies an asymmetric relationship
between words, where one word is a dependent of the other word,
which is called its governor.
lexical construction
 Lexical meaning contains two parts of information: word sense and
semantic behavior which is all the semantic relationships the word
may contain.
36
Syntax and Dependency Features

Connexor Machinese
Text Analyser


Syntax Features



based on a functional
dependency grammar
Morphology features
Syntactic features
Dependency
Features


Relation type
Dependency Path
37
Identification of Entity Pair
with a Semantic Relation
Testing all possible pairs is not efficient.

Step 1: For each input sentence, generate
a dependency tree that specifies the
syntactic head in the sentence.

Step 2: Find a headNode set from the
dependency tree. Each can be a
headword of a head entity to govern a
relation. We select nodes which have
subtree, and omit those which cannot be
headNodes by creating a head stoplist.


Step3: For each headNode, check its
subtrees to find those that can be tail
entities to the headNode. We create a tail
stoplist containing those that cannot be
root nodes of subtrees of tail entities.
Repeat this process.
Step 4: A simple post-processing is
applied to correct the boundaries within
which the dependency tree does not show
correct relationship.
A dependency tree generated from
Connexor Machinese Analyser
Entity pairs
[fought, (the brave soldiers)]
[fought, (their enemies)]
[fought, (their country)]
[fought, (the War)]
[soldiers, brave]
[enemies, their]
[country, their]
……..
38
Features for CDL Relation Recognition
root
Syntactic and
Dependency-path
features
main:
fought
subj:
soldiers
attr:
det:
brave
The
ha:
loc:
in
pcomp:
phr:
War
det:
for
with
pcomp:
enemies
attr:
their
pcomp:
the
country
attr:
their
Lexical features from
WordNet,
VerbNet and
UNLKB.
Some labels of Connexor Machinese Analyser:
ha (prepositional phase attachment), phr (verb particle), pcomp (subject complement)
39
Experimental Setting

Datasets:

Manual-annotated dataset (1700 sentences documents)
 Contains 13487 CDL.nl relations (44 types of relations)

We choose to train on top 36 relation types with large number of
training instances.

Tools



SVM-light software
Connexor Machinese Text Analyser
Scheme


10-fold cross validation
One-vs-all
40
Results
The table shows the performance of the SVM using
individual kernels incrementally.
Kernel
Precision
Recall
F-value
KS
80.10
86.35
83.11
KD
85.43
83.57
84.49
KL
73.98
82.75
78.12
KS+D
87.19
86.27
86.73
KS+D+L
87.35
88.07
87.71
S: syntactic features
D: dependency features
L: lexical features
41
Supporting Input Editor

Selection among possible candidates proposed by a
computer analysis.
(Like Japanese Input Front-end Processor.)

Graphical verification and editing.
42
Semantic Search with CDL.nl
beyond Keyword-based Search







Baseline: Combination of keywords
(The use of bi-gram, tri-gram,… does not lead to the improvement of
performance.)
A pair of dependent words leads to a slight improvement.
Search using natural language queries is preferable in a sense.
However, when the search result is unsatisfactory, it is not obvious
for a user how to modify the query sentence.
The CDL.nl-based Search allows a more specified search based
on a set of words with named dependency relations, rather than
with the simple (non-named) dependency.
It also allows a search with using more specific word concepts
such as one modified by attributes and/or larger concept units
than a single word.
It allows a search taking account of semantic relevancy, such as
similarity between two words, a relation between words in a
sentence, etc.
43
Interface for CDL.nl Data Retrieval
(Query)

Query by Natural Language

SQL-like Query Language: CDQL

Graphical Query Interface for CDL.nl data
44
Approach to Implementing CDQL

1st Step
 It is not easy to implement it from scratch.
 Thus we utilize SPARQL which is the query language
for RDF data.
 SPARQL is backed by Jena RDB.

Next Step
 Maybe original implementation for the CDQL processing.
45
CDL.nl Data Retrieval System
46
CDL to RDF
47
CDL.nl Data converted into
RDF Graphical Form
48
CDL Data Retrieval via SPARQL ::
a simple case
Query (the RealizationLabel
of) a person to whom John
reported.
49
CDL.nl Data Retrieval
via CDQL (an Extended SPARQL)
Query::
What did John report.
50
Semantically Flexible Matching
Query::
What did John take?
51
Toward the Foundation
of Next-generation Web
52
Immediate Applications of
Relation Extraction from Texts
Jie Yang, Dat Nguyen and Mitsuru Ishizuka
Dept. of Creative Informatics &
Dept. of Info. and Communication Eng.
School of Information Science and Technology
Relation Extraction from Wikipedia
William Henry Gates III (born
October 28, 1955) is the co-founder,
chairman, former chief software
architect, and former CEO of Microsoft
Corporation. He is also the founder of
Corbis, a digital image archiving
company…
(Microsoft, founder, Bill Gates)
(Microsoft, chairman, Bill Gates)
(Microsoft, CEO, Bill Gates)
(Corbis, founder, Bill Gates)
…
Microsoft Corporation,…
Headquartered in Redmond, Washington,
USA, its best selling products are the
Microsoft Windows operating system and
the Microsoft Office suite of productivity
…
software.
(Microsoft, location, Redmond)
(Microsoft, product, MS Windows)
(Microsoft, product, MS Office)
…
54
System Framework
Wikipedia
Principal Entity
Detector
Sentence
Detector
Keyword
Extractor
Pre-processing
Tag & link
extractor
Sentence
Splitter
Tokenizer
Phrase
Chunker
Microsoft was…
The company was
FOUNDER: found, establish…
LOCATION: headquartered,
situated…
MS co-founded… Bill Gates
MS … in Albuquerque…
…by Bill Gates
… in Albuquerque…
Secondary Entity
Detector
Dependency trees
Entity Classifier
Structured
knowledge
Microsoft: ORG
Bill Gates: PER
Albuquerque: LOC
FOUNDER
Entity type
feature
LOCATION
Sub tree
feature
SVM
classifiers
Core Trees
55
Relation Extractor
Triple Tagging allowing rich info.
in social tagging (folksonomy)
56
Triple Tag Extraction
57
TripleTag Editor
58
Semantic Retrieval of Patent Documents


Represent patent texts in CDL.nl (started in 2008).
Also contributes to the translation of patents.
59
I have introduced our research on
Semantic Computing
centered around
CDL (Concept Description Language).
Thank You
Mitsuru Ishizuka
Univ. of Tokyo
60

An Integrated Approach to Measuring Semantic Similarity

Transcript An Integrated Approach to Measuring Semantic Similarity

Directory