Transcript Slide 1

Lexical knowledge schemes
for modeling words and
expressions in communication
Computational Lexicology & Terminology Lab
Wauter Bosma
Isa Maks
Roxane Segers
Hennie van der Vliet
Piek Vossen
LCC-meeting, October, 9th, 2008, VU University Amsterdam
Overview
•
•
•
•
•
Genre as a knowledge scheme
What do we do at CLTL?
How does it relate to genre?
Projects at CLTL
Discussion
LCC meeting, October 9th, 2008,
VU University Amsterdam
A view on genre
• Genre is an abstract knowledge scheme
that natural language speakers can apply
to effectively structure communication.
– How and where is such a scheme stored?
– How is this knowledge activated and applied
in a communicative setting?
– How can we benefit from these insights in
computerized information and communication
systems?
LCC meeting, October 9th, 2008,
VU University Amsterdam
Social behaviour
Text: structure & content
Communication
targets
Intentions
genre
strategy
Participants
medium
Attitudes
form
language
World Knowledge
grammar
entities
lexicon
relations
Ontology
objects
relations
LCC meeting, October 9th, 2008,
VU University Amsterdam
Focus of Computational Lexicology
and Terminology Lab (CLTL)
• Lexicon = model of abstract knowledge to efficiently
process and produce natural language in communicative
settings
• Symbolic & abstract representation of forms related to
concepts:
– forms are variants that can refer to more-or-less the same
semantic content:
• shootV – shootingN – agressionN- fightN - conflictN – warN – WOIIName
• payV – exchangeV - buyV – sellV – merchandiseN - tradeN - businessN
• Also encode pragmatic aspects of use
– Sentiment, subjectivity & attitude
– Perspective
– Domain restrictions
LCC meeting, October 9th, 2008,
VU University Amsterdam
Focus of Computational Lexicology
and Terminology Lab (CLTL)
• Broad notion of knowledge:
• words & expressions (what is a word, what is a concept?)
• phrases, sentences and text (incorporating grammar)
• genres
• Abstract symbolic representations related to
statistical expectation patterns
• Tagged corpus represents an 'experience' of language use:
– "X drinks beer", "Y drinks wine", "Z drinks milk"
• Lexicon is the highest abstraction of these experiences that
gives the most effective prediction of how words and
expressions behave:
– "XYZ drink beverages"
• Corpus-based lexicon or corpus data represented as a
lexicon
LCC meeting, October 9th, 2008,
VU University Amsterdam
Focus of Computational Lexicology
and Terminology Lab (CLTL)
• Validation of models and databases with lexical
knowledge:
– Can we define types of structures (lexical and compositional
expressions) that correctly predict their behavior in language
use? -> pluriform-object-count-noun (police), object-count-noun (police
officer), group-object-count-noun (eikenbos (oak forest)), mass-objectuncount-noun (bos (forest))
– Can we build a comprehensive database using these types?
• Use the database in corpus research and analysis:
– import corpus data into the lexical database
– apply the database to textual corpora in computer applications:
• Automatic tagging of corpora with features
• Automatically mine textual data using the lexicon as a background
knowledge resource, e.g. to find facts of causal relations for
environmental phenomena
LCC meeting, October 9th, 2008,
VU University Amsterdam
Lexical database
-generic list of words and terms
-abstracts from various text corpora
-differentiation for different domains
and genres
-most generic representation
-in a language community
Map
Ontology
-concepts instead of words
-identity criteria
-language neutral
-domain and perspective neutral
-no genre dependency
-logically valid
-for inferencing
Validate
Integrate
Term database:
-generic list of terms
-derived from text corpus
Derive
-patterns and features that
are dominant in domain and genre
Text corpus with empirical data
-linear text
-every word occurrence is unique
-domain and genre specific
LCC meeting, October 9th, 2008,
VU University Amsterdam
Projects at CLTC
• Cornetto (Stevin project: STE05039)
• Kyoto (FP7 ICT Work Programme 2007 under
Challenge 4 - Digital libraries and Content,
project ICT-211423)
• Camera projects:
– From sentiments and opinions in text to positions of
political parties
– The semantics of history
• A term bank for the Belastingdienst (Steunpunt
Terminologie)
• DutchSemCor (NWO investeringssubsidie)
LCC meeting, October 9th, 2008,
VU University Amsterdam
Cornetto
• COmbinatorial Relational NEtwork voor Taal
TOepassingen
• Goal: to develop a lexical semantic database
for Dutch:
– 90K Entries: generic and central part of the
language
– Rich horizontal and vertical semantic relations
– Combinatoric information
– Ontological information
LCC meeting, October 9th, 2008,
VU University Amsterdam
Lexical Unit & Synsets
• Lexical Unit = form-meaning relation, such that:
– form = abstract representation of certain realizations;
– part-of-speech is the same;
– meaning is the same, where meaning is defined by a
reference to a unique Synset;
• Synset = Set of synonyms (LUs) that refer to the
same entities in most contexts.
– Defined by lexical semantic relations;
– Defined by reference to ontology Terms or logical
expressions involving Terms from the ontology;
LCC meeting, October 9th, 2008,
VU University Amsterdam
Data Organization
Lexical Unit
Correspond to
word-meaning pair
form
morphology
syntax
semantics
pragmatics
usage examples
Internal relations
Synonyms
Synset
Model meaning relations
Collection of Terms
and Axioms
Princeton
Wordnet
Czech
Wordnet
Spanish
Wordnet
German
Wordnet
Korean
Wordnet
Arabic
French
Wordnet
Wordnet
Wordnet
Domains
LCC meeting, October 9th, 2008,
VU University Amsterdam
SUMO
MILO
Data overview
ALL
Synsets
NOUNS
VERBS
ADJ.
ADV.
Other
70,434
52,888
9,053
7,703
220
570
118,466
85,278
17,363
15,731
73
21
91,991
70,556
9,055
12,307
73
n.a.
Synonyms in synsets
102,572
74,893
14,091
12,899
84
605
CID records
103,668
75,812
14,093
13,089
484
190
Synonym per synset
1.46
1.42
1.56
1.67
0.38
1.06
Senses per lemma
1.29
1.21
1.92
1.28
1.00
n.a.
Lexical Units
Lemmas (form+pos)
LCC meeting, October 9th, 2008,
VU University Amsterdam
Combinatorics
in een band spelen
(to play in a band)
een band oprichten
(to start a band)
de band speelt
(the band plays)
Combinatorics
Combinatorics
de band oppompen
(to pump air in a tire)
de band starten
(to start a tape)
een band plakken
(to fix a whole in a tire)
op de band opnemen
(to record on a tape)
een lekke band
(flat tire)
de band afspelen
(to play from a tape)
een goede/sterke band
(a good strong bond)
de banden verbreken
(to break all bonds)
een band hebben met iemand
(to have a bond with s.o.)
de band springt
(the tire explodes)
artiest (artist)
groep
(groep)
voorwerp (object)
gezelschap
(group of people) muzikant
(musician)
muziek
(music)
middel (device)
lezen
(read)
ring (ring)
muziekgezelschap
(music group)
band#1
popgroep
(pop group)
informatiedrager
(data carrier) schrijven
(write)
geluidsdrager
(audio carrier)
musiceren
(to make music)
band#2 (tire)
band#3/geluidsband
(audio tape)
(band)
jazzband
(jazz band)
Combinatorics
fietsband
(bike tire)
zwemband
(tire for swimming)
binnenband
(inner tire)
autoband
(car tire)
toestand (state)
relatie (reltion)
verhouding
(relation)
band#5 (bond)
cassettebandje familieband
(audio cassette) (family bond)
buitenband
(outer tire)
LCC meeting, October 9th, 2008,
VU University Amsterdam
moederband
(mother bond)
bloedband
(blood bond)
Integrating the ontology:
Sumo terms and axioms
LCC meeting, October 9th, 2008,
VU University Amsterdam
Lexicon versus Ontology
LABELS for ROLES:
{bluswater}
{theewater}
{koffiewater}
Ontology

Abstract
Process
{buy}
subj
obj
ind obj
Possession
Transaction
receiver
giver
goods
{sell}
subj
obj
ind obj
Physical
Element
H20
Organism
Dog
CO2
PoodleDog
NAMES for TYPES:
{poodle}EN
{poedel}NL
{pudoru}JP
((instance x Poodle)
LABELS for ROLES:
{watchdog}EN,
{waakhond}NL,
{banken}JP
((instance x Canine)
(role x GuardingProcess))
Kyoto
• Yielding Ontologies for Transition-Based Organization
• Funded:
– 7th Framework Program-ICT of the European Union: Intelligent Content
and Semantics
• Goal:
– Platform for knowledge sharing across languages and cultures
– Enables knowledge transition and information search across different
target groups, transgressing linguistic, cultural and geographic
boundaries.
– Open text mining and deep semantic search
– Wiki environment that allows people in the field to maintain their
knowledge and agree on meaning without knowledge engineering skills
• URL: http://www.kyoto-project.eu/
• Duration: March 2008 – March 2011
• Effort: 364 person months of work
LCC meeting, October 9th, 2008,
VU University Amsterdam
KYOTO (ICT-211423) Overview
• Languages:
– English, Dutch, Italian, Spanish, Basque, Chinese, Japanese
• Domain:
– Environmental domain, BUT usable in any domain
• Global:
– Both European and non-European languages
• Available:
– Free: as open source system and data (GPL)
• Future perspective:
– Content standardization that supports world wide communication
– Global Wordnet Grid
LCC meeting, October 9th, 2008,
VU University Amsterdam
Citizens
Governors
Companies
Environmental
organizations
Domain
Wikyoto
Environmental
organizations
Global Wordnet Grid
Universal Ontology
Wordnets

Top
Concept
Mining
Abstract Physical
Process
Capture
Tybots
Fact
Mining
Substance
Docs
Sudden increase
of CO2 emissions
URLs
in 2008 in Europe
Middle
water CO2
Domain water CO2
pollution emission
Index
Kybots
LCC meeting, October 9th, 2008,
VU University Amsterdam
Experts
Images
Dialogue
Search
User perspective
• Ecosystem services
– nature as a resource: food, transport,
recreation, medicine, material
– nature for waste absorption
– economic dependency
– state of nature
– footprint
– poverty
LCC meeting, October 9th, 2008,
VU University Amsterdam
Lexicon versus Ontology
Ecosystem services
Ontology
-Nature as a resource
-Nature for waste absorption

-State of nature
-Threats to nature
Abstract
Physical
Possession
Transaction
species migration
Artifacts
Element
Process
H20
qualifies
Organism
Spider
CO2
alien invasive species
qualifies
green house gas
ecosystem-based drinking water production
green roof
branding rural products
sustainable products
LCC meeting, October 9th, 2008,
VU University Amsterdam
System components
• Wikyoto = wiki environment for a social group:
– to model the terms and concepts of a domain and agree on their
meaning, within group, across languages and cultures
– to define the types of knowledge and facts of interest
• Tybots = Term extraction robots, extract term data from
text corpus
• Kybots = Knowledge yielding robots, extract facts from a
text corpus
• Linguistic processors:
–
–
–
–
tokenizers, segmentizers, taggers, grammars
named entity recognition
word sense disambiguation
generate a layered text annotation in Kyoto Annotation Format
(KAF)
LCC meeting, October 9th, 2008,
VU University Amsterdam
Capture
Server
Document Base
Linear KAF
Tybot server
(Term Extraction)
Semantic
Annotation
Fact User
Concept User
Domain
Wordnet
K-LMF
Extracted Terms
Generic K-TMF
Document Base
Linear KAF
Kybot Editor
Term Editor
(Wikyoto)
Kybot Server
(Fact
Extraction)
Kybot
Profiles
Domain Ontology
OWL_DL
Document Base
Linear Generic
KAF
LCC meeting, October 9th, 2008,
VU University Amsterdam
Conceptual modeling
Linguistic
Processors
Source
Documents
[[the emission]NP
[of greenhouse gases]PP
[in agricultural areas]PP] NP
TYBOT
Concept
Miners
Morpho-syntactic analysis
English Wordnet
Ontology

Abstract
Process
Chemical
Reaction
location:3
Ontologize
Physical
substance:1
naturalprocess:1
of
Synthesize
region:3
area emission gas
emission:3
Substance
H20
Term hierarchy
geographical area:1
area:1
gas:1 CO2
CO2
GreenhouseGas
emission:2
agricultural
area
greenhouse gas:1
rural area:1
CO2Emission
GlobalWarming
WaterPollution
greenhouse
gas
farmland:2
(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)
Axiomatize
LCC meeting, October 9th, 2008,
VU University Amsterdam
in
CO2
Fact mining by Kybots
Morpho-syntactic analysis
Source
Documents
Ontology
Logical
Wordnets &
Expressions Linguistic Expressions

Abstract
Generic
Physical
Fact analysis
Patient
Process
Chemical
Reaction
[[the emission]NP ] Process: e1
[of greenhouse gases]PP Patient: s2
[in agricultural areas]PP] Location: a3
Substance
H2O
CO2
Patient
CO2
emission
[[the emission]NP
[of greenhouse gases]PP
[in agricultural areas]PP] NP
Linguistic
Processors
Domain
water
pollution
LCC meeting, October 9th, 2008,
VU University Amsterdam
Wordnets in LMF
Facts in RDF
FactAF
KAF
G-WN
Kyoto
Server
pdf
A..
...
decline
...
G-KON
SUMO DOLCE
plugin
plugin
Kybots
Ontologies in OWL
DE-WN
GEO
DE-KON
WIKIPEDIA
FRAMENET
KAF
Tybots
DE-TN
Simplified
Term Fragment
Hidden
Simplified
Ontology Fragment
population
Group
terrestrial marine
species species
?Population
Shown
population
Interview
...
..Z
Are terrestrial
species
a type of
populations?
Are terrestrial species
never
marine species?
Interview
Smart Kytext
.... populations
such as
terrestrial and
marine species .....
.... populations
declined
.....terrestrial and
marine species..
in forests
.....declined
LCC meeting, October 9th, 2008,
VU University Amsterdam
Do populations
consist of
marine species?
Do populations
always consist of
marine species?
substance:1
Abstract Physical
natural process:1
Process
Ontologize
C02 gas:1 emission:2 emission:3
H20
CO2
Emission
Maximal
abstraction&
integrity
Synthesize
Substance
ChemicalReaction
Axiomatize
greenhouse gas:1

Ontology
Lexical database: wordnet
Global
Warming
Greenhouse CO2
Gas
Language
neutral
integrity
Text mining
(instance s1 Substance) (instance
e1 Warming) (katalyist s1 e1)
by Kybots
Term database
gas
green house gas -> gas
-increase(AG)
-in 2003 (TIME)
CO2 -> green house gas
-emission (PA)
-in European countries (LO)
Text corpus
Concept
Mining
by Tybots
Generic
text based
Sudden increase of green house
gases in 2003........ C02 emission
in European countries....Green
house gases such as C02, ....
Linear
text
From sentiments and opinions in
text to positions of political parties
• Most language use does not express facts but personal
opinions and positions with respect to facts or issues,
often disguised for some communicative or
manipulative goal.
• CAMERA project involving 2 AIOs from FdL and 1 AIO
from Political Sciences
• Combines contemporary theories and methods in
linguistics and political science to develop an automated
research tool for rich text-mining:
– Complexity of language use, the linguistic modeling of
subjectivity and the representation of this knowledge in a lexicon.
– Complex dimensionality of competition between political parties.
• Mining tool for language-meaning research can be
applied to enhance the Kieskompas (Electoral
Compass).
LCC meeting, October 9th, 2008,
VU University Amsterdam
aio-1
Modeling
Lexical Analysis
Lexical database
Derivation
Lexical acquisition
aio-2
Co-occurrence
Corpus Linguistics
Manual
Coding & Tagging
Quantitative
Text Analyis
Linguistic rules
Search
Concordance
Omstreden democratie:
-Jan Kleinnijenhuis
-Wouter van Atteveldt
Political Text
Corpus
Automated
Tagging &
Analysis
Political Database
Search
aio-3
Quantitative
Data Analysis
Political Analysis
Morpho-syntactic
Parsers
system integrator-4
Manual
Coding
Interpretation rules
AIO-1: Lexical model and acquisition
for sentiment and opinion analysis
in Dutch text
• Words & expressions in political text
• Model sentiment, subjectivity, lexical
framing and attitudinal implications
• Build a lexicon encoding these layers
• Validate the lexicon in the mining
application applied to the text corpus
LCC meeting, October 9th, 2008,
VU University Amsterdam
Levels of subjectivity
• sentiment orientation, e.g.
– small (neutral), splendid (positive), dull (negative)
– funeral (negative), birthday party (positive), meeting
(neutral)
• explicit attitudinal and deontic implications
– hate, love, favour, desire, want
– impossible, possible, can, cannot
– demand, beg, hope, wish
• implicit attitudinal and deontic implications
– neutral: describe, cite, quote
– subjective: tell my story, shout, cry out, suggest
LCC meeting, October 9th, 2008,
VU University Amsterdam
Some concepts of saying
The reporter expresses attitude towards the subject (is not aware)
nazeggen:1, herhalen:4, echoën:2
meesmuilen:1
herkauwen:2
toesnauwen:1, aanblaffen:2, sissen:2, toebijten:1, toeblaffen:1
toesmijten:2,toevoegen:4
uitputten:3
verzuchten:1
pretenderen:1, beweren:1
Subject of speech act has attitude towards (is aware):
afzeggen:1, cancellen:1
ontkennen:1, miskennen:1, ontveinzen:1
toewensen:1, wensen:2
verbieden:1
aanzetten:12, beklemtonen:2, hameren:2, tamboereren:2 onderstrepen:2,
onderlijnen:1, accentueren:1
toezeggen:1, beloven:1
uitlaten:5, beoordelen:1
distantiëren:1
erkennen:2, toegeven:1
opmerken:2, aantekenen:4
LCC meeting, October 9th, 2008,
VU University Amsterdam
Synsets or lexical units
• {brilliant:3, glorious:4, magnificent:1,
splendid:2}
• {bus:4, jalopy:1, heap:3}
– has_hyperonym: {car:1, auto:1, automobile:1,
machine:4, motorcar:1}
• {fiets:1, brik:7, kar:3, karretje:2, rijwiel:1,
velo:1}
LCC meeting, October 9th, 2008,
VU University Amsterdam
The semantics of history
• Camera project involving 1 AIO from FdL
and 1 AIO from FEW (Exact Science)
• Goal: an ontology and lexicon for a
historical multimedia archive of the
Rijksmuseum.
• Applied to an innovative information
system for accessing the historical
archive.
LCC meeting, October 9th, 2008,
VU University Amsterdam
The semantics of history =
semantics of change
• Represent different realities:
– related through causal changes over time
– representing different views or perspectives
on the same reality, e.g. form a different
historical angle or from different geographical
or social parties.
• Changes are typed as events
LCC meeting, October 9th, 2008,
VU University Amsterdam
Events as key notions
• Historical events:
– events considered from a distance in time and abstraction of detail.
– referenced by names (WOII, de Val van Srebrenica), nouns (war) or
nominalizations (the violation of human rights)
• News events:
– Reports on (the same) reality but more in the active verbal form: US
soldiers shoot Iraqi citizens.
– Close to the actual event
– lacking a historical abstraction and filtering.
• Both news and historic imply subjectivity and perspective on these
events but probably make different selections and use different
genres to convey this information.
• News becomes history over time, and we therefore expect a smooth
transition in the use of language to refer to the same events, adding
more and more historical perspective.
LCC meeting, October 9th, 2008,
VU University Amsterdam
“Val van Srebrenica” in Wikipedia
• Headings:
– 1992 ethnic cleansing campaign
– The conflict in eastern Bosnia
– Struggle for Srebrenica
• Text:
– A fierce struggle for territorial control then ensued among the three
major groups in Bosnia: Bosniak (commonly known as 'Bosnian
Muslims'), Serb and Croat. In the eastern part of Bosnia, close to
Serbia, conflict was particularly fierce between Serbs and Bosniaks
– Serb military and paramilitary forces from the area and neighboring
parts of eastern Bosnia and Serbia gained control of Srebrenica for
several weeks in early 1992, killing and expelling Bosniak civilians. In
May 1992, Bosnian government forces under the leadership of Naser
Orić recaptured the town
– thus proceeded with the ethnic cleansing of Bosniaks from Bosniak
ethnic territories in Eastern Bosnia and Central Podrinje
LCC meeting, October 9th, 2008,
VU University Amsterdam
Letter from the Dutch minister of
defense
• De afgelopen zes maanden werd de uitvoering van deze
taken aanzienlijk bemoeilijkt door de Bosnisch-Servische
weigering de enclave voldoende te laten bevoorraden.
Door een gebrek aan brandstof moesten patrouilles te
voet worden uitgevoerd. Ook blokkeerden de Bosnische
Serviers sinds mei jl. de rotatie van het personeel van
Dutchbat, waardoor de bezetting werd teruggebracht van
630 naar 430 blauwhelmen. De vijandelijkheden namen
geleidelijk toe, waardoor op 3 juni jl. een observatiepost
in het zuidoostelijke deel van de enclave moest worden
opgegeven
• Historical terms: blokkade, val, opgave, overgave
LCC meeting, October 9th, 2008,
VU University Amsterdam
Event
Ont.
Historic
Ont.
Alignment
Data model
Structured
Data
Conversion
Semi
Structured
Data
Term
Extraction
Ontolization
Terms
&
Relations
Ontology
Lexical
mapping
Lexicalization
Lexicon
Free
Text
Smart
Indexing
Objects
Events
Locations
Smart Retrieval
Validation
People
conflict
struggle
ethnic cleansing
….
killing
expelling
gain control
AIO at FdL
• Lexical framing of events in news reporting
and historical descriptions.
• Use historical thesaurus to group all the
words and expressions in a lexicon
relative to the same events
• Differentiate implications of the lexical
variation: packaging of events
• Classification of news
LCC meeting, October 9th, 2008,
VU University Amsterdam
Thank you for your attention