Automatic event extraction from text on the base of

Download Report

Transcript Automatic event extraction from text on the base of

Automatic event extraction from text
on the base of linguistic and
semantic annotation
Thierry Declerck
DFKI – Language Technology Lab
JRC 2005/05/10
Events …
• Involve entities and relations between then
• Implies a change of states
– Example: The striker of Liverpool shot a
wonderful goal in the 87. Minute.
• 1 event (goal-shot)
• 2 entities (person and team)
• 1 change of state (the scoring)
JRC 2005/05/10
Events in textual documents
• Various types of text
– Structured: Example and Example_2
• For processing, pattern matching techniques required. Very
few linguistic knowledge needed
– Semi-structured: Example
• Requires a mixture of pattern matching and more linguistic
knowledge
– Unstructured: Example
• Requires a mixture of layout analysis and linguistic knowledge
• All types of text require a domain specific
knowledge base (ontology) for event extraction
JRC 2005/05/10
Domain Knowledge
• Domain knowledge can be organised in
terminologies, thesauri, taxonomies or
ontologies. Example of a (non-formal)
multingual ontology for the soccer domain.
• More on ontology engineering in the talk by
Borislav
JRC 2005/05/10
Automatic Event Extraction from
Text is
• A combination of human language
technology (HLT) and semantic web
technologies (ontologies)
• Can also be done on the base of purely
statistical means (with minimal linguistic
knowledge), but we concentrate here on the
HLT-based approach
JRC 2005/05/10
What is Human Language
Technology
JRC 2005/05/10
Linguistic Analysis
Language technology tools are needed to support the upgrade of
the actual web to the Semantic Web (SW) by providing an
automatic analysis of the linguistic structure of textual
documents. Free text documents undergoing linguistic analysis
become available as semi-structured documents, from which
meaningful units can be extracted automatically (information
extraction) and organized through clustering or classification
(text mining). Here we focus on the following linguistic analysis
steps that underlie the extraction tasks: tokenization,
morphological analysis, part-of-speech tagging, chunking,
dependency structure analysis, semantic tagging.
JRC 2005/05/10
Tokenisation
Tokenisation deals with the detection of the word units in a text
and with the detection of sentence boundaries.
The markets acknowledge the measures taken on the 24th of
September by the CEO of XYZ Corp.
JRC 2005/05/10
Morphological Analysis
Morphological analysis is concerned with the inflectional,
derivational, and compounding processes in word formation in
order to determine properties such as stem and inflectional
information. Together with part-of-speech (PoS) information this
process delivers the morpho-syntactic properties of a word.
While processing the German word Häusern (houses) the
following morphological information should be analysed:
[PoS=N NUM=PL CASE=DAT GEN=NEUT STEM=HAUS]
JRC 2005/05/10
Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of determining the
correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for
a particular word given its current context. The word “works” in
the following sentences will be either a verb or a noun:
He works [N,V] the whole day for nothing.
His works [N,V] have all been sold abroad.
PoS tagging involves disambiguation between multiple part-ofspeech tags, next to guessing of the correct part-of-speech tag for
unknown words on the basis of context information.
JRC 2005/05/10
Chunking
Chunks are sequences of words which are grouped on the base
of linguistic properties, such as nominal, prepositional, adjectival
and adverbial phrases and verb groups.
[NP His works] [VG have] [NP all] [VG been sold] [AdvP abroad].
JRC 2005/05/10
Named Entities detection
Related to chunking is the recognition of so-called named
entities (names of institutions and companies, date expressions,
etc.). The extraction of named entities is mostly based on a
strategy that combines look up in gazetteers (lists of companies,
cities, etc.) with the definition of regular expression patterns.
Named entity recognition can be included as part of the
linguistic chunking procedure and the following sentence
fragment:
“…the secretary-general of the United Nations, Kofi Annan,…”
will be annotated as a nominal phrase, including two named
entities: United Nations with named entity class: organization,
and Kofi Annan with named entity class: person
JRC 2005/05/10
Dependency Structure Analysis
A dependency structure consists of two or more linguistic units
that immediately dominate each other in a syntax tree. The
detection of such structures is generally not provided by
chunking but is building on the top of it.
There are two main types of dependencies that are relevant for
our purposes: On the one hand, the internal dependency
structure of phrasal units or chunks and on the other hand the socalled grammatical functions (like subject and direct object).
JRC 2005/05/10
Internal Dependency Structure
In linguistic analysis, for this we use the terms head,
complements and modifiers, where the head is the dominating
node in the syntax tree of a phrase (chunk), complements are
necessary qualifiers thereof, and modifiers are optional
qualifiers.Consider the following example:
.
“The shot by Christian Ziege goes over the goal.”
The prepositional phrase “by Christian Ziege” (containing the named
entity Christian Ziege) depends on (and modifies) the head noun “shot”.
JRC 2005/05/10
Grammatical Functions
Determine the role (function) of each of the linguistic chunks in
the sentence and allow to identify the actors involved in certain
events. So for example in the following sentence, the syntactic
(and also the semantic) subject is the NP constituent “The shot
by Christian Ziege”:
“The shot by Christian Ziege goes over the goal.”
This nominal phrase depends on (and complements) the verb
“goes”, whereas the Noun “shot” is the head of the NP (it this
the shot going over the goal, and not Christian Ziege!)
JRC 2005/05/10
Semantic Tagging
Automatic semantic annotation has developed within language
technology in recent years in connection with more integrated
tasks like information extraction, which require a certain level of
semantic analysis. Semantic tagging consists in the annotation of
each content word in a document with a semantic category.
Semantic categories are assigned on the basis of a semantic
resources like WordNet for English or EuroWordNet, which
links words between many European languages through a
common inter-lingua of concepts.
JRC 2005/05/10
Semantic Resources
Semantic resources are captured in dictionaries, thesauri, and semantic
networks, all of which express, either implicitly or explicitly, an ontology of
the world in general or of more specific domains, such as medicine.
They can be roughly distinguished into the following three groups:
Thesauri: Semantic resources that group together similar words or terms
according to a standard set of relations, including broader term, narrower
term, sibling, etc. (like Roget)
Semantic Lexicons: Semantic resources that group together words (or
more complex lexical items) according to lexical semantic relations like
synonymy, hyponymy, meronymy, and antonymy (like WordNet)
Semantic Networks: Semantic resources that group together objects
denoted by natural language expressions (terms) according to a set of
relations that originate in the nature of the domain of application (like UMLS
in the medical domain)
JRC 2005/05/10
The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus for indexing articles
and books in the medical domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each term a number of
term variants that refer to the same concept. It currently includes a
vocabulary of over 250,000 terms. The following is a sample entry for
the term gene library (MH is the term itself, ENTRY are term variants):
MH
ENTRY =
ENTRY =
ENTRY =
ENTRY =
etc.
=
Gene Library
Bank, Gene
Banks, Gene
DNA Libraries
Gene Bank
JRC 2005/05/10
The WordNet Semantic Lexicon
WordNet has primarily been designed as a computational
account of the human capacity of linguistic categorization
and covers an extensive set of semantic classes (called
synsets). Synsets are collections of synonyms, grouping
together lexical items according to meaning similarity.
Synsets are actually not made up of lexical items, but rather
of lexical meanings (i.e. senses)
JRC 2005/05/10
The WordNet Semantic Lexicon
WordNet has primarily been designed as a computational
account of the human capacity of linguistic categorization
and covers an extensive set of semantic classes (called
synsets). Synsets are collections of synonyms, grouping
together lexical items according to meaning similarity.
Synsets are actually not made up of lexical items, but rather
of lexical meanings (i.e. senses)
JRC 2005/05/10
WordNet: An Example
The word 'tree' has two meanings that roughly correspond to the classes
of plants and that of diagrams, each with their own hierarchy of classes
that are included in more general super-classes:
09396070 tree 0
09395329 woody_plant 0 ligneous_plant 0
09378438 vascular_plant 0 tracheophyte 0
00008864 plant 0 flora 0 plant_life 0
00002086 life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0
10025462 tree 0 tree_diagram 0
09987563 plane_figure 0 two-dimensional_figure 0
09987377 figure 0
00015185 shape 0 form 0
00018604 attribute 0
00013018 abstraction 0
JRC 2005/05/10
What is the Semantic Web
• “The Semantic Web is a new initiative to
transform the web into a structure that supports
more intelligent querying and browsing, both by
machines and by humans. This transformation is
to be supported through the generation and use of
metadata constructed via web annotation tools
using user-defined ontologies that can be related to
one another.”
Somewhere on the web
JRC 2005/05/10
End User
Semantic
Web
Ontology Articulation
Toolkit
Agents
Ontology Construction
Tool
Ontologies
Community Portal
x C  D
Web-Page Annotation
Tool
Inference
Engine
Annotated Web Pages
Metadata Repository
Based on www.semanticweb.org
JRC 2005/05/10
Extracting Events from
Structured Documents
• Detecting Metadata in our Example:
– Type of game: N/A
– Teams involved: England - Deutschland
– Players: Deutschland: Kahn (2) - Matthaeus (3) Babbel (3,5),
– Final (and intermediate) score:1:0 (0:0)
– Referee:Schiedsrichter: Collina, Pierluigi (Viareggio)
– Date: N/A
– Etc…
JRC 2005/05/10
Extracting Events from
Structured Documents (2)
• Detecting Events in our Example:
– Substitution: Eingewechselt: 61. Gerrard fuer
Owen,
– Goal: Tore: 1:0 Shearer (53., Kopfball,
Vorarbeit Beckham)
– Cards: Gelbe Karten: Beckham - Babbel,
Jeremies
JRC 2005/05/10
Results in XML
• Automatically extracted events (and entities and
relations) from structured text, on the base of
patterns (DTD) of typical expressions and the
soccer ontology. Example and Example_2
• Since various results are available in XML files,
those results can be merged automatically, guided
by the ontology. Example. This is supporting an
incremental and dynamic extraction.
JRC 2005/05/10
Extracting Events from SemiStructured Documents
• Need of linguistic processing, for providing
of a basic structure of the document, which
allows the domain specific annotation.
Example.
JRC 2005/05/10
Extracting Events from SemiStructured Documents (2)
• Using as well the results from the semantic
annotation of the structured documents,
supporting incremental extraction:
Example.
JRC 2005/05/10
Actual Development
• Extracting information from multilingual balance
sheets (WINS eTen project), extending this to
unstructured text and extracting relations and
events from annexes to balance sheets (upcoming
Project MUSING).
• Detecting positive/negative mentioning of entities
in news documents (project Direct-Info on Media
Monitoring). Example.
JRC 2005/05/10
Further Challenge for HLT
• Not only use HLT for the semantic
annotation of web pages (or other
documents), but use HLT for supporting
ontology extraction/learning from the web
(or other documents)
JRC 2005/05/10
Example of semantic relation
extraction in bio-medicine
•[Rheumatoid arthritis] [is characterized] [by progressive synovial inflammation
• and joint destruction] [.]
JRC 2005/05/10
Open issues for HLT and SW
• To achieve a better coordination for
improving semantic annotation results
• Development and use of standards for
interelated linguistic and semantic
annotation (see eContent Project LIRICS
for standards for language resources)
JRC 2005/05/10
Interoperable Standards?
JRC 2005/05/10
Thank you!
JRC 2005/05/10