Transcript Document

Human Language Technology
for the Semantic Web
http://gate.ac.uk/
http://nlp.shef.ac.uk/
Hamish Cunningham
Kalina Bontcheva
Diana Maynard
Valentin Tablan
ESWS, Crete, May 2004
[This work has been supported by
AKT (http://aktors.org/) and
SEKT (http://sekt.semanticweb.org/)]
Are you wasting your time?
2(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
3(130)
The Knowledge Economy and
Human Language
Gartner, December 2002:
• taxonomic and hierarchical knowledge mapping and
indexing will be prevalent in almost all informationrich applications
• through 2012 more than 95% of human-to-computer
information input will involve textual language
A contradiction: formal knowledge in semantics-based
systems vs. ambiguous informal natural language
The challenge: to reconcile these two opposing
tendencies
4(130)
HLT & Knowledge: Closing the Language Loop
(M)NLG
Human
Language
KEY
MNLG: Multilingual Natural Language Generation
OBIE: Ontology-Based Information Extraction
AIE: Adaptive & Mixed-Initiative IE
CLIE: Controlled Language IE
Formal Knowledge
(ontologies and
instance bases)
OBIE
(A)IE
Controlled
Language
CLIE
5(130)
Semantic
Web;
Semantic
Grid;
Semantic
Web
Services
Background and Examples (1)
• Like other areas of computer science, HLT has typical
data structures and infrastructure requirements
• Annotation: associating arbitrary data with areas of
text or speech
• Defacto standard: Stand-off Markup (e.g. TEI/XCES,
NITE, ATLAS, GATE)
• Other issues: visualisation and editing; persistence
and search; metrics; component model; baseline NLP
tools; ...
• To cut a long story short: HLT has a lot of T
underneath it which comes in many shapes and sizes
6(130)
Background and Examples (2)
Infrastructure & (many) examples in this tutorial:
• GATE, a General Architecture for Text
Engineering: architecture, framework & IDE
Why?
• I happen to know a little about it 
• Free software, relatively comprehensive, widely
used, has extensive Semantic Web support
• It means we can ignore the infrastructural
issues
Not a claim that it is the best or only in all cases!
7(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
Information Extraction (1)
• Information Extraction (IE) pulls facts and
structured information from the content of
large text collections.
• Contrast IE and Information Retrieval
• NLP history: from NLU to IE
(if you can’t score, why not move the
goalposts?)
9(130)
Information Extraction (2)
• When you can measure what you are speaking about,
and express it in numbers, you know something about
it; but when you cannot measure it, when you cannot
express it in numbers, your knowledge is of a meager
and unsatisfactory kind: it may be the beginning of
knowledge, but you have scarcely in your thoughts
advanced to the stage of science. (Kelvin)
• Not everything that counts can be counted, and not
everything that can be counted counts. (Einstein)
• IE progress driven by quantitative measures
• MUC: Message Understanding Conferences
• ACE: Automatic Content Extraction
10(130)
MUC-7 tasks
Held in 1997, around 15 participants inc. 2
UK. Broke IE down into component tasks:
• NE: Named Entity recognition and typing
• CO: co-reference resolution
• TE: Template Elements
• TR: Template Relations
• ST: Scenario Templates
11(130)
An Example
“The shiny red rocket was fired on Tuesday. It is the
brainchild of Dr. Big Head. Dr. Head is a staff scientist at
We Build Rockets Inc.”
• NE: "rocket", "Tuesday", "Dr. Head“, "We Build
Rockets"
• CO:"it" = rocket; "Dr. Head" = "Dr. Big Head"
• TE: the rocket is "shiny red" and Head's
"brainchild".
• TR: Dr. Head works for We Build Rockets Inc.
• ST: rocket launch event with various participants
12(130)
Performance levels
• Vary according to text type, domain,
scenario, language
• NE: up to 97% (tested in English, Spanish,
Japanese, Chinese, etc. etc.)
• CO: 60-70% resolution
• TE: 80%
• TR: 75-80%
• ST: 60% (but: human level may be only
80%)
13(130)
What are Named Entities?
• NE involves identification of proper names
in texts, and classification into a set of
predefined categories of interest
• Person names
• Organizations (companies, government
organisations, committees, etc)
• Locations (cities, countries, rivers, etc)
• Date and time expressions
14(130)
What are Named Entities (2)
• Other common types: measures (percent,
money, weight etc), email addresses, Web
addresses, street addresses, etc.
• Some domain-specific entities: names of drugs,
medical conditions, names of ships,
bibliographic references etc.
• MUC-7 entity definition guidelines [Chinchor’97]
http://www.itl.nist.gov/iaui/894.02/related_projects/muc
/proceedings/ne_task.html
15(130)
What are NOT NEs (MUC-7)
• Artefacts – Wall Street Journal
• Common nouns, referring to named entities –
the company, the committee
• Names of groups of people and things named
after people – the Tories, the Nobel prize
• Adjectives derived from names – Bulgarian,
Chinese
• Numbers which are not times, dates,
percentages, and money amounts
16(130)
Basic Problems in NE
• Variation of NEs – e.g. John Smith, Mr
Smith, John.
• Ambiguity of NE types: John Smith
(company vs. person)
– May (person vs. month)
– Washington (person vs. location)
– 1945 (date vs. time)
• Ambiguity with common words, e.g. "may"
17(130)
More complex problems in NE
• Issues of style, structure, domain, genre
etc.
• Punctuation, spelling, spacing, formatting,
... all have an impact:
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
 Tell me more about Leonardo
 Da Vinci
18(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
19(130)
Corpora and System Development
• “Gold standard” data created by manual annotation
• Corpora are divided typically into a training and testing portion
• Rules and/or learning algorithms are developed or trained on
the training part
• Tuned on the testing portion in order to optimise
– Rule priorities, rules effectiveness, etc.
– Parameters of the learning algorithm and the features used
(typical routine: 10-fold cross validation)
• Evaluation set – the best system configuration is run on this
data and the system performance is obtained
• No further tuning once evaluation set is used!
20(130)
Some NE Annotated Corpora
• MUC-6 and MUC-7 corpora - English
• CONLL shared task corpora
http://cnts.uia.ac.be/conll2003/ner/ NEs in English and German
http://cnts.uia.ac.be/conll2002/ner/ NEs in Spanish and Dutch
• TIDES surprise language exercise (NEs in
Cebuano and Hindi)
• ACE – English http://www.ldc.upenn.edu/Projects/ACE/
21(130)
The MUC-7 corpus
• 100 documents in SGML
• News domain
Named Entities:
• 1880 Organizations (46%)
• 1324 Locations (32%)
• 887 Persons (22%)
• Inter-annotator agreement very high (~97%)
• http://www.itl.nist.gov/iaui/894.02/related_project
s/muc/proceedings/muc_7_proceedings/marsh_
slides.pdf
22(130)
The MUC-7 Corpus (2)
<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>,
<ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in
chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX>
<TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX
TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied
the space shuttle Endeavour for launch on a Japanese satellite
retrieval mission.
<p>
Endeavour, with an international crew of six, was set to blast off from
the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy
Space Center</ENAMEX> on <TIMEX
TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18
a.m. EST</TIMEX>, the start of a 49-minute launching period. The
<TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be
the 12th launched in darkness.
23(130)
ACE – Towards Semantic Tagging
of Entities
• MUC NE tags segments of text whenever that
text represents the name of an entity
• In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer) the
mentions in the text of the entities themselves
• Rolls together the NE and CO tasks
• Domain- and genre-independent approaches
• ACE corpus contains newswire, broadcast news
(ASR output and cleaned), and newspaper
reports (OCR output and cleaned)
24(130)
ACE Entities
• Dealing with
– Proper names – e.g., England, Mr. Smith, IBM
– Pronouns – e.g., he, she, it
– Nominal mentions – the company, the spokesman
• Identify which mentions in the text refer to which
entities, e.g.,
– Tony Blair, Mr. Blair, he, the prime minister, he
– Gordon Brown, he, Mr. Brown, the chancellor
25(130)
ACE Example
<entity ID="ft-airlines-27-jul-2001-2"
GENERIC="FALSE"
entity_type = "ORGANIZATION">
<entity_mention ID="M003"
TYPE = "NAME"
string = "National Air Traffic Services">
</entity_mention>
<entity_mention ID="M004"
TYPE = "NAME"
string = "NATS">
</entity_mention>
<entity_mention ID="M005"
TYPE = "PRO"
string = "its">
</entity_mention>
<entity_mention ID="M006"
TYPE = "NAME"
string = "Nats">
</entity_mention>
</entity>
26(130)
Annotation Tools (1): GATE
27(130)
Annotation Tools (2): Alembic
28(130)
Performance Evaluation
• Evaluation metric – mathematically defines
how to measure the system’s performance
against human-annotated gold standard
• Scoring program – implements the metric
and provides performance measures
– For each document and over the entire
corpus
– For each type of NE
29(130)
Evaluation Metrics
• Most common are “Precision” and “Recall”
• Precision = correct answers/answers produced
• Recall = correct answers/total possible correct
answers
• Trade-off between precision and recall
• F-Measure = (β2 + 1)PR / β2R + P
[van Rijsbergen 75]
• β reflects the weighting between precision and recall,
typically β=1
• Some tasks sometimes use other metrics, e.g.:
– false positives (not sensitive to doc richness)
– cost-based (good for application-specific adjustment)
30(130)
The Evaluation Metric (2)
• We may also want to take account of partially
correct answers:
• Precision =
Correct + ½ Partially correct
Correct + Incorrect + Partial
• Recall =
Correct + ½ Partially correct
Correct + Missing + Partial
• Why: NE boundaries are often misplaced, so
some partially correct results
31(130)
The GATE Evaluation Tool
32(130)
Corpus-level Regression Testing
• Need to track system’s performance over
time
• When a change is made we want to know
implications over whole corpus
• Why: because an improvement in one
case can lead to problems in others
• GATE offers automated tool to help with
the NE development task over time
33(130)
Regression Testing (2)
At corpus level – GATE’s corpus benchmark tool –
tracking system’s performance over time
34(130)
SW IE Evaluation tasks
• Detection of entities and events, given a target
ontology of the domain.
• Disambiguation of the entities and events from the
documents with respect to instances in the given
ontology. For example, measuring whether the IE
correctly disambiguated “Cambridge” in the text to
the correct instance: Cambridge, UK vs Cambridge,
MA.
• Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the ontology.
35(130)
Challenge:
Evaluating Richer NE Tagging
• Need for new metrics
when evaluating
hierarchy/ontologybased NE tagging
• Need to take into
account distance in
the hierarchy
• Tagging a company as
a charity is less wrong
than tagging it as a
person
36(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
37(130)
Two kinds of IE approaches
Knowledge Engineering
Learning Systems
• rule based
• developed by experienced
language engineers
• make use of human intuition
• requires only small amount of
training data
• development could be very
time consuming
• some changes may be hard to
accommodate
• use statistics or other machine
learning
• developers do not need LE
expertise
• requires large amounts of
annotated training data
• some changes may require reannotation of the entire training
corpus
• annotators are cheap (but you
get what you pay for!)
38(130)
A) NE Baseline: list lookup approach
• System that recognises only entities
stored in its lists (gazetteers).
• Advantages - Simple, fast, language
independent, easy to retarget (just create
lists)
• Disadvantages – impossible to enumerate
all names, collection and maintenance of
lists, cannot deal with name variants,
cannot resolve ambiguity
39(130)
B) Shallow parsing approach
using internal structure
• Internal evidence – names often have internal structure.
These components can be either stored or guessed, e.g.
location:
• Cap. Word + {City, Forest, Center, River}
• e.g. Sherwood Forest
• Cap. Word + {Street, Boulevard, Avenue, Crescent,
Road}
• e.g. Portobello Street
40(130)
Problems ...
• Ambiguously capitalised words (first word in
sentence)
[All American Bank] vs. All [State Police]
• Semantic ambiguity
"John F. Kennedy" = airport (location)
"Philip Morris" = organisation
• Structural ambiguity
[Cable and Wireless] vs. [Microsoft] and [Dell];
[Center for Computational Linguistics] vs.
message from [City Hospital] for [John Smith]
41(130)
C) Shallow parsing with context
• Use of context-based patterns is helpful in
ambiguous cases
• "David Walton" and "Goldman Sachs" are
indistinguishable
• But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "[Person] of
[Organization]" to identify "Goldman Sachs“
correctly.
42(130)
Examples of context patterns
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[PERSON] earns [MONEY]
[PERSON] joined [ORGANIZATION]
[PERSON] left [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
[ORGANIZATION]'s [JOBTITLE] [PERSON]
[ORGANIZATION] [JOBTITLE] [PERSON]
the [ORGANIZATION] [JOBTITLE]
part of the [ORGANIZATION]
[ORGANIZATION] headquarters in [LOCATION]
price of [ORGANIZATION]
sale of [ORGANIZATION]
investors in [ORGANIZATION]
[ORGANIZATION] is worth [MONEY]
[JOBTITLE] [PERSON]
[PERSON], [JOBTITLE]
43(130)
Example Rule-based System - ANNIE
• ANNIE – A Nearly-New IE system
• A version distributed as part of GATE
• GATE automatically deals with document
formats, saving of results, evaluation, and
visualisation of results for debugging
• GATE has a finite-state pattern-action rule
language, used by ANNIE
• A reusable and easily extendable set of
components
44(130)
NE Components
45(130)
Gazetteer lists for rule-based NE
• Needed to store the indicator strings for
the internal structure and context rules:
• Internal location indicators – e.g., {river,
mountain, forest} for natural locations;
{street, road, crescent, place, square,
…}for address locations
• Internal organisation indicators – e.g.,
company designators {GmbH, Ltd, Inc, …}
• Produces Lookup results of the given kind
46(130)
The Named Entity Transducers
• Phases run sequentially and constitute a
cascade of FSTs over the pre-processing results
• Hand-coded rules applied to annotations to
identify NEs
• Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules
• Use contextual information
• Finds person names, locations, organisations,
dates, addresses.
47(130)
NE Rule in JAPE
JAPE: a Java Annotation Patterns Engine
• Light, robust regular-expression-based processing
• Cascaded finite state transduction
• Low-overhead development of new components
• Simplifies multi-phase regex processing
Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+ //from tokeniser
{Lookup.kind == companyDesignator} //from gazetteer lists
):match
-->
:match.NamedEntity =
{ kind=company, rule=“Company1” }
48(130)
49(130)
Named Entities in GATE
Using co-reference to classify
ambiguous NEs
• Orthographic co-reference module matches
proper names in a document
• Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs
• May not reclassify already classified entities
• Classification of unknown entities very useful for
surnames which match a full name, or
abbreviations,
e.g. [Bonfield] will match [Sir Peter Bonfield];
[International Business Machines Ltd.] will
match [IBM]
50(130)
Named Entity Coreference
51(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
52(130)
Machine Learning Approaches
• Approaches:
– Train ML models on manually annotated text
– Mixed initiative learning
• Used for producing training data
• Used for producing working systems
• ML Methods
– Symbolic learning: rules/decision trees
induction
– Statistical models: HMMs, Bayesian methods,
Maximum Entropy
53(130)
ML Terminology
• Instances (tokens, entities)
Occurrences of a phenomenon
• Attributes (features)
Characteristics of the instances
• Classes
Sets of similar instances
54(130)
Methodology
• The task can be broken into several
subtasks (that can use different methods):
– Boundary detection
– Entity classification into NE types
– Different models for different entity types
• Several models can be used in
competition.
– Some algorithms perform better on little data while
others are better when more training is
available
55(130)
Methodology (2)
Boundaries (and entity types) notations
– S(-XXX), E(-XXX)
<S-ORG/>U.N.<E-ORG/> official <S-PER/>Ekeus<E-PER/> heads for
<S-LOC/>Baghdad<E-LOC/>.
– IOB notation (Inside, Outside, Beginning_of)
U.N.
official
Ekeus
heads
for
Baghdad
.
I-ORG
O
I-PER
O
O
I-LOC
O
– Translations between the two conventions are
straight-forward
56(130)
Features
• Document structure
• Linguistic features
–
–
–
–
– Original markup
– Paragraph/sentence
structure
• Surface features
– Token length
– Capitalisation
– Token type (word,
punctuation, symbol)
POS
Morphology
Syntax
Lexicon data
• Semantic features
– Ontological class
• ETC
–…
• Feature selection – the most difficult part
• Some automatic scoring methods can be used
57(130)
Mixed Initiative Learning
•
•
•
•
Human – computer interaction
Speeds up the creation of training data
Can be used for corpus/system creation
Example implementations:
– Alembic [Day et al’97] and later
– Amilcare [Ciravegna’03]
58(130)
Mixed Initiative Learning (2)
User annotates
System learns
P>t1
System annotates
User corrects
P>t2
System annotates
59(130)
System learns
Example 1: Alembic, Day et al 1997
• Mixed initiative approach implemented in
Alembic Workbench
• Bootstrapping procedure – use already tagged
data to pre-annotate new documents
• Transforms the process from tagging to review
• Finally, the trained system can be used on its
own
60(130)
Mixed-Initiative text annotation
User can
also edit the
induced rules
and write
new ones
Brill-style
learning –
generate-andtest
61(130)
Considerations
• Too high recall and the human might
become over-reliant on the system
annotations
• Too high precision might have similar
effect
• “Theory-creep” – the choices of the human
annotator are increasingly influenced by
the machine’s and might deviate from the
task definition – measure inter-annotator
agreement
62(130)
Example 2: Amilcare & Melita
• Amilcare: rule-learning algorithm
– Tagging rules – learn to insert tags in the text,
given training examples
– Correction rules – learn to move already
inserted tags to their correct place in the text
•
•
•
•
Learns begin and end tags independently
Melita support adaptive IE
Applied in SemWeb context (see below)
Being extended as part of the EU-funded
DOT.KOM project towards KM and
SemWeb applications
63(130)
Comparison of Alembic & Melita
• The life cycle of user tagging, the machine
learning, then making suggestions which user
corrects, is very similar in Melita and Alembic
• Alembic is more oriented towards NLP
developers, while Melita – more towards endusers
• Melita considers timeliness and intrusiveness as
criteria, while Alembic does not (possibly due to
performance bottleneck from old hardware)
• Both acknowledge but do not address problems
with over-reliance on machine annotations
• From ML perspective the two are very similar –
rule-learning
64(130)
Eg. 3: GATE Machine Learning support
• Uses classification.
[Attr1, Attr2, Attr3, … Attrn]  Class
• Classifies annotations.
(Documents can be classified as well using a 1-to1
relation with annotations.)
• Annotations of a particular type are selected as
instances.
• Attributes refer to features of the instance annotations or
their context.
• Generic implementation for attribute collection – can be
linked to any ML engine.
• ML engines currently integrated: WEKA and Ontotext’s
HMM.
65(130)
Implementation
Machine Learning PR in GATE.
Has two functioning modes:
– training
– application
Uses an XML file for configuration:
<?xml version="1.0" encoding="windows-1252"?>
<ML-CONFIG>
<DATASET> … </DATASET>
<ENGINE>…</ENGINE>
<ML-CONFIG>
66(130)
Attributes Collection
Instances type: Token
67(130)
Dataflow
Feature
Collection
Plain text
documents
Tokeniser
Gazetteer
POS Tagger
Lexicon Lookup
Semantic Tagger
etc…
Results
Converter
GATE ML
Library
NLP
Pipeline
Engine
Interface
Machine
Learning
Engine
68(130)
Annotated
documents
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
69(130)
Towards Semantic Tagging of
Entities
• The MUC NE task tags selected segments
of text whenever that text represents the
name of an entity.
• Semantic tagging - view as mentions of
the underlying instances from the ontology
• Identify which mentions in the text refer to
which instances in the ontology, e.g.,
– Tony Blair, Mr. Blair, he, the prime minister, he
– Gordon Brown, he, Mr. Brown, the chancellor
70(130)
Tasks
• Identify entity mentions in the text
• Reference disambiguation
– Add new instances if needed
– Disambiguate wrt instances in the ontology
• Identify instances of attributes and relations
– take into account what are allowed given the
ontology, using domain&range as constraints
71(130)
Example
XYZ was established on 03 November 1978
in London. It opened a plant in Bulgaria
in …
Ontology & KB
Company
Location
HQ
City
type
XYZ
partOf
Country
type
HQ
London
establOn
type
type
partOf
“03/11/1978”
UK
72(130)
Bulgaria
Classes, instances & metadata
“Gordon Brown met George Bush during his
two day visit.
<metadata>
<DOC-ID>http://… 1.html</DOC-ID>
<Annotation>
<s_offset> 0 </s_offset>
<e_offset> 12 </e_offset>
G.Brown
<string>Gordon Brown</string>
<class>…#Person</class>
<inst>…#Person12345</inst>
</Annotation>
<Annotation>
<s_offset> 18 </s_offset>
<e_offset> 32 </e_offset>
<string>George Bush</string>
<class>…#Person</class>
<inst>…#Person67890</inst>
</Annotation>
73(130)
</metadata>
after
Classes+instances before
…
Entity
Person
Bush
Job-title
president
minister
chancellor
…
Classes, instances & metadata (2)
“Gordon Brown met Tony Blair to discuss the
university tuition fees.
<metadata>
<DOC-ID>http://… 2.html</DOC-ID>
…
<Annotation>
<s_offset> 0 </s_offset>
Entity
<e_offset> 12 </e_offset>
T. Blair
Person
<string>Gordon Brown</string>
<class>…#Person</class>
G. Brown
<inst>…#Person12345</inst>
Job-title
G. Bush
</Annotation>
<Annotation>
president
<s_offset> 18 </s_offset>
minister
<e_offset> 30 </e_offset>
chancellor
<string>Tony Blair</string>
<class>…#Person</class>
<inst>…#Person26389</inst>
…
</Annotation>
74(130)
</metadata>
after
Classes+instances before
Why not put metadata in
ontologies?
• Can be encoded in RDF/OWL, etc. but does it need to be
put as instances in the ontology?
• Typically we do not need to reason with it
– Reasoning happens in the ontology when the new instances of
classes and properties are added, but the metadata statements are
different from them, they only refer to them
• A lot more metadata than instances
– Millions of metadata statements, thousands of instances, hundreds
of concepts
• Different access required:
– By offset (give me all metadata of the first paragraph)
– Efficient metadata-wide statistics based on strings – not an
operation that people would do on other concepts
– Mixing with keyword-based search using IR-style indexing
75(130)
Metadata Creation with IE
• Semantic tagging creates metadata
• Stand-off or part of document
• Semi-automatic
– One view (given by the user, one ontology)
– More reliable
• Automatic metadata creation
– Many views – change with ontology, re-train IE engine
for each ontology
– Always up to date, if ontology changes
– Less reliable
76(130)
Problems with “traditional” IE for
metadata creation
• S-CREAM – Semi-automatic CREAtion of
Metadata [Handschuh et al’02]
• Semantic tags from IE need to be mapped to
instances of concepts, attributes or relations
• Most ML-based IE systems do not deal well with
relations, mainly entities
• Amilcare does not handle anaphora resolution,
GATE has such component but not used here
• Implemented a discourse model with
logical rules
– LASIE used discourse model with domain
ontology – problem is robustness and
domain portability
77(130)
Example
[Handschuh et al’02] S-CREAM, EKAW’02
78(130)
S-CREAM: Discourse Rules
• Rules to attach instances only when the
ontology allows that (e.g., prices)
• Attach tag values to the nearest preceding
compatible entity (e.g., prices and rooms)
• Create a complex object between two
concept instances if they are adjacent
(e.g., rate – number followed by currency)
• Experienced users can write new rules
79(130)
Challenges for IE for SemWeb
•
•
•
•
•
Portability – different and changing ontologies
Different text types – structured, free, etc.
Utilise ontology information where available
Train from small amount of annotated text
Output results wrt the given ontology
– bridge the gap demonstrated in S-CREAM
• Learn/Model at the right level
– ontologies are hierarchical and data will get sparser
the lower we go
[DOT.KOM http://nlp.shef.ac.uk/dot.kom/]
80(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE (OBIE)
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
81(130)
Eg. 1: GATE & metadata extraction
• Combines learning and rule-based
methods
• Allows combination of IE and IR
• Enables use of large-scale linguistic
resources for IE, such as WordNet
• Supports ontologies as part of IE
applications - Ontology-Based IE (OBIE)
82(130)
Ontology Management in GATE
83(130)
Information Retrieval
Currently based on the Lucene IR engine – useful for
combining semantic and keyword-based search
84(130)
85(130)
WordNet support
Populating Ontologies with IE
86(130)
Example 2: OBIE in h-TechSight
• hTechSight project – using Ontology-Based IE for
semantic tagging of job adverts, news and reports in
chemical engineering domain
• Aim is to track technological change over time through
terminological analysis
• Fundamental to the application is a domain-specific
ontology
• Terminological gazetteer lists are linked to classes in the
ontology
• Rules classify the mentions in the text wrt the domain
ontology
• Annotations output into a database or as an
ontology
87(130)
88(130)
89(130)
Exported Database
90(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
91(130)
Platforms for Large-Scale Metadata
Creation
• Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation
• Automated alias discovery
• Generate SemWeb output (RDF, OWL)
• Stand-off storage and indexing of metadata
• Use large instance bases to disambiguate to
• Ontology servers for reasoning and access
• Architecture elements:
– Crawler, onto storage, doc indexing, query, annotators
– Apps: sem browsers, authoring tools, etc.
92(130)
Example 1: SemTag
• Lookup of all instances from the ontology (TAP) –
65K instances
• Disambiguate the occurrences as:
– One of those in the taxonomy
– Not present in the taxonomy
• Not very high ambiguity of instances with the same
label in TAP – concentrate on the second problem
• Use bag-of-words approach for disambiguation
• 3 people evaluated 200 labels in context – agreed
on only 68.5% - metonymy
• Placing labels in the taxonomy is hard
Dill et al, SemTag and Seeker. WWW’03
93(130)
Seeker
• High-performance distributed infrastructure
• 128 dual-processor machines with separate
½ terabyte of storage
• Each node runs approx. 200 documents per sec.
• Service-oriented architecture – Vinci (SOAP)
Dill et al, SemTag and Seeker. WWW’03
94(130)
Example 2: OBIE in KIM
• The ontology (KIMO) and 86K/200K instances KB
• High ambiguity of instances with the same label –
need for disambiguation step
• Lookup phase marks mentions from the ontology
• Combined with rule-based IE system to recognise
new instances of concepts and relations
• Special KB enrichment stage where some of these
new instances are added to the KB
• Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same label
based on corpus statistics (e.g., Paris)
Popov et al. KIM. ISWC’03
95(130)
OBIE in KIM (2)
Popov et al. KIM. ISWC’03
96(130)
Comparison between SemTag & KIM
• SemTag only aims for accuracy (precision) of
classification of the annotated entities
• KIM also aims for coverage (recall) – whether all
possible mentions of entities were found
• Trade-off – sometimes finding some is enough
• SemTag does not attempt to discover and expand the
KB with new instances (e.g., new company) – the reason
why KIM uses IE, not simple KB lookup
• i.e. OBIE is often needed for ontology population, not
just metadata creation
97(130)
Two Annotation Scenarios (1)
• Getting the instances
and the relations
between them is
enough, maybe not all
mentions in the text
are covered, but
compensated by
giving access to this
info from the
annotated text
98(130)
Example
“Gordon Brown met president Bush during his
two day visit. Afterwards George Bush said…
…
The system
Entity
…
G.Brown
Person
Benchmark
Entity
Person
Bush
Job-title
Job-title
president
minister
chancellor
president
minister
chancellor
…
Score: 100%
G.Brown
Bush
…
99(130)
Two Annotation Scenarios (2)
• Exhaustive annotation is required, so all
occurrences of all instances and relations are
needed
• Allows sentence and paragraph-level exploration,
rather than document-level as in the previous
scenario
• Harder to achieve
• Distinction between these scenarios needs to be
made in the metadata annotation tools/KM tools
using IE
100(130)
Example
“Gordon Brown met president Bush during his
two day visit. Afterwards George Bush said…
<metadata>
<Annotation>
<s_offset> 0 </s_offset>
<e_offset> 12 </e_offset>
<class>…#Person</class>
<inst>…#Person12345</inst>
</Annotation>
<Annotation>
<s_offset> 61 </s_offset>
<e_offset> 72 </e_offset>
<class>…#Person</class>
<inst>…#Person1267</inst>
</Annotation>
</metadata>
Score: 66%
<metadata>
<Annotation>
<s_offset> 0 </s_offset>
<e_offset> 12 </e_offset>
<class>…#Person</class>
<inst>…#Person12345</inst>
</Annotation>
<Annotation>
<s_offset> 18 </s_offset>
<e_offset> 32 </e_offset>
<class>…#Person</class>
<inst>…#Person1267</inst>
</Annotation>
<Annotation>
<s_offset> 61 </s_offset>
<e_offset> 72 </e_offset>
<class>…#Person</class>
<inst>…#Person1267</inst>
</Annotation>
101(130)
</metadata>
Eg. 3: SWAN: a Semantic Web
Annotator
• Collaboration between DERI/NUIG, OntoText
and USFD, hosted at DERI
• GATE + KIM + SECO
•
•
•
•
Custom indexing of news or other web fractions
Quantitative media reporting
Annotated web workbench service
Custom knowledge services
• Demo and poster at ESWS
102(130)
SWAN Logical Architecture
Web
IE
(64 bit)
Focussed
Focussed
crawling
Focussed
crawling
Focussed
crawling
Focussed
crawling
crawling
Focussed
Focussed
crawling
Focussed
crawling
Focussed
crawling
IE
crawling
(32 bit)
Annotation
(Oracle)
Web UI,
Web services
Knowledge
base
(Sesame)
Service
Users
103(130)
UI
Users
Cluster Controller
104(130)
Semantic Reference
Disambiguation
• Possible approaches:
– Vector-space models – compare context
similarity – runs over a corpus
• SemTag
• Bagga’s cross-document coreference work
– Communities of practise approach from KM
– Identity criteria from the ontology based on
properties, e.g., date_of_birth, name
105(130)
Why disambiguation is hard –
not all knowledge is explicit in text
Paris fashion week underway as cancellations continue
By Jo Johnson and Holly Finn - Oct 07 2001 18:48:17 (FT)
Even as Paris fashion week opened at the weekend, the
cancellations and reschedulings were still trickling in over the
fax machines: Loewe, the leather specialists owned by LVMH
empire, is not showing, Cerruti, the Italian tailor,is
downscaling to private viewings, Helmut Lang, master of the
sharp suit, is cancelling his catwalk.
The Oscar de la Renta show, for example, which had been
planned for September 11th in New York, and which might
easily enough have moved over to Paris instead, is not on the
schedule. When the Dominican Republic-born designer
consulted America Vogue's influential editor, Anna Wintour,
she reportedly told him it would be unpatriotic to decamp.
106(130)
Structure of the Tutorial
1.
2.
3.
4.
Motivation, background
Information Extraction - definition
Evaluation – corpora & metrics
IE approaches – some examples
– Rule-based approaches
– Learning-based approaches
5. Semantic Tagging
– Using “traditional” IE
– Ontology-based IE
– Platforms for large-scale processing
6. Language Generation
[Slides: http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt]
107(130)
Natural Language Generation
• NLG is:
– “subfield of AI and CL that is concerned with
the construction of computer systems that can
produce understandable texts in English or
other human languages from some underlying
linguistic representation of information”
[Reiter&Dale’97]
– NLG techniques are applied also for
producing speech, e.g., in speech dialogue
systems
108(130)
Natural Language Generation
Ontology/KB/Database
Text
109(130)
Lexicons
+
Grammars
Requirements Analysis
• Create a corpus of target texts and (if
possible) their input representations
• Analyse the information content
– Unchanging texts: thank you, hello, etc.
– Directly available data: timetable of buses
– Computable data: number of buses
– Unavailable data: not in the system’s KB/DB
110(130)
NLG Tasks
1.
2.
3.
4.
5.
6.
Content determination
Discourse planning
Sentence aggregation
Lexicalisation
Referring expression generation
Linguistic realisation
111(130)
Content determination
• What information to include in the text –
filtering and summarising input data into a
formal knowledge representation
• Application dependent
• Example
[ project:
start_date:
end_date:
participants:
AKT
October-2000
October-2006
{A,E,OU,So,Sh}]
112(130)
Discourse Planning
• Determine ordering and structure over the
knowledge to be generated
• Theories of discourse – how texts are
structured
• Influences text readability
• Result: tree structure imposing ordering
over the predicates and possibly providing
discourse relations
113(130)
Example
Root
SEQUENCE
Project participants
LIST
Participant descr.
ELABORATION
[project:AKT
duration: 6 yrs]
[project: AKT
participant:Shef]
Participant descr.
ELABORATION
[univ: Shef
[project: AKT
Web-page: URL] participant:OU]
114(130)
…
…
Planning-Based Approaches
• Use AI-style planners
(e.g., [Moore & Paris 93]
– Discourse relations (e.g., ELABORATION) are
encoded as planning operators
– Preconditions specify when the relation can
apply
– Planning starts from a top-level goal, e.g.,
define-project(X)
• Computationally expensive and require a lot
of knowledge – problem for
real-world systems
115(130)
Schema-Based Approaches
• Capture typical text structuring patterns in
templates (derived from corpus), e.g.,
[McKeown 85]
• Typically implemented as RTN
• Variety comes from different available
knowledge for each entity
• Reusable ones available: Exemplars
• Example:
Describe-Project-Schema ->
Sequence([duration], ProjParticipants-Schema)
116(130)
Sentence Aggregation
• Determine which predicates should be grouped
together in sentences
• Less understood process
• Default: each predicate can be expressed as a
sentence, so optional step
• SPOT: trainable planner
• Example:
AKT is a 6-year project with 5 participants:
• Sheffield (URL)
• OU …
117(130)
Lexicalisation
• Choosing words and phrases to express
the concepts and relations in predicates
• Trivial solution: 1-1 mapping between
concepts/relations and lexical entries
• Variation is useful to avoid repetitiveness
and also convey pragmatic distinctions
(e.g. formality)
118(130)
Referring Expression Generation
• Choose pronouns/phrases to refer to the
entities in the text
• Example: he vs Mr Smith vs John Smith,
the president of XXX Corp.
• Depends on what is previously said
– He is only appropriate if the person is already
introduced in the text
119(130)
Linguistic Realisation
• Use grammar to generate text which is
grammatical, i.e., syntactically and
morphologically correct
• Domain-independent
• Reusable components are available – e.g.,
RealPro, FUF/SURGE
• Example:
– Morphology: participant -> participants
– Syntactic agreement: AKT starts on …
120(130)
Example: a GATE-based generator
• Input
– The MIAKT ontology
– The RDF file for the given case
– The MIAKT lexicon
• Output
– GATE document with the generated
text
121(130)
Lexicalising Concepts and
Instances
122(130)
Example RDF Input
<rdf:Description
rdf:about='c:\breast_cancer_ontology.daml#01401_patient'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Patient'/>
<NS2:has_age>68</NS2:has_age>
<NS2:involved_in_ta rdf:resource='c:\breast_cancer_ontology.daml#tasoton-1069861276136'/>
</rdf:Description>
<rdf:Description rdf:about='c:\breast_cancer_ontology.daml#01401_mammography'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Mammography'/>
<NS2:carried_out_on rdf:resource='c:\breast_cancer_ontology.daml#01401_patient'/>
<NS2:has_date>22 9 1995</NS2:has_date>
<NS2:produce_result rdf:resource='c:\breast_cancer_ontology.daml#image_01401_right_cc'/>
</rdf:Description>
<rdf:Description rdf:about='c:\breast_cancer_ontology.daml#image_01401_right_cc'>
<NS2:image_file>cancer/case0140/C_0140_1.RIGHT_CC.LJPEG</NS2:image_file>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Right_CC_Image'/>
<NS2:has_lateral rdf:resource='c:\breast_cancer_ontology.daml#lateral_right'/>
<NS2:view_of_image rdf:resource='c:\breast_cancer_ontology.daml#craniocaudal_view'/>
<NS2:contains_entity rdf:resource='c:\breast_cancer_ontology.daml#01401_right_cc_abnor_1'/>
</rdf:Description>
<rdf:Description rdf:about='c:\breast_cancer_ontology.daml#01401_right_cc_abnor_1'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Abnormality'/>
<NS2:is_finding rdf:resource='c:\breast_cancer_ontology.daml#mass_01401_right_cc_abnor_1'/>
<NS2:has_morph_feature rdf:resource='c:\breast_cancer_ontology.daml#shape_mammo_round'/>
<NS2:has_morph_feature
rdf:resource='c:\breast_cancer_ontology.daml#margin_mammo_microlobulated'/>
<NS2:has_overall_impression
rdf:resource='c:\breast_cancer_ontology.daml#assessment_probably_malignant'/>
</rdf:Description>
123(130)
CASE0140.RDF
The 68 years old patient is involved in a triple
assessment procedure. The triple assessment
procedure contains a mammography exam. The
mammography exam is carried out on the
patient on 22 9 1995. The mammography exam
produced a right CC image. The right CC image
contains an abnormality and it has a right lateral
side and a craniocaudal view. The abnormality
has a mass, a microlobulated margin , a round
shape, and a probably malignant assessment.
124(130)
Further Reading on IE for SemWeb
•
•
•
•
•
•
•
•
Requirements for Information Extraction for Knowledge Management.
http://nlp.shef.ac.uk/dot.kom/publications.html
Information Extraction as a Semantic Web Technology: Requirements and
Promises. Adaptive Text Extraction and Mining workshop, 2003.
A. Kiryakov, B. Popov, et al. Semantic Annotation, Indexing, and Retrieval. 2nd
International Semantic Web Conference (ISWC2003),
http://www.ontotext.com/publications/index.html#KiryakovEtAl2003
S. Handschuh, S. Staab, R. Volz: http://www.aifb.unikarlsruhe.de/WBS/sha/papers/p273_handschuh.pdf. On Deep Annotation.
WWW’03.
S. Dill, N. Eiron, et al: http://www.tomkinshome.com/papers/2Web/semtag.pdf .
SemTag and Seeker: Bootstrapping the semantic web via automated semantic
annotation. WWW’03.
E. Motta, M. Vargas-Vera, et al: MnM: Ontology Driven Semi-Automatic and
Automatic Support for Semantic Markup. : Knowledge Engineering and Knowledge
Management (Ontologies and the Semantic Web), (EKAW02),
http://www.aktors.org/publications/selected-papers/06.pdf
K. Bontcheva, A. Kiryakov, H. Cunningham, B. Popov. M. Dimitrov. Semantic Web
Enabled, Open Source Language Technology. Language Technology and the
Semantic Web, Workshop on NLP and XML (NLPXML-2003).
http://www.gate.ac.uk/sale/eacl03-semweb/bontcheva-etal-final.pdf
Handschuh, Staab, Ciravegna. S-CREAM - Semi-automatic CREAtion of Metadata
(2002) http://citeseer.nj.nec.com/529793.html
125(130)
Further Reading on “traditional” IE
•
•
•
•
•
•
•
•
•
•
[Day et al’97] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language Processing Systems. In Proceedings of the Fifth
Conference on Applied Natural Language Processing (ANLP’97). 1997.
[Ciravegna’02] F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks: User-System Cooperation in
Document Annotation based on Information Extraction. Knowledge Engineering and
Knowledge Management (Ontologies and the Semantic Web), (EKAW02), 2002.
N. Kushmerick, B. Thomas. Adaptive information extraction: Core technologies for
information agents (2002). http://citeseer.nj.nec.com/kushmerick02adaptive.html
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical
Development Environment for Robust NLP Tools and Applications. 40th Anniversary Meeting
of the Association for Computational Linguistics (ACL'02). 2002.
D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named
entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.
Califf and Mooney: Relational Learning of Pattern Matching Rules for Information Extraction
http://citeseer.nj.nec.com/6804.html
Borthwick. A. A Maximum Entropy Approach to Named Entity Recognition.
PhD Dissertation. 1999
Bikel D., Schwarta R., Weischedel. R. An algorithm that learns what’s in a name. Machine
Learning 34, pp.211-231, 1999
Riloff, E. (1996) "Automatically Generating Extraction Patterns from Untagged Text"
Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996,
pp. 1044-1049. http://www.cs.utah.edu/%7Eriloff/psfiles/aaai96.pdf
Daelemans W. and Hoste V. Evaluation of Machine Learning Methods for Natural Language
Processing Tasks. In LREC 2002 Third International Conference on Language Resources
and Evaluation, pages 755–760
126(130)
Further Reading on “traditional” IE
•
•
•
•
•
•
•
•
Black W.J., Rinaldi F., Mowatt D. Facile: Description of the NE System Used For
MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998.
Collins M., Singer Y. Unsupervised models for named entity classification
In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora, 1999
Collins M. Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted
Perceptron. Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, pp.
489-496, July 2002 Gotoh Y., Renals S. Information extraction from broadcast news,
Philosophical Transactions of the Royal Society of London, series A: Mathematical,
Physical and Engineering Sciences, 2000.
Grishman R. The NYU System for MUC-6 or Where's the Syntax? Proceedings of the
MUC-6 workshop, Washington. November 1995.
Krupka G. R., Hausman K. IsoQuest Inc.: Description of the NetOwlTM Extractor
System as Used for MUC-7. Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998.
McDonald D. Internal and External Evidence in the Identification and Semantic
Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus
Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996
Mikheev A., Grover C. and Moens M. Description of the LTG System Used for MUC7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1
May, 1998
Miller S., Crystal M., et al. BBN: Description of the SIFT System as Used for MUC-7.
Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1
May, 1998
127(130)
Further Reading on multilingual IE
•
•
•
•
•
•
•
Palmer D., Day D.S. A Statistical Profile of the Named Entity Task.
Proceedings of the Fifth Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3, 1997.
Sekine S., Grishman R. and Shinou H. A decision tree method for finding
and classifying names in Japanese texts. Proceedings of the Sixth
Workshop on Very Large Corpora, Montreal, Canada, 1998
Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N. Chinese Named Entity
Identification Using Class-based Language Model. In proceeding of the
19th International Conference on Computational Linguistics
(COLING2002), pp.967-973, 2002.
Takeuchi K., Collier N. Use of Support Vector Machines in Extended
Named Entity Recognition. The 6th Conference on Natural Language
Learning. 2002
D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic
extraction of named entities. Recent Advances in Natural Language
Processing, Bulgaria, 2003.
M. M. Wood and S. J. Lydon and V. Tablan and D. Maynard and H.
Cunningham. Using parallel texts to improve recall in IE. Recent Advances
in Natural Language Processing, Bulgaria, 2003.
D.Maynard, V. Tablan and H. Cunningham. NE recognition without training
data on a language you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition: Combining Statistical and
Symbolic Models, Sapporo, Japan, 2003.
128(130)
Further Reading on multilingual IE
•
•
•
•
•
•
H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, Y. Wilks.
Multimedia Indexing through Multisource and Multilingual Information
Extraction; the MUMIS project. Data and Knowledge Engineering, 2003.
D. Manov and A. Kiryakov and B. Popov and K. Bontcheva and D. Maynard, H.
Cunningham. Experiments with geographic knowledge for information
extraction. Workshop on Analysis of Geographic References, HLT/NAACL'03,
Canada, 2003.
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework
and Graphical Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary Meeting of the Association
for Computational Linguistics (ACL'02). Philadelphia, July 2002.
H. Cunningham. GATE, a General Architecture for Text Engineering.
Computers and the Humanities, volume 36, pp. 223-254, 2002.
D. Maynard, H. Cunningham, K. Bontcheva, M. Dimitrov. Adapting A Robust
Multi-Genre NE System for Automatic Content Extraction. Proc. of the 10th
International Conference on Artificial Intelligence: Methodology, Systems,
Applications (AIMSA 2002), 2002.
K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is
the reuse of grammars for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.
129(130)
THANK YOU!
(for not snoring)
The slides:
http://gate.ac.uk/sale/talks/esws2004-tutorial.ppt
[This work has been supported by
AKT (http://aktors.org/) and
SEKT (http://sekt.semanticweb.org/)]
130(130)