Transcript Document

After OWL: defacto standards
for semantic technologies
(or: what do you get for €40m
EU research money?)
http://gate.ac.uk/
http://nlp.shef.ac.uk/
Hamish Cunningham,
Kalina Bontcheva, Valentin Tablan, Diana Maynard,
Wim Peters, Niraj Aswani, Milena Yankova,
Yaoyong Li, Akshay Java, Michael Dowman
ILASH workshop, March 2004
Structure of the talk
• Context:
• increasing use of “semantic” technology in IT
• the role(s) of human language technology
• substantial investment in the next phase of semantic web
research
• Semantic Web: moving on from formal standards
• Acronym soup:
• GATE: HLT API 4 SDK SW & KT
• An application: Ontology-Based IE in KIM
• Issues in API design, next steps
2(24)
The Knowledge Economy and
Human Language
Gartner, December 2002:
• taxonomic and hierachical knowledge mapping and indexing
will be prevalent in almost all information-rich applications
• through 2012 more than 95% of human-to-computer
information input will involve textual language
A contradiction:
• to deal with the information deluge we need formal knowledge
in semantics-based systems
• our information spaces are in informal and ambiguous natural
language
The challenge: to reconcile these two phenomena
3(24)
HLT: Closing the Loop
(M)NLG
Human
Language
KEY
MNLG: Multilingual Natural Language Generation
OIE: Ontology-aware Information Extraction
AIE: Adaptive IE
CLIE: Controlled Language IE
Formal Knowledge
(ontologies and
instance bases)
OIE
(A)IE
Controlled
Language
CLIE
4(24)
Semantic
Web;
Semantic
Grid;
Semantic
Web
Services
SEKT: Semantic Knowledge Technology
•
•
•
•
•
•
•
•
6th framework IP project
Duration: 36 months from 1/1/4, €12.5m
http://sekt.semanticweb.org/
Improve automation of ontology and metadata
generation
Develop highly-scalable solutions
Research sound inferencing despite
inconsistent models
Develop semantic knowledge access tools
Develop methodology for deployment
5(24)
PrestoSpace (20th Century Rot)
• 20th Century audio-visual media is rapidly
disappearing
• Preservation and restoration are high cost
• The costs must be justified by increased access
• “Metadata”: descriptive information about
content
• PrestoSpace (€9m IP, 40 months from 02/04):
– rich metadata and semantic access
– cross-lingual access
– syndicated delivery
– repurposeable content
6(24)
The “SDK” research cluster
• “Building the European Research Area” in KM through
collaboration with related IP and NoE projects in this
area for a coordinated impact strategy
• SEKT, DIP, KnowledgeWeb – SDK cluster:
http://sdk.semanticweb.org/
• Other related projects:
• AceMedia IP (semantic knowledge systems)
• PrestoSpace IP (cultural heritage / digital libraries)
• BRICKS IP (cultural heritage / digital libraries)
• Total EU/6FP investment in semantic tech. research
€40m: potential to influence the emergence of defacto
standards
7(24)
Next step for Semantics tech: from
formal to defacto standards?
•
•
•
•
Computer scientists love standards, so we have many
For any given problem there are usually 3 “standards”
OWL is no exception: Lite, DL, Full
There are good reasons, but cf. RDF(S)
implementation history: applications will of necessity
mix and match
• If we can achieve standard practice and libraries in
applications we will have made a next step and will
promote takeup
• (Pathological) example: TCP/IP vs. OSI
8(24)
HLT API 4 SDK SW & KT
• What sorts of software do we need?
• Ontology and metadata management: storage;
versionning; caching, inferencing; etc. (below)
• Human language technology components and
services (not monolithic systems, not unproven
research prototypes)
• The role of measurement in scaling and
robustness: in HLT this means MUC, TREC,
ACE, TIDES, ...
• Here’s one we baked earlier....
9(24)
GATE (the Volkswagen Beetle
of Language Processing) is:
• Eight years old, with the largest user constituency of its type
• An architecture A macro-level organisational picture for LE
software systems.
• A framework For programmers, GATE is an object-oriented
class library that implements the architecture.
• A development environment For language engineers,
computational linguists et al, a graphical development
environment.
• Some free components... ...and wrappers for other people's
components
• Tools for: evaluation; visualise/edit; persistence; IR; IE;
dialogue; ontologies; etc.
• Free software (LGPL). Download at
http://gate.ac.uk/download/
10(24)
Critical mass: 000s people 00s sites
GATE team projects. Past:
GATE users = significant proportion of
community. A small sample:
• Conceptual indexing: MUMIS:
automatic semantic indices for
• the American National Corpus project
sports video
• the Perseus Digital Library project,
• MUSE, cross-genre entitiy finder
Tufts University, US
• HSL, Health-and-safety IE
• Longman Pearson publishing, UK
• Old Bailey: collaboration with HRI
on 17th century court reports
• Merck KgAa, Germany
• Multiflora: plant taxonomy text
• Canon Europe, UK
analysis for biodiversity research e• Knight Ridder, US
science
• BBN (leading HLT research lab), US
• EMILLE: S. Asian language corpus
• ACE / TIDES: Arabic, Chinese NE
• SMEs: Melandra, SG-MediaStyle, ...
• Imperial College, London, the University • JHU summer w/s on semtagging
Present:
of Manchester, UMIST, the University of
• Advanced Knowledge
Karlsruhe, Vassar College, the
Technologies: €12m UK five site
University of Southern California and a
collaborative project
large number of other UK, US and EU
• ETCSL: Sumerian digital library
Universities
• MiAKT: medical informatics / AKT
• UK and EU projects inc. MyGrid, CLEF,
• SEKT: Semantic Knowledge Tech
dotkom, AMITIES, CubReporter,
• PrestoSpace: AV Preservation
Poesia...
11(24)• KnowledgeWeb; h-TechSight
Architectural principles
• Non-prescriptive, theory neutral
(strength and weakness)
• Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of Protégé,
Jena, Weka, interoperation with SCHUG in MUMIS)
• (Almost) everything is a component, and
component sets are user-extendable
• (Almost) all operations are available both from API
and GUI
• Why does this matter? It means that GATE works
well with other tools, embeds easily, and achieves
robustness through focus (API requirements)
12(24)
All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for
Language Engineering:
• GATE components: modified Java Beans with
XML configuration
• The minimal component = 10 lines of Java, 10
lines of XML, 1 URL
Why bother?
• Allows the system to load arbitrary language
processing components
13(24)
GATE APIs
PDF
docs
RTF
docs
HTML
docs
XML
docs
email
…
XML
Document
Format
HTML
Document
Format
PDF
Document
Format
…
Document
Format
Layer (LRs)
ADiff
OntolVR
DocVR
...
ANNIE
Document
NE
Co-ref
Annotation
POS
…
DataStore Layer
Gazetteers
...
Language Resource Layer (LRs)
XML Oracle Postgre .ser
Sql
14(24)
TRs
Onto- Protégé WordOntology
net
logy
Corpus Layer (LRs)
NOTES
•everything is a replaceable bean
•all communication via fixed APIs
•low coupling, high modularity,
high extensibility
TEs
Processing Layer (PRs)
Document Annotation
Content
Set
Feature
Map
Web
Services
Application Layer
IDE GUI Layer (VRs)
Corpus
…
OBIE
NOTES (2)
•eg: Protégé LR & VR both
wrapped in Res. (bean) API
•ontology repositories and
inference are the same:
KAON + Sesame +
Orenge + ?
Issues (1): a common HLT API
• OGSA, WMSO in the web services layer?
• Eclipse: less code for us, more services for
users? (A free OWL/UML drawing tool, for
example)
• ISO TC37/SC4: JNLE special; LIRICS
consortium
15(24)
API Application: Ontology-based IE
XYZ was established on 03 November 1978
in London. It opened a plant in Bulgaria
in …
Ontology & KB
Company
Location
HQ
City
type
XYZ
partOf
Country
type
HQ
London
establOn
type
type
partOf
“03/11/1978”
UK
16(24)
Bulgaria
Classes, instances & metadata
“Gordon Brown met George Bush during his
two day visit.
<metadata>
<DOC-ID>http://… 1.html</DOC-ID>
Classes+instances
before
…
<Annotation>
<s_offset> 0 </s_offset>
Entity
<e_offset> 12 </e_offset>
G.Brown
Person
<string>Gordon Brown</string>
<class>…#Person</class>
Bush
<inst>…#Person12345</inst>
Job-title
</Annotation>
<Annotation>
president
Classes+
<s_offset> 18 </s_offset>
minister
<e_offset> 32 </e_offset>
instances
chancellor
<string>George Bush</string>
after
<class>…#Person</class>
<inst>…#Person67890</inst>
…
</Annotation>
</metadata>
17(24)
OBIE in KIM
• An ontology (KIMO) and 200K instances KB
• High ambiguity of instances with the same label –
uses disambiguation step
• Lookup phase marks mentions from the ontology
• Combined with GATE-based IE system to
recognise new instances of concepts and relations
• KB enrichment stage where some of these new
instances are added to the KB
• Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same label
based on corpus statistics (e.g., Paris)
Popov et al. KIM. ISWC’03
18(24)
OBIE in KIM (2)
Popov et al. KIM. ISWC’03
19(24)
KIM demo...
Next steps in OBIE
• Continue to exploit the pluggability and community effects
of GATE (and Sesame, Lucene, ...)
• SWAN: Semantic Web Annotator at DERI/Galway
• Syndication
• Social networking
• Evaluation (below)
20(24)
(The “P” in OLP) Challenge:
Evaluating Richer NE Tagging
• Need for new metrics
when evaluating
hierarchy/ontologybased NE tagging
• Need to take into
account distance in
the hierarchy
• Tagging a company as
a charity is less wrong
than tagging it as a
person
21(24)
SW IE Evaluation tasks
• Detection of entities and events, given a target
ontology of the domain.
• Disambiguation of the entities and events from the
documents with respect to instances in the given
ontology. For example, measuring whether the IE
correctly disambiguated “Cambridge” in the text to
the correct instance: Cambridge, UK vs Cambridge,
MA.
• Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the ontology.
22(24)
Issues (2): a common OMM API
• Two design approaches:
A. the “richest set of features” approach
pool experience, cover all the bases, be relevant to very many users
(“top-down”)
B. the “highest common factors” approach
analyse software, pick common features, create plugability layer
(“bottom-up”)
• Both useful; can be combined
• Approach B. has some key advantages:
– leads to quicker version 1.0
– minimises arguments (criteria: feature exists in several sys, not is “good”)
• Problems:
– features present several places but not all – “operation not supported”?
– new work not prefigured in version 1.0 – roadmaps, placeholders
23(24)
The end
• Tutorial on HLT for the Semantic Web at
European Semantic Web Symposium:
http://www.esws2004.org/
• These slides:
http://gate.ac.uk/sale/talks/ilash-semweb-mar2004.ppt
• More information:
http://gate.ac.uk/
http://nlp.shef.ac.uk/
24(24)