Transcript Document

The Semantic Web and Language Technology
BT Exact, Martlesham
Hamish Cunningham
Department of Computer Science,
University of Sheffield
Friday October 11th 2002
• Next generation web
• GATE, language technology infrastructure
1(19)
A Ubiquitous Permeable Web
The next generation of the web must be:
• ubiquitous: semantics for every device, every organisation, every individual;
• permeable: allow contextual data to penetrate and persist;
• companionable: able to engage with us via multiple natural modalities.
Roles for Language Technology:
• discovery of semantics (ubiquity);
• mediating between context and personal semantic memories (permeability);
• conversing with people and the semantic web (companionableness).
2(19)
Critical Mass for the Semantic Web
The SW: machine processable, repurposable data to compliment hypertext
But: semantics = 0.0000000...% of the Web
How to achieve critical mass? Huge scale automatic annotation. Requirements:
• Huge scale:
– freely available to all EU citizens
– distributed (over a Grid)
– re-purposeable (delivered as Web Services)
• Portability and robustness via:
– simple and therefore shallow HLT methods
– +ve and –ve learning
– analogs of IPSEs for computer-literate users
3 (19)
Motivation for Software Infrastructure
for Language Engineering
• Need for scalable, reusable, and portable HLT solutions
• Support for large data, in multiple media, languages, formats, and locations
• Lowering the cost of creation of new language processing components
• Promoting quantitative evaluation metrics via tools and a level playing field
4 (19)
Motivation (II):
5 (19)
GATE, a General Architecture for Text Engineering
• An architecture
A macro-level organisational picture for LE software systems.
• A framework
For programmers, GATE is an object-oriented class library that implements the
architecture.
• A development environment
For language engineers, computational linguists et al, GATE is a graphical
development environment bundled with a set of tools for doing e.g. Information
Extraction.
• Some free components... ...and wrappers for other people's components
• Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
• Free software (LGPL). Download at http://gate.ac.uk/download/
6 (19)
Architectural principles
• Non-prescriptive, theory neutral (strength and weakness)
• Re-use, interoperation, not reimplementation (e.g. diverse XML support,
integration of tools like Protégé, Jena and Weka)
• (Almost) everything is a component, and component sets are user-extendable
Component-based development
• An OO way of chunking software: Java Beans
• GATE components: CREOLE = modified Java Beans (Collection of REusable
Objects for Language Engineering)
• The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.
7 (19)
GATE Language Resources
GATE LRs are documents, ontologies, corpora, lexicons, ……
Documents / corpora:
• GATE documents loaded from local files or the web...
• Diverse document formats: text, html, XML, email, RTF, SGML.
Processing Resourcres
Algorithmic components knows as PRs – beans with execute methods.
• All PRs can handle Unicode data by default.
• Clear distinction between code and data (simple repurposing).
• 20-30 freebies with GATE
• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer;
DAML+OIL export; Information Retrieval based on Lucene
8 (19)
…
ANNIE
…
Named
entity
Coreference
HTML
docs
XML
docs
GATE Format Handlers
RTF
docs
Document content
Document metadata
…
Document format data
Named
entity
Linguistic data
…
A Language
Analysis
Example
POS
tagger
…
Event
extraction
Custom application 1
Relational
Database
Oracle/
PostgresQL
File
storage
10(11)
Building IE Components in GATE (1)
The ANNIE system – a reusable and easily extendable set of components
11 (19)
Building IE Components in GATE (2)
JAPE: a Java Annotation Patterns Engine
• Light, robust regular-expression-based processing
• Cascaded finite state transduction
• Low-overhead development of new components
Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+
{Lookup.kind == companyDesignator}
):companyMatch
-->
:companyMatch.NamedEntity = { kind = company, rule = “Company1” }
12 (19)
The Semantic Web and GATE
GATE is being used for development of (semi-)automatic
methods for:
• linking web pages to Ontologies using Information
Extraction;
• learning and evolving Ontologies via IE and lexical
semantic network traversal.
13 (19)
Populating Ontologies with IE
Protégé and Ontology Management
Information Retrieval Support
Based on the Lucene IR engine
16 (19)
Displaying Multilingual Data
All the visualisation and editing tools for ML LRs use enhanced Java facilities:
17 (19)
Applications
GATE has been used for a variety of applications, including:
• MUMIS: automatic creation of semantic indexes for multimedia programme material
• MUSE: a multi-genre IE system
• Metadata for Medline (at Merck)
• ACE: participation in the Automatic Content Extraction programme
• HSE: summarisation of health and safety information from company reports
• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.
• Various Medical Informatics and database technology projects
• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and
French (Arabic, Chinese and Russian this autumn)
18 (19)
Conclusion
GATE: an infrastructure that lowers the overhead of
creating & embedding robust NLP components
Further information: http://gate.ac.uk/
• Online demos, tutorials and documentation
• Software downloads
• Talks and papers
19 (19)