Transcript Document

GATE, a General Architecture for Text
Engineering
http://gate.ac.uk/
Hamish Cunningham, Kalina Bontcheva, Valentin Tablan,
Diana Maynard, Yorick Wilks
Department of Computer Science,
University of Sheffield
UMIST
Friday November 29th 2002
Motivation for Software Infrastructure
for Language Engineering
• Need for scalable, reusable, and portable HLT solutions
• Support for large data, in multiple media, languages, formats, and locations
• Lowering the cost of creation of new language processing components
• Promoting quantitative evaluation metrics via tools and a level playing field
2/29
Motivation (II): software lifecycle in collaborative research
Project Proposal: We love each other. We can work so well together. We can hold
workshops on Santorini together. We will solve all the problems of AI that our
predecessors were too stupid to.
Analysis and Design: Stop work entirely, for a period of reflection and
recuperation following the stress of attending the kick-off meeting in Luxembourg.
Implementation: Each developer partner tries to convince the others that program
X that they just happen to have lying around on a dusty disk-drive meets the
project objectives exactly and should form the centrepiece of the demonstrator.
Integration and Testing: The lead partner gets desperate and decides to hardcode the results for a small set of examples into the demonstrator, and have a failsafe crash facility for unknown input ("well, you know, it's still a prototype...").
Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard
problems, and how if we had another grant we could go on to transform information
processing the World over (or at least the European business travel industry).
3/29
GATE, a General Architecture for Text Engineering
• An architecture
A macro-level organisational picture for LE software systems.
• A framework
For programmers, GATE is an object-oriented class library that implements the
architecture.
• A development environment
For language engineers, computational linguists et al, GATE is a graphical
development environment bundled with a set of tools for doing e.g. Information
Extraction.
• Some free components... ...and wrappers for other people's components
• Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
• Free software (LGPL). Download at http://gate.ac.uk/download/
4/29
Architectural principles
• Non-prescriptive, theory neutral (strength and weakness)
• Re-use, interoperation, not reimplementation (e.g. diverse XML support,
integration of tools like Protégé, Jena and Weka)
• (Almost) everything is a component, and component sets are user-extendable
Component-based development
• An OO way of chunking software: Java Beans
• GATE components: CREOLE = modified Java Beans (Collection of REusable
Objects for Language Engineering)
• The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.
5/29
GATE Language Resources
GATE LRs are documents, ontologies, corpora, lexicons, ……
Documents / corpora:
• GATE documents loaded from local files or the web...
• Diverse document formats: text, html, XML, email, RTF, SGML.
Processing Resourcres
Algorithmic components knows as PRs – beans with execute methods.
• All PRs can handle Unicode data by default.
• Clear distinction between code and data (simple repurposing).
• 20-30 freebies with GATE
• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer;
DAML+OIL export; Information Retrieval based on Lucene
6/29
7/29
Visual Resources
Displaying Coreference Information
8/29
Displaying Syntactic Information
9/29
Lexicon Support – WordNet example
10/29
…
A Language
Analysis
Example
ANNIE
…
Named
entity
Coreference
HTML
docs
XML
docs
GATE Format Handlers
RTF
docs
Document content
Document metadata
POS
tagger
…
Document format data
Named
entity
Linguistic data
…
…
Event
extraction
Custom application 1
Relational
Database
Oracle/
PostgresQL
File
storage
11/29
Building IE Components in GATE (1)
The ANNIE system – a reusable and easily extendable set of components
12/29
Building IE Components in GATE (2)
JAPE: a Java Annotation Patterns Engine
• Light, robust regular-expression-based processing
• Cascaded finite state transduction
• Low-overhead development of new components
Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+
{Lookup.kind == companyDesignator}
):companyMatch
-->
:companyMatch.NamedEntity = { kind = company, rule = “Company1” }
13/29
Performance Evaluation
• At document level – annotation diff
• At corpus level – corpus benchmark tool – tracking system’s performance over time
14/29
Regression Testing –
Corpus Benchmark Tool
15/29
The Semantic Web and GATE
GATE is being used for development of (semi-)automatic
methods for:
• linking web pages to Ontologies using Information
Extraction;
• learning and evolving Ontologies via IE and lexical
semantic network traversal.
16/29
Populating Ontologies with IE
17/29
Protégé and Ontology Management
18/29
Information Retrieval Support
Based on the Lucene IR engine
19/29
Editing Multilingual Data
GATE Unicode Kit (GUK)
Java provides no special support for text input (this may change)
• Support for defining additional
Input Methods (IMs)
• currently 30 IMs
for 17 languages
• Pluggable in other
applications
20/29
Processing Multilingual Data
All the visualisation and editing tools for ML LRs use enhanced Java facilities:
21/29
Dialogue Systems
• GATE is being used in the Amities project for automating
call centres
• Creation of dialogue processing server components to run
in the Galaxy Communicator architecture
• Easy adaptation of the portable IE components to work on
noisy ASR output
• Robustness and speed of GATE components vital for realtime dialogue systems
22/29
Applications
GATE has been used for a variety of applications, including:
• MUMIS: automatic creation of semantic indexes for multimedia programme material
• MUSE: a multi-genre IE system
• EMILLE: a 70 million word corpus of Indic languages
• Metadata for Medline (at Merck)
• ACE: participation in the Automatic Content Extraction programme
• HSE: summarisation of health and safety information from company reports
• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.
• AKT: language technology in knowledge management
• AMITIES: call centre automation
•Various Medical Informatics and database technology projects
• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and
French (Arabic, Chinese and Russian next year)
23/29
Some users…
At time of writing a representative fraction of GATE users
includes:
• Longman Pearson publishing, UK;
• Merck KgAa, Germany;
• Canon Europe, UK;
• Knight Ridder (the second biggest US news publisher);
• BBN;
• Sirma AI Ltd., Bulgaria;
• the American National Corpus project, US;
• Imperial College, London, the University of Manchester,
the University of Karlsruhe, Vassar College, the University
of Southern California and a large number of other UK, US
and EU Universities;
• the Perseus Digital Library project, Tufts University, US.
24/29
The MUMIS project
• Multimedia Indexing and Searching Environment
• Composite index of a multimedia programme
from multiple sources in different languages
• ASR, video processing, information extraction
(Dutch, English, German), merging, user interface
• University of Twente/CTIT, University of Sheffield,
University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA
• Yorick Wilks, Hamish Cunningham, Horacio Saggion,
Kalina Bontcheva, Diana Maynard, Oana Hamza, Cristian
Ursu
25/29
The Whole Picture
Ontology & Lexicon
IE
DE
Formal
Formal
Formal
Text
Text
Formal
NL Formal
Text
Formal
Text
Text
Formal
Text
Formal
EN Formal
Text
Text
Text
Text
Sources
IE
IE
Forma
Formal
lText
Forma
AnnoText
l
tations
Text
Merging
Final Annotations
Video & Audio
Signal
Forma
Forma
Forma
Forma
l
Forma
Forma
ll
Forma
lText
Forma
lText
Forma
lText
Forma
lText
Forma
lText
Speech
l
lText
lText
Text
Signals
Text
Text
Text
ASR
Formal
Formal
Formal
Text
Text
Text
Formal
Formal
Formal
Text
Formal
Text
Text
Formal
Formal
Trans
Text
Text
Text
criptions
Query
User
Interface
Multimedia
Data Base
Results
26/29
User Interface
27/29
Play
28/29
Conclusion
GATE: an infrastructure that lowers the overhead of
creating & embedding robust NLP components
Further information: http://gate.ac.uk/
• Online demos, tutorials and documentation
• Software downloads
• Talks and papers
29/29