Transcript Document

GATE technical workshop:
introduction
http://gate.ac.uk/
http://nlp.shef.ac.uk/
Hamish Cunningham
Sheffield, March 17/18, 2004
Agenda
Wednesday (G22)
Thursday (G30)
• 10.15: arrival, setup
• 10.30: introductions, summary of
background / skills
• 10.40: mission, conventions,
internal pages, GATE intro (hc)
• 11.30: tools: cvs, jbuilder, tkdiff,
building GATE (vt)
• 12.00: break
• 12.15: intro to the GUI (dm)
• 1.30: lunch
• 2.30: annie, jape (dm)
• 4.00: break
• 4.15: summary of projects (hc)
• 5.30: close
• 10.30: API, CREOLE lifecycle, java
for jape [1] (vt)
• 12.00: break
• 12.15: tests, writing, running; API
etc. [2] (hc, vt)
• 1.30: lunch
• 2.30: corpora, evaluation tools
(dm, kb)
• 3.00: machine learning (vt)
• 4.00: break
• 4.15: ontologies (kb)
• 5.15: wrapup
• 5.30: close
2(9)
Blah
• mission
• conventions
• mailing lists
• roles and responsibilities
3(9)
GATE (the Volkswagen Beetle
of Language Processing) is:
• Eight years old (!), with 000s of users at 00s of sites
• An architecture A macro-level organisational picture for LE
software systems.
• A framework For programmers, GATE is an object-oriented
class library that implements the architecture.
• A development environment For language engineers,
computational linguists et al, a graphical development
environment.
• Some free components... ...and wrappers for other people's
components
• Tools for: evaluation; visualise/edit; persistence; IR; IE;
dialogue; ontologies; etc.
• Free software (LGPL). Download at
http://gate.ac.uk/download/
4(9)
A bit of a nuisance (our users)
Thousands of users at hundreds of
sites. A representative sample:
• the American National Corpus project
• the Perseus Digital Library project,
Tufts University, US
• Longman Pearson publishing, UK
• Merck KgAa, Germany
• Canon Europe, UK
• Knight Ridder, US
• BBN (leading HLT research lab), US
• SMEs: Melandra, SG-MediaStyle, ...
• Imperial College, London, the University
of Manchester, UMIST, the University of
Karlsruhe, Vassar College, the
University of Southern California and a
large number of other UK, US and EU
Universities
• UK and EU projects inc. MyGrid, CLEF,
dotkom, AMITIES, CubReporter,
Poesia...
5(9)
GATE team projects. Past:
• Conceptual indexing: MUMIS:
automatic semantic indices for
sports video
• MUSE, cross-genre entitiy finder
• HSL, Health-and-safety IE
• Old Bailey: collaboration with HRI
on 17th century court reports
• Multiflora: plant taxonomy text
analysis for biodiversity research escience
• EMILLE: S. Asian language corpus
• ACE / TIDES: Arabic, Chinese NE
• JHU summer w/s on semtagging
Present:
• Advanced Knowledge
Technologies: €12m UK five site
collaborative project
• ETCSL: Sumerian digital library
• MiAKT: medical informatics / AKT
• SEKT: Semantic Knowledge Tech
• PrestoSpace: AV Preservation
• KnowledgeWeb; h-TechSight
Architectural principles
• Non-prescriptive, theory neutral
(strength and weakness)
• Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of Protégé,
Jena, Weka...)
• (Almost) everything is a component, and
component sets are user-extendable
• (Almost) all operations are available both from API
and GUI
6(9)
All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for
Language Engineering:
• GATE components: modified Java Beans with
XML configuration
• The minimal component = 10 lines of Java, 10
lines of XML, 1 URL
Why bother?
• Allows the system to load arbitrary language
processing components
7(9)
GATE APIs
PDF
docs
RTF
docs
HTML
docs
XML
docs
email
…
XML
Document
Format
HTML
Document
Format
PDF
Document
Format
…
Document
Format
Layer (LRs)
ADiff
OntolVR
DocVR
...
ANNIE
Application Layer
IDE GUI Layer (VRs)
Corpus
Document
NE
Co-ref
Annotation
POS
…
DataStore Layer
Gazetteers
...
Language Resource Layer (LRs)
XML Oracle Postgre .ser
Sql
8(9)
TRs
Onto- Protégé WordOntology
net
logy
Corpus Layer (LRs)
NOTES
•everything is a replaceable bean
•all communication via fixed APIs
•low coupling, high modularity,
high extensibility
TEs
Processing Layer (PRs)
Document Annotation
Content
Set
Feature
Map
…
OBIE
NOTES (2)
•eg: Protégé LR & VR both
wrapped in Res. (bean) API
•ontology repositories and
inference should be the
same: KAON + Sesame +
Orenge + ?
Happy Birthday
Valy!
9(9)