Transcript Document
GATE: an AKT success story
[GATE: open source language technology component architecture and many tools, with a number of AKT roles]
http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva Yorick Wilks
Southampton, January 2004
1. New GATE-related projects 2. Current state of the system 3. Future plans
• • • • • •
New Projects
SEKT
: €9m IP with BT, AIFB, JSI, Empolis, SAI, OntoPrise, ISOCO, UB, Kea-Pro
PrestoSpace
– €9m IP with BBC, RAI, ORF, INA, ...: preservation of audio-visual media
KnowledgeWeb
– NoE successor to OntoWeb
ETCSL
– GATE for humanities scholars
hTechSight
– petrochem tech oversight
SWAN
– large-scale semantic annotation 2(12)
SEKT: large-scale DM + robust HLT for NGKM
(M)NLG
KEY MNLG
: Multilingual Natural Language Generation
OBIE
: Ontology-Based Information Extraction (
MI)IE
: Mixed-Intiative IE
CLIE
: Controlled Language IE
Human Language
OBIE
Formal Knowledge (ontologies and instance bases) Semantic Web; Semantic Grid; Semantic Web Services
(MI)IE
Controlled Language
CLIE 3(12)
SEKT: Evaluating Semantic Tagging
• Need for new metrics when evaluating hierarchy/ontology-based NE tagging • Need to take into account distance in the hierarchy • Tagging a company as a charity is less wrong than tagging it as a person • Several SEKT-related initiatives (w/s at ECAI; Pascal network) 4(12)
PrestoSpace
• Cultural Heritage / Digital Libraries IP • BBC, RAI, ORF, INA, B&G, USFD, and 23 others (!) • 20 th Century Rot: rapid disappearance of audio visual media • Preservation and digitisation is high cost • Therefore we need rich metadata and semantic access • Little training data, open domain: FSTs for users • Follows MUMIS and other projects • Evaluation: TRECVID, OBIE 5(12)
GATE Status (version 2½)
• Stable core since end 2002 • Increasing numbers of users (next slide) • Increasing numbers of languages (most recently: Chinese, Arabic, Russian, German system from DotKom) • Increasing numbers of 3 rd party components (e.g. Medline and UMLS work, OBIE/KIM, QA, summarisation, ...) • Embedded in KM applications 6(12)
A bit of a nuisance (GATE users)
Thousands of users at hundreds of sites
(based on
survey of 4,700 downloaders
)
.
A representative sample: • the
American National Corpus
project • the
Perseus Digital Library
University, US project, Tufts • • • •
Greenstone
digital library, NZ
Longman Pearson Merck KgAa Canon Europe
, UK publishing, UK , Germany • •
Knight Ridder BBN
, US (leading HLT research lab), US • SMEs inc.
Sirma AI
Ltd., Bulgaria • Imperial College, London, the University of Manchester, UMIST, Vassar College, the University of Southern California and
a large number of other UK, US and EU Universities
• UK and EU
projects
inc.MyGrid, CLEF, DotKom, AMITIES, Cub Reporter, EMILLE, Poesia...
GATE team projects.
• • • • • • •
Past: MUMIS
: semantic index of sports video
MUSE HSL
, cross-genre entitiy finder , Health-and-safety IE
Old Bailey
: collaboration with HRI on 17th century court reports
Multiflora
: plant taxonomy text analysis for biodiversity research e-science
EMILLE
: S. Asian languages corpus
ACE / TIDES
: Arabic, Chinese NE • • • • • • • •
Present: Advanced Knowledge Technologies SEKT
: next-generation KM
PrestoSpace: KnowledgeWeb h-TechSight: ETCSL: SWAN MiAKT
audiovisual preservation) : semantic web network technology oversight Sumerian language corpus : Semantic Web Annotator : medical informatics KM 7(12)
Some new stuff
• Johns Hopkins w/s on Semantic Annotation: BNC-based corpus, ME expts • WEKA 2 release (JSI library integration soon) • papers: RANLP, ISWC, Journal of Digital Libraries, Journal of Data and Knowledge Eng.
• JWS editorial board; co-editor JNLE special • RANLP IE tutorial, tutorial on HLT/SW at ESWS • HLT/SW evaluation workshop at ECAI • OBIE in Multiflora, hTechsight • SW NLG in MiAKT (below) 8(12)
MIAKT – NLG for SW
RDF input from image annotation GUI...
...generated text 9(12) MIAKT has important productivity and accuracy implications
hTechSight tech oversight
• Ontology-Based IE (OBIE) for semantic tagging of job adverts, news and reports in chemical engineering domain • Aim is to track technological change over time • Centred around domain-specific ontology • Terminological gazetteer lists are linked to classes in the ontology • Rules classify the mentions in the text wrt. the domain ontology • Annotations output to DB or RDF 10(12)
Varying plant taxa
OBIE in MultiFlora 2
Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Merged RDF BBSRC project led by Mary McGee Wood, U. Mcr.
GATE 4: the Final Conflict
• (GATE 3 release happening soonish) • Continuity guaranteed for AKT phase 2 (€2 million GATE-related work 2004-2007) • Some future elements: – more and better OBIE, inc. cross-doc co-reference – pluggable OWL repository support (now only Sesame; soon 3Store, KAON) – large- and huge-scale processing – standardisation of the component integration model (ECLIPSE) – service-based integration (“SDK” SW API) • This talk:
http://gate.ac.uk/sale/talks/akt-jan04.ppt
• What else? You tell us...
12(12)