Transcript Document

GATE: an AKT success story

[GATE: open source language technology component architecture and many tools, with a number of AKT roles]

http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva Yorick Wilks

Southampton, January 2004

1. New GATE-related projects 2. Current state of the system 3. Future plans

• • • • • •

New Projects

SEKT

: €9m IP with BT, AIFB, JSI, Empolis, SAI, OntoPrise, ISOCO, UB, Kea-Pro

PrestoSpace

– €9m IP with BBC, RAI, ORF, INA, ...: preservation of audio-visual media

KnowledgeWeb

– NoE successor to OntoWeb

ETCSL

– GATE for humanities scholars

hTechSight

– petrochem tech oversight

SWAN

– large-scale semantic annotation 2(12)

SEKT: large-scale DM + robust HLT for NGKM

(M)NLG

KEY MNLG

: Multilingual Natural Language Generation

OBIE

: Ontology-Based Information Extraction (

MI)IE

: Mixed-Intiative IE

CLIE

: Controlled Language IE

Human Language

OBIE

Formal Knowledge (ontologies and instance bases) Semantic Web; Semantic Grid; Semantic Web Services

(MI)IE

Controlled Language

CLIE 3(12)

SEKT: Evaluating Semantic Tagging

• Need for new metrics when evaluating hierarchy/ontology-based NE tagging • Need to take into account distance in the hierarchy • Tagging a company as a charity is less wrong than tagging it as a person • Several SEKT-related initiatives (w/s at ECAI; Pascal network) 4(12)

PrestoSpace

• Cultural Heritage / Digital Libraries IP • BBC, RAI, ORF, INA, B&G, USFD, and 23 others (!) • 20 th Century Rot: rapid disappearance of audio visual media • Preservation and digitisation is high cost • Therefore we need rich metadata and semantic access • Little training data, open domain: FSTs for users • Follows MUMIS and other projects • Evaluation: TRECVID, OBIE 5(12)

GATE Status (version 2½)

• Stable core since end 2002 • Increasing numbers of users (next slide) • Increasing numbers of languages (most recently: Chinese, Arabic, Russian, German system from DotKom) • Increasing numbers of 3 rd party components (e.g. Medline and UMLS work, OBIE/KIM, QA, summarisation, ...) • Embedded in KM applications 6(12)

A bit of a nuisance (GATE users)

Thousands of users at hundreds of sites

(based on

survey of 4,700 downloaders

)

.

A representative sample: • the

American National Corpus

project • the

Perseus Digital Library

University, US project, Tufts • • • •

Greenstone

digital library, NZ

Longman Pearson Merck KgAa Canon Europe

, UK publishing, UK , Germany • •

Knight Ridder BBN

, US (leading HLT research lab), US • SMEs inc.

Sirma AI

Ltd., Bulgaria • Imperial College, London, the University of Manchester, UMIST, Vassar College, the University of Southern California and

a large number of other UK, US and EU Universities

• UK and EU

projects

inc.MyGrid, CLEF, DotKom, AMITIES, Cub Reporter, EMILLE, Poesia...

GATE team projects.

• • • • • • •

Past: MUMIS

: semantic index of sports video

MUSE HSL

, cross-genre entitiy finder , Health-and-safety IE

Old Bailey

: collaboration with HRI on 17th century court reports

Multiflora

: plant taxonomy text analysis for biodiversity research e-science

EMILLE

: S. Asian languages corpus

ACE / TIDES

: Arabic, Chinese NE • • • • • • • •

Present: Advanced Knowledge Technologies SEKT

: next-generation KM

PrestoSpace: KnowledgeWeb h-TechSight: ETCSL: SWAN MiAKT

audiovisual preservation) : semantic web network technology oversight Sumerian language corpus : Semantic Web Annotator : medical informatics KM 7(12)

Some new stuff

• Johns Hopkins w/s on Semantic Annotation: BNC-based corpus, ME expts • WEKA 2 release (JSI library integration soon) • papers: RANLP, ISWC, Journal of Digital Libraries, Journal of Data and Knowledge Eng.

• JWS editorial board; co-editor JNLE special • RANLP IE tutorial, tutorial on HLT/SW at ESWS • HLT/SW evaluation workshop at ECAI • OBIE in Multiflora, hTechsight • SW NLG in MiAKT (below) 8(12)

MIAKT – NLG for SW

RDF input from image annotation GUI...

...generated text 9(12) MIAKT has important productivity and accuracy implications

hTechSight tech oversight

• Ontology-Based IE (OBIE) for semantic tagging of job adverts, news and reports in chemical engineering domain • Aim is to track technological change over time • Centred around domain-specific ontology • Terminological gazetteer lists are linked to classes in the ontology • Rules classify the mentions in the text wrt. the domain ontology • Annotations output to DB or RDF 10(12)

Varying plant taxa

OBIE in MultiFlora 2

Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Merged RDF BBSRC project led by Mary McGee Wood, U. Mcr.

GATE 4: the Final Conflict

• (GATE 3 release happening soonish) • Continuity guaranteed for AKT phase 2 (€2 million GATE-related work 2004-2007) • Some future elements: – more and better OBIE, inc. cross-doc co-reference – pluggable OWL repository support (now only Sesame; soon 3Store, KAON) – large- and huge-scale processing – standardisation of the component integration model (ECLIPSE) – service-based integration (“SDK” SW API) • This talk:

http://gate.ac.uk/sale/talks/akt-jan04.ppt

• What else? You tell us...

12(12)