Transcript Document
GATE technical workshop: introduction http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Sheffield, March 17/18, 2004 Agenda Wednesday (G22) Thursday (G30) • 10.15: arrival, setup • 10.30: introductions, summary of background / skills • 10.40: mission, conventions, internal pages, GATE intro (hc) • 11.30: tools: cvs, jbuilder, tkdiff, building GATE (vt) • 12.00: break • 12.15: intro to the GUI (dm) • 1.30: lunch • 2.30: annie, jape (dm) • 4.00: break • 4.15: summary of projects (hc) • 5.30: close • 10.30: API, CREOLE lifecycle, java for jape [1] (vt) • 12.00: break • 12.15: tests, writing, running; API etc. [2] (hc, vt) • 1.30: lunch • 2.30: corpora, evaluation tools (dm, kb) • 3.00: machine learning (vt) • 4.00: break • 4.15: ontologies (kb) • 5.15: wrapup • 5.30: close 2(9) Blah • mission • conventions • mailing lists • roles and responsibilities 3(9) GATE (the Volkswagen Beetle of Language Processing) is: • Eight years old (!), with 000s of users at 00s of sites • An architecture A macro-level organisational picture for LE software systems. • A framework For programmers, GATE is an object-oriented class library that implements the architecture. • A development environment For language engineers, computational linguists et al, a graphical development environment. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/ 4(9) A bit of a nuisance (our users) Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project, Tufts University, US • Longman Pearson publishing, UK • Merck KgAa, Germany • Canon Europe, UK • Knight Ridder, US • BBN (leading HLT research lab), US • SMEs: Melandra, SG-MediaStyle, ... • Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities • UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, CubReporter, Poesia... 5(9) GATE team projects. Past: • Conceptual indexing: MUMIS: automatic semantic indices for sports video • MUSE, cross-genre entitiy finder • HSL, Health-and-safety IE • Old Bailey: collaboration with HRI on 17th century court reports • Multiflora: plant taxonomy text analysis for biodiversity research escience • EMILLE: S. Asian language corpus • ACE / TIDES: Arabic, Chinese NE • JHU summer w/s on semtagging Present: • Advanced Knowledge Technologies: €12m UK five site collaborative project • ETCSL: Sumerian digital library • MiAKT: medical informatics / AKT • SEKT: Semantic Knowledge Tech • PrestoSpace: AV Preservation • KnowledgeWeb; h-TechSight Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets are user-extendable • (Almost) all operations are available both from API and GUI 6(9) All the world’s a Java Bean.... CREOLE: a Collection of REusable Objects for Language Engineering: • GATE components: modified Java Beans with XML configuration • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Why bother? • Allows the system to load arbitrary language processing components 7(9) GATE APIs PDF docs RTF docs HTML docs XML docs email … XML Document Format HTML Document Format PDF Document Format … Document Format Layer (LRs) ADiff OntolVR DocVR ... ANNIE Application Layer IDE GUI Layer (VRs) Corpus Document NE Co-ref Annotation POS … DataStore Layer Gazetteers ... Language Resource Layer (LRs) XML Oracle Postgre .ser Sql 8(9) TRs Onto- Protégé WordOntology net logy Corpus Layer (LRs) NOTES •everything is a replaceable bean •all communication via fixed APIs •low coupling, high modularity, high extensibility TEs Processing Layer (PRs) Document Annotation Content Set Feature Map … OBIE NOTES (2) •eg: Protégé LR & VR both wrapped in Res. (bean) API •ontology repositories and inference should be the same: KAON + Sesame + Orenge + ? Happy Birthday Valy! 9(9)