GATE and Unicode

Download Report

Transcript GATE and Unicode

A Unicode-based Environment for the
Creation and use of LRs
Valentin Tablan, Cristian Ursu, Kalina Bontcheva,
Hamish Cunningham, Diana Maynard, Oana Hamza,
Tony McEnery1, Paul Baker1, Mark Leisher2
Department of Computer Science, University of Sheffield
1University of Lancaster
2New Mexico State University
GATE (a General Architecture for Text Engineering) and ML LRs
1. Motivation (history of men’s underwear)
2. Short definition of GATE
3. GATE, Unicode and Java
4. EMILLE
1(11)
Motivation for Software Infrastructure
for Language Engineering
Analogy with recent history of men’s underwear – also supportive infrastructure:
• The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive
• The brave new world: boxer shorts: still supportive, but less constraining
The purpose of our work (the boxer shorts ideal):
freedom within a supportive environment
2(11)
GATE is:
• An architecture
A macro-level organisational picture for LE software systems.
• A framework
For programmers, GATE is an object-oriented class library that implements the
architecture.
• A development environment
For language engineers, computational linguists et al, GATE is a graphical
development environment bundled with a set of tools for doing e.g. Information
Extraction.
• Some free components... ...and wrappers for other people's components
• Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies;
etc.
• Free software (LGPL). Download at http://gate.ac.uk/download/
3(11)
Architectural principles
• Non-prescriptive, theory neutral (strength and weakness)
• Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML
input; v2 talks to other XML-based systems, APIs and standards)
• (Almost) everything is a component, and component sets are user-extendable
Component-based development
• An OO way of chunking software: Java Beans
• GATE components: CREOLE = modified Java Beans (Collection of REusable
Objects for Language Engineering)
• The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.
4(11)
GATE Language Resources
GATE LRs are documents, ontologies, corpora, lexicons.
Documents / corpora:
• GATE documents loaded from local files or the web...
• Diverse document formats: text, html, XML, email, RTF, SGML.
Multilinguality:
• New internationalised versions of JVM support >100 different encodings.
• Other encodings: developing system for user-entry of mapping tables.
• LR persistence through XML, file datastore or databases (Oracle, PostgreSQL).
5(11)
Processing Resourcres
Algorithmic components knows as PRs – beans with execute methods.
• All PRs can handle Unicode data by default.
• Clear distinction between code and data (simple repurposing).
• 20-30 freebies with GATE
Unicode Tokeniser
• splits text into typed tokens based on FSM
• dynamically constructed from a set of rules based on the character categories
defined by the Unicode standard.
UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)*
> Token;orth=upperInitial;kind=word;
• output can be localised by a later module (e.g. “don’t” … “do” “n’t”)
• current status:
• 23 rules seem able to handle without changes Indo-European languages.
• the English tokeniser: Unicode tokeniser + pattern grammar FST.
6(11)
Displaying Multilingual Data (1)
GATE uses standard (and imperfect) Java rendering engine for displaying text.
7(11)
Displaying Multilingual Data (2)
All the visualisation and editing tools for ML LRs use the same facilities:
8(11)
Editing Multilingual Data
• Java provides no special support for text input (this may change)
• GATE Unicode Kit (GUK) plugs this hole
• Support for defining additional Input Methods; currently 30 IMs
for 17 languages
• Pluggable in other
applications (e.g. MPI’s
EUDICO)
• Can use virtual keyboard
or standard layouts over
QWERTY
• IMs defined in plain text files
• GUK comes with a
standalone Unicode editor
9(11)
EMILLE: Enabling Minority LE
3 year EPSRC project at Lancaster University and Sheffield University.
Corpus development:
• written language corpora of at least 9,000,000 words for Bengali, Gujarati,
Hindi, Panjabi, Singhalese, Tamil and Urdu.
• spoken corpora of at least 500,000 words per language.
Unicode developments for GATE:
• Indic keyboard layouts.
• encodings for Indic languages.
Development of basic LE tools:
• POS tagging.
• alignment tools for parallel corpora.
10(11)
Encore
http://gate.ac.uk/
Other GATE-related stuff at LREC:
• Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05]
• Baker et al.: EMILLE [Thurs, 10.25]
• Demo and poster [Thurs, 11.00-12.20, session D1]
• Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20]
• Fliers
11(11)