Experiences with UIMA from a User’s Perspective

Download Report

Transcript Experiences with UIMA from a User’s Perspective

Experiences with UIMA from a User’s
Perspective
Dietmar Rösner,
Manuela Kunze,
Hany Mahgoub
University of Magdeburg C Knowledge Based Systems and Document Processing
Overview
• Introduction
• GATE
• UIMA
• Conclusion
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
2
Introduction
• November 2005; Version 1.2.3 of UIMA is available
"IBM’s Unstructured Information Management Architecture
(UIMA) is an architecture and software framework for
creating, discovering, composing and deploying a broad
range of multi-modal analysis capabilities and integrating
them with search technologies."
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
3
Introduction
really?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
4
Introduction
• similarity/comparison of GATE and UIMA
– frameworks
– results are documents + annotations
– pipeline processing
• steps:
– task definition
– one corpus
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
5
Evaluation Topics/Points
• ease of getting acquainted with system?:
– quality of docus: completeness, clarity, up-to-date, …?
– tutorials, use cases, …?
• processing and linguistic resources?
– lexica, Gazetteer lists, tools
• tools for resource maintenance and extension?
– quality: selfexplanatory, robust, comfortable
• speed of processing?
• single docs vs. large corpora?
• limitations, suggestions for improvement?
• support for im-/export of a variety of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
6
Task of the Experiment
• process a corpus of websites
– to detect and extract information relevant for tourists
• opening times of museum, prices of hotels,…
• corpus:
– 30 tourism web sites of Egypt
– additional 20 web sites of Washington, New York, London
• output:
– Prolog facts for a reasoner
– Questions:
• Which museum is now open?
• …
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
7
Excerpts from the Corpus
• The Egyptian Museum is open the hours: 9am-5pm daily
• The Military Museum is open the hours: Summer: 8am5:30pm; winter: 8am-4:30pm
• Palace Museum is open the hours: 8am-5:30pm
(summer) 8am-4:30pm (winter)
• 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri
• …
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
8
Overview
• Introduction
• GATE
• UIMA
• Conclusion
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
9
GATE: General Architecture for Text
Engineering
• a suite of tools for language processing and information extraction
• rule-based modular IE system (ANNIE)
• language and domain-independent processing resources
• open and extensible architecture
• aims to provide uniform access to various linguistic and ontological
resources
• http://gate.ac.uk/
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
10
GATE: General Architecture for Text
Engineering
• a software infrastructure for NLP researchers; based on
three main elements:
– an architecture
• describing the components composing a language processing
system
– a framework
• could be used as a basis for building such systems
– a graphical development environment
• a set of tools and
• components for language engineers
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
11
GATE: General Architecture for Text
Engineering
• GATE distributed with IE system called ANNIE
– relies on finite state algorithms and the Java Annotation Pattern
Engine (JAPE) language
– comprising a set of core Processing Resources (PRs):
•
•
•
•
•
•
•
Tokeniser
Gazetteers
POS tagger
Sentence Splitter
Semantic Tagger (JAPE transducer)
Orthomatcher (orthographic coreference)
…
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
12
GATE: ANNIE
[Cunningham et al.: Developing Language Processing Components with GATE; Version 3 (a User Guide)]
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
13
Gate Application
• several Processing Resources: Tokenizer, Hash
Gazetteer (with new/extended Gazetteer lists), JAPE
Transducer
...
* The Military Museum*
Summer: 8am-5:30pm; Winter: 9pm-5pm …
ANNIE English
Tokenizer
Gazetteer
lists
JAPE
Transducer
JAPE rules: to annotate
• interval of times and restrictions
names of museums, fragments of times and restrictions
• museum
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
14
Museum information in JAPE
Rule: egyptmuseums
(
({SpaceToken})
({Token.kind == word})
({SpaceToken})
{Lookup.majorType ==org_base} // from gazetteer lists
({SpaceToken})?
(({Token.kind==punctuation})|({Token.kind==word})|({SpaceToken}))*
({timeinfo})
// annotation
byrules
jape detects
transducer
timeinfo defined
by JAPE
patterns like:
)
• 9am-5pm, 6pm-9pm
• 8am-4:30pm, 8:30am-4:30pm, 8:30am-4pm
:museum
--> • 5:00PM-7:00PM, 10:00am-5:00pm
• ….
:museum.sight
= {rule ="egyptmuseums"}
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
15
GATE: Presentation of Results
Type and location of
every extracted
annotation on
document
Museums
Information
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
Annotations
16
GATE: Results
• information annotated in the documents:
–
–
–
–
–
–
names of museums, hotels
names of tourist places in Egypt
times, time intervals
time restrictions
prices, intervals of prices (hotel prices and museum prices)
names of pharaohs, queens
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
17
GATE: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
- good
- illustrative examples (tutorial) but not
enough specialy about JAPE rules
- can deal with it without know of Java
programming
- but is advantage to have experinces
with Java programming to use it in
JAPE rules
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
18
GATE: Evaluation
documentation?
processing and linguistic
resources?
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
- many processing resources available
(ANNIE)
-
tokenisers
POS taggers
parsers
gazetteers
sentence splitter
…
- additional PRs :
-
gazetteer collector
PRs for Machine Learning
various exporters
annotation set transfer etc...
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
19
GATE: Evaluation
documentation?
processing and linguistic resources?
- editor for gazetteer list
- corpus manager
tools for resource
maintenance and extension? - text editor and debugger for JAPE
rules
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
20
GATE: Evaluation
documentation?
processing and linguistic resources?
- there is no measurement of
processing time in the GATE tool
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
21
GATE: Evaluation
documentation?
- corpus pipeline vs document pipeline
processing and linguistic resources?
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
22
GATE: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and
extension?
speed of processing?
- no limitations:
- all is possible but it is not necessary to
implement by yourself
- for beginning:
- processing and linguistic resources
available within the distribution
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
23
GATE: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and
extension?
- import:
- supports a variety of document
formats: HTML, rtf, email, SGML and
plain text
- In all cases the format is analysed
and converted into a single unified
model of annotation
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
- export:
- documents, corpora and annotations in
databases of various sorts
- required: Java application (CREOLE)
im-/export of document
formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
24
Overview
• Introduction
• GATE
• UIMA
• Conclusion
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
25
UIMA: Unstructured Information
Management Architecture
• a software architecture for developing and deploying
unstructured information management (UIM) applications
• UIM application: a software system
– analyse large volumes of unstructured information to
• discover,
• organize, and
• deliver relevant knowledge to the end user
• software architecture which specifies
– component interfaces, data representations, …
• http://www.research.ibm.com/UIMA/
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
26
UIMA: Unstructured Information Management
Architecture
… may
…
…be
takes
interfaces
used
a CAS,
by atoCollection
a
analyzes
collection
its
Reader
of
contents,
datatoitems
populate
and(e.g.,
produces
a documents)
CAS an
from
enriched
a document.
to be
An example
CAS.
analyzed.
Analysis
of a Collection
CAS
Engines
Initializer
Readers
canisbe
anreturn
recursively
HTMLCASes
parser
composed
that
that contain
de-tags
of other
the
an HTML
documents
Analysis Engines
to
document
(called
analyze,
and
analso
possibly
Aggregate
inserts
along
paragraph
Analysis
with additional
Engine).
annotations
Aggregates
metadata.
(determined
may also
fromcontain
<P> tags
CAS
in theConsumers.
original HTML) into the CAS.
CAS: Common Analysis Structure
CPM: Collecting
Processing
Manager
… consume
the enriched
CAS that was produced by the sequence of Analysis
Engines before it, and produce an application-specific data structure, such as a
search engine index or database.
[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
27
UIMA: Unstructured Information Management
Architecture
• Analysis Engine (AE):
– a component that analyzes artifacts (e.g. documents) and infers
information about them
– consists of two parts:
• Java classes (typically packaged as one or more JAR files) and
• AE descriptors (one or more XML files)
– the configuration settings for the Analysis Engine as well as
– a description of the AE’s input and output requirements.
[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
28
UIMA Application
• several annotators (like a pipeline)
regular expressions
...
*Fraunces Tavern Museum*
54 Pearl St. - 1-212-425-1778
Tuesday-Friday, 12pm?5pm; …
restrictions
time
pattern
museum
pattern
interval of
times
Prolog facts:
museumopen('Fraunces Tavern Museum ',
'2005-12-01T12:00:00','2005-12-01T17:00:00').
museum
museumopen('Fraunces Tavern Museum
',
information
'2005-12-02T12:00:00','2005-12-02T17:00:00').
window
covering
twoMuseum
time intervals
museumopen('Fraunces
Tavern
',
and a'2005-12-03T12:00:00','2005-12-03T17:00:00').
restriction
regular expressions
window covering a museum and
opening hours
regular expressions
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
29
UIMA: Results
• information annotated in the documents:
–
–
–
–
–
–
names of museums, hotels
times, time intervals
time restrictions
prices, intervals of prices (hotel prices)
keywords for museum category
names of pharaohs (annotated with a correction of mispellings)
• hotel and museum information are exported into Prolog facts and
into a short textual summary
– templates filled with the detected information
• hotels: Price information about Cosmopolitan Hotel : $157
• museums:
*** *Fraunces Tavern Museum* ***
Open from 12:00:00 to 17:00:00;
Restriction: Tuesday-Friday
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
30
UIMA: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and
extension?
speed of processing?
- good
- illustrative examples (tutorial)
- completeness: sometimes it is very
shortly described
- prior knowledge about Java and
Eclipse is helpful
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
31
UIMA: Evaluation
documentation?
processing and linguistic
resources?
tools for resource maintenance and
extension?
- annotators only from tutorial
-
sentence annotation
word annotation
date/time annotators
examples for using regular
expressions etc.
- external resources can be integrated:
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
- lexical resources as external resources
(text files)
- existing processing resources
- implementation of an interface is
necessary
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
32
UIMA: Evaluation
documentation?
processing and linguistic resources?
- specific Eclipse component editors or
- simple text Editors
tools for resource
maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
33
UIMA: Evaluation
documentation
processing and linguistic resources
- faster than GATE?
- in CPE detailed information about
processing time for each module
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
34
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and
extension?
speed of processing?
- Collection Reader
- document(s) from a directory
- adapt extensions into Preprocessing
(CAS Initializer)
- e.g., extraction of text fragments from
a HTML document
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
35
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and
extension?
speed of processing?
• no limitations:
– all is possible, but implementation or
interfacing by user
• wish:
– more processing and linguistic
resources within the distribution
single docs vs. large corpora?
limitations, suggestions for
improvement?
im-/export of document formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
36
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and
extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for
improvement?
- import: CAS Initializer
- export: CAS Consumer
- transform annotations in any other
format
- export of
- document + annotations
- only annotations
- required: Java application
im-/export of document
formats?
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
37
Overview
• Introduction
• GATE
• UIMA
• Conclusion
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
38
Conclusion
• intended use
– GATE: academic/scientific application
• tools available
• comfortable GUI
– UIMA: more commercial
• plain framework
• simplified definition of (complex) results structures
• simplified pre- and postprocessing of annotations
• in sum: incommensurable
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
39
Conclusion
• both are extensible
• no final judgement about: use GATE or UIMA
– depends on
• your task
– task description
– expected results
– which processing resources are necessary
• your preferences for interface
– prefer the Eclispe environment (or other Java editors)
– prefer a comfortable GUI
• or use both
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
40
Conclusion
• found in the UIMA Forum:
I see UIMA and GATE as complementary rather than competitive, and each
can gain from the strengths of the other.
GATE was originally developed as a research tool, and has features suited
to rapid prototyping of text processing code, like JAPE (a language for
defining finite-state transducers over annotations on a document).
UIMA is more targetted at robust deployment of applications, with strong
typing of feature structures and better support for distributed processing.
We're currently working on writing a translation layer to allow UIMA analysis
components to be used in GATE and vice-versa. It's not in a releasable
state just yet, but we hope to release something in the near future. Keep
your eye on http://gate.ac.uk/ for details.
Ian Roberts (GATE developer)
Rösner, Kunze, Mahgoub: Experiences with UIMA from a User’s Perspective
41