GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Download Report

Transcript GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

GATE
Overview and Demo
University of Washington
CLMA Treehouse Presentation
October 8, 2010
Prescott Klassen
Overview
• Summary of GATE information and
documentation found at gate.ac.uk
• GATE Developer features, components, and
plug-ins
• IDE Demo
• Embedded GATE
• Using GATE with Condor on Patas
• GATE code samples
Background
• Sheffield Natural Language Processing Group at the
University of Sheffield
• Released 1996 – re-written and re-released 2002
• Latest Release GATE 5.2.1 (May 6, 2010) – Windows, Linux,
Solaris, and Mac OS
• Beta Release GATE 6.0 (Beta 1 – August 21, 2010)
• 100% Java Reference Implementation
• Compatible with IBM Unstructured Information
Management Architecture (UIMA)
• Open Source (GNU Library General Public License)
• XML Corpus Encoding Standard (XCES) format, used by the
American National Corpus
What is GATE?
• An architecture describing how language
processing systems are made up of
components.
• A framework (class library) written in Java and
tested on Linux, Windows and Solaris.
• A graphical development environment built on
the framework (IDE for NLP)
GATE Products
• GATE Developer
– IDE for language processing components bundled with the ANNIE (A
Nearly-New Information Extraction system) and plug-ins
• GATE Teamware
– Web app for collaborative semantic annotation projects incorporating
a workflow engine and a backend service infrastructure
• GATE Embedded
– Object library optimized for inclusion in applications
• GATE Services
– Hosted services for cloud application development
• GATE Wiki
– Wiki/CMS
• GATE Cloud
– Cloud computing solution for hosted large-scale text processing
GATE Components
• Language Resources (LRs)—documents,
corpora and ontologies
• Processing Resources (PRs)—parsers,
stemmers, co-reference resolvers, ML
components, etc.
• Visual Resources (VRs)—IDE components that
provide a visual interface (GUI) to GATE
components and plug-ins
Language Resources
• Documents, corpora, and ontologies
• Can persist in Java Serial Store or Lucene Serial Data
Store
• Document = content + annotations + features
• “Stand-off” Markup
• Annotations as Directed Acyclic Graphs (start Node,
end Node, ID, type, Feature Map, pointers into the
sources document—character offsets)
• Input Formats: Plain Text, HTML,SGML,XML, RTF, Email,
PDF, Microsoft Word
• Ontology support (Sesame2,OWLIM3)
Processing Resources
• ANNIE (a Nearly-New Information Extraction
System)
–
–
–
–
–
–
–
–
–
Document Reset
Tokeniser
Gazetteer
Sentence Splitter
RegEx Sentence Splitter
Part of Speech Tagger
Semantic Tagger
Orthographic Coreference (OrthoMatcher)
Pronominal Coreference
Processing Resources
• JAPE (Java Annotation Pattern Engine):
– Regular expressions over annotations
– Finite state transduction over annotations based on
regular expressions
– Not against strings but against annotation graphs
– Non-deterministic
• ANNIC: ANNotations-In-Context
– full-featured annotation indexing and retrieval system
– Searchable Serial DataStore
– Based on Lucene
Processing Resources
• The Annotation Diff Tool
– enables two sets of annotations in one or two
documents to be compared
– figures are generated for precision, recall, Fmeasure
• Corpus Benchmark Tool
– Apply evaluation across an entire corpus
• Balance Distance Measure (BDM) Ontology
Tool
Processing Resources (PlugIns)
•
•
•
•
•
•
OntoGazetteer
HashGazetteer
Gazetteer List Collector
Large KB Gazetteer
Ontology-Aware JAPE Transducer
Batch Learning PR (LibSVM, PAUM algorithm,
Weka interface)
• Machine Learning PR (Maxent, Weka and SVM
Light)
Resources on the Web site
gate.ac.uk
•
•
•
•
•
•
•
•
•
•
User Guide
Movie Tutorials
Developer’s Guide/API docs
NLP Application Programmer’s Guide
Research Papers
GATE project descriptions
Demos
Plug-in Info
Commerical/Academic partnerships
Etc…
IDE Demo
What is GATE Embedded?
• Everything in GATE IDE without the GUI
• A Java framework for many different types of
NLP solutions
• A complex assortment of core functionality
and plug-ins
• Extensible and Composable
– GATE can be included as a component in other
Java Frameworks and vice-versa
Example Application with a GATE
Embedded Component
Running GATE (“Hello World”)
import gate.*;
import gate.creole.*;
public class Main {
public static void main(String[] args) throws
Exception {
Gate.setGateHome(new File(<Path to GATE>));
Gate.setPluginsHome(new File(<Path to Plugins>));
Gate.init();
}
// start GATE
Registering Directories
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome(), "ANNIE").toURL());
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome(),
"Information_Retrieval").toURL());
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome(),
"Stemmer_Snowball").toURL());
Creating Processing Resources
SerialAnalyserController annieController =
(SerialAnalyserController) Factory.createResource(
"gate.creole.SerialAnalyserController",
Factory.newFeatureMap(),
Factory.newFeatureMap(), "ANNIE");
FeatureMap params = Factory.newFeatureMap();
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.tokeniser.DefaultTokeniser", params));
annieController.add((ProcessingResource)
Factory.createResource("stemmer.SnowballStemmer", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.gazetteer.DefaultGazetteer", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.splitter.RegexSentenceSplitter", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.POSTagger", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.ANNIETransducer", params));
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.orthomatcher.OrthoMatcher", params));
FeatureMap coRefParams = Factory.newFeatureMap();
coRefParams.put("resolveIt", "true");
annieController.add((ProcessingResource)
Factory.createResource("gate.creole.coref.Coreferencer", coRefParams));
Creating Language Resources
Corpus corpus = Factory.newCorpus("DUC
Queries");
@SuppressWarnings("static-access")
File topicsFile = new
File(ConfigMgr.getTopicFilePath() +
"topics.xml");
gate.Document topicDoc =
Factory.newDocument(topicsFile.toURL());
corpus.add(topicDoc);
annieController.setCorpus(corpus);
annieController.execute();
Iteration and Cleanup
AnnotationSet defaultAnnotations = topicDoc.getAnnotations();
AnnotationSet originalMarkup = topicDoc.getAnnotations("Original markups");
AnnotationSet topicAnnotationSet = originalMarkup.get("TOPIC");
for (Annotation topicAnnotation : topicAnnotationSet) {
ArrayList<Query> topicQueryArrayList = new ArrayList<Query>();
if (ConfigMgr.isQueryBreakdown()) {
topicQueryArrayList =
Utilities.buildTopicMultiQuery(topicAnnotation,
originalMarkup, defaultAnnotations, config);
} else {
topicQueryArrayList = Utilities.buildTopicQuery(topicAnnotation,
originalMarkup, defaultAnnotations, config);
}
String topicKey = null;
topicKey = topicQueryArrayList.get(0).getDucTopicName();
globalQueryHash.put(topicKey, topicQueryArrayList);
}
topicDoc.cleanup();
Factory.deleteResource(topicDoc);
corpus.cleanup();
Factory.deleteResource(corpus);
Iterating through Annotations
public static AnnotationSet getChildAnnotationSet(
String childAnnotationSetName,
Annotation annotation,
AnnotationSet parentAnnotationSet)
throws NullPointerException {
AnnotationSet childAnnotationSet = null;
// traverse nested Annotation Set for named annotation using parent offsets to
delimit range
try {
childAnnotationSet = parentAnnotationSet.get(childAnnotationSetName,
annotation.getStartNode().getOffset(),
annotation.getEndNode().getOffset());
if (childAnnotationSet == null) {
throw new NullPointerException();
}
} catch (Exception e) {
System.err.println(e.getMessage());
}
return childAnnotationSet;
}
Example Script for Compiling on Patas
#! /bin/bash
javac -classpath .:/NLP_TOOLS/tool_sets/gate/gate5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-contrib1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/ant-junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antlauncher.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antnodeps.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/anttrax.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-discovery0.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-fileupload1.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang2.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commonslogging.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateasm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compilerjdt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/geronimo-ws-metadata_2.0_spec-1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jakarta-oro2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/jaxb-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxen1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxws-api2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/lucene-core-2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/poi-2.5.1-final20040804.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-beans2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-core-
GATE Condor Script
universe
= java
executable
= ling573extractive/Main.class
arguments
= ling573extractive.Main
output
= ling573extractive.output
error
= ling573extractive.error
jar_files
= /NLP_TOOLS/tool_sets/gate/gate5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antjunit.jar,/NLP_TOOLS/tool_sets/gate/gate5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang2.4.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateasm.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compilerjdt.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar
java_vm_args = -Xmn100M -Xms500M -Xmx500M
+RequiresWholeMachine = True
Requirements = ( Memory > 0 && TotalMemory >= (7*1024) )
queue
Discussion