Transcript wyner.info

Natural Language Processing Techniques
for Managing Legal Resources
JURIX 2009 Tutorial
Erasumus University, School of Law
Rotterdam, The Netherlands
December 16, 2009
Adam Wyner
University College London
[email protected]
www.wyner.info/LanguageLogicLawSoftware
Overview
•
•
•
•
Preliminary comments.
•
From GATE to ontologies and logical representations.
Natural Language Processing elements.
GATE introduction.
Samples of GATE applied to legal resources -legislation, case based reasoning, and gazettes.
Main Point
Legal text expressed in natural language can be
automatically annotated with semantic mark ups using
natural language processing systems such as the General
Architecture for Text Engineering (GATE). Using the
annotations, we can (in principle) extract the information
from the texts then use it for answering queries or
reasoning.
Outcome from the Tutorial
•
Overview of key issues and objectives of NLP with
respect to legal resources.
•
Idea of how one NLP tool (GATE) works and can be
used.
•
•
Idea of what some initial results might be.
Sense of what can be done.
Audience
•
Law school students, legal professionals, public
administrators. Get things done that are relevant to
them.
•
AI and Law researchers. A collaborative framework for
research and development.
What the Tutorial is....
•
A report of learning and working with this material.
Faster way in.
•
An invitation to collaborate as a research and
development community using a common framework.
•
A presentation of initial elements of a work in progress,
not a through and through prototype or full fledged
system.
Open Data Lists, Rules, and Development
Environment
•
Contribute to the research community and build on past
developments.
•
•
•
Teaching and learning.
•
On balance, academic research ought to contribute to the
common good rather than be proprietary. If you need to
own it, work at a company.
•
Distributive research, stream results.
Interchange. Semantic web chokes on different formats.
No publication without replication. Text analytics has an
experimental aspect.
Sample Texts
•
Legislation (EU and UK) for structure and rules.
•
Case decisions (US on intellectual property and
crime) for details and CBR factors.
•
Gazettes (UK).
•
On paper:



What information do you want to identify?
How can you identify it (e.g. how do you know
what you know)?
What do you want to do with it?
Semantic Web
•
Want to be able to do web based information
extraction and reasoning with legal documents, e.g.
find the attorneys who get decisions for plaintiffs in a
corpus of case law.
•
Machine only “sees” strings of characters, while we
“see” and use meaning.

John Smith, for plaintiff..... The decision favours
plaintiff.
•
How can we do this? Semantic annotation of
documents, then extraction of those annotations in
meaningful relations.
•
“Self-describing” documents
What is the Problem?
Natural language supports implicit information,
multiple forms with the same meaning, the same
form with multiple meanings (context), and dispersed
meanings:
•

Entity ID: Jane Smith, for plaintiff.

Relation ID: Edgar Wilson disclosed the formula to Mary Hays.

Jane Smith, Jane R. Smith, Smith, Attorney Smith....



Jane Smith in one case decision need not be the same Jane Smith
in another case decision.
Jane Smith represented Jones Inc. She works for Dewey, Chetum,
and Howe. To contact her, write to [email protected]
As for names, so to with sentences.
Knowledge Light v. Heavy Approaches
•
Statistical approaches - compare and contrast large bodies of
textual data, identifying regularities and similarities. Sparse
data problem. No rules extracted. Useful for ranking
documents for relevance.
•
Machine learning - apply learning algorithms to known material
to extend results to unknown material. Needs known,
annotated material. Useful for text classification. Black box cannot really know the rules that are learned and use them
further.
•
Lists, rules, and processes - know what we are looking for.
Know the rules and can further use and develop them. Labour
and knowledge intensive.
Knowledge Light v. Heavy Approaches
•
Some mix of the approaches.
•
The importance of humanly accessible explanation
and justification in some domains of the law warrants
a knowledge heavy approach.
Overview
•
•
•
•
Motivations and objectives of NLP in this context.
General Architecture for Text Engineering (GATE).
Processing and marking up text.
Other technologies for parsing and semantic
interpretation (C&C/Boxer).
Motivation
•
Annotate large legacy corpora.
•
Address growth of corpora.
•
Reduce number of human annotators and tedious
work.
•
Make annotation systematic, automatic, and
consistent.
•
Annotate fine-grained information:

•
Names, locations, addresses, web links,
organisations, actions, argument structures,
relations between entities.
Map from well-drafted documents in NL to RDF/OWL.
Approaches
•
Top-down vs. Bottom-up approaches:
•
Both do initial (and iterative) analysis of the texts in the
target corpora.
•
Top-down defines the annotation system, which is
applied manually to texts. Knowledge intensive in
development and application.
•
Annotation system is ‘defined’ in terms of parsing, lists of
basic components, ontologies, and rules to construct
complex mark ups from simpler one. Apply the
annotation system to text, which outputs annotated text.
Knowledge intensive in development.
•
Convergent/complementary/integrated approaches.
•
Bottom-up reconstructs and implements linguistic knowledge.
However, there are limits....
Objectives of NLP
•
•
•
Generation – convert information in a database into
natural language.
Understanding – convert natural language into a
machine readable form. Support inference?
Information Retrieval – gather documents which contain
key words or phrases. Preindex (list of what documents
a word appears in) the corpus to speed retrieval (e.g.
Google). Rank documents in terms of “relevance”.
Documents must be read to identify information.
Objectives of NLP
•
•
•
•
Text Summarization – summarize (in a paragraph) the
main meaning of a text or corpus.
Question Answering – queries made and answers given
in natural language with respect to some corpus of texts.
Information Extraction – identify and extract information
from documents which is then reused or represented.
The information should be meaningfully related.
Information extraction can be used to improve
information retrieval.
Objectives of NLP
•
•
•
Automatic mark up to overcome bottleneck.
•
•
•
Develop ontologies.
Semantic representation for modelling and inference.
Semantic representation as a ‘interlanguage’ for
translation.
Provide gold-standard corpora.
To understand and work with human language
capabilities.
Subtasks of NLP
•
Syntactic parsing into phrases/chunks (prepositional,
nominal, verbal,....).
•
•
•
•
•
Identify semantic roles (agent, patient,....).
Entity recognition (organisations, people, places,....).
Resolve pronominal anaphora and co-reference.
Address ambiguity.
Focus on entity recognition (parsing happens,
anaphora can be shown, others are working on
semantic roles, etc).
Computational Linguistic Cascade
•
Sentence segmentation – divide text into sentences.
•
Tokenisation - words identified by spaces between them.
•
Part of speech tagging – noun, verb, adjective.... Determined by
lookup up and relationships among words.
•
Morphological analysis - singular/plural, tense, nominalisation, ...
•
Shallow syntactic parsing/chunking - noun phrase, verb phrase,
subordinate clause, ....
•
Named entity recognition – the entities in the text.
•
Dependency analysis - subordinate clauses, pronominal anaphora,...
•
Each step guided by pattern matching and rule application.
Development Cycle
•
Text -> Linguistic Analysis -> Knowledge Extraction
•
Cycle back to text and linguistic analysis to improve
knowledge extraction.
GATE
•
General Architecture for Text Engineering (GATE)
open source framework which supports plug in NLP
components to process a corpus of text. Is “open”
open?
•
•
•
A GUI to work with the tools.
A Java library to develop further applications.
Where to get it? Lots of tutorial information.
•
http://gate.ac.uk/
•
Components and sequences of processes, each
process feeding the next in a “pipeline”.
•
Annotated text output.
Loading and Running GATE with ANNIE
•
•
Start GUI
•
•
Adds Processing Resources and an Application.
•
When added, RC on the document (BNA sample) >
New Corpus with this document.
•
•
•
RC on ANNIE under applications to see the pipeline.
LC on File > Load ANNIE System > Choose with
Defaults.
RC on Language Resources > New > Select GATE
document > Browse to document > OK.
At Corpus, select the Corpus created.
Run.
GATE Example
Inspecting the Result
•
RC on document (not Corpus) > Show, which shows
the text.
•
•
LC on Annotation Sets, LC on Annotations List.
•
Selecting an annotation highlights the relevant text in
colour. In the List box below, we get detailed
information about location, Id, and features
On right see coloured list with check boxes for
annotations; below see a box with headings.
Inspecting the Result
•
For Location, we have “United Kingdom”, with locType
= country, matching Ids, and the rules that have
been applied.
•
Similarly for JobTitle, Lookup (from Gazettes),
Sentence, SpaceToken, Split (for sentences), and
Token (every “word”).
•
Note different information provided by Lookup and
Token, which is useful for writing rules.
•
Will remark on Type, Start/End, Id, and features.
GATE Example
GATE Example
GATE Example
XML -- Inline
XML is a flexible, extensible framework for mark up
languages. The mark ups have beginnings/endings.
Inline XML is strictly structured in a tree (note contains
body, body could contain date, no overlap) and is
“inline” with the text. Compare to standoff, which
allows overlap and sets text off from annotations.
Allows reprocessing since text is constant.
XML -- Standoff
GATE Output Inline
In the GATE Document Editor, the Annotations can be deleted
(RC > Delete). We have left just Location and JobTitle. To output
text with annotations that are XML compatible, RC on the
document in Language Resources, then Save preserving
document format. Further processing can be done using XSLT.
GATE Output Offset - part 1a
In the GATE Document Editor, the Annotations can be deleted
(RC > Delete). We have left just Location and JobTitle. To output
text with annotations that are in XML, RC on the document in
Language Resources, then Save as XML. This is the top part.
The text is serialized, and annotations relate to positions in the
text.
GATE Output - part 1b
GATE ANNIE Annotations
GATE ANNIE Annotations
Organisations and Quotations. Case references.
GATE
•
Language Resources: corpora of documents.
•
Processing Resources: lexicons, ontologies, parsers,
taggers.
•
Visual Resources: visualisation and editing.
•
The resources are plug ins, so can be added,
removed, or modified. See this latter with ANNIC
(Annotations in Context) and Onto Root Gazetteer
(using ontologies as gazetteers).
GATE
A source document contains all its original mark up
and format.
•
John Smith ran.
A GATE document is:
•
Document = text + (annotations + features)

<Person, gender = “male”>John Smith</Person>

<Verb, tense = “past”>ran</Verb>
Not really the way it appears in GATE, but the idea using
XML.
GATE Annotations
•
Have types (e.g. Token, Sentence, Person, or
whatever is designed for the annotation).
•
Belong to annotation sets (see later).
•
Relate to start and end offset nodes (earlier).
•
Have features and values that store a range of
information as in (not GATE, but XML-ish):

<Person, gender = “male”>John Smith</Person>

<Verb, tense = “past”>ran</Verb>
GATE
Construction:
From smaller units, compose larger, derivative units.
Gazetteers:
Lists of words (or abbreviations) that fit an annotation:
first names, street locations, organizations....
JAPE (Java Annotation Patterns Engine):
Build other annotations out of previously given/defined
annotations. Use this where the mark up is not given by a
gazetteer. Rules have a syntax.
GATE – A Linguistic Example
•
•
•
Lists:
•
List of Verb: like, run, jump, ....
•
List of Common Noun: dog, cat, hamburger, ....
•
List of Proper Name: Cyndi, Bill, Lisa, ....
•
List of Determiner: the, a, two, ....
Rules:

(Determiner + Common Noun) | Proper Name => Noun Phrase

Verb + Noun Phrase => Verb Phrase

Noun Phrase + Verb Phrase => Sentence
Input:
•
•
Cyndi likes the dog.
Output:
•
[s [np Cyndi] [vp [v likes] [np [det the] [cn dog]]]].
Lists, Lists of Lists, Rules
•
Coalesce diverse yet related information in a list, e.g.
organisation.lst. What is included here depends on....
What is Looked Up from the list is associated with the
“master category”.
•
Make a master list of the lists in lists.def, which contains
organisation.lst, date.lst, legalrole.lst.....
•
The master list indicates the majorType of things looked
up in the list, e.g. organisation, and minorType, e.g.
private, public (and potentially more features). Two lists
may have the same majorType, but different minor types.
Useful so rules can apply similarly or differently according
to major or minor types.
GATE organisation.lst
GATE Gazetteer – a list of lists
What Goes into a List?
•
A 'big' question. Depends on what one is looking for,
how one wants to find it, and what rules one wants
to apply.
•
Every difference in character is a difference in the
string even if the 'meaning' is the same.

B.C.
b.c.

May01,1950
bc
b.c
bC.
May 01 1950
•
More examples later.
•
By list or by rule....
01 May 1950
Token, Lookup, Feature, Annotation
•
Token - a string of characters with a space. In The big
brown dog chased the lazy cat there are seven tokens.
Token information includes syntactic part of speech (noun,
verb,....) and string details (orthography, kind, position,
length,....).
•
Lookup - look up a string in a list and assign it major or
minor types. The “bottom semantic” layer of the cascade.
•
Annotation - subsequent mark ups which depend on
Token, Lookup, or prior annotations.
•
Feature - additional Token, Lookup, or Annotation
information.
Rolling your Own
•
Create lists and a gazetteer.
•
Add processing resources.
•
Add documents and make a corpora.
•
Construct the pipeline - an ordered sequence of
processes.
•
Run the pipeline over the corpora.
•
Inspect the results.
GATE JAPE
JAPE rule idea (not the real thing).
<FirstName>aaaa</FirstName><LastName>bbbb</LastName>
=>
<WholeName><FirstName>aaaa</FirstName>
<LastName>bbbb</LastName></WholeName>
FirstName and LastName we get from the Gazetteer.
WholeName we construct using the rule. For
complex constructions, must have a range of
alternative elements in the rule.
GATE JAPE
•
Header - rule name, annotations to use, processing
features, processing priority....
•
Left hand side of rule (LHS) - refers to various mark
ups that are found in the text, relies on order, uses
expressions of combination or iteration, and identifies
what portion is to be annotated as per the rule.
•
Right hand side of rule (RHS) - annotates as per the
rule (plus some information)
•
Can have Java on RHS, but will not cover this.
GATE JAPE
? means optional
GATE JAPE
GATE JAPE
Other GATE Components
•
Plug in other parsers or work with other languages....
(no)
•
Machine learning component. (no)
•
Search for annotations in context to refine gazettees
and JAPE rules (ANNIC). (yes)
•
Develop an ontology, import it into GATE, then mark
up elements automatically or manually (Onto Root
Gazetteer). (yes)
GATE – Problems and Issues
•
Any difference in the characters of the basic text or in
annotations is an absolute difference
•
theatre and theater are different strings for entities.
Variants in Gazetteers.
•
Organisation and Organization are different
annotations.
•
Output in XML is possible, but GATE mark up allows
overlapping tags, which are barred in standard XML. Must
rework GATE XML with XSLT to make it standard XML.
•
Work to get 100% for a variety of reasons (depends on
consistency of input), but it can be 85-95%.
GATE – Extraction
•
So far we have really only covered annotation.
Where is the extraction bit?
•
Currently, GATE has no plug in to support extraction
of information with respect to rich schema template,
e.g. give cases, parties, attorneys, factors, and
outcomes.
•
With further processing using tools outside GATE, this
can be done:
•

XSLT, Java, .... Example....?

Use ontologies (I think the direction to go...)
Yet, can output as presented earlier.
GATE on Legal Resources
•
Legislative structure for rule book (Structure
identification).
•
Rule detection for inference (General, UK Nationality
Act).
•
Elements of Cases (CATO intellectual property,
California criminal cases).
•
Gazette/Notices information (TSO/OPSI).
Legislative Structure
•
Legislative structure for rule book that is used for
compliance.
•
Identify and annotate the structure of legislation.
•
Show what, then how.
•
Look for “posts” which can help one identify
“content”.
•
RuleBookTest.xgapp
Insurance and Reinsurance
(Solvency II)
Desired Output
•
Reference Code: Article 1
•
Title: Subject Matter
•
Level: 1.0
•
Description: This Directive lays down rules concerning the following:
•
Level: 1.1
•
Description: the taking-up and pursuit, within the Community, of the selfemployed activities of direct insurance and reinsurance;
•
Level: 1.2
•
Description: the supervision in the case of insurance and reinsurance groups;
•
Level: 1.3
•
Description: the reorganisation and winding-up of direct insurance
undertakings.
GATE Annotation
Comments
•
The article is not a logical statement, but identifies
the matters which the directive is concerned with.
•
Each statement of the article may be understood as a
conjunct: the rules concern a, b, and c. However,
we have not (yet) represented this.
•
The JAPE rules work for this example, but need to be
further refined to work with the whole legislation.
•
Break down the text into useful segments that can
support identification.
Lists
•
roman_numerals_i-xx.lst: It has majorType =
roman_numeral. A list of roman numbers from i to
xx.
•
rulebooksectionlabel.lst: It has majorType =
rulebooksection. A list of section headings such as:
Subject matter, Scope, Statutory systems, Exclusion
from scope due to size, Operations, Assistance,
Mutual undertakings, Institutions, Operations and
activities.
JAPE Rules
•
ArticleSection.jape: What is annotated with Article (from the
lookup) and a number is annotated ArticleFlag.
•
ListFlagLevel1.jape: The string number followed by a period of
closed parenthesis is annotated ListFlagLevel1.
•
ListFlagLevel1sub.jape: A number followed by a letter followed
by a period is annotated ListFlagLevel1sub.
•
ListFlagLevel2.jape: A string of lower case letters followed by a
closed parenthesis is annotated ListFlagLevel2.
•
ListFlagLevel3.jape: A roman number from a lookup list
followed by a closed parenthesis is annotated ListFlagLevel3.
JAPE Rules
•
RuleBookSectionLabel.jape: Looks up section labels from a list
and annotates them SectionType. For example, Subject matter,
Scope, and Statutory systems.
•
ListStatement01.jape: A string which occurs between
SectionType annotation and a colon is annotated ListStateTop.
•
ListStatement02.jape: A string which occurs between a
ListFlagLevel1 and a semicolon is annotated
SubListStatementPrefinal.
•
ListStatement03.jape: A string which occurs between a
ListFlagLevel1 and a period is annotated SubListStatementFinal.
JAPE Rules
JAPE Rules
Note the use of “or”.
JAPE Rules
JAPE Rules
Repeat getting tokens so long as not punctuation. + is
one or more tokens. Negation.
Rule Detection
•
Rule detection with a general example and specific
(UK Nationality Act).
•
Sentence classification.
•
From extraction almost to executable logical form
(Haley's manual translation and proprietary logical
form).
•
Conditional.xgapp
Rule Detection
Rule Detection
Problems with: list recognition “(x)”, use of “;
“use of “--”, and use of “or”.
Lists and Rules
•
No particular lists. Used the list detection from the
previous exercise (so particular to that context).
•
AntecedentInd01: annotates the token “if” or “If” in
the text as a conditional flag.
•
AntecedentInd02: annotates a sentence as a
conditional if it includes the conditional flag. Another
(better?) way: use “if” to identify antencedents and
consequents; sentence is conditional if it has one or
more antecedent sentences and one consequence
sentence.
Lists and Rules
•
ConditionalParts 01 and 05: annotates a sentence
portion as an antecedent between a conditional flag
and some punctuation
•
ConditionalParts 02, 03, and 04: annotates a
sentence portion as a consequent where: it appears
between a sentence and some conditional flag, after
“then” and a period, before a conditional flag and a
list indicator (e.g. colon).
Rule Detection
Rule Detection
Rule Detection
* is zero or more tokens,
but should be +.
Rule Detection
Case Factors and Elements
•
Factors in CATO. Relate to ANNIC.
•
Case parts in California criminal cases. Relate to
Onto Root Gazetteer.
Case Based Reasoning
•
The CATO case base is a set of cases concerning
intellectual property.
•
Given a current undecided case, compare and
contrast the “factors” of the current case against
factors of decided cases in the case base. Decide the
current case following decisions of decided cases.
•
If a current case has exactly the same factors as a
decided case, the decision in the current case is
decided as it was in the decided case.
•
A complex counter-balancing of various factors (and
their circumstances and weightings...)
Case Based Reasoning
The Factors are abstracts of fact patterns that favour
one side or the other. Suppose you have a product
and a secret way of making it.
•


Secrets disclosed to outsiders: You announce the
method, thereby divulging the secret. If you try
to sue someone who uses the method, you are
likely not to win.
Security measures: You have lots of security
methods to protect your method and never
publicly divulge it. If you try to sue someone who
uses the method, you are likely to win.
Case Based Reasoning
•
Task is simply to find linguistic indicators for the
factors in case texts.
•
This is currently done “manually”.
•
We do this “roughly”, then look at ANNIC, which is a
tool we can use to look at the matter more closely.
•
CATOCaseFactors.xgapp
Case Based Reasoning
From Aleven 1997
Case Based Reasoning - Creating a Concept
•
We are looking for a “concept”, which is an abstraction over
particular forms.
•
Looked up “disclose” and “secret” in WordNet and made two
lists for each “concept”:



disclosure.lst: announce, betray, break, bring out,
communicate, confide, disclose, discover, divulge, expose,
give away, impart, inform, leak, let on, let out, make known,
pass on, reveal, tell
secret.lst: confidential, confidentiality, hidden, private,
secrecy, secret
The majorType of the disclose list is “disclose” and that of the
secret list is “secret”.
Case Based Reasoning - Rules
Similarly
for Secret
Case Based Reasoning - Examples
Want to refine these results by looking at context
ANNIC - Annotations in Context
•
A plug in tool which helps in searching for
annotations, visualising them, and inspecting
features. Useful for JAPE rule development.
•
How to plug in, load, and run.
•
CATOCaseFactors.xgapp
ANNIC - Instantiating an SSD
•
RC on Datastores > Create datastore > Lucene Based
Searchable DataStore
•
At the input window, provide the following parameters:




DataStore URL: Select an empty folder where the data
store is created.
Index Location: Select an empty folder. This is where
the index will be created.
Annotation Sets: Provide the annotation sets that you
wish to include or exclude from being indexed. Make
this list empty (CHECK).
Base-Token Type: The tokens which your documents
must have to get indexed.
ANNIC - Instantiating an SSD
At the input window, provide the following
parameters:
•



Index Unit Type: The unit of Index (e.g.
Sentence). We use the Sentence unit.
Features: Specify the annotation types and
features that should be included or excluded from
being indexed (e.g. exclude SpaceToken, Split, or
Person.matches).
Click OK. Creates a new empty searchable SSD
will be created.
ANNIC - Instantiating an SSD

Create an empty corpus and save it to the SSD.
• Populate the corpus with some documents. Each document in the
corpus is automatically indexed and saved to the data store.
• Load some processing resources and then a pipeline. Run the
pipeline over the corpus.
• Once the pipeline has finished (and there are no errors), save the
corpus in the SSD by RC on the corpus, then “Save to its
datastore”.
• Double click on the SSD file under Datastorees. Click on the
“Lucene DataStore Searcher” tab to activate the search GUI.
• Now you can specify a search query of your annotated documents
in the SSD.
ANNIC - The GUI
•
Top - area to write a query, select corpus, annotation
set, number of results, and size of context.
•
Middle - visualisation of annotations and values given
the search query.
•
Bottom - a list of the matches to the query across the
corpus, giving the left and right contexts relative to
the search results.
•
Annotations can be added (green) or removed (red).
ANNIC - The GUI
ANNIC - Queries (a subset of JAPE)
•
String
•
{AnnotationType}
•
{AnnotationType == String}
•
{AnnotationType.feature == feature value}
•
{AnnotationType1, AnnotationType2.feature ==
featureValue}
•
{AnnotationType1.feature == featureValue,
AnnotationType2.feature == featureValue}
•
Trandes - returns all occurrences of the string where
it appears in the corpus.
ANNIC - Queries (a subset of JAPE)
•
{Person} -- returns annotations of type Person.
•
{Token.string == "Microsoft"} - returns all occurrences of
``Microsoft''.
•
{Person}({Token})*2{Organization} - returns Person followed by
zero or up to two tokens followed by Organization.
•
{Token.orth=="upperInitial", Organization} - returns Token with
feature orth with value set to "upperInitial" and which is also
annotated as Organization
•
{Token.string=="Trandes"}({Token})*10{Secret} - returns
strings ``Trandes'' followed by zero to ten tokens followed by
Secret.
•
{Token.string =="not"}({Token})*4{Secret}
ANNIC - Example
ANNIC - Example
Case Details
We would like to annotate a range of case details such as:
•

Case citation

Names of parties

Roles of parties

Sort of court

Names of judges

Names of attorneys

Final decision....
•
Look at some of this and relate to ontologies.
•
DSACaseInfo.xgapp
California Criminal Cases
California Criminal Cases
California Criminal Cases
Onto Root Gazetteer
•
A plug in tool which uses an ontology as a gazetteer.
The ontology can be created and modified in GATE.
Can add individuals. Some steps in automating
ontology creation and population.
•
Using the ontology, one can query, draw inferences,
and write rules with another tool (Protege).
•
How to plug in, load, and run.
•
An example -- CBR-OWL.xgapp
•
Check ontology and add individuals.
Onto Root Gazetteer
•
Links text to an ontology by creating Lookup
annotations which come from the ontology
•
Richly structured.
•
Relates textual and ontological information by adding
instances.
•
One richer annotations that can be used for further
processes.
Onto Root Gazetteer
•
Add Onto Root Gazetteer plug in.
•
Add the Ontology Tools.
•
Create (or load) an ontology with OWLIM. This is the ontology that is
the language resource that is then used by Onto Root Gazetteer.
Suppose this ontology is called myOntology.
•
OWLIM can only use OWL-Lite ontologies.
•
Create processing resources with default parameters:

Document Reset PR

RegEx Sentence Splitter

ANNIE English Tokeniser

ANNIE POS Tagger

GATE Morphological Analyser
Onto Root Gazetteer
Create an Onto Root Gazetteer PR and initialise as:
•


Ontology: select previously created myOntology

Tokeniser: select previously created Tokeniser

POSTagger: select previously created POS Tagger

Morpher: select previously created Morpher.
Create a Flexible Gazetteer PR. elect previously created
OntoRootGazetteer for gazetteerInst. For inputFeatureNames,
click on the button on the right and when prompted with a
window, add ‘Token.root’ in the provided text box, then click
Add button. Click OK, give name to the new PR (optional) and
then click OK.
Onto Root Gazetteer
•
Create an application, right click on Application, New –> Pipeline (or
Corpus Pipeline).
•
Add the following PRs to the application in this order:

Document Reset PR

RegEx Sentence Splitter

ANNIE English Tokeniser

ANNIE POS Tagger

GATE Morphological Analyser

Flexible Gazetteer
•
Run the application over the selected corpus.
•
Inspect the results. Look at the Annotation Set with Lookup and also
the Annotation List to see how the annotations appear. NAY..... Is not
working this way.
Onto Root Gazetteer
•
Editing the ontology (using the tools in GATE to add
classes, subclasses, etc).
•
Annotating the texts manually with respect to the
ontology (highlighting a string and hovering bring out
a menu).
•
Adding instances to the ontology (have a choice to
add instances).
•
The ontology can then be exported into an ontology
editor (Protege) and used for reasoning. Not shown.
Onto Root Gazetteer
Onto Root Gazetteer
Content in Gazette Notices
Not
glamorous,
but useful.
www.london-gazette.co.uk search insolvency
Content in Notices
C&C/Boxer – Motivations and Objectives
• Fine-grained syntactic parsing – can identify not only
parts of speech, but grammatical roles (subject, object)
and phrases (e.g. verb plus direct object is verb phrase).
• Contributes to NL to RDF/OWL translation – individual
entities, data and object properties?
• Input to semantic interpretation in FOL – test for
consistency, support inference, allow rule extraction.
C&C/Boxer
• C & C is a combinatorial categorial grammar.
• Boxer provides a semantic interpretation, given the
parse. The semantic interpretation is a form of first
order logic – discourse representation theory.
• Needs some manipulation. Parser outputs the ‘best’
parse, but that might not be what one wants; the
semantic representation might need to be selected.
C&C/Boxer
• Try it out at:
• http://svn.ask.it.usyd.edu.au/trac/candc
• Various representations – C&C, Graphic, XML Parse,
Prolog.
• Not perfect (or even clear), but a step in the right
direction and something concrete to build on.
C&C/Boxer - Parse
C&C/Boxer - DRT
Vx [ man’(x) -> happy’(x)]
Dynamic, so assignment function can
grow with context.
C&C/Boxer - Prolog
A woman who is born in the United Kingdom after commencement of the act is
happy.
A woman who is born in the United Kingdom after commencement of the act is
a British citizen if her mother is a British citizen when she is born.
Other Topics
•
•
Controlled Languages
•
An expressive subset of grammatical constructions and lexicon.
•
Guided in put so only well-formed, unambiguous expressions.
•
Translation to FOL.
Machine Learning
•
Annotating a set of documents to make a ‘gold standard’.
•
Train the system on the gold standard and unannotated
documents.
•
Test accuracy and adjust.
•
No information on how the algorithm works.
Evaluation
•
Have given an evaluation sheet at the start. Would
be helpful to get feedback with comments, questions,
suggestions, ideas....
Conclusions
•
Different approaches to mark up.
•
Burdens of initial analysis, coding, and labour.
•
Top-down is far ahead of bottom-up, but this is a
matter of focus of research effort.
•
Converging, complementary, integrated approaches.
•
Potential to enrich annotations further for finergrained information.
References
Manu Konchady (2008) “Building Search Applications:
Lucene, LingPipe, and Gate”.
Graham Wilcock (2009) “Introduction to Linguistic
Annotation and Analytics Technologies”.
Bransford-Koons, “Dynamic semantic annotation of
California case law”, MSc Thesis, San Diego State
University.
Thakker, Osman, and Lakin JAPE Tutorial.
Thanks
•
For your attention!
•
To Phil Gooch, Hazzaz Imtiaz, and Emile de Maat for
discussion.
•
To the London GATE User's Group.
•
To the GATE Community and discussion list.