Alias I Linguistic Pipeline Architecture, Algorithms & Applications Bob Carpenter Alias I, Inc. [email protected].

Download Report

Transcript Alias I Linguistic Pipeline Architecture, Algorithms & Applications Bob Carpenter Alias I, Inc. [email protected].

Alias I Linguistic Pipeline
Architecture, Algorithms & Applications
Bob Carpenter
Alias I, Inc.
[email protected]
Who is Alias-i?
• Spun out of 1995 U Penn Message Understanding
Conference (MUC-5) projects on coreference
• Founded in 2000 by Breck Baldwin as Baldwin
Language Technologies
• I’m the other technical employee as of 2003.
• Funded through the Defense Advance Research
Projects Agency (DARPA) through the Translingual
Information Detection, Extraction and Summarization
Program (TIDES) and the Total, er Terrorist
Information Awareness Program (TIA)
• Targeting Research Analysts with Text Mining
• Based in Brooklyn (we love visitors)
Application: Threat Tracker Interface
• Intended for use by Information Analysts
– Analysts typically get short-term assignments and are asked to do
thorough reviews, producing reports at the end.
– Some analysts are assigned to track situations longer term.
– Use unstructured news feeds and standing collections as sources
– Basically, a lot like legal, medical or biological research
• Trackers Specify Structured Searchers & Gatherers
– Entities, Sub-trackers, Sample Documents, Saved Keyword
Searches, Alerts
– Allow addition of annotated documents making up a case
• Entities Specify
– Aliases
• Spelling, Language, Coreference Properties
– Properties
• Person (Gender), Place, Thing, Other
• Trackers Evaluated against real-time feeds and/or standing
collections
Tracker Example(s)
• Tracker: New York Yankees
– Entity: New York Yankees
• Aliases: Bronx bombers, …
• Properties: Organization
– Tracker: Yankee Players
• Entity: Joe Dimaggio
Aliases: Joltin’ Joe, The Yankee Clipper, Joe D
Properties: Person/male
• Entity: Babe Ruth
• …
– Entity: Yankee Stadium
• Aliases: The stadium, The house that Ruth built, …
• Properties: Facility
– Document: (The Onion) Steinbrenner corners free-agent market.
• Tracker: Sports
– Tracker: Baseball
• Tracker: Teams
– Tracker: NY Yankees
ThreatTracker Interface: Screenshot
‘…’ indicates
sentences have
been removed
because they don’t
mention the Entity
Translation of
Excerpt Summary
Mentions of
Vajpayee and
Pakistan found by
ThreatTrackers
ThreatTracker Architecture
Client and Web-Container Architecture
Flexible Model-View-Controller (MVC)
Threat
Tracker
Document
Analysis
20k words/sec
250k docs/1.5GB
LingPipe Document Analysis
• LingPipe implements (most of) Document Analysis
–
–
–
–
–
–
XML, HTML and Plain Text input; (well-formed) XML output
Tokenization
Named-entity Extraction
Sentence Boundary Detection
Within-document Coreference
Not yet released: cross-document coreference
• Dual Licensing
– “Open Source”
– Commercial
•
•
•
•
•
100% Pure Java (runs anywhere that runs Java)
Quick Start-up with sample scripts & Ant tasks
Extensive JavaDoc
API & Command-line resources
Production quality code & unit testing
XML Handling: SAX Filters
• All input/output is handled through SAX filters
• Streams all I/O at the element level
• An org.xml.sax.ContentHandler receives callbacks:
–
–
–
–
–
startElement(Element, Attributes); endElement(Element);
startDocument(); endDocument();
characters(char[] cs, int start, int length)
And a whole lot more
Not event-based, despite what everyone calls it
• SAX filters
– Same pattern as the Java stream filters (eg. java.io.InputStreamFilter)
– Allow chains of handlers to be combined
• Full XML Processing
–
Entities, DTD validation, character sets, etc.
• Supplied filters tunable to input elements, or can be run on all
text content
HTML & Plain Text Handling
• HTML run through CyberNeko’s HTML
– Implements org.xml.sax.XMLReader over HTML input
– HTML’s a mess, so you’re taking chances
• Plain Text Input
– Handled with SAX filter, with wrapper elements
– Text just sent to characters()
Tokenization
• General Interface Streams output
–
–
–
–
Tokenizer(char[], int, int)
String nextToken()
String nextWhitespace()
Whitespaces critical for reconstructing original document
with tags in place
• Implementation for Indo-European
– Very fine-grained tokenization
– But try to keep numbers, alphanumerics, and compound
symbols together
• 555-1212; 100,000; ---; 40R
– Not “cheating” as in many pre-tokenized evaluations
– Break on most punctuation
• “Mr. Smith-Jones.” yields 6 tokens
Interfaces & “Abstract” Factories
• Interfaces allow flexible implementations of tokenizers
• Factories allow reflectively specified tokenizer creation
• TokenizerFactory interface (not an “abstract class”)
– Tokenizer createTokenizer(char[] cs, int start, int length);
• All APIs accept tokenizer factories for flexibility
• Reflection allows command-line specification
– -tokenizerFactory=fee.fi.fo.fum.TokenizerFactory
– Java’s Reflection API used to create the tokenizer factory
– Assumes nullary constructor for factory
• Named-entity extraction and string-matching also handled
with factories for flexible implementations
Named Entity Detection
• Balancing Speed With Efficiency
– 100K tokens/second runtime
•
•
•
•
Windows XP
3GHz P4, 800MHz FSB, 2*10K ATA disks in RAID-0
Sun’s JDK 1.4.2 on Windows XP
-server mode
– .93 MUC7 F-score (more on scores later)
– Very low dynamic memory requirements due to streamed output
– Train 500K tokens, decode & score 50K tokens in 20-30 seconds
• Pipelined Extraction of Entities
–
–
–
–
Speculative
User-defined
Pronouns
Stop-list Filtering (not in LingPipe, but in ThreatTracker)
• User-defined Mentions, Pronouns & Stop list
–
–
–
–
Specified in a dictionary
Left-to-right, Longest match
Removes overlapping speculative mentions
Stop list just removes complete matches
Speculative Named Entity Tagging
• Chunking as Tagging
–
–
–
–
Convert a “parsing” problem to a tagging problem
Assign ST_TAG, TAG and OUT to tokens
INPUT: John Smith is in Washington.
OUTPUT: John:ST_PERSON Smith:PERSON is:OUT in:OUT
Washington:ST_LOCATION .:OUT
Statistical Named Entity Model
• Generative Statistical Model
– Find most likely tags given words
– ARGMAX_Ts P(Ts|Ws) = ARGMAX_Ts P(Ts,Ws)/P(Ws)
= ARGMAX_Ts P(Ts,Ws)
– Predict next word/tag pair based on previous word/tag pairs
– word trigram, tag bigram history
– Decompose into tag and lexical model
– P(w[n],t[n] | t[n-1], w[n-1], w[n-2])
= P(t[n] | t[n-1], w[n-1], w[n-2])
[tag model]
* P(w[n] | t[n], t[n-1], w[n-1])
[lexical model]
• State Tying for Lexical Model
– P(w[n]) | t[n], t[n-1], …) t[n-1] doesn’t differentiate TAG and ST_TAG
– P(w[n] | t[n], t[n-1], w[n-1], w[n-2]) ~ P(w[n] | t[n], w[n-1] ) if t[n] != t[n-1]
• Bigram model within category
– P(w[n] | t[n], t[n-1], w[n-1], w[n-2]) ~ P(w[n] | t[n], t[n-1])
• Unigram model cross category
if t[n] = t[n-1]
Smoothing the Named Entity Model
• Witten-Bell smoothing
– Not as accurate as held-out estimation, but much simpler
– P’(E|C1,C2) = lambda(C1,C2) * P_ml(E|C1,C2)
+ (1 – lambda(C1,C2) * P’(E|C1)
– lambda(x) = events(x) / (events(x) + K * outcomes(x))
– Lexical Model: smooth to uniform vocab estimate
– Tag Model: tag given tag for well-formedness
• Category-based Smoothing of Unknown Tokens
– Assign lexical tokens to categories
• Capitalized, all-caps, alpha-numeric, number+period, etc.
– Replace unknown words with categories
• Result is not joint model of P(Ws,Ts)
• OK for maximizing P(Ts|Ws)
– No category-based smoothing of known tokens in history
Blending Dictionaries/Gazetteers
• Lexical and Tag models
• Given “John”:PERSON
– P(John|ST_PERSON) ++
• Given “John Smith”:PERSON
– P(Smith|PERSON,ST_PERSON,John) ++
– P(PERSON|ST_PESON,John) ++
• Given “John Smith Junior”:PERSON
– P(Junior|PERSON,PERSON,Smith,John) ++
– P(PERSON|PERSON,Smith,John) ++
• Easier with pure language-model based system
Multi-lingual & Multi-genre Models
• Based on language segmentation for SpeechWorks
• Trained models for Hindi & English
– TIDES Surprise Language 2003
– Ported our ThreatTracker interface
– About ½-1% f-score hit for using multilingual model
• Models don’t interfere much
– P(w[n] | t[n], t[n-1], w[n-1])
• Until smoothing to P(w[n] | t[n]), only use Hindi context for Hindi
following if t[n], w[n-1] is known.
– P(t[n] | t[n-1], w[n-1], w[n-2])
• Until smoothing to P(t[n] | t[n-1])
– Would probably help to model transitions on multi-lingual data and
expected quantity of each if not uniform
– As is, we just trained with all the data we had (400K toks/language)
– Not nearly as bad as HMMs for pronunciation variation
Named Entity Algorithms
• See Dan Gusfield’s book: Algorithms on Strings and Trees
– Must read for non-statistical string algorithms
– Also great intro to suffix trees and computational biology
• Theoretically linear in input text size * tag set size
– Beam greatly reduces dependence on tagging
– Smoothing ST_TAG and TAG reduces contexts by half
• Dictionary-based tagging
– Aho-Corasick Algorithm is linear asymptotically
• Trie with suffix-to-prefix matching
– Actually more efficient to just hash prefixes for short strings
• Statistical Model Decoding
– Simple dynamic programming (often called “Viterbi”)
• Only keep best analysis for outcome given history
• Outcomes are tags, and only bigram tag history
– Lattice slicing for constant memory allocation (vs. full lattice)
• Allocate a pair of arrays sized by tags and re-use per token
• Still need backpointers, but in practice, very deterministic
– Rely on Java’s Garbage Collection
So why’s it so slow?
• Limiting factor is memory to CPU bandwidth
–
–
–
–
•
•
•
•
aka frontside bus (FSB)
Determined by Chipset, motherboard & memory
Best Pentium FSB: 800MHz (vs 3.2GHz CPU)
Best Xeon FSB: 533MHz
Models are 2-15 MB, even pruned & packed
CPU L2 Cache sizes are 512K to 1MB
Thus, most model lookups are cache misses
Same issue as database paging, only closer to CPU
Packing Models into Memory
• Based on SpeechWorks Language ID work
– Had to run on a handheld with multiple models
• Prune Low Counts
– Better to do Relative Entropy Based Pruning: Eliminate estimate
counts that are similar to smoothed estimates
• Symbol tables for tokens & 32-bit floating point
– At SPWX, mapped floats to 16-bit integers
• Trie-structure from general to specific contexts
–
–
–
–
Only walk down until context is found (Lambda != 0.0)
P(w[n] | t[n], t[n-1], w[n-1])
Contexts:  t[n]  t[n-1]  w[n-1] log(1 – lambda(context))
Outcomes:  w[n]  w[n]
 w[n] log(P(w[n] | context)
• Array-based with binary search
– Binary search is very hard on memory with large arrays
– Better to hash low-order contexts, OK for smaller contexts
– I’m going to need the board for this one
Named Entity Models and Accuracy
•
•
•
•
•
Spanish News (CoNLL): P=.95, R=.96, F=.95
English News (MUC7): P=.95, R=.92, F=.93
Hindi News (TIDES SL): P=.89, R=.84, F=.86
English Genomics (GENIA): P=.79, R=.79, F=.79
Dutch News (CoNLL): P=.90, R=.68, F=.77
• All tested without Gazetteers
• All Caps models only 5-10% less accurate
Within-Document Coreference
• Mentions merged into mention chains
– Greedy left-to-right algorithm over mentions
• Computes match of mention vs. all previous mention chains
• No-match creates new mention chain
• Ties cause new mention chain (or can cause tighter match)
– Matching functions determined by entity type (PERSON,
ORGANIZATION, etc.)
• Generic matching functions for token-sensitive edit distance
– Next step is soundex style spelling variation
• Specialized matching for pronouns and gender
• Matching functions may depend on user-defined entities
providing thesaurus-like expansion (“Joe Dimaggio” and “Joltin’
Joe” or “the Yankee Clipper”)
• User-configurable matching based on entity type (e.g.
PROTEIN)
– Next step is to add contextual information
Cross-Document Coreference
• Mention Chains merged into entities
– Greedy order-independent algorithm over mention chains
– Matching functions involve complex reasoning over sets of
mentions in chain versus sets of mention in candidate entities.
– Matching involves properties of the mentions in the whole database
and degree of overlap
• “Joe” or “Bush” show up in too many entities to be good distinguishing
matchers
• Chain: “John Smith”, “Mr. Smith”, “Smith”
– Entity1: John Smith Jr., John Smith, John, Smith
– Entity 2: John Smith Sr., John Smith, Jack Smith, Senior
• Chain: “John James Smith”, “John Smith”
– Entity: John Smith, Smith, John K. Smith
• Only pipeline component that must run synchronously.
– Only takes 5% of pipeline processing time.
• Next Step (recreating Bagga/Baldwin): Contextual Information
Document Feed Web Service for DARPA
• HTTP Implementation of Publish/Subscribe.
– Implemented as Servlets.
• Subscribers submit URL to receive documents.
• Publishers submit binary documents.
– May be validated if form is know; eg. XML DTD.
• Subscribers receive all published documents via
HTTP.
– A more general implementation allows reception by topic.
What’s next?
• Goal is total recall, with highest possible precision
• Finding “spelling” variations of names
–
–
–
–
–
Suffix Trees
Edit Distance (weighted by spelling variation)
Cross-linguistically (pronunciation transduction)
Context (weighted keyword in context)
Over 100K newswire articles
• Name structure
– Nicknames: Robert:Bob
– Acronyms: International Business Machines:IBM
– Abbreviationss: Bob Co:Bob Corporation
Analyzed Document Format
<!ELEMENT DOCUMENT (P)*>
<!ATTLIST DOCUMENT
uri CDATA #REQUIRED
source CDATA #REQUIRED
language CDATA #REQUIRED
title CDATA #REQUIRED
classification CDATA "UNCLASSIFIED"
date CDATA #REQUIRED>
<!ELEMENT P (S)*>
<!-- Analysis adds rest of data to input document -->
<!ELEMENT S (#PCDATA | enamex)*>
<!ELEMENT ENAMEX (#PCDATA)>
<!ATTLIST ENAMEX
id CDATA #REQUIRED
type CDATA #REQUIRED>