Transcript Document
SProUT
Shallow Processing with Unification
and Typed Feature Structures
Jakub Piskorski
Language Technology Lab
DFKI GmbH
Jakub Piskorski
Warszawa, 10.01.2003
Shallow Text Processing
TEXT DOCUMENTS
Building
ontologies
Concept indices,
more accurate queries
Domain-specific
patterns
Tokens
Clause
structure
MULTI-AGENTS
Shallow Text
Processing
Components
Term association
extraction
Text
Mining
Word
Stems
EXECUTIVE
INFORMATION
SYSTEMS
Template
generation
Information
Extraction
Q/A Systems
Phrases
Semi-structured data
E-COMMERCE
Automatic
Database
Construction
Jakub Piskorski
Document
Indexing/Retrieval
Named
Entities
WORKFLOW
MANAGEMENT
Fine-grained
concept matching
Text
Classification
DATA
WAREHOUSING
Warszawa, 10.01.2003
Finite-State based approaches
SPPC - pure finite-state based STP, small number of basic predicates
SMES – predciates inspect arbitrary properties of the input tokens/fragments
FASTUS – uses CPSL (Common Pattern Specification Language)
GATE – uses JAPE (Java Annotation Patterns Engine)
Jakub Piskorski
Warszawa, 10.01.2003
Motivation for SProUT
One System for Multilingual and Domain Adaptive Shallow Text Processing
Trade-off between efficiency and expressiveness
Modularity
Flexible integration of different processing modules
Portability
Industrial standards
Jakub Piskorski
Warszawa, 10.01.2003
SProUT is a joint work by:
Markus Becker, Witold Drożdżyński, Ulrich Krieger
Jakub Piskorski, Ulrich Schäfer, FeiyuXu
Jakub Piskorski
Warszawa, 10.01.2003
SProUT Architecture
INPUT DATA
LINGUISTIC
PROCESSING
RESOURCES
LEXICAL
RESOURCES
JTFS
STREAM OF
TEXT ITEMS
…. [..] [..] [..] ….
XTDL
GRAMMAR
REGULAR
COMPILER
EXTENDED
OPTIMIZED
FINITE-STATE
NETWORK
XTDL
INTERPRETER
STRUCTURED
OUTPUT DATA
FINITE-STATE
MACHINE
TOOLKIT
GRAMMAR
DEVELOPMENT
ENVIRONMENT
Jakub Piskorski
ONLINE
PROCESSING
Warszawa, 10.01.2003
Core Components – FSM Toolkit
Finite-state Machine Toolkit for building, combining,
and optimizing finite-state devices
Finite-state Machine model: FSA, WFSA, FST, WFST
Arbitrary real-valued semirings
Some new crucial STP-relevant operations
(e.g., incremental construction of minimal deterministic FSAs)
Functionality similar to AT&T tools
Jakub Piskorski
Warszawa, 10.01.2003
Core Components – Regular Compiler
Definition and configuration via XML
Unicode compatible
Extendible set of circa 20 operations
Scanner definitions vs. general regular expressions
Biasing optimization process
Various ways of handling ambiguities
Direct database connection for flexible pattern-based transformation of
linguistic resources into optimized FS representation
Regular expressions over TFSs (SProUT) with restrictions
Jakub Piskorski
Warszawa, 10.01.2003
Core Components – Typed Feature Structure Package
JAVA implementation of TFSs
Efficient unification operations
Dynamic extension of the type hierarchy
Other operations: subsumptipon checking, deep copying, path selection,
feature iteration, and various printers
Jakub Piskorski
Warszawa, 10.01.2003
XTDL Formalism
Combines typed feature structures (TFS) and regular expressions, including
coreferences and functional application
XTDL grammar rules – production part on LHS, and output description on RHS
TDL used for establishment of a type hierarchy of linguistic entities
*top*
morph := sign & [POS atom,
atom
*avm*
*rule*
STEM atom,
INFL infl]
tense
present
sign
token
infl
morph
index-avm
lang
de
Jakub Piskorski
en
tokentype
separator
url
Warszawa, 10.01.2003
XTDL Formalism
Couple of standard regular operators:
concatenation
disjunction
Kleene plus
m-n span repetition
|
+
{m,n}
optionality
?
Kleene star
*
n-fold repetition {n}
Unidirectional coreference under Kleene star (and restricted iteration)
[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]
Jakub Piskorski
Warszawa, 10.01.2003
XTDL Formalism
loc-pp :>
morph & [POS Prep & #preposition,
INFL [CASE #1, NUMBER #2, GENDER #3]]
morph & [POS Determiner,
INFL [CASE #1, NUMBER #2, GENDER #3]] ?
morph & [POS Adjective,
INFL [CASE #1, NUMBER #2, GENDER #3]] *
gazetteer & [TYPE general-location,
SURFACE #location]
-> [CAT location-pp,
PREP #preposition
LOCATION #location].
Jakub Piskorski
Warszawa, 10.01.2003
XTDL Interpreter
1. Matching of regular patterns using unifiability (LHS)
2. LHS Pattern instance creation
3. Unfication of the rule instance and matched input
Longest match strategy
Ambiguities allowed
Interpreter generates TFSs as output (cascaded architecture)
Jakub Piskorski
Warszawa, 10.01.2003
XTDL Interpreter
Matched input sequence “im sonnigen Rom” (in sunny Rome)
IN
rule
SURFACE im
SURFACE sonnigen
ST EM
im
sonnig
ST EM
POS
POS
Prep
Adjective
SURFACE Rom
,
,
nom
case gazetteer T YPE
CASE
CASE
general- location
NUMBER plural
NUMBER number
INFL
INFL
GENDER fem
GENDER gender
infl
infl
morph
morph
Jakub Piskorski
Warszawa, 10.01.2003
XTDL Interpreter
Rule with an instantiated pattern on the LHS
5 P rep
P OS
P OS
1
CASE
IN
NUMBER 2 ,
INFL
INFL
GENDER
3
morph
infl
morph
location- np
CAT
P REP
OUT
5
LOCAT ION 4
phrase
rule
Jakub Piskorski
1
CASE
SURFACE
NUMBER 2 ,
gazetteer T YP E
GENDER 3
infl
Adjective
4
general- location
Warszawa, 10.01.2003
XTDL formalism
Unified result
SURFACE im
SURFACE sonnigen
ST EM
im
ST
EM
sonnig
P OS
P OS
5 P rep
Adjective
SURFACE
IN
,
,
1 dat
1 gazetteer T YP E
CASE
CASE
NUMBER 2 sing
NUMBER 2
INFL
INFL
GENDER
3
neut
GENDER
3
morph
infl
infl
morph
location- np
CAT
OUT
P REP
5
LOCAT ION 4
phrase
rule
Jakub Piskorski
4 Rom
general- location
Warszawa, 10.01.2003
Linguistic Processing Resources
Tokenization with fine-grained token classification
Gazetteer (static named-entity lexica)
Morphology
Full-form lexica obtained from ‘compactified’ MMORPH:
English200,000
German 830,000
French 225,000
Spanish570,000
Italian330,000
entries
entries
entries
entries
entries
+
Shallow Compound Recognition
Asian Languages:
Chinese – Shanxi
Japanese – Chasen
Other:
Czech – HMM-based Part-of-Speech Tagging + Morphology
Jakub Piskorski
Warszawa, 10.01.2003
System Description Language
Construction of a concrete system instance via definition of a
regular expression of module specifications
M1 M 2
output of M 1 servesas theinput toM 2
M*
fixpointcomputation
M1 M 2
quasi - parallelcomputation of independent modules
All lingusitic modules must implement a specific JAVA interface
Automatic compilation of system description into a single JAVA class
Jakub Piskorski
Warszawa, 10.01.2003
System Description Language
(M1 M2)(input)
M1.clearState();
M1.setInput(input);
M1.setOutput(M1.computeOutput(M1.getInput()));
M2.clearState();
M2.setInput(mediateSeq(M1,M2));
M2.setOutput(M2.computeOutput(M2.getInput()));
return M2.getOutput();
(M*)(input)
M.clearState();
M.setInput(input);
M.setOutput(mediateFix(M));
return M.getOutput();
Jakub Piskorski
Warszawa, 10.01.2003
Future Work
Optimization of grammar interpretation
Various search strategies
Additional linguistic processing resources
Real data testing: large grammars and real-world texts
Jakub Piskorski
Warszawa, 10.01.2003