Transcript Document

SProUT
Shallow Processing with Unification
and Typed Feature Structures
Jakub Piskorski
Language Technology Lab
DFKI GmbH
Jakub Piskorski
Warszawa, 6.10 .2003
Information Extraction
Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC),
London, have merged their UK private communication systems and networks
activities to form a new company, Siemens GEC Communication Systems Limited.
Munich, February 18, 1997, Siemens AG and The General Electric Company (GEC),
London, have merged their UK private communication systems and networks
activities to form a new company, Siemens GEC Communication Systems Limited.
JOINT-VENTURE FOUNDATION EVENT
VENTURE: Siemens GEC Communication Systems Limited
PARTNERS: Siemens AG, The General Electric
TIME: February 18 1997
PRODUCT/SERVICE: communication systems, networks activities
LOCATION: Munich
Jakub Piskorski
Warszawa, 6.10 .2003
Finite-State based approaches
 SPPC - pure finite-state based STP, small number of basic predicates
 SMES – predciates inspect arbitrary properties of the input tokens/fragments
 FASTUS – uses CPSL (Common Pattern Specification Language)
 GATE – uses JAPE (Java Annotation Patterns Engine)
Jakub Piskorski
Warszawa, 6.10 .2003
Motivation for SProUT
 One System for Multilingual and Domain Adaptive Shallow Text Processing
 Trade-off between efficiency and expressiveness
 Modularity
 Flexible integration of different processing modules
 Portability
 Industrial standards
Jakub Piskorski
Warszawa, 6.10 .2003
Credits
SProUT is a joint work by:
Witold Drożdżyński,
Ulrich Krieger,
Jakub Piskorski,
Ulrich Schäfer,
Feiyu Xu
Jakub Piskorski
Warszawa, 6.10 .2003
SProUT Architecture
INPUT DATA
LINGUISTIC
PROCESSING
RESOURCES
LEXICAL
RESOURCES
JTFS
STREAM OF
TEXT ITEMS
…. [..] [..] [..] ….
XTDL
GRAMMAR
REGULAR
COMPILER
EXTENDED
OPTIMIZED
FINITE-STATE
NETWORK
XTDL
INTERPRETER
STRUCTURED
OUTPUT DATA
FINITE-STATE
MACHINE
TOOLKIT
GRAMMAR
DEVELOPMENT
ENVIRONMENT
Jakub Piskorski
ONLINE
PROCESSING
Warszawa, 6.10 .2003
Core Components – FSM Toolkit
 Finite-state Machine Toolkit for building, combining,
and optimizing finite-state devices
 Finite-state Machine model: FSA, WFSA, FST, WFST
 Arbitrary real-valued semirings
 Some new crucial STP-relevant operations
(e.g., incremental construction of minimal deterministic FSAs)
 Various memory models
 Functionality similar to AT&T tools
Jakub Piskorski
Warszawa, 6.10 .2003
Core Components – Regular Compiler
 Definition and configuration via XML
 Unicode compatible
 Extendible set of circa 20 operations
 Scanner definitions vs. general regular expressions
 Biasing optimization process
 Various ways of handling ambiguities
 Direct database connection for flexible pattern-based transformation of
linguistic resources into optimized FS representation
 Regular expressions over TFSs (SProUT) with restrictions
Jakub Piskorski
Warszawa, 6.10 .2003
Core Components – Typed Feature Structure Package
 JAVA implementation of TFSs
 Efficient unification operations
 Dynamic extension of the type hierarchy
 Other operations: subsumptipon checking, deep copying, path selection,
feature iteration, and various printers
Jakub Piskorski
Warszawa, 6.10 .2003
XTDL Formalism
 Combines typed feature structures (TFS) and regular expressions, including
coreferences and functional application
 XTDL grammar rules – production part on LHS, and output description on RHS
 TDL used for establishment of a type hierarchy of linguistic entities
*top*
morph := sign & [POS atom,
atom
*avm*
*rule*
STEM atom,
INFL infl]
tense
present
sign
token
infl
morph
index-avm
lang
de
Jakub Piskorski
en
tokentype
separator
url
Warszawa, 6.10 .2003
XTDL Formalism
 Couple of standard regular operators:
concatenation
disjunction
Kleene plus
m-n span repetition
|
+
{m,n}
optionality
?
Kleene star
*
n-fold repetition {n}
 Unidirectional coreference under Kleene star (and restricted iteration)
[POS Det, ...] ([POS Adj, ..., RELN %LIST])* [POS Noun, ...] -> [..., RELN %LIST]
Jakub Piskorski
Warszawa, 6.10 .2003
XTDL Formalism
loc-pp :>
morph & [POS Prep & #preposition,
INFL [CASE #1, NUMBER #2, GENDER #3]]
morph & [POS Determiner,
INFL [CASE #1, NUMBER #2, GENDER #3]] ?
morph & [POS Adjective,
INFL [CASE #1, NUMBER #2, GENDER #3]] *
gazetteer & [TYPE general-location,
SURFACE #location]
-> [CAT location-pp,
PREP #preposition
LOCATION #location].
Jakub Piskorski
Warszawa, 6.10 .2003
XTDL Interpreter
1. Matching of regular patterns using unifiability (LHS)
2. LHS Pattern instance creation
3. Unfication of the rule instance and matched input
 Longest match strategy
 Ambiguities allowed
 Interpreter generates TFSs as output (cascaded architecture)
Jakub Piskorski
Warszawa, 6.10 .2003
XTDL Interpreter
 Matched input sequence “im sonnigen Rom” (in sunny Rome)




IN





rule 
SURFACE im

SURFACE sonnigen

ST EM



im
sonnig


ST EM

POS

POS

Prep
Adjective
SURFACE Rom


,

,

nom  
case   gazetteer T YPE
CASE
CASE
general- location


 NUMBER plural 
 NUMBER number 
INFL
INFL







GENDER fem  
GENDER gender  

infl 
infl 
 morph 

morph 
Jakub Piskorski










Warszawa, 6.10 .2003
XTDL Interpreter
 Rule with an instantiated pattern on the LHS

5 P rep
P OS

P OS


1   

CASE

IN
 NUMBER 2 ,
INFL
INFL










GENDER
3
  morph 

infl 
morph 


location- np
CAT

P REP

OUT
5





LOCAT ION 4

phrase 
rule 
Jakub Piskorski

1  
CASE
SURFACE
 NUMBER 2 ,


  gazetteer T YP E
GENDER 3 
infl 

Adjective


4
 

general- location 







Warszawa, 6.10 .2003
XTDL formalism
 Unified result

SURFACE im

SURFACE sonnigen


ST EM



im
ST
EM
sonnig






P OS

P OS


5 P rep
Adjective
SURFACE
IN

,

,

1 dat   
1   gazetteer T YP E
CASE
CASE


 NUMBER 2 sing 
 NUMBER 2 
INFL
INFL











 




GENDER
3
neut
GENDER
3

  morph 
infl 
infl 

morph 


location- np
CAT



OUT

P REP
5



LOCAT ION 4


phrase 
rule 
Jakub Piskorski
4 Rom

general- location
















Warszawa, 6.10 .2003
Linguistic Processing Resources
 Tokenization
 Gazetteer
 Extended Gazetteer
 Morphology
 Sentence Splitter
 Reference Matcher
Jakub Piskorski
Warszawa, 6.10 .2003
Tokenization
 Text segmentation into tokens
 Fine-grained token classification (ca. 30 types)
complex_compound_first_capital : AT&T-Chief
 Token postsegmentation
‘<a,b>’  ‘<‘ ‘a’ ‘,’ ‘b’ ‘>’
 Token Subclassification
Information
contains_position_sufix: AT&T-Chief
Jakub Piskorski
ST ART : 25
 END : 34

 MAIN : first _ capital

 LANG : germ an
  LANG : english


,  DOM : any

SUB :  DOM : any




nding

SPEC : has _ noun_ ending SPEC : has_noun_e









Warszawa, 6.10 .2003
Gazetteer/Extended Gazetteer
 for storing static named-entities (eg. locations) or keywords (eg. company|
designators, month names, etc.)
 Extended Gazetteer allows for associating entries with a list of arbitrary
attribute-value pairs (and uses path compression)
...
Warsaw | gaz_type:city | concept:Warsaw
Warszawa | gaz_type:city | concept:Warsaw
Varsovie | gaz_type:city | concept:Warsaw
...
 Case Sensitivie/Insensitive Modus
 Unicode compatibility
Jakub Piskorski
Warszawa, 6.10 .2003
Morphology
 compactification of available full-form lexica
 external components implemented as server
Full-form lexica obtained from ‘compactified’ MMORPH:
English
German
French
Spanish
Italian
Dutch
200,000
830,000
225,000
570,000
330,000
?
entries
entries
entries
entries
entries
Entries
+
Shallow Compound Recognition
(under development)
Asian Languages:
Chinese – Shanxi
Japanese – Chasen
Other:
Czech – 600,000 entries + HMM-based Part-of-Speech Tagging
Polish – 120,000 lexemes (Morfeusz)
Lithuanian – Lemouklis
Russian – under acquisition
Jakub Piskorski
Warszawa, 6.10 .2003
Morphology
 Compound Recognition & Segmentation for German
“Biergartenfest”
[Bier [garten fest]] vs. [[Bier garten] fest]
“Wein“ + “sorten“ (wine types)
“Wein” + “s“ + “orten“ (wine places)
(„Autoradiozubehör“ – radio car equipment)
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
Autoradiozubehör
 Next: Adoptation for processing Dutch compounds
Jakub Piskorski
Warszawa, 6.10 .2003
System Description Language
 Construction of a concrete system instance via definition of a
regular expression of module specifications
M1  M 2
output of M 1 servesas theinput toM 2
M*
fixpointcomputation
M1 M 2
quasi - parallelcomputation of independent modules
 All lingusitic modules must implement a specific JAVA interface
 Automatic compilation of system description into a single JAVA class
Jakub Piskorski
Warszawa, 6.10 .2003
System Description Language
(M1 M2)(input)
M1.clearState();
M1.setInput(input);
M1.setOutput(M1.computeOutput(M1.getInput()));
M2.clearState();
M2.setInput(mediateSeq(M1,M2));
M2.setOutput(M2.computeOutput(M2.getInput()));
return M2.getOutput();
(M*)(input)
M.clearState();
M.setInput(input);
M.setOutput(mediateFix(M));
return M.getOutput();
Jakub Piskorski
Warszawa, 6.10 .2003
Optimization of Grammar processing
 Problem: TFSs treated as symbolic values by FSM Toolkit
 Sorting outgoing transitions from slected states
(transition hierarchy under subsumption)
- flat trees for bad-style grammars
 Extending transition hierarchy via additional nodes
[ TOP ]
[MORPH stem: ‘Prof.’]
Jakub Piskorski
[GAZETTEER type: X]
[TOKEN]
Warszawa, 6.10 .2003
Optimization of Grammar processing
 Input text consisting of 32 520 words, 157 080 characters, 22 pages
+ English Grammar for NE (circa 700 transitions from the initial state)
 Run-time behaviour with Tokenizer/Gazetter/Morphology:
before: overall: 17.7 seconds
candidate pattern selection: 11.6
now:
candidate pattern selection: 6.9
Jakub Piskorski
overall: 13.2 seconds
Warszawa, 6.10 .2003
Optimization of Grammar processing
 Using restrictions during compilation of XTDL grammars into FS-format
 ’Determinization under subsumption’ -> Approximation
 ’Expansion’ techniques for highly recursive grammars
Jakub Piskorski
Warszawa, 6.10 .2003
Adapting SProUT to processing Polish
 Tokenization – trivial
 Morphology – integration of Morfeusz (Marcin Woliński)
 Part-of-speech Disambiguation - ?
 Gazetteer - several strategies:
- list all inflectional variants with additional morphological information
- interplay between gazetteer and morphology
- component for guessing morphological information of unknown words
 Grammar Adaptation
- provide additional information to control inflection by using
STEM attribute instead of SURFACE
Jakub Piskorski
Warszawa, 6.10 .2003
Future Work
 Further work concerning optimization of grammar processing
 Various search strategies
 Additional linguistic processing resources
 Adopting to processing new languages
 Real data testing: large grammars and real-world texts
 Utilization in research and industrial projects
Jakub Piskorski
Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for title
title :/ gazetteer & [SURFACE #title, GTYPE gaz_title] -> #title.
;; dummy rule for position
position :/ gazetteer & [SURFACE #position, GTYPE gaz_position] -> #position.
;; dummy rule for complex position, zB. Dierktor und CEO
complex_position :/
(gazetteer & [GTYPE gaz_position, SURFACE #pos1]
token & [SURFACE "und"]
gazetteer & [GTYPE gaz_position, SURFACE #pos2])
-> #position, where #position = Append(#pos1," ","und"," ",#pos2).
Jakub Piskorski
Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for given name
given_name :/ gazetteer & [SURFACE #name, GTYPE gaz_given_name] -> #name.
;; dummy rule for name-suffix such as "Jr."
name_suffix :/
(token & [ SURFACE ","] ?)
token & [ SURFACE "Jr" & #suffix ] | token & [ SURFACE "jr" & #suffix ]
(token & [ SURFACE "." ] ?)
-> #suffix.
;; dummy rule for initial "M." and middle name
initial :/
(gazetteer & [GTYPE gaz_initial, SURFACE #initial]
token & [SURFACE "."] ?)
-> #middle, where #middle = Append(#initial, ".").
Jakub Piskorski
Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; dummy rule for infix like "van", "van der"
infix :/ gazetteer & [GTYPE gaz_name_infix, SURFACE #infix] -> #infix.
;; dummy rule for last name
last_name :/
token & [TYPE first_capital_word, SURFACE #name]
| token & [TYPE mixed_word_first_capital, SURFACE #name]
| token & [TYPE word_with_hyphen_first_capital, SURFACE #name]
| token & [TYPE word_with_apostrophee_first_capital, SURFACE #name]
-> #name.
;; dummy rule for last name with infix
last_name_with_infix :/
@seek(infix) & #infix
@seek(last_name) & #last_name
-> #last, where #last=Append(#infix," ",#last_name).
Jakub Piskorski
Warszawa, 6.10 .2003
Examples – Simple grammar for person names
;; rule for person names, example: Direktor und CTO Prof. Dr. hab. Witold P. van der Berg, Jr.
person :>
((@seek(position) & #pos | @seek(complex_position) & #pos) token & [TYPE comma] ?)?
@seek(title) & #title ?
(@seek(given_name) & #given_name (@seek(given_name) & #given_name_extra ?)
| (@seek(initial) & #given_name))
@seek(initial) & #middle1 ?
@seek(initial) & #middle2 ?
(@seek(last_name) & #last_name | @seek(last_name_with_infix) & #last_name)
@seek(name_suffix) & #suffix ?
-> ne-person & [GIVEN_NAME #first_name,
TITLE #title,
SURNAME #last_name,
P-POSITION #position,
NAME-SUFFIX #suffix],
where #first_name = ConcWithBlanks(#given_name,#given_name_extra,#middle1,#middle2).
Jakub Piskorski
Warszawa, 6.10 .2003
Examples – Embedding rules
simple_noun_phrase :> .................
-> phrase & [CAT np,
SURFACE #info,
AGR [N #n,
C #c,
G #g]], where #info=..........
simple_event :> @seek(person) & #person
morph & [POS verb, STEM #action]
@seek(simple_noun_phrase) & [SURFACE #info]
-> [PERSON #person, ACTION #action, OBJECT #info].
Jakub Piskorski
Warszawa, 6.10 .2003