A Robust Shallow Parser for Swedish

Download Report

Transcript A Robust Shallow Parser for Swedish

A Robust Shallow Parser for
Swedish
Ola Knutsson, Johnny Bigert, Viggo Kann
Royal Institute of Technology, Sweden
Introduction
What is robustness?
Robust against noisy, ill-formed and partial
natural language data
Shallow parsing
Many NLP-applications do not need full
parsing
Shallow parsing:
• A parsing approach
• Pre-processing for full parsing
A collection of techniques
Abney - finite state cascades (1991)
Currently, a lot of attention on ML
Well suitable for modularization
Chunking and phrase
identification
Common modules in a shallow parser:
• Tokenizer
• PoS-tagger
• Chunker
• Phrase identifier
• Grammatical function identifier
Chunking
[NP Den mycket gamla mannen][VC gillade][NP mat]
Phrase identification
[NP Den [AP mycket gamla] mannen][VC gillade][NP
mat]
Parsers for Swedish
Full parser: UCP (Sågvall Hein) and SLE
(Gambäck)
Shallow parsers (phrase structure): CassSwe (Kokkinakis) and Megyesi using
machine learning
Dependency: CG (Birn) and FDG
(Voutilainen)
Granska Text Analyzer (GTA)
Hand-crafted rules
Context-free backbone
Partly object-oriented notation
Major Phrase Categories
NP: Han såg den lilla mannen på bänken
VC: Han har spelat kort hela natten
PP: Han såg spår i sanden
AP: Han ogillade små vita lögner
ADVP: Han vill inte gå på bio.
INFP: Han tycker om att spela
Clause Boundary Identification
Based on Ejerhed’s algorithm
Context-sensitive rules
Using only PoS information
Different kinds of rules
GTA contains 260 rules
200 identify phrase structure
20 clause boundary identification
40 selection rules (disambiguation)
Example rule, [NP den lilla bilen]
NPmin@
{
X(wordcl=dt| wordcl=hd | wordcl=rg),
X2(wordcl=ab | wordcl=rg)?,
Y(wordcl=jj | wordcl=ro | wordcl=pc)*,
Z(wordcl=nn)
-->
action(help, wordcl:=Z.wordcl, pnf:=
undef,
gender:=Z.gender, num:=Z.num,
spec:=Z.spec, case:=Z.case)
Clause boundary rule
cl@
{
V(sed!=sen & text!="som" & wordcl!=sn),
X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) |
(wordcl=nn & case=nom & V.case!=gen) |
wordcl=ab),
---endleftcontext---,
Y(wordcl=kn),
---beginrightcontext---,
Y2(((wordcl=pn & pnf=sub) |
(wordcl=pm & case=nom) |
(wordcl=nn & case=nom) |
wordcl=ab) &
wordcl=X.wordcl),
Z(wordcl=vb &
(vbf=prs | vbf=prt | vbf=imp))
-->
action(help, wordcl:=Y.wordcl)
}
The Tetris Algorithm
PP
till general
PP
till general Claes
NP
general Claes Olsson
NP
Fänrik Ax
VC
gav
NP
boken
PP
till general Claes Olsson
The IOB format
Marcus and Ramshaw 1995
A phrase/clause tag contains two parts:
1. Phrase/Clause type, e.g. NP, PP
2. One of two tags:
I = Inside a phrase/clause
B = Beginning a phrase/clause
When a word does not belong to a phrase
3. O = Outside
Disagreement error
De
gamla
äppelträdet
kan
bli
som
nya
.
dt.utr/neu.plu.def
jj.pos.utr/neu.plu.ind/def.nom
nn.neu.sin.def.nom
vb.prs.akt.mod
vb.inf.akt.kop
kn
jj.pos.utr/neu.plu.ind/def.nom
mad
NPB
APB|NPI
NPI
VCB
VCI
O
APB
O
CLB
CLI
CLI
CLI
CLI
CLI
CLI
CLI
Partial input
Arrangör
var
Järfälla
naturskyddsförening
där
är
medlem
.
nn.utr.sin.ind.nom
vb.prt.akt.kop
pm.gen
nn.utr.sin.ind.nom
ab
vb.prs.akt.kop
nn.utr.sin.ind.nom
mad
NPB
VCB
NPB|NPB
NPB|NPI
ADVPB
VCB
NPB
O
CLB
CLI
CLI
CLI
CLI
CLI
CLI
CLI
Noisy data
Inte
så
tjck
som
det
ofta
står
i
lärobökerna
;
ab
ab
jj.pos.utr.sin.ind.nom
ha
pn.neu.sin.def.sub/obj
ab.pos
vb.prs.akt
pp
nn.utr.plu.def.nom
mid
APB
ADVPB|APB|API
APB|API|API
O
NPB
ADVPB
VCB
PPB
NPB|PPI
0
CLB
CLI
CLI
CLB
CLI
CLI
CLI
CLI
CLI
CLI
Word order violation
Ympkvisten
inte
ska
vara
sådär
lång
,
nn.utr.sin.def.nom
ab
vb.prs.akt.mod
vb.inf.akt.kop
ab
jj.pos.utr.sin.ind.nom
mid
NPB
ADVPB
VCB
VCI
ADVPB|APB
APB
O
CLB
CLI
CLI
CLI
CLI
CLI
CLI
Evaluation
Manually corrected output from GTA
Untuned GTA in the evaluation
15 000 words from SUC
5 genres
F-scores for individual phrase
types
Type
ADVP
AP
INFP
NP
O
PP
VC
Total
Accuracy
81.9
91.3
81.9
91.4
94.4
95.3
92.9
88.7
Count
1008
1332
512
6895
2449
3886
2562
F-score for clause boundary
identification
Tagger
F-score
UNIGRAM
84.2
BRILL
87.3
TNT
88.3
F-score for a baseline identifier was 69.0%
Aplications with GTA
We are using GTA in:
Grammar checking, statistical and rule based
Clustering of medical texts
CALL-systems
What do you want to do with GTA?
More information
www.nada.kth.se/theory/projects/xcheck
Contact: Ola Knutsson
[email protected]