Transcript Dependency Annotation of English
The Prague (Czech-)English Dependency Treebank
Jan Haji č
Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by:
E: Silvie Cinkov á, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulov á, Zdeňka Urešová, Jan Štěpánek
Today...
• The family of Prague Dependency Treebanks – Incl. the Prague (Czech-)English Dependency Treebank • English “Tectogrammatical Representation” (TR) – Annotation layers – From Penn Treebank (et al.) to PDT-style English tectogrammatics – TR annotation of 5 interesting English phenomena • The annotation process – TrEd, EngVallex and the current status • To take home + pointers June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2
The Family of Prague Dependency Treebanks
• Prague Dependency Treebank (Czech) – 2001: version 1.0 (no deep syntax/semantics) – 2006: version 2.0 (w/deep syntax, semantics) • Prague Czech-English Dependency TB 1.0
– 2004: automatic annotation – English: PTB, Czech: 1/3rd of PTB translated • Prague Arabic Dependency Treebank 1.0
– 2004: ~ PDT 1.0 (no deep syntax) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 3
The Prague Czech-English Dependency Treebank
• Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics • Translation to Czech – Manual annotation (with auto pre-annotation) • Morphology, Syntax, Tectogrammatics (TR) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 4
Example: English TR
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank • • • • • • Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN) 5
Layers of Annotation
• t-layer – tectogrammatics • a-layer – (surface) syntax • m-layer – Morphology (POS) • w-layer – words (tokens) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 6
English Surface Syntax
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank • From PTB: – Form – POS Tag – Function label – (Structure) • Added – Lemma – Heads 7
Head Determination Rules
• Exhaustive set of rules – By J. Eisner + M. Čmejrek/J. Cuřín – 4000 rules (non-terminal based) • Ex.: (S (NP-SBJ VP .)) → VP – Additional rules • Coordination, Apposition • Punctuation (end-of-sentence, internal) • Original idea (possibility of conversion) – J. Robinson (1960s) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 8
Example: Head Determination Rules
(join) (join) June 8, 2009 (will) (join) (the) (board) (board) (join) Rules: (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Dependency Workshop Boulder, CO Czech-English Dependency Treebank 9
Conversion: Analytic Structure, Functions
• Syntactic Function assignment (conversion) • Rules – based on PTB functional tags: -SBJ Sb -LGS Obj -LOC Adv -TMP Adv -PRD Pnom -ADV Adv -MNR Adv -BNF Obj -DIR Adv -PRP Adv -DTV Obj -EXT Adv -PUT Adv – Ad-hoc rules (if functional tags missing) – Lemmatization (years → year) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 10
Syntactic Structure, Functions: PTB to P(E)DT (join) (join)
PRED.Fut
(will) (join) (join) June 8, 2009
PAT
(the) (board) (board) Penn Treebank structure (with heads added) Dependency Workshop Boulder, CO Czech-English Dependency Treebank PDT-like Tectogrammatic Representation (automatic PDT-like Analytic pre-annotation) Representation 11
English TR I Predicative Complement • Free (non-valency) modification (of both a noun and a verb) • attribute compl.rf
(green arrow to the noun) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 12
English TR II
Which
+ Relative Clause
We have not answered your question completely, for which we apologize.
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 13
English TR III: Coordination June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 14
English TR IV: Comparison June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 15
English TR V: Restriction (“Exclusion”)
except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 16
English TR: (manual) annotation
June 8, 2009 • TrEd – Pre-annotated – Graphical • TR dep. tree is primary – Text + TR – Czech translation • Valency (a.k.a. “propbanking”) – During TR annotation – Propbank origins and examples • Linked, displayed Dependency Workshop Boulder, CO Czech-English Dependency Treebank 17
EngVallex (
give
)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 18
EngVallex Format (
admit
)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 19
Interannotator Agreement
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2007-2009: - New annotators (lower numbers) Annotation “by phenomenon” - Restarting now 20
Prague English Dependency Treebank
• Availability – Version 1.0 now (PTB license needed) • 250k words – Full version (parallel with Czech): late 2010 • Size – Full WSJ portion of PTB (2312 files) – 49208 sentences, 1253013 tokens – Now: – 17210 sentences (34.97%), 439983 tokens (35.11%) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 21
Czech PDT-style Annotation
• All layers – morphology, syntax, tectogrammatical • So far… – Automatic (many tools by many authors) • Manual annotation – In progress (28124 sentences/639326 words) – Top-down • Tectogrammatical first (
lower layers automatically
) • … then syntactic structure and morphology June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 22
Summary
• PDT is/has (a)… – (Family of) dependency-based treebanking project(s) • Czech (English, Arabic, ...) – ~ 1mil. words • sufficient size for ML experiments – 4 interlinked layers of annotation • token, morphology, syntax,
deep syntax/semantics++
) • independent and “full” information at all levels • interlinked (for the development of parsers/generators) – Parallel corpus Cze <-> Eng ->
Machine Translation
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 23
Pointers, Acknowledgements
• http://ufal.mff.cuni.cz/pedt • http://ufal.mff.cuni.cz/pdt2.0
• http://ufal.mff.cuni.cz/~pajas/tred • Acknowledgements – FP6-IST “Euromatrix”, FP7-IST “Euromatrix+” – LC536 (Center for Computational Linguistics) – GAČR 405/06/0589 (Speech and deep syntax) – MŠMT: MSM0021620838, ME838, ME09008 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 24