Dependency Annotation of English

Download Report

Transcript Dependency Annotation of English

The Prague (Czech-)English Dependency Treebank

Jan Haji č

Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by:

E: Silvie Cinkov á, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulov á, Zdeňka Urešová, Jan Štěpánek

Today...

• The family of Prague Dependency Treebanks – Incl. the Prague (Czech-)English Dependency Treebank • English “Tectogrammatical Representation” (TR) – Annotation layers – From Penn Treebank (et al.) to PDT-style English tectogrammatics – TR annotation of 5 interesting English phenomena • The annotation process – TrEd, EngVallex and the current status • To take home + pointers June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2

The Family of Prague Dependency Treebanks

• Prague Dependency Treebank (Czech) – 2001: version 1.0 (no deep syntax/semantics) – 2006: version 2.0 (w/deep syntax, semantics) • Prague Czech-English Dependency TB 1.0

– 2004: automatic annotation – English: PTB, Czech: 1/3rd of PTB translated • Prague Arabic Dependency Treebank 1.0

– 2004: ~ PDT 1.0 (no deep syntax) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 3

The Prague Czech-English Dependency Treebank

• Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics • Translation to Czech – Manual annotation (with auto pre-annotation) • Morphology, Syntax, Tectogrammatics (TR) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 4

Example: English TR

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank • • • • • • Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN) 5

Layers of Annotation

• t-layer – tectogrammatics • a-layer – (surface) syntax • m-layer – Morphology (POS) • w-layer – words (tokens) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 6

English Surface Syntax

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank • From PTB: – Form – POS Tag – Function label – (Structure) • Added – Lemma – Heads 7

Head Determination Rules

• Exhaustive set of rules – By J. Eisner + M. Čmejrek/J. Cuřín – 4000 rules (non-terminal based) • Ex.: (S (NP-SBJ VP .)) → VP – Additional rules • Coordination, Apposition • Punctuation (end-of-sentence, internal) • Original idea (possibility of conversion) – J. Robinson (1960s) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 8

Example: Head Determination Rules

(join) (join) June 8, 2009 (will) (join) (the) (board) (board) (join)  Rules: (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Dependency Workshop Boulder, CO Czech-English Dependency Treebank 9

Conversion: Analytic Structure, Functions

• Syntactic Function assignment (conversion) • Rules – based on PTB functional tags: -SBJ Sb -LGS Obj -LOC Adv -TMP Adv -PRD Pnom -ADV Adv -MNR Adv -BNF Obj -DIR Adv -PRP Adv -DTV Obj -EXT Adv -PUT Adv – Ad-hoc rules (if functional tags missing) – Lemmatization (years → year) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 10

Syntactic Structure, Functions: PTB to P(E)DT (join) (join)

PRED.Fut

(will) (join) (join) June 8, 2009

PAT

(the) (board) (board) Penn Treebank structure (with heads added) Dependency Workshop Boulder, CO Czech-English Dependency Treebank PDT-like Tectogrammatic Representation (automatic PDT-like Analytic pre-annotation) Representation 11

English TR I Predicative Complement • Free (non-valency) modification (of both a noun and a verb) • attribute compl.rf

(green arrow to the noun) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 12

English TR II

Which

+ Relative Clause

We have not answered your question completely, for which we apologize.

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 13

English TR III: Coordination June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 14

English TR IV: Comparison June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 15

English TR V: Restriction (“Exclusion”)

except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 16

English TR: (manual) annotation

June 8, 2009 • TrEd – Pre-annotated – Graphical • TR dep. tree is primary – Text + TR – Czech translation • Valency (a.k.a. “propbanking”) – During TR annotation – Propbank origins and examples • Linked, displayed Dependency Workshop Boulder, CO Czech-English Dependency Treebank 17

EngVallex (

give

)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 18

EngVallex Format (

admit

)

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 19

Interannotator Agreement

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2007-2009: - New annotators (lower numbers) Annotation “by phenomenon” - Restarting now 20

Prague English Dependency Treebank

• Availability – Version 1.0 now (PTB license needed) • 250k words – Full version (parallel with Czech): late 2010 • Size – Full WSJ portion of PTB (2312 files) – 49208 sentences, 1253013 tokens – Now: – 17210 sentences (34.97%), 439983 tokens (35.11%) June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 21

Czech PDT-style Annotation

• All layers – morphology, syntax, tectogrammatical • So far… – Automatic (many tools by many authors) • Manual annotation – In progress (28124 sentences/639326 words) – Top-down • Tectogrammatical first (

lower layers automatically

) • … then syntactic structure and morphology June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 22

Summary

• PDT is/has (a)… – (Family of) dependency-based treebanking project(s) • Czech (English, Arabic, ...) – ~ 1mil. words • sufficient size for ML experiments – 4 interlinked layers of annotation • token, morphology, syntax,

deep syntax/semantics++

) • independent and “full” information at all levels • interlinked (for the development of parsers/generators) – Parallel corpus Cze <-> Eng ->

Machine Translation

June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 23

Pointers, Acknowledgements

• http://ufal.mff.cuni.cz/pedt • http://ufal.mff.cuni.cz/pdt2.0

• http://ufal.mff.cuni.cz/~pajas/tred • Acknowledgements – FP6-IST “Euromatrix”, FP7-IST “Euromatrix+” – LC536 (Center for Computational Linguistics) – GAČR 405/06/0589 (Speech and deep syntax) – MŠMT: MSM0021620838, ME838, ME09008 June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 24