Transcript Slide 1
Corpus informed approach to Canonical Typology Jiajin Xu, Anna Siewierska & Richard Xiao Dept. of Linguistics & English Language Lancaster University 1 Corpus informed approach to Canonical Typology: With special reference to separable verb compounds in Mandarin Part of the project A corpus-based study of split words in Chinese: Interfacing morphology, syntax and pragmatics PI: Anna Siewierska RA: Jiajin Xu Consultant: Richard Xiao 2 Outline 1 The issue: Separable verb compounds (SVCs) 2 Canonical typological strategies 3 How can corpus method inform CT? 4 A case study: Mandarin SVCs 3 The issue: Separable verb compounds • In some languages, there are verb compounds which are made up of two parts, a verbal stem and a movable element standing before or after the verb root in adjacency or close proximity. • Such verbs are called separable (compounds) verbs in German (e.g. abfahren, ‘to drive off/depart’) and Dutch (e.g. binnenkomen, ‘to come in’). 4 • Separable prefixes in some German and Dutch verbs can stand alone as words. They are moved from the verb to the end of the sentence in the present and past tense. • E.g. • Kommen Sie herein! • Come in. 5 SVCs in other languages • Parallels can be found in other languages (e.g. Afrikaans, Danish, Mandarin, Swedish, etc.), although they are not exactly the same type of verbs or constructions. 6 Why are SVCs interesting? • SVCs are a large class of verbs in, say, German, Dutch and Mandarin. • They violate one of the most fundamental principle of the lexicalist theory of word formation (Booij 1990: 45): the Principle of Lexical Integrity. 7 Why are SVCs interesting? • The Principle of Lexical Integrity predicts that wordinternal structures are not accessible to rules of syntax (ibid: 45). • SVCs straddle the boundary of morphology and syntax. • SVCs are listed as words, but they apparently have some phrasal properties, and sensitive to tense, clause type, for example. 8 SVCs in Mandarin • The 2002 edition of the Modern Chinese Dictionary includes 3,236 SVCs. Among the four subtypes of SVCs, verb object compounds constitute 97% of the category, while the other three categories combined make up the remaining 3% (Zhu 2006: 29). • No grammar of Mandarin can turn a blind eye to the “verb-object paradox” (Packard 2003: 108), because SVCs are by no means a marginal morphological phenomenon. 9 An example of Mandarin SVC • dan1xin1, carry heart, ‘to worry’ • dan1-le yi1shang4wu3 xin1, carry ASP one morning heart, ‘to be worried the whole morning’ • xin1 yi4zhi2 dan1-zhe, heart all the time carry ASP, ‘to have been worried ever since’ 10 Research objectives • What are common insertion and external patterns and that keep the Mandarin SVCs apart? • How can the study on Mandarin SVCs contribute to the study of similar phenomena in other languages? 11 Canonical typological strategies • Canonical typology has been recognised as a strategy of studying fuzzy and cross-border grammatical categories (Bond 2007; Corbett 2007; Nikolaeva 2008). • application of canonical typology to suppletive forms, agreement, negation, syncretism, etc. 12 Standard strategy in typological research (Croft 2003: 14) I. Determine the particular structure or situation type we want to explore; II. Examine the morpho-syntactic construction or strategies used to encode that situation type; III. Search for dependencies between the structures used for that situation and other linguistic factors: other structural features and external functions expressed by the construction in question, or both. 13 Steps of canonical typology approach • • • • Start with a linguistic phenomenon Establish a general definition for identifying the linguistic category in question Construct a set of features or criteria for the canonical case of the category Investigate the categories in languages against the criteria 14 Steps of canonical typology approach • • • • Start with a linguistic phenomenon Establish a general definition for identifying the linguistic category in question Construct a set of features or criteria for the canonical case of the category Investigate the categories in languages against the criteria 15 How can corpus method inform CT? • Usually the features are summarised from literature. • The collection of features could be selective and arbitrary. • Can the selection of features be more objective and reliable? • We seek to answer this question from a corpus linguistics perspective. 16 Corpus-driven vs. corpus-based • In relation to the discontinuous constructions, we can consider corpus-driven approach and corpus-based approach. • With or without prior assumptions or input information 17 Corpus-driven approach • A recent development in Corpus Linguistics called “concgram” or “concgramming” • Automatically identify all frequent co-occurring lexical items, either in contiguous or noncontiguous contexts, or in forward or backward directions. 18 2-word concgram examples • • • • • • family support support…family family…support take over over take take…over • • • • the * of a * of one of * the * is 19 Limitation • This approach is so powerful that it searches continuous and discontinuous sequence of any type. • Separable compounds are but a small fraction of all discontinuous constituents. • Concgram generation based on POS-tagged can yield a more accurate search of verbal constructions. 20 Corpus-based approach • Variational parameters of SVCs can be summarised from corpus search based on some formally fixed search terms (very often relying on grammar books, dictionaries etc.) • “Corpus-informed” is a more cautious term for studies which uses corpus data or method, but not strictly relies on them. Very often it combines other qualitative methods. 21 Mandarin SVCs as a case study • What are common types of insertions in the discontinuous use of SVCs? • How can typical (canonical) features be identified on a frequency basis? 22 Corpora • Two corpora will be used in this study: the Lancaster Corpus of Mandarin Chinese (LCMC) for written Chinese and the Lancaster Los Angeles Corpus of Spoken Chinese (LLSCC) for spoken register. • The LCMC is a balanced corpus of written Chinese composed of one million words proportionally sampled from fifteen genres ranging from news, fiction to academic prose published in mainland China within two years around 1991 (see McEnery, Xiao & Mo 2003). 23 Corpora • The LLSCC comprises one million words of dialogues (55%) and monologues (45%) in Chinese. These represent both spontaneous (57%) and scripted (43%) speech. • The two corpora are also tokenised and annotated with part-of-speech tags. They form the empirical basis for our analysis of SVCs. 24 Procedures • First, we will use the 1,738 commonly used split words listed in A Dictionary of Split Word Usage in Modern Chinese (Yang 1995) as seeds to automatically extract all instances of SVCs when their two parts are separated, in either forward or backward direction, by 1-10 tokens. 25 • 2793 concordance lines of SVCs were returned from the 2 million word corpora • among which are 1348 instances for crude SVCs (609 forward + 739 backward) in the written corpus of the LCMC • and 1445 instances for crude SVCs (686 forward + 798 backward) in the spoken corpus of the LLSCC. 26 SVCs according to a general definition • Similar noise cases were deleted manually. • After manual filtering, only 565 legitimate SVCs were left for further morpho-syntactic and pragmatic annotation and analysis. 27 Syntagmatic pattern SVCH + NEG + ASP/RVC + MC + CL + MOD + SVCT 28 Lexical & grammatical patterning of insertions • Verbal satellites of SVC constructions SVC types (%) SVC tokens (%) SVCH-le SVCT 42 (25%) 74 (13%) SVCH-guo SVCT 15 (9%) 22 (4%) SVCH-zhe SVCT 12 (7%) 35 (6%) Total 69 (42%) 131 (23%) 29 • Verbal satellites of SVC constructions SVCH (?) ASP (?) SVCT SVC types(%) SVC tokens(%) 91 (55%) 244 (43%) 30 • Verbal satellites of SVC constructions SVCH RVC SVCT SVCH (?) RVC (?) SVCT SVC type (%) SVC token (%) 20 (12%) 26 (5%) SVC types (%) SVC tokens (%) 20 (12%) 66 (12%) 31 Nominal/Complement satellites of SC constructions • Quantifier • There are 108 instances of quantificational expressions in the insertion structures (19% of all SVCs). • Classifier • Nominals in Mandarin are typically preceded by a classifier. In our data, 116 SVCs (21% of all SVCs) contain a classifier. 32 Nominal/Complement satellites of SVC constructions • Modifier • The modifier category here means the pre-modifying element(s) of the nominal component of an SVC, which typically includes adjectival modifiers (63 times, 11% of all SVCs), nominal items (59 times, 10% of all SVCs), possessive personal pronouns (64 times, 11% of all SVCs), question words (i.e. shen2me, ‘what’, 26 times, 5% of all SVCs), and also combinations of these elements 33 Lexical and grammatical patterning Aspect marker (±) SVC H SVCT LE DA Modifier GE Resultative verb complement (±) HA O Classifier YI Quantifier 34 • Our quantitative data explicate that over half or more (54%, if RVCs are seen as quasi-aspect markers) of split uses of SVCs, together with their continuous cognates, can be analysed as legitimate compounds. 35 Two overarching criteria • The key grammatical/structural criteria – Host dependency: • head dependence > tail dependence • The key phonological criteria – PrWd restriction • A disyllabic unit is the typical prosodic foot in Mandarin. • A trisyllabic construction can also be a prosodic word. 36 Grammatical/structural criteria • The host dependency criterion (a > b > c) of the canonical approach perceives: • a) SVCs with a clitic-like aspect marker (e.g. the perfective marker -le) as compounds instead of phrases; • b) SVCs with resultative verb complements attached to the main verb as quasi-compounds; and • c) other modifiers (classifiers, modifiers, etc.) attached to SVCTs, represented typically by a noun or complement, as least possibly compounds. 37 Phonological criteria • We propose that the various manifestations of SVCs define a continuum of phonological conditions as a complement to the grammatical criteria (a > b > c): • a) The combined uses of SVCHs and SVCTs are disyllabic compounds; • b) SVCs in which the SVCHs and SVCTs are separated by one single morpheme under the Trisyllabic Foot rule are possible compounds; while • c) SVCHs and SVCTs separated by other grammatical categories like quantifiers, adjectival modifiers, etc. are phrases. 38 Conclusion • In this paper, we used the corpus method to generalise canonical internal structures between the verbal head and nominal element. • We started off with a list of SVCs based an SVC dictionary. • The grammatical contexts were analysed manually. • Quantitative data were employed as reference for establishing canonical features. • We hope the method can be used for separable verb prefixes in languages like German, Dutch, etc. • It is also interesting to look at Eastern Asian, South East Asian languages which might have similar V-O compounds. • Phonological aspects, like, prosodic foot, stress should also be considered. 39 Thank you! 40