Transcript Slide 1

Corpus informed approach
to Canonical Typology
Jiajin Xu, Anna Siewierska & Richard Xiao
Dept. of Linguistics & English Language
Lancaster University
1
Corpus informed approach to
Canonical Typology:
With special reference to separable verb compounds in Mandarin
Part of the project
A corpus-based study of split words in Chinese:
Interfacing morphology, syntax and pragmatics
PI:
Anna Siewierska
RA:
Jiajin Xu
Consultant: Richard Xiao
2
Outline
1
The issue: Separable verb compounds (SVCs)
2
Canonical typological strategies
3
How can corpus method inform CT?
4
A case study: Mandarin SVCs
3
The issue: Separable verb compounds
• In some languages, there are verb compounds
which are made up of two parts, a verbal stem
and a movable element standing before or
after the verb root in adjacency or close
proximity.
• Such verbs are called separable (compounds)
verbs in German (e.g. abfahren, ‘to drive
off/depart’) and Dutch (e.g. binnenkomen, ‘to
come in’).
4
• Separable prefixes in some German and Dutch
verbs can stand alone as words. They are
moved from the verb to the end of the
sentence in the present and past tense.
• E.g.
• Kommen Sie herein!
• Come in.
5
SVCs in other languages
• Parallels can be found in other languages (e.g.
Afrikaans, Danish, Mandarin, Swedish, etc.),
although they are not exactly the same type of
verbs or constructions.
6
Why are SVCs interesting?
• SVCs are a large class of verbs in, say, German,
Dutch and Mandarin.
• They violate one of the most fundamental
principle of the lexicalist theory of word
formation (Booij 1990: 45): the Principle of
Lexical Integrity.
7
Why are SVCs interesting?
• The Principle of Lexical Integrity predicts that wordinternal structures are not accessible to rules of
syntax (ibid: 45).
• SVCs straddle the boundary of morphology and syntax.
• SVCs are listed as words, but they apparently have
some phrasal properties, and sensitive to tense,
clause type, for example.
8
SVCs in Mandarin
• The 2002 edition of the Modern Chinese Dictionary
includes 3,236 SVCs. Among the four subtypes of
SVCs, verb object compounds constitute 97% of the
category, while the other three categories combined
make up the remaining 3% (Zhu 2006: 29).
• No grammar of Mandarin can turn a blind eye to the
“verb-object paradox” (Packard 2003: 108), because
SVCs are by no means a marginal morphological
phenomenon.
9
An example of Mandarin SVC
• dan1xin1, carry heart, ‘to worry’
• dan1-le yi1shang4wu3 xin1, carry ASP one
morning heart, ‘to be worried the whole
morning’
• xin1 yi4zhi2 dan1-zhe, heart all the time carry
ASP, ‘to have been worried ever since’
10
Research objectives
• What are common insertion and external
patterns and that keep the Mandarin SVCs
apart?
• How can the study on Mandarin SVCs
contribute to the study of similar phenomena
in other languages?
11
Canonical typological strategies
• Canonical typology has been recognised as a
strategy of studying fuzzy and cross-border
grammatical categories (Bond 2007; Corbett
2007; Nikolaeva 2008).
• application of canonical typology to suppletive
forms, agreement, negation, syncretism, etc.
12
Standard strategy in typological
research (Croft 2003: 14)
I. Determine the particular structure or situation
type we want to explore;
II. Examine the morpho-syntactic construction or
strategies used to encode that situation type;
III. Search for dependencies between the
structures used for that situation and other
linguistic factors: other structural features and
external functions expressed by the
construction in question, or both.
13
Steps of canonical typology approach
•
•
•
•
Start with a linguistic phenomenon
Establish a general definition for identifying
the linguistic category in question
Construct a set of features or criteria for the
canonical case of the category
Investigate the categories in languages
against the criteria
14
Steps of canonical typology approach
•
•
•
•
Start with a linguistic phenomenon
Establish a general definition for identifying
the linguistic category in question
Construct a set of features or criteria for the
canonical case of the category
Investigate the categories in languages
against the criteria
15
How can corpus method inform CT?
• Usually the features are summarised from
literature.
• The collection of features could be selective
and arbitrary.
• Can the selection of features be more objective
and reliable?
• We seek to answer this question from a corpus
linguistics perspective.
16
Corpus-driven vs. corpus-based
• In relation to the discontinuous constructions,
we can consider corpus-driven approach and
corpus-based approach.
• With or without prior assumptions or input
information
17
Corpus-driven approach
• A recent development in Corpus Linguistics
called “concgram” or “concgramming”
• Automatically identify all frequent co-occurring
lexical items, either in contiguous or noncontiguous contexts, or in forward or backward
directions.
18
2-word concgram examples
•
•
•
•
•
•
family support
support…family
family…support
take over
over take
take…over
•
•
•
•
the * of
a * of
one of *
the * is
19
Limitation
• This approach is so powerful that it searches
continuous and discontinuous sequence of any type.
• Separable compounds are but a small fraction of all
discontinuous constituents.
• Concgram generation based on POS-tagged can yield
a more accurate search of verbal constructions.
20
Corpus-based approach
• Variational parameters of SVCs can be
summarised from corpus search based on
some formally fixed search terms (very often
relying on grammar books, dictionaries etc.)
• “Corpus-informed” is a more cautious term for
studies which uses corpus data or method, but
not strictly relies on them. Very often it
combines other qualitative methods.
21
Mandarin SVCs as a case study
•
What are common types of insertions in the
discontinuous use of SVCs?
•
How can typical (canonical) features be
identified on a frequency basis?
22
Corpora
• Two corpora will be used in this study: the Lancaster
Corpus of Mandarin Chinese (LCMC) for written
Chinese and the Lancaster Los Angeles Corpus of
Spoken Chinese (LLSCC) for spoken register.
• The LCMC is a balanced corpus of written Chinese
composed of one million words proportionally
sampled from fifteen genres ranging from news,
fiction to academic prose published in mainland China
within two years around 1991 (see McEnery, Xiao &
Mo 2003).
23
Corpora
• The LLSCC comprises one million words of
dialogues (55%) and monologues (45%) in
Chinese. These represent both spontaneous
(57%) and scripted (43%) speech.
• The two corpora are also tokenised and
annotated with part-of-speech tags. They form
the empirical basis for our analysis of SVCs.
24
Procedures
• First, we will use the 1,738 commonly used
split words listed in A Dictionary of Split Word
Usage in Modern Chinese (Yang 1995) as seeds
to automatically extract all instances of SVCs
when their two parts are separated, in either
forward or backward direction, by 1-10 tokens.
25
• 2793 concordance lines of SVCs were returned
from the 2 million word corpora
• among which are 1348 instances for crude
SVCs (609 forward + 739 backward) in the
written corpus of the LCMC
• and 1445 instances for crude SVCs (686
forward + 798 backward) in the spoken corpus
of the LLSCC.
26
SVCs according to a general definition
• Similar noise cases were deleted manually.
• After manual filtering, only 565 legitimate SVCs
were left for further morpho-syntactic and
pragmatic annotation and analysis.
27
Syntagmatic pattern
SVCH + NEG + ASP/RVC + MC + CL + MOD + SVCT
28
Lexical & grammatical patterning of insertions
• Verbal satellites of SVC constructions
SVC types (%) SVC tokens (%)
SVCH-le SVCT
42 (25%)
74 (13%)
SVCH-guo SVCT
15 (9%)
22 (4%)
SVCH-zhe SVCT
12 (7%)
35 (6%)
Total
69 (42%)
131 (23%)
29
• Verbal satellites of SVC constructions
SVCH (?) ASP (?) SVCT
SVC types(%)
SVC tokens(%)
91 (55%)
244 (43%)
30
• Verbal satellites of SVC constructions
SVCH RVC SVCT
SVCH (?) RVC (?) SVCT
SVC type (%)
SVC token (%)
20 (12%)
26 (5%)
SVC types (%)
SVC tokens (%)
20 (12%)
66 (12%)
31
Nominal/Complement satellites of SC constructions
• Quantifier
• There are 108 instances of quantificational
expressions in the insertion structures (19% of
all SVCs).
• Classifier
• Nominals in Mandarin are typically preceded
by a classifier. In our data, 116 SVCs (21% of all
SVCs) contain a classifier.
32
Nominal/Complement satellites of SVC constructions
• Modifier
• The modifier category here means the pre-modifying
element(s) of the nominal component of an SVC,
which typically includes adjectival modifiers (63 times,
11% of all SVCs), nominal items (59 times, 10% of all
SVCs), possessive personal pronouns (64 times, 11%
of all SVCs), question words (i.e. shen2me, ‘what’, 26
times, 5% of all SVCs), and also combinations of these
elements
33
Lexical and grammatical patterning
Aspect marker (±)
SVC
H
SVCT
LE
DA
Modifier
GE
Resultative verb
complement (±)
HA
O
Classifier
YI
Quantifier
34
• Our quantitative data explicate that over half
or more (54%, if RVCs are seen as quasi-aspect
markers) of split uses of SVCs, together with
their continuous cognates, can be analysed as
legitimate compounds.
35
Two overarching criteria
• The key grammatical/structural criteria
– Host dependency:
• head dependence > tail dependence
• The key phonological criteria
– PrWd restriction
• A disyllabic unit is the typical prosodic foot in Mandarin.
• A trisyllabic construction can also be a prosodic word.
36
Grammatical/structural criteria
• The host dependency criterion (a > b > c) of the
canonical approach perceives:
• a) SVCs with a clitic-like aspect marker (e.g. the
perfective marker -le) as compounds instead of
phrases;
• b) SVCs with resultative verb complements attached
to the main verb as quasi-compounds; and
• c) other modifiers (classifiers, modifiers, etc.)
attached to SVCTs, represented typically by a noun or
complement, as least possibly compounds.
37
Phonological criteria
• We propose that the various manifestations of SVCs
define a continuum of phonological conditions as a
complement to the grammatical criteria (a > b > c):
• a) The combined uses of SVCHs and SVCTs are
disyllabic compounds;
• b) SVCs in which the SVCHs and SVCTs are separated
by one single morpheme under the Trisyllabic Foot
rule are possible compounds; while
• c) SVCHs and SVCTs separated by other grammatical
categories like quantifiers, adjectival modifiers, etc.
are phrases.
38
Conclusion
• In this paper, we used the corpus method to generalise
canonical internal structures between the verbal head and
nominal element.
• We started off with a list of SVCs based an SVC dictionary.
• The grammatical contexts were analysed manually.
• Quantitative data were employed as reference for establishing
canonical features.
• We hope the method can be used for separable verb prefixes
in languages like German, Dutch, etc.
• It is also interesting to look at Eastern Asian, South East Asian
languages which might have similar V-O compounds.
• Phonological aspects, like, prosodic foot, stress should also be
considered.
39
Thank you!
40