More on corpus linguistics

Download Report

Transcript More on corpus linguistics

Overview of Corpus Linguistics
Ling 240
Outline
Definition
History
Current status
What is corpus linguistics?
Linguistics: the scientific study of language
using…
Corpus: a large and principled collection of natural t
History of corpus linguistics
As early as 1897, Wilhelm Kaeding compiled a had 5
In the early 1900s, Otto Jesperson, a Danish profess
In 1959, Randolph Quirk started the Survey of Englis
History of corpus linguistics
1961: Brown Corpus
1M words
500 samples of 2,000 words
Various genres; printed, edited, American English
1961: Lancaster-Oslo/Bergen (LOB) Corpus
British version of Brown Corpus
1991: Frown and FLOB Corpora
1988: International Corpus of English (ICE)
World English varieties
20 completed so far
History of corpus linguistics
1991: British National Corpus (BNC)
100M words
Wide range of written (90%) and spoken (10%) texts
2008: BYU Corpora
Corpus of Contemporary American English (COCA)
TIME corpus
Corpus of Historical American English (COHA)
GloWbE Corpus
International Corpus of Learner English (ICLE)
MICASE & MICUSP
Status of corpus linguistics
Is corpus linguistics a branch of linguistics or a meth
Evidence for branch:
Journals such as Corpora and the International Journal of
Some researchers claim corpus linguistics as their area of
Evidence for method:
Most linguistic phenomena can be measured using CL
CL has the potential to inform virtually any theory
Characteristics of corpus-based ana
It is empirical, analyzing the actual patterns of use in
It utilizes a large and principled collection of natural
It makes extensive use of computers for analysis, us
It depends on both quantitative and qualitative ana
Uses of corpora
Changes over time
Changes in register
Changes in situation
Changes in individual
Time
Different Genitives
Of genitive
 The leg of the table
's genitive
 The table's leg
NN genitive
 The table leg
‘s genitive vs. of-genitive vs. NN sequence
Frequency
60
Feature
40
NN
Of-Gen
S-Gen
20
0
1700
1800
1900
Year
2000
NN sequence across time in three registers
Situation
Phrasal Compression
Uncompressed



The dog that was hungry was looking for
something to eat.
Drugs that require a prescription should be
monitored
Compressed



The hungry dog was looking for something to eat.
Prescription drugs should be monitored.
Phrasal compression across levels in an EAP reading series
Phrasal compression across levels in another EAP reading se
Individual
Dimension 2: 'Abstract Narrative versus Concrete Action'
‘Abstract Exposition versus Concret
5
0
-5
Alcott
Dickens
Eliot
Hawthorne James
Kipling
Melville Stevenson Twain
Wells
nd Representativeness
Ling 240
Designing Representative Corpora
• Many people believe that the design of a corpus doesn’t m
• Researchers typically focus on target domain representativ
 Target domain (medical texts, newspapers, academic, ge
• Very few corpora are actually evaluated in terms of their r
Steps—representing the target domain
1.Describe the target domain
2.Design the corpus to represent target domain
3.Complete the sampling
• Simple random
 Randomly choose sections of the data for the corpus
• Stratified
 Determine what genres are included and randomly sample from those data
• Cluster
 Divide data into naturally occurring groups and sample from them
ng practice
Text A
Text B
# Nouns
50
100
# Words
200
1000
(raw count/total words) * 1000
Text A: (50 nouns / 200 words) * 1000 = 250 nouns per thousand words
Text B: (100 nouns / 500 words) * 1000 = 200 nouns per thousand words
ng practice
• BNC has 100 million words
• COCA has 450 million words
snuck
sneaked
BNC
BNC
COCA
COCA
#
Per M
#
Per M
11
767
132
830
Annotation
Ling 240
Annotation
• Corpora can be annotated for a wide range of external and
• External variables
•
•
•
•
Speaker
L1 background
Gender
Extralinguistic information (e.g., laughter, nodding, etc.)
External annotation—example
<Exam ID: 3B>
<Arrangement ID: 54945>
<Center ID: 14>
<Candidate ID: 42285>
<Test Date: 12/6/2013>
<Age: 19>
<Gender: F>
<L1: Arabic>
<Reason for test: B>
<Original MELAB: 2>
<Original Transformed: 3>
<Second MELAB: >
<Second Transformed: >
<End header>
E: Alright, welcome to the MELAB speaking exam, my name is <deleted>. And uh what is your name?
T: Uh my name is uh <deleted>.
E: Now I'll just uh read the MELAB ID number that we have for you. Uh you don't need to know it or anything. The number is <deleted>. Al
T: Uh actually I came to USA to complete my education here, so uh if I want to go to university I need to get score and to to be good in spe
E: Okay uh so uh what are you interested in studying at the university?
T: Actually I think about uh medical science.
Part of speech tagging
• Rule-based
• Probability-based
• 95%+ accuracy rate
• Some features very easy (e.g., the)
• Some features more difficult (e.g., that)
•
•
•
•
Pronoun (He doesn’t like that.)
Determiner (He doesn’t like that dog.)
Complementizer (They thought that they could do it.)
Relativizer (The thought that I entertained.)
POS tagging accuracy
• Accuracy
• Precision – What percent of the cases labeled as X are actually X?
• Recall – What percent of all of the true cases of X were labeled as
• Example
• He saw that dog that I saw.
• If both ‘that’s are tagged as determiners:
• Calculate the precision and recall for determiners
• Calculate the precision and recall for relativizers
POS tagging—two examples
Everything_PN1
I_PPIS1
CLAWS Tagger
've_VH0
read_VVN
says_VV0
they_PPHS2
were_VBDR
warned_VVN
to_TO
leave_VVI
immediately_RR
Everything ^pn++++=Everything
I ^pp1a+pp1+++=I've
Biber
Tagger
've ^vb+hv+aux++0=EXTRAWORD
read ^vprf+++xvbnx+=read
says ^vb+vpub+++=say's
they ^pp3a+pp3+++=they
were ^vbd+bed+aux++=were
warned ^vpsv++agls+xvbnx+=warned
to ^to+vcmp+++=to
leave ^vbi++++=leave
immediately ^rb+tm+++=immediately
Lemmatization
• Lemma
 The citation or dictionary entry
 Run is the lemma
 It includes the words run, running, runs, ran
 We often want the frequency of the lemma not of a part
Answer these questions about COCA
• What external annotation does it contain?
• What internal annotation does it contain?
Answer these questions about COCA
• What external annotation does it contain?
 Text source
 Date of publication
• What internal annotation does it contain?
 Lemmatization
 Part of speech
 Genre
versus ‘of-genitive’
• ‘the bird’s owner’ vs. ‘the owner of the bird’
• Finding 1: “by 1991, the s-genitive had overtaken th
• Finding 2: of-genitive is almost 10 times more frequ
• Q: Are these findings contradictory???
34
esign—variationist
• Two approaches to corpus linguistics: Variationist and Text• Variationist: “has the goal of comparing linguistic variants:
– Statistics: Binomial/logistic regression; Linear discriminant analysis
– Interpretation: When a choice can be made, variant X is preferred o
35
Variationist Analysis (Type A)
• Unit of analysis is linguistic feature
• Most studies do not take register into account (e
• Comparison of the proportion of use in a particu
• E.g., Benedict Szmrecsanyi & Hinrichs, 2008
–preference of s-genitive over of-genitive in speech; B
sign—text-linguistic
• Text-Linguistic: “has the goal of providing a linguistic desc
–Statistics: T-test, ANOVA, Multiple regression, Factor anal
–Interpretation: Feature X is more frequent in context A th
37
Text-linguistic (Type B)
• Comparison of actual frequency of use in a part
• Unit of analysis is text
• Normed rates of occurrence by text
• Much more common for register studies
Text-linguistic (Type C)
• Also compares frequencies of use in a particula
• Unit of analysis is subcorpus
• Normed rates of occurrence for features across
• Cannot use inferential statistics (need to look at
ative analysis
• Coding/tagging features
• Counts in text vs. subcorpus
• Norming (raw count/total words * 1000)
• Use appropriate statistical tests if applicable
Kinds of Corpora
•
•
•
•
Spoken language
General corpora (mainly written)
Bitext (two languages side-by-side)
Specialized
• Children’s speech
• L2 learner speech
• Historical
General Corpora
• Mainly written
• British National Corpus (BNC)
• 100 million words
• 10% spoken
• 25% fiction
• 75% non-fiction
General Corpora
• Corpus of Contemporary American
English (COCA)
• 450 million words (more added
every year)
• Divided into registers
• Spoken
• Fiction
• Academic
General Corpora
• International Corpus of English
• 1 million words from each
country
• 60% spoken, 40% written
Historical Corpora
• Helsinki Corpus
• English texts from 770-1700
• Corpus of Historical American
English (COHA)
• 1860-present
Introduction to COHA
Corpus of Historical American English
• End up verbing
• Try and verb versus try to verb
• Adjectives and nouns used in 2000s not before
• Collocates of Muslim, liberal, Mormon
•?
Raw Corpora
• Not easily searchable
• Not tagged
• Project Gutenberg
• Pre 1928 books (copyright
expired)
• Online newspapers
• Time Magazine
• The internet
• General Conference
Where can you get
corpora?
• Online
• BNC, COCA, COHA
• Distributors (membership
required)
• ELRA (based in Europe)
• Linguistic Data Consortium
• US based
• BYU has a membership
• Catalog
• Top 10 corpora