4 - University of Reading
Download
Report
Transcript 4 - University of Reading
Corpora:
Resources for the
study of language
Paul Thompson
Applied Linguistics
([email protected])
British Academic Spoken English corpus
(BASE)
160 lectures, 39 seminars
Transcripts, video and audio
199 XML files:
Transcripts with detailed annotation
Metadata included in header
160 lecture transcripts are tagged for Part-ofSpeech
www.reading.ac.uk/AcaDepts/ll/base_corpus/
Funded by AHRB, Euralex, BALEAP and university
sources
British Academic Written English corpus
(BAWE)
A corpus of assessed student writing at
university level
Texts collected at Warwick, Reading and
Oxford Brookes University
Funded by Economic and Social Research
Council of England (ESRC)
RES-000-23-0800
BAWE figures
6.5 million words
2,896 texts
2,761 assignments
XML files, POS-tagged
30+ disciplines
4 levels of study
Query interface:
Sketch Engine
Commercial service:
Applied Linguistics
pays annual
subscription
BAWE: it BE ADJ that
(eg, ‘it is important that’)
Level
Raw
Rel %
3
225
121.7
2
275
107.7
1
255
96.0
PG
66
62.1
Further possibilities
BASE: Linking audio and video to the
transcripts, either online or on hard drives
Insertion of timestamp data into transcripts
Example
Why?
Access to temporal, spatial, paralinguistic,
phonological information
Studies of speech rate, for example
Uses of corpora
Comparison between languages
Historical linguistics
Stylistics
Studies of language in use
Specialised language use [eg, doctorpatient interactions]
Investigations of multimodality
Projects in mind
PhD thesis corpus
Academic speech events
Electronic submission
Seminars, tutorials, etc
Student use of computers in preparing
assignments [video and text]
Reading and writing of undergraduates
Desiderata
Hosting corpus resources at Reading or other
university – preferably on Linux servers – with
customisable interfaces
BASE, BAWE, and other corpora that Reading
possesses
For use by all departments at Reading and also
elsewhere
Varied levels of user access
Centralised support needed – lack of continuity
with project staff