The 10-milion-words Spoken Dutch Corpus and its potential

Download Report

Transcript The 10-milion-words Spoken Dutch Corpus and its potential

The 10-milion-words Spoken
Dutch Corpus and its potential
use in experimental phonetics
Louis C.W. Pols
Institute of Phonetic Sciences
University of Amsterdam
100 Years of Experimental Phonetics in Russia
St.-Petersburg State Univ., Febr. 1-4, 2001
Amsterdam city center
Herengracht 338
2
Overview
•
•
•
•
•
•
•
•
Introduction
Corpus design, recording, digitization
Orthographic transcription
Part-of-speech tagging, lemmatization and
syntactic annotation
Phonetic transcription
Prosodic transcription
Exploration
Potential phonetic benefit
3
Introduction
•
•
•
•
•
appropriate topic given long Russian tradition
Dutch-Flemish initiative
10 Mƒ, 10 M words (about 1000 hrs of speech)
start June 1998, 5 yrs, 7 releases (audio + ann.)
many speaking styles, also over telephone, only
adult speakers, ABN variants but no dialect
• for linguistics and speech/language technology
• rights with NTU (http://www.taalunie.nl)
4
Corpus design
(number of words x 1000)
monologues
dialogues and
multilogues
Corpus components
1
2
3
4
5
6
7
8
9
10
11
12
13
14
conversations (face-to-face)
interviews
telephone conversations
business transactions
interviews and discussions
discuss., debates, meetings
lectures
description of pictures
spontaneous commentary
news rep., current affairs progr.
news bulletins
commentary
lectures, speeches
read aloud text
Total Phonetic Syntactic Prosodic
corpus transcr. annotat. annotat.
3,000
460
3,000
175
750
375
350
40
250
250
250
200
275
1,000
150
50
300
15
75
35
35
10
25
25
25
25
30
200
550
50
100
15
75
35
35
10
25
25
25
25
30
0
100
20
50
10
10
10
0
0
10
10
10
10
10
0
Total 10,375
1,000
1,000
250
5
Recording, digitization
• mono or stereo using portable DAT-recorders
• 16 kHz and 16 bit
(telephone recordings at 8 kHz and 8 bit)
• .WAV format in PRAAT
• meta data about recording and speaker
• 7 audio releases on CD-ROM, or DVD (future?)
• annotations updated with each release
6
Orthographic transcription (1)
•
•
•
•
•
•
by trained students, checked by expert
according to fixed protocol; no text interpretations
transcr. aligned at few sec. chunks; multiple tiers
few punctuations; capitals for names only
standard spelling conventions, checked vs. lexicon
special mark-up symbols:
–
–
–
–
*d dialect words; *z regionally accented words
*t interjection; *a truncated wrd; *u mispronunciation
*v foreign words; *n new words; *x hardly intelligible
ggg speaker sounds; xxx unintelligible word(part)(s)
7
Orthographic transcription (2)
text window
selection
frame
selected
segment
sound
window
tiers
time
marker
8
Part-of-speech tagging
• all words in the text automatically tagged
• discontinuous verbs not recognized at this level
• Dutch tag set with 10 major word classes
(noun, adjective, verb, pronoun, article, numeral,
preposition, adverb, conjunction, and interjection)
• additional morpho-syntactic features per class
(e.g., singular, dimunitive and neuter for nouns)
• resulting in some 300 tags
• self-learning automatic tagger (given context)
9
Lemmatization
• all words autom. paired with base form (lemma)
• verbs  infinitive (gedaan  doen)
other forms  stem (vijfde  vijf)
truncated forms  full forms (z’n  zijn)
• base form must be an independently existing form
(hersenen  hersen; meisje  meis)
• discontinuous verbs and split prepositions are not
recognized at this level (op...bellen; van...uit)
• one and only one baseform per word
(vliegen  verb vliegen, or noun vlieg, depending POS)
10
Broad phonetic transcription (1)
•
•
•
•
•
•
on 10% of the data (mainly dialogues)
hand correction of automatic phonetic transcription
across-word assimilation, levels of reduction?
use of extended SAMPA
within PRAAT
word level respected
die ik wel vind dat ze kloppen  di k wEl fInt_tAt s@ klOp@
• no hand segmentation at phoneme level
11
Broad phonetic transcription (2)
12
Signal coupling, word alignment
• the phonetically transcribed part (1 M words) will
be automatically aligned at word level
• using ASR techniques (forced alignment)
• this word alignment will be hand corrected
– pauses and noises will also be aligned
– geminate plosives are aligned separately, others shared
(komt terug  kom t erug; is zeker  isseker)
– inserted phonemes are shared with neighbouring words
(toen belde n ie naar huis  belden nie
• all the rest may be automatically aligned only
• few seconds chunks are always accessible
13
Syntactic annotation
• 10% will be semi-automatically annotated
• procedure still under developed
• interactive annotation software from NEGRA
project (Saarbrücken) will be used
• taking into account idiosyncracies of speech,
such as hesitations, false starts, clause
extensions
• functional information (dependency labels)
• category information (in form of node labels)
14
Prosodic annotation
•
•
•
•
manually, on 250K words subset only
procedure still under development
prosodic markers in orthography
1) prosodic boundaries
long silences ()
phrase boundaries ()
other discontinuities, like (filled) pauses (%)
• 2) prominence (^ before vowel in prominent syllable)
sp. A: n^ee  Jan heeft n^egen % medailles 
medailles. 
sp. B:
z^even 
z^even
15
Exploration software
•
•
•
•
COREX tool under developed (Max Planck Inst.)
both locally and internet-based (Java)
1) browser
2) viewer for orthography and annotations, plus
waveform display and audio player (time synchr.)
• 3) search module, also on meta data
16
Potential phonetic benefit
•
•
•
•
•
•
•
•
•
huge database, many speakers/styles,‘real’ speech
easily accessible via orthography, plus audio
partly accessible via phonetic transcription
no segmentation at phoneme level (automatic?)
automatic segmentation at word level
after COREX search: own additions possible
f.i. spectro-temporal analyses via PRAAT scripts
f.i. svarabhakti vowel, final n-deletion, assimilation
f.i. vowel reduction, turn-taking behavior, etc.
17
More information
•
•
•
•
•
•
see references in paper
see websites mentioned in paper
second release Oct. 2000
new releases every half year
feedback from users group (workshops)
useful for proposed INTAS project
“Spontaneous speech of typologically unrelated
languages (Russian, Finnish and Dutch):
Comparison of phonetic properties” (De Silva, 2000)
18