Presentation title (on one or two lines)
Download
Report
Transcript Presentation title (on one or two lines)
How (not) to Select Your Voice Corpus:
Random Selection vs. Phonologically Balanced
Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz
6th ISCA Workshop on Speech Synthesis
Bonn, Germany
22-24th August 2007
Copyright 2007, Toshiba Corporation.
Overview
Text selection for a TTS voice
Random sub-corpus
Phonologically balanced sub-corpus
Phonetic and phonological inventory of full corpus and its sub-corpora
Phonetic and phonological coverage of units in test sentences with
respect to the full corpus and its sub-corpora
Voice building - automatic annotation and training
Objective and subjective evaluations
Conclusions
2
Selection of Text for a TTS Voice
Voice preparation for a TTS system is affected by:
Text domain from which text is selected
Text annotations (phonetic, phonological, prosodic, syntactic)
The linguistic and signal processing capabilities of the TTS system
Unit selection method and the type of units selected for speech
synthesis
Corpus training
Speech annotation (automatic/manual; phonetic details, post lexical
effects)
Other factors (time and financial resources, voice talent, recording
quality, the target audience of a TTS application, etc.)
3
Text Selection
Our case study tries to answer the following question:
What is the effect of different script selection methods on a
half-phone unit selection system, automatic corpus
annotation and corpus training?
Full corpus: The ATR American English Speech Corpus for Speech
Synthesis (~ 8 h) used in this year’s Blizzard Challenge.
Random sub-corpus (0.8 h);
Phonologically-rich sub-corpus (0.8 h)
Full corpus
~8 h
Phonologically
balanced selection
Random
selection
Phonbal
Random
4
Phonologically-Rich Sub-Corpus
Full corpus
Sub-corpus A (1133 sentences)
………………………
…………………….
…………………..
…………………
………………
……………
Set cover ………..
……..
……
….
algorithm ….
….
…
..
..
...
...
Removed
...
stress in
consonants
…….
...
…….
…….
..
Lexical units
(sub-corpus)
Phonetically and
phonologically
transcribed
full corpus
+
…….
…….
…….
…….
…….
594 sentences
covered 1 unit
per sentence
Sentences from full corpus
(emphasis on interrogative,
exclamatory, multisyllabic
phrases, consonant clusters
before and after silence)
….
….
+
...................
........
...................
.................
...................
...
................
...................
................
...................
...............
...................
........
...................
..
.............
...................
................
.................
...................
...................
...................
..............
...................
...
B
Set cover Sub-corpus
………..
..……
algorithm ……
…..
…
..
.
+
Sub-corpus A
539 sentences
(above the cut
point)
=
Sub-corpus
(728 sentences
~2906 sec)
Lexical units
(full corpus)
5
Random Sub-Corpus
Randomized sequence of sentences:
Sub-corpus (686 sentences
< 2914 sec)
Full corpus
Removed
sentences
including
foreign
words
………
……….
………
………………
…………..
…………
……………….
………………..
………………
………………..
……………….
……………
………
……….
………………..
………………
……….
…..
………………
……………
……………..
………………..
………………
= 2914 sec
+
1 sentence
…….
…….
…….
Sub-corpus
(687 sentences
~2914 sec)
6
Textual and Duration Characteristics of Corpora
Full
Arctic
28,591
2,914
2,906
2,914
6,579
1,032
728
687
79,182
9,196
8,156
8,094
12.0
8.9
11.2
11.8
1 – 9 words
37.7
54.9
41.0
38.6
10 – 15 words
27.6
45.1
18.6
26.9
> 15 words
34.8
-
40.4
34.5
‘?’
868
1
96
94
‘!’
4
-
-
1
‘,’
3,977
430
452
410
‘;’
30
6
4
3
‘:’
17
-
-
-
seconds
sentences
words
words/sent.
Phonbal Random
% sent with
7
Corpus Selection - Considerations
Selection of text based on broad phonetic transcription
may be insufficient
Inclusion of phonological, prosodic and syntactic markings
how to make it effective for a half-phone unit selection system?
Distribution of Unit Types in Full Corpus and its Sub-Corpora
Unit Types
Full
Arctic Phonbal Random
diph. (no stress)
1607
1385
1510
1322
lex. diphones
4332
2716
3306
2735
lex. triphones
17032
7945
8716
8144
sil_CV clusters (no stress)
104
42
46
43
VC_sil clusters (no stress)
184
84
100
75
8
Percentage Distribution of Units in Full Corpus and its Sub-corpora
% of full corpus
20.0
diphones
15.0
lexical diphones
10.0
lexical triphones
sil CV clusters
5.0
VC sil clusters
0.0
Arctic
Random
Phonol. Rich
% of full corpus
100.00
80.00
diphones (no stress)
lexical diphones
60.00
lexical triphones
40.00
sil_cv_clusters.lf
vc_clusters_sil.lf
20.00
0.00
Arctic
Random
Phonol. rich
9
Distribution of Unit Types in Test Sentences
Testing distribution of unit types in 400 test sentences
100 sentences each from: conv = conversational; mrt = modified rhyme
test; news = news texts; novel = sentences from a novel; sus =
semantically unpredictable sentences
4000
3500
type occurrences
3000
2500
2000
1500
1000
500
0
diph. (no stress)
conv
lexical diph.
mrt
news
lexical triph.
novel
sus
10
Distribution of Lexical Diphone Types per Corpus per Text Genre
occurrence of lexical diph. types
1800
1600
1400
1200
Full corpus
1000
Arctic
800
Phon. rich
600
Random
400
200
0
conv
mrt
news
novel
sus
11
Missing Diphone Types from Each Corpus in Relation to Test Sentences
missing lexical diphone types
160
140
120
100
80
60
40
20
0
conv
mrt
news
novel
sus
missing diphone types (no stress)
30
Full corpus
25
Arctic
20
Random
Phonologically rich
15
10
5
0
conv
mrt
news
novel
sus
12
Diphone Types in Each Corpus but not Required in Test Sentences
4500
lexical diphone types
4000
3500
3000
2500
2000
1500
1000
500
0
conv
mrt
new s
novel
sus
diphone types (no stress)
1600
1400
Full corpus
Arctic
Random
1200
1000
800
Phonologically rich
600
400
200
0
conv
mrt
news
novel
sus
13
Voice Building – Automatic Annotation and Training
From both corpora Phonbal and Random synthesis voices were created
Automatic synthesis voice creation encompasses
Grapheme to phoneme conversion
Automatic phone alignment
Automatic prosody annotation
Automatic prosody training (duration, F0, pause, etc.)
Speech unit database creation
Automatic phone alignment
Depends on the quality of grapheme to phoneme conversion
Depends on the output of text normalisation
Uses HMM’s with a flat start, i.e. depends on corpus size
Respects pronunciation variants
Acoustic model typology: three-state Markov, left-to-right with no
skips, context independent, single Gaussian monophone HMM’s
14
Voice Building – Automatic Annotation and Training
Automatic prosody annotation
Prosodizer creates ToBI markup for each sentence
Rule based
Depends on quality of phone alignments
Depends on quality of text analysis module, i.e. uses PoS, etc.
Automatic prosody training
Depends on phone alignments, ToBI markup, and text analysis
Creates prediction models for:
• Phone duration
• Prosodic chunk boundaries
• Presence or absence of pauses
• The length of previously predicted pauses
• The accent property of each word: de-accented, accented, high
• The F0 contour of each word
Quality of predicted prosody is important factor for overall voice quality
15
Objective Evaluation – how good are the phone alignments?
Comparison of phone alignments in the Phonbal and Random subcorpora against those in the Full corpus
Metric
Phonbal
Random
95.26
96.35
RMSE of boundaries
6.3 ms
3.3 ms
boundaries within 5 ms
86.6 %
91.8 %
boundaries within 10 ms
97.1 %
99.1 %
boundaries within 20 ms
99.1 %
99.9 %
Overlap Rate
Phone alignment of Random corpus is slightly better than that of
Phonbal
16
Objective Evaluation – Accuracy of Prosody Prediction
Comparison of the accuracy of
pause prediction, prosodic chunk prediction, and word accent prediction;
by the modules trained on the Phonbal or on the Random sub-corpus
against the automatic markup of 1000 sentences not in either sub-corpus
Chunks
Pauses
acc
high
Phonbal
Random
Precision
58.9
56.3
Recall
34.2
38.7
Precision
63.1
63.4
Recall
34.1
38.0
Precision
69.7
69.5
Recall
78.4
78.9
Precision
54.7
57.1
Recall
38.6
41.1
Some prosody modules trained on Random corpus are better
17
Subjective Evaluation – Preference Listening Test
Result of preference test comparing
53 test sentences synthesized with
voice Phonbal or voice Random
2 groups of listeners:
Non American listeners
Native American listeners
Columns 2 and 3 show the number of
times each subject preferred each
voice
Each of the 9 subjects preferred the
Random voice
Subject
Phonbal Random
Non-American Listeners
1
20
33
2
21
32
3
24
29
4
25
28
All
90
122
American English Listeners
1
21
32
2
21
32
3
16
37
4
23
30
5
25
28
All
106
159
18
Conclusions
Two synthesis voices were compared in this study:
The two voices are based on two separate selections of sentences
from the same source corpus
The Random corpus was created by a random selection of
sentences from the source corpus
The Phonbal corpus was created by selecting sentences which
optimise its phonetic and phonological coverage
Listeners consistently preferred the TTS voice built with our system
from the Random corpus
Investigation of the differences of the two sub-corpora revealed:
Phonbal has better diphone and lexical diphone coverage
Random has better phone alignments
Random has slightly better prosody prediction performance
19
Future
Is the prosody prediction performance only due to better automatic
prosody annotation which is due to better phone alignment?
Is the random selection inherently better suited to train prosody
models on, e.g. because its distribution of sentence lengths is not as
skewed as the Phonbal one?
What exactly is the relation between phone frequency and alignment
accuracy?
Why does the Random corpus have so much better pause alignment
when it contains fewer pauses?
Is it worth trying to construct some kind of prosodically balanced
corpus to boost the performance of the trained modules, or would that
result in a similar detrimental effect on alignment accuracy?
20