Presentation title (on one or two lines)

Download Report

Transcript Presentation title (on one or two lines)

How (not) to Select Your Voice Corpus:
Random Selection vs. Phonologically Balanced
Tanya Lambert, Norbert Braunschweiler, Sabine Buchholz
6th ISCA Workshop on Speech Synthesis
Bonn, Germany
22-24th August 2007
Copyright 2007, Toshiba Corporation.
Overview

Text selection for a TTS voice

Random sub-corpus

Phonologically balanced sub-corpus

Phonetic and phonological inventory of full corpus and its sub-corpora

Phonetic and phonological coverage of units in test sentences with
respect to the full corpus and its sub-corpora

Voice building - automatic annotation and training

Objective and subjective evaluations

Conclusions
2
Selection of Text for a TTS Voice
Voice preparation for a TTS system is affected by:

Text domain from which text is selected

Text annotations (phonetic, phonological, prosodic, syntactic)

The linguistic and signal processing capabilities of the TTS system

Unit selection method and the type of units selected for speech
synthesis

Corpus training

Speech annotation (automatic/manual; phonetic details, post lexical
effects)

Other factors (time and financial resources, voice talent, recording
quality, the target audience of a TTS application, etc.)
3
Text Selection

Our case study tries to answer the following question:
 What is the effect of different script selection methods on a
half-phone unit selection system, automatic corpus
annotation and corpus training?

Full corpus: The ATR American English Speech Corpus for Speech
Synthesis (~ 8 h) used in this year’s Blizzard Challenge.
 Random sub-corpus (0.8 h);
 Phonologically-rich sub-corpus (0.8 h)
Full corpus
~8 h
Phonologically
balanced selection
Random
selection
Phonbal
Random
4
Phonologically-Rich Sub-Corpus
Full corpus
Sub-corpus A (1133 sentences)
………………………
…………………….
…………………..
…………………
………………
……………
Set cover ………..
……..
……
….
algorithm ….
….
…
..
..
...
...
Removed
...
stress in
consonants
…….
...
…….
…….
..
Lexical units
(sub-corpus)
Phonetically and
phonologically
transcribed
full corpus
+
…….
…….
…….
…….
…….
594 sentences
covered 1 unit
per sentence
Sentences from full corpus
(emphasis on interrogative,
exclamatory, multisyllabic
phrases, consonant clusters
before and after silence)
….
….
+
...................
........
...................
.................
...................
...
................
...................
................
...................
...............
...................
........
...................
..
.............
...................
................
.................
...................
...................
...................
..............
...................
...
B
Set cover Sub-corpus
………..
..……
algorithm ……
…..
…
..
.
+
Sub-corpus A
539 sentences
(above the cut
point)
=
Sub-corpus
(728 sentences
~2906 sec)
Lexical units
(full corpus)
5
Random Sub-Corpus
Randomized sequence of sentences:
Sub-corpus (686 sentences
< 2914 sec)
Full corpus
Removed
sentences
including
foreign
words
………
……….
………
………………
…………..
…………
……………….
………………..
………………
………………..
……………….
……………
………
……….
………………..
………………
……….
…..
………………
……………
……………..
………………..
………………
= 2914 sec
+
1 sentence
…….
…….
…….
Sub-corpus
(687 sentences
~2914 sec)
6
Textual and Duration Characteristics of Corpora
Full
Arctic
28,591
2,914
2,906
2,914
6,579
1,032
728
687
79,182
9,196
8,156
8,094
12.0
8.9
11.2
11.8
1 – 9 words
37.7
54.9
41.0
38.6
10 – 15 words
27.6
45.1
18.6
26.9
> 15 words
34.8
-
40.4
34.5
‘?’
868
1
96
94
‘!’
4
-
-
1
‘,’
3,977
430
452
410
‘;’
30
6
4
3
‘:’
17
-
-
-
seconds
sentences
words
words/sent.
Phonbal Random
% sent with
7
Corpus Selection - Considerations

Selection of text based on broad phonetic transcription


may be insufficient
Inclusion of phonological, prosodic and syntactic markings

how to make it effective for a half-phone unit selection system?
Distribution of Unit Types in Full Corpus and its Sub-Corpora
Unit Types
Full
Arctic Phonbal Random
diph. (no stress)
1607
1385
1510
1322
lex. diphones
4332
2716
3306
2735
lex. triphones
17032
7945
8716
8144
sil_CV clusters (no stress)
104
42
46
43
VC_sil clusters (no stress)
184
84
100
75
8
Percentage Distribution of Units in Full Corpus and its Sub-corpora
% of full corpus
20.0
diphones
15.0
lexical diphones
10.0
lexical triphones
sil CV clusters
5.0
VC sil clusters
0.0
Arctic
Random
Phonol. Rich
% of full corpus
100.00
80.00
diphones (no stress)
lexical diphones
60.00
lexical triphones
40.00
sil_cv_clusters.lf
vc_clusters_sil.lf
20.00
0.00
Arctic
Random
Phonol. rich
9
Distribution of Unit Types in Test Sentences

Testing distribution of unit types in 400 test sentences
100 sentences each from: conv = conversational; mrt = modified rhyme
test; news = news texts; novel = sentences from a novel; sus =
semantically unpredictable sentences
4000
3500
type occurrences

3000
2500
2000
1500
1000
500
0
diph. (no stress)
conv
lexical diph.
mrt
news
lexical triph.
novel
sus
10
Distribution of Lexical Diphone Types per Corpus per Text Genre
occurrence of lexical diph. types
1800
1600
1400
1200
Full corpus
1000
Arctic
800
Phon. rich
600
Random
400
200
0
conv
mrt
news
novel
sus
11
Missing Diphone Types from Each Corpus in Relation to Test Sentences
missing lexical diphone types
160
140
120
100
80
60
40
20
0
conv
mrt
news
novel
sus
missing diphone types (no stress)
30
Full corpus
25
Arctic
20
Random
Phonologically rich
15
10
5
0
conv
mrt
news
novel
sus
12
Diphone Types in Each Corpus but not Required in Test Sentences
4500
lexical diphone types
4000
3500
3000
2500
2000
1500
1000
500
0
conv
mrt
new s
novel
sus
diphone types (no stress)
1600
1400
Full corpus
Arctic
Random
1200
1000
800
Phonologically rich
600
400
200
0
conv
mrt
news
novel
sus
13
Voice Building – Automatic Annotation and Training

From both corpora Phonbal and Random synthesis voices were created

Automatic synthesis voice creation encompasses

Grapheme to phoneme conversion
 Automatic phone alignment

Automatic prosody annotation
 Automatic prosody training (duration, F0, pause, etc.)
 Speech unit database creation

Automatic phone alignment





Depends on the quality of grapheme to phoneme conversion
Depends on the output of text normalisation
Uses HMM’s with a flat start, i.e. depends on corpus size
Respects pronunciation variants
Acoustic model typology: three-state Markov, left-to-right with no
skips, context independent, single Gaussian monophone HMM’s
14
Voice Building – Automatic Annotation and Training

Automatic prosody annotation

Prosodizer creates ToBI markup for each sentence
 Rule based
 Depends on quality of phone alignments
 Depends on quality of text analysis module, i.e. uses PoS, etc.

Automatic prosody training

Depends on phone alignments, ToBI markup, and text analysis
 Creates prediction models for:
• Phone duration
• Prosodic chunk boundaries
• Presence or absence of pauses
• The length of previously predicted pauses
• The accent property of each word: de-accented, accented, high
• The F0 contour of each word
 Quality of predicted prosody is important factor for overall voice quality
15
Objective Evaluation – how good are the phone alignments?

Comparison of phone alignments in the Phonbal and Random subcorpora against those in the Full corpus
Metric
Phonbal
Random
95.26
96.35
RMSE of boundaries
6.3 ms
3.3 ms
boundaries within 5 ms
86.6 %
91.8 %
boundaries within 10 ms
97.1 %
99.1 %
boundaries within 20 ms
99.1 %
99.9 %
Overlap Rate
 Phone alignment of Random corpus is slightly better than that of
Phonbal
16
Objective Evaluation – Accuracy of Prosody Prediction

Comparison of the accuracy of

pause prediction, prosodic chunk prediction, and word accent prediction;
 by the modules trained on the Phonbal or on the Random sub-corpus
against the automatic markup of 1000 sentences not in either sub-corpus
Chunks
Pauses
acc
high
Phonbal
Random
Precision
58.9
56.3
Recall
34.2
38.7
Precision
63.1
63.4
Recall
34.1
38.0
Precision
69.7
69.5
Recall
78.4
78.9
Precision
54.7
57.1
Recall
38.6
41.1
 Some prosody modules trained on Random corpus are better
17
Subjective Evaluation – Preference Listening Test


Result of preference test comparing
53 test sentences synthesized with
voice Phonbal or voice Random
2 groups of listeners:

Non American listeners
 Native American listeners

Columns 2 and 3 show the number of
times each subject preferred each
voice
 Each of the 9 subjects preferred the
Random voice
Subject
Phonbal Random
Non-American Listeners
1
20
33
2
21
32
3
24
29
4
25
28
All
90
122
American English Listeners
1
21
32
2
21
32
3
16
37
4
23
30
5
25
28
All
106
159
18
Conclusions

Two synthesis voices were compared in this study:

The two voices are based on two separate selections of sentences
from the same source corpus

The Random corpus was created by a random selection of
sentences from the source corpus

The Phonbal corpus was created by selecting sentences which
optimise its phonetic and phonological coverage

Listeners consistently preferred the TTS voice built with our system
from the Random corpus

Investigation of the differences of the two sub-corpora revealed:

Phonbal has better diphone and lexical diphone coverage

Random has better phone alignments

Random has slightly better prosody prediction performance
19
Future

Is the prosody prediction performance only due to better automatic
prosody annotation which is due to better phone alignment?

Is the random selection inherently better suited to train prosody
models on, e.g. because its distribution of sentence lengths is not as
skewed as the Phonbal one?

What exactly is the relation between phone frequency and alignment
accuracy?

Why does the Random corpus have so much better pause alignment
when it contains fewer pauses?

Is it worth trying to construct some kind of prosodically balanced
corpus to boost the performance of the trained modules, or would that
result in a similar detrimental effect on alignment accuracy?
20