Dialect Simulation through Prosody Transfer: A preliminary

Download Report

Transcript Dialect Simulation through Prosody Transfer: A preliminary

The role of prosody in dialect authentication
Simulating Masan dialect with Seoul speech segments
Kyuchul Yoon
Division of English, Kyungnam University
The Joint Conference of
The Korean Association of Speech Sciences
&
Korean Society of Phonetic Sciences and Speech Technology
Wonkwang University, 2007. 05. 18 ~ 19
Table of Contents
•
•
•
•
•
•
Background & motivation
Goals
Prosody cloning
Stimuli
Evaluation
Future work
2
Background & motivation
• Differences among dialects
– Segmental differences
• Fricative differences in the time domain (Lee, 2002)
– Busan fricatives have shorter frication/aspiration intervals than for Seoul
• Fricative differences in the frequency domain (Kim et al., 2002)
– The low cutoff frequency of Kyungsang fricatives was higher than for
Cholla fricatives (> 1,000 Hz)
– Non-segmental or prosodic differences
•
•
•
•
Intonation or fundamental frequency (F0) contour difference
Intensity contour difference
Segment durational difference
Voice quality difference
3
Background & motivation
• Concatenative text-to-speech (TTS) synthesizers
– Concatenation-based
– Concatenation units: e.g. diphones
– Concatenation units from pre-recorded utterances of a
particular dialect
– No need for modeling segmental properties
(cf. formant-based synthesizers)
• Strength/Weakness
– Usually single dialect
4
Background & motivation
• To build a multi-dialectal TTS synthesizer
– Concatenation units: Multiple dialects(?)
– User-selectable dialects
• Question:
– Scenario A: A multi-dialectal TTS system containing multiple
concatenation units from all the dialects involved
– Scenario B: Use the concatenation units from a single dialect
and simulate the other dialects
5
Background & motivation
• The answer has implications on the cost and the complexity
of building multi-dialect TTS systems.
• Scenario B
– Simpler & cheaper
– Need for simulating the segmental/non-segmental aspects of the
other dialects involved.
– Scenario A may be closer to the ultimate solution
• Concatenative TTS systems
– Since modeling the segmental aspects of the concatenation units in
the frequency domain can be more difficult, the non-segmental or
prosodic aspects should be manipulated.
6
Background & motivation
• The imaginary TTS system (Scenario B)
Concatenation units from dialect 1
Simulate prosodic aspects
Dialect 2
Dialect 3
Dialect 4
Dialect 4
7
Background & motivation
• The questions are;
Would the simulated dialects be good enough?
In other words,
Would the segmental effects be negligible in
perceiving the simulated dialects as authentic?
8
Goals
• The goal is to test the viability of this scenario (B)
with an imaginary system:
– Simulate Masan dialect with Seoul speech segments
• The simulated Masan dialect will have
– the speech segments from Seoul dialect
– the prosody of Masan dialect (F0, intensity, duration)
– the voice source of Masan dialect (not tested)
9
Goals
• The imaginary system would have
– the concatenation units from Seoul dialect and
– the ‘near-perfect’ prosody-generating module and
– have to simulate the other dialects, e.g. Masan dialect
• The imaginary TTS system will be implemented with
– the recorded utterances of Seoul dialect
– the Masan prosody (F0, intensity, duration) from recorded
Masan utterances
– the voice source of recorded Masan utterances (not tested)
10
Prosody cloning
• Three aspects of the prosody
– Fundamental frequency (F0) contour
– Intensity contour
– Segmental durations
• Pitch-Synchronous OverLap and Add (PSOLA) algorithm
(Mouline & Charpentier, 1990)
– Implemented in Praat (Boersma, 2005)
– Use of a script for semi-automatic segment-by-segment
manipulation (Yoon, 2006)
11
Prosody cloning
• PSOLA algorithm (Mouline & Charpentier, 1990)
– Windowing pitch periods of the original signal
– Rearranging windowed pitch periods to
• Stretch/shrink the signal
(involves adding/deleting windowed pitch periods)
• Change, i.e. increase/decrease the F0 of the signal
(involves adding/deleting windowed pitch periods)
12
Prosody cloning
original waveform
windowed waveform
1
2
3 4
5 6
7
8
9 10 11 12 13 14 15 16 17 18 19
shortened waveform
1
1
4
7
3
10 13 16 19
5
7
9
waveform with lower F0
11
13
15
17
19
13
Prosody cloning
• Prosody transfer using the PSOLA algorithm
–
–
–
–
Align segments btw/ Masan and Seoul utterances
Make the segment durations of the two identical
Make the two F0 contours identical
Make the two intensity contours identical
14
Prosody cloning
Align segments btw/ Masan and Seoul utterances
Make the segment durations of the two utterances identical
Masan
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
ㅂ
ㅏ
ㄹ
ㅏ
“…바람…”
ㅁ
15
Prosody cloning
Make the two F0 contours identical
Masan F0
Masan
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul F0
16
Prosody cloning
Make the two intensity contours identical
Masan intensity
Masan
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul intensity
17
Stimuli for experiment
18
Stimuli for control
19
Simulated Masan utterances
Stimuli used in experiment
Masan dialect
Seoul dialect
prosody-donor (A) prosody-recipient (B)
prosody-recipient (C) prosody-recipient (D)
바다에 보물섬이 없다
교수님 가시는 길이 구미로…
동대구에 볼 일이 없습니다
쌀 사고 난 후에 와라
바람이 불어서 먼지가 많다
싸기는 해 보여도, 비싸기는 …
서울에 사는 삼촌이 왔다
7 control stimuli
(used)
7 test stimuli
(used)
test stimuli
(not used)
20
Evaluation
• 14 test/control stimuli normalized & randomized
• Presented to 4 Masan listeners for magnitude estimation
–
–
–
–
On a scale of 1 (bad) to 10 (best)
Qualitatively assessed
Used Praat experimentMFC object
Repetition of each stimulus : up to 10 times
(User can press “replay” button)
21
Evaluation
22
Evaluation
23
Future work
• Carefully control the phonological,
morphological, and syntactic aspects of the
test sentences
• Try the voice source of Masan utterances
24
Future work
•
Compare narrow-band spectra btw/ Masan and Seoul /i/
바람이
H1 & H2
25
Future work
Original Masan utterance
Original Seoul utterance
Simulated Masan utterance: Seoul segments + Masan prosody
Simulated Masan utterance: Seoul segments + Masan prosody +
Masan voice source
26
Simulated Seoul utterances
Appendix Additional stimuli not used in experiment
Seoul dialect
Masan dialect
prosody-donor (A) prosody-recipient (B)
prosody-recipient (C) prosody-recipient (D)
바다에 보물섬이 없다
교수님 가시는 길이 구미로…
동대구에 볼 일이 없습니다
쌀 사고 난 후에 와라
바람이 불어서 먼지가 많다
싸기는 해 보여도, 비싸기는 …
서울에 사는 삼촌이 왔다
control stimuli
test stimuli
test stimuli
27
References
[1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan
dialect on fricatives”, Speech Sciences, Vol.9/3, pp.223-235, 2002.
[2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study
of Kyungsang and Cholla dialect using power spectrum and laryngeal
fiberscope”, Speech Sciences, Vol.9/2, pp.25-47, 2002.
[3] Kyuchul Yoon, “Swapping native and non-native speakers' prosody using PSOLA
algorithm”, Proceedings of the Korean Society of Phonetic Sciences and Speech
Technology, Spring Conference, pp.77-81, 2006.
[4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniques
for text-to-speech synthesis using diphones”, Speech Communication, 9:n 5-6, 1990.
[5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International,
Vol.5, 9/10, pp.341-345, 2005.
28