Identifying frication & aspiration noise in the frequency

Download Report

Transcript Identifying frication & aspiration noise in the frequency

Current Interests 2007~2008
(Unfinished papers & Premature ideas)
1.
2.
3.
4.
5.
6.
7.
Identifying frication & aspiration noise in the
frequency domain: The case of Korean alveolar lax fricatives
The role of prosody
in dialect synthesis and authentication
Synthesis & evaluation of
prosodically exaggerated utterances
Determining the weights of prosodic components in
prosody evaluation
Difference database of prosodic features for automatic
prosody evaluation
Transforming Korean alveolar lax fricatives into tense
Gender transformation of utterances
1
1. Identifying frication & aspiration
noise in the frequency domain:
The case of Korean alveolar lax fricatives
Kyuchul Yoon
School of English Language & Literature
Yeungnam University
Spring 2008 Joint Conference of KSPS & KASS
Korean lax alveolar fricatives
• Two different types of noise
3
Algorithm
4
Algorithm
• Change of energy distribution in the frequency
domain over time
• Energy distribution on a frame-by-frame basis
(e.g. 5 msec)
• Sums of band energy across the reference (e.g.
low cutoff) frequency
• criterionValue variable determines the boundary
• Assumption: Same criteronValue for same speaker
5
How Praat script works
See Demo
6
How Praat script works
7
Experiment
<Table 1> The list of words used in the experiment. The words marked with * was also used in
the repeated series experiment. The numbers in parentheses represent the number of repetition
during the recording.
8
Results & Conclusion
Human 1 vs. Script 1
Repeated
<Histogram 1> The histogram of differences between the manually inserted and
automatically inserted boundaries for the repeated series experiment. X-axis in msec.
9
Results & Conclusion
The outlier from <Histogram 1>. The difference was 6.4 msec.
The m and a represents manual and automatic respectively.
10
Results & Conclusion
The same-speaker-same-criterionValue assumption holds!
Human 1 vs. Script 1
Non-repeated
Human 2 vs. Script 2
Non-repeated
<Histogram 2> The histogram of differences between the manually inserted and automatically
inserted boundaries for the non-repeated series experiment with 53 words. X-axis in msec.
11
Results & Conclusion
Human 1 vs. Human 2
Non-repeated
Script 1 vs. Script 2
Non-repeated
<Histogram 3> The histogram of differences between the two phoneticians and the two automated
scripts for the non-repeated series experiment with 53 words. X-axis in msec.
12
Results & Conclusion
<Table 2> The summary of the means and the standard deviations of the differences from the two
experiments. The numbers are given in msec.
13
Results & Conclusion
The automated identification of the boundary (labeled auto) between /s/ and /h/ in the phrase Miss
Henry produced by a female native speaker of English. The f and v represent the beginnings of /s/
and the vowel following /h/.
14
References
[1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot International
5(9/10). pp.341-345.
[2] Yoon, Kyuchul. 2002. A production and perception experiment of Korean alveolar
fricatives. Speech Sciences. 9(3). pp.169-184.
[3] Yoon, Kyuchul. 2005. Durational correlates of prosodic categories: The case of two
Korean voiceless coronal fricatives. Speech Sciences. 12(1). pp.89-105.
15
2. The role of prosody
in dialect synthesis and authentication
Kyuchul Yoon
School of English Language & Literature
Yeungnam University
Spring 2008 Joint Conference of KSPS & KASS
Goals
1. Synthesize Masan utterances from
matching Seoul utterances by prosody
cloning
2. Examine the role of prosody in the
authentication of synthetic Masan
utterances (Listening experiment)
17
Background
• Differences among dialects
– Segmental differences
• Fricative differences in the time domain (Lee, 2002)
– Busan fricatives have shorter frication/aspiration intervals than for Seoul
• Fricative differences in the frequency domain (Kim et al., 2002)
– The low cutoff frequency of Kyungsang fricatives was higher than for
Cholla fricatives (> 1,000 Hz)
– Non-segmental or prosodic differences
•
•
•
•
Intonation or fundamental frequency (F0) contour difference
Intensity contour difference
Segment durational difference
Voice quality difference
18
Synthesis
• Simulating (by prosody cloning) Masan
dialect from Seoul dialect
• The simulated Masan utterances will have
– the speech segments of Seoul dialect
– the prosody of Masan dialect
• F0 contour
• Intensity contour
• Segmental duration
19
Evaluation
• Through a listening experiment
• Stimuli consist of
–
–
–
–
–
–
–
–
#1. Authentic, but synthetic, Masan utterance
#2. Seoul utterance with Masan segmental durations (D)
#3. Seoul utterance with Masan F0 contour (F)
#4. Seoul utterance with Masan intensity contour (I)
#5. Seoul utterance with Masan durations and F0 contour (D+F)
#6. Seoul utterance with Masan durations and intensity contour (D+I)
#7. Seoul utterance with Masan F0 contour and intensity contour (F+I)
#8. Seoul utterance with Masan durations, F0 contour and intensity contour (D+F+I)
(1) 동대구에 볼 일이 없습니다. (2) 바다에 보물섬이 없다
Listen to Stimuli
20
Prosody transfer (PSOLA algorithm)
• Three aspects of the prosody
– Fundamental frequency (F0) contour
– Intensity contour
– Segmental durations
• Pitch-Synchronous OverLap and Add (PSOLA)
algorithm (Mouline & Charpentier, 1990)
– Implemented in Praat (Boersma, 2005)
– Use of a script for semi-automatic segment-by-segment
manipulation (Yoon, 2007)
21
Prosody transfer (PSOLA algorithm)
• Procedures for full prosody transfer
–
–
–
–
Align segments btw/ Masan and Seoul utterances
Make the segment durations of the two identical
Make the two F0 contours identical
Make the two intensity contours identical
22
Prosody transfer (PSOLA algorithm)
Align segments btw/ Masan and Seoul utterances
Make the segment durations of the two utterances identical
Masan
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
ㅂ
ㅏ
ㄹ
ㅏ
“…바람…”
ㅁ
23
Prosody transfer (PSOLA algorithm)
Make the two F0 contours identical
Masan F0
Masan
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul F0
24
Prosody transfer (PSOLA algorithm)
Make the two intensity contours identical
Masan intensity
Masan
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul
ㅂ ㅏ ㄹ ㅏ ㅁ
Seoul intensity
25
Synthetic (simulated) Masan stimulus
26
Synthetic authentic Masan stimulus
27
Listening experiment
• 16 stimuli (8 + 8)
• Presented to 13 Masan/Changwon listeners
– On a scale of 1 (worst) to 10 (best)
– Used Praat ExperimentMFC object
– Allowed repetition of stimulus: up to 10 times
28
Listening experiment
See Demo
29
Results & Conclusion
Histogram of listener responses
30
Results & Conclusion
1 … listener responses … 10
F0 contour transfer
31
Results & Conclusion
Masan
F
D
FI
DF
I
DFI
DI
Seoul utterances with Masan prosody
32
Results & Conclusion
• Main effects of
– Segmental durations; F(1,12)=11.53, p=0.005
– F0 contour; F(1,12)=141.12, p=0.00000005
• Regression analysis
33
Results & Conclusion
• Prosody cloning not sufficient for dialect
simulation
– (Sub)Segmental differences may be at work
– Quality of synthetic stimuli
• F0 contour transfer (from Masan to Seoul)
– Most influential on shifting perception from
Seoul to Masan utterances
34
References
[1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan
dialect on fricatives”, Speech Sciences, Vol.9/3, pp.223-235, 2002.
[2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study
of Kyungsang and Cholla dialect using power spectrum and laryngeal
fiberscope”, Speech Sciences, Vol.9/2, pp.25-47, 2002.
[3] Kyuchul Yoon, “Imposing native speakers’ prosody on non-native speakers’
utterances: The technique of cloning prosody”, Journal of the Modern British &
American Language & Literature, Vol.25(4). pp.197-215, 2007.
[4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniques
for text-to-speech synthesis using diphones”, Speech Communication, 9 5-6, 1990.
[5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International,
Vol.5, 9/10, pp.341-345, 2005.
35
3. Synthesis & evaluation of
prosodically exaggerated utterances:
A preliminary study
Kyuchul Yoon
School of English Language & Literature
Yeungnam University
Spring 2008 Joint Conference of KSPS & KASS
Contents
• Synthesis & evaluation of human utterances
with exaggerated prosody
• Synthesis of exaggerated prosody
– Useful for presenting native utterances to students
– The definition of prosody “exaggeration”
– The algorithm
• Evaluation of exaggerated prosody
– Useful for evaluating learner utterances
– The algorithm & an experiment
37
Teaching & evaluating prosody
• Teaching language prosody
– The need for “exaggeration” of native utterances
– How to define “exaggeration”
• Evaluating language prosody
– Given the native version of an utterance,
evaluate learner’s atypical prosody
– How to measure the differences btw/ the native
and learner utterances
38
Exaggerating native prosody
• Exaggeration of the F0 contour
– One way would be to make the pitch peaks/valleys
higher/lower
• Exaggeration of the intensity contour
– One way would be to manipulate the intensity contour
of the pitch peaks(or valleys)
• Exaggeration of the segmental durations
– One way would be to manipulate the segmental
durations of the pitch peaks(or valleys)
See Demo
39
Exaggerating native prosody
F0
The fundamental frequency (F0) contour of an utterance Marianna!.
40
Exaggerating native prosody
Intensity
The intensity contour of an utterance Marianna!.
41
Exaggerating native prosody
Duration
The segmental durations of an utterance Marianna! before and after the exaggeration.
42
Algorithm: prosody exaggeration
• Definition of prosody exaggeration
– F0 contour
• Make pitch peaks/valleys higher/lower in Hz values
– Intensity contour
• Make pitch peaks higher in dB values
– Segmental durations
• Make pitch peaks longer in times values
43
Algorithm: prosody exaggeration
F0
44
Algorithm: prosody exaggeration
Intensity
45
Algorithm: prosody exaggeration
Durations
46
How Praat script works
47
How Praat script works
F0
Intensity
Durations
48
How Praat script works
Original
F0
Durations
F0
Durations
Intensity
49
Evaluating learner prosody
• Assumes the existence of the native version
• Evaluates the learner versions
• Evaluation of the F0 & intensity contours
– Is preceded by duration manipulation:
• The durations of the matching segments of the two utterances are
made identical [3]
– Is preceded by F0/intensity normalization & F0 smoothing
• The mean difference is added/subtracted to/from learner utterance
– Is followed by pitch/intensity point-to-point comparison
• Evaluation of segmental durations
– Done without any duration manipulation. Segment-tosegment comparison
• Evaluation measure: Euclidean distance metric
50
Algorithm: prosody evaluation
Before & after duration manipulation
native
learner
before
learner
after
51
Algorithm: prosody evaluation
F0 point-to-point comparison btw/ native and learner
native
learner
after
Normalization & smoothing were performed in prior steps
52
Algorithm: prosody evaluation
Intensity point-to-point comparison btw/ native and learner
native
learner
after
Normalization was performed in prior steps
53
Algorithm: prosody evaluation
Duration segment-to-segment comparison btw/ native and learner
native
learner
before
Euclidean distance metric for evaluation measure
P = (p1, p2, p3,..., pn) and Q = (q1, q2, q3,..., qn) in Euclidean n-space
54
A pilot experiment
native
learner
after
D/F/I
cloning
An ideal case: Three Euclidean distances (Ed) should be minimum
Ed1: F0 contour
Ed2: Intensity contour
Ed3: Segment durations
55
Creation of Stimuli: F0
native
+
+


learner
after
D cloning


+

+
+
F0: -100Hz to +100Hz with a 10Hz interval  21 stimuli
Evaluation of the stimuli against the F0 contour of the native utterance
56
Creation of Stimuli
native
learner
after
D cloning
+
+


Intensity: -25dB to +25dB with a 5dB interval  11 stimuli
Evaluation of the stimuli against the intensity contour of the native utterance
57
Creation of Stimuli
native
learner
+
+


Duration: 0.25, 0.50, 0.75, 1.00, 1.50, 2.00, 2.50, 3.00 times the original  8 stimuli
Evaluation of the stimuli against the segment durations of the native utterance
58
Results & Conclusion
59
Results & Conclusion
60
Results & Conclusion
61
Results & Conclusion
• Prosody exaggeration
– Can be a tool for teaching language prosody
– Can be used to test measures for evaluating prosody
• Limitation of the current prosody evaluation
– Native utterances should exist to yield measures
• TTS systems with advanced prosody models could be helpful to
process any learner utterances
– “Weights” of the three separate measures
(F0/intensity/duration) need to be determined
• Experiments with human evaluators could provide the weights
62
References
[1] Boersma, Paul. 2001. Praat, a system for doing phonetics by computer. Glot
International 5(9/10). pp.341-345.
[2] Moulines, E. & F. Charpentier. 1990. Pitch synchronous waveform processing
techniques for text-to-speech synthesis using diphones. Speech Communication 9.
pp.453-467.
[3] Yoon, K. 2007. Imposing native speakers' prosody on non-native speakers'
utterances: The technique of cloning prosody. Journal of the Modern British &
American Language & Literature 25(4). pp.197-215.
63
4. Determining the weights
of prosodic components in prosody evaluation
• Problem
– Raw components vs. Abstracted concepts
– F0, intensity, duration vs. Rhythm, tempo, etc.
• Determine the weights of prosodic components in prosody
evaluation
–
–
–
–
Use raw units: F0, intensity, duration
Use cloning of prosody (problem of unequal number of segments)
Create an “other-things-being-equal” environment
Evaluation of
• Each raw prosodic component
• Overall prosodic fluency
– Compare & Assess the weights of each component in prosody
evaluation
64
Stimuli (4) Determining the weights of prosodic components in prosody evaluation
•
Given (a) model native utterance(s)
–
•
(1) Its F0 contour
(learner utterance version 1)
(2) Its intensity contour (learner utterance version 2)
(3) Its segmental durations (learner utterance version 3)
Evaluate the manipulated learner utterances
–
–
–
•
Human evaluator evaluates the learner utterance in terms of its prosodic fluency
= Overall Prosody Score (from the unmodified learner utterance)
Manipulate the learner utterance to create an “other-things-being-equal” environment so
that the learner utterance is the same as its native version except for
–
–
–
•
and its learner version
(1) F0 score (from learner version 1)
(2) Intensity score (from learner version 2)
(3) Duration score (from learner version 3)
Hypothesis:
Overall prosody score =  * (F0 score) +  * (Intensity score) +  * (Duration score)
•
•
•
Repeat the evaluation for other utterances from the same learner to solve the equation
Verify the coefficients with unevaluated utterances from the same learner
If the hypothesis holds, make the prosody evaluation process automatic
65
Stimuli “The dancing queen likes only the apple pies”
Native (5061_02)
Evaluate overall prosody with respect to the native version (Overall Prosody Score)
Learner (1047_02)
66
Stimuli “The dancing queen likes only the apple pies”
Native
Learner_DI
Now has the native durations/intensity. Evaluate F0 contour (F0 Score)
Learner_DF
Now has the native durations/F0 contour. Evaluate intensity contour (Intensity Score)
67
Stimuli “The dancing queen likes only the apple pies”
Native
Learner_FI
Now has the native F0/intensity. Evaluate segmental durations (Duration Score)
Overall prosody score =  * (F0 score) +  * (Intensity score) +  * (Duration score)
68
5. Difference database of prosodic features
for automatic prosody evaluation
• Given (a) model native utterance(s) and its learner version, get
difference values of
– (1) F0 contour
– (2) intensity contour
– (3) segmental durations
between the two utterances
• Use techniques & scripts used in
– (3) Synthesis & evaluation of prosodically exaggerated utterances
• Store difference values of each prosodic feature for each learner
utterance in a database
• Use the database to develop algorithms for automatic prosody scoring
• Pilot study: labeled sentences from KT_K-SEC corpus
69
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
70
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
Intensity difference
native
learner
numFrames
frameNo
time
nativedB
learnerdB
diffdB
5053_02.wav
1044_02.wav
482
1
0.035
31.86
42.42
-10.56
5053_02.wav
1044_02.wav
482
2
0.043
30.73
42.45
-11.72
5053_02.wav
1044_02.wav
482
3
0.051
29.33
41.94
-12.61
5053_02.wav
1044_02.wav
482
4
0.059
29.03
41.00
-11.97
5053_02.wav
1044_02.wav
482
5
0.067
29.11
40.97
-11.86
5053_02.wav
1044_02.wav
482
6
0.075
29.92
41.97
-12.05
5053_02.wav
1044_02.wav
482
7
0.083
30.27
42.67
-12.40
5053_02.wav
1044_02.wav
482
8
0.091
31.14
42.63
-11.49
5053_02.wav
1044_02.wav
482
9
0.099
30.27
44.10
-13.83
5053_02.wav
1044_02.wav
482
10
0.107
30.35
45.12
-14.77
5053_02.wav
1044_02.wav
482
11
0.115
30.73
43.90
-13.18
5053_02.wav
1044_02.wav
482
12
0.123
30.53
43.15
-12.62
5053_02.wav
1044_02.wav
482
13
0.131
32.44
42.67
-10.22
5053_02.wav
1044_02.wav
482
14
0.139
31.12
40.94
-9.82
5053_02.wav
1044_02.wav
482
15
0.147
30.97
38.88
-7.91
5053_02.wav
1044_02.wav
482
16
0.155
33.92
38.15
-4.24
5053_02.wav
1044_02.wav
482
17
0.163
33.78
37.45
-3.67
5053_02.wav
1044_02.wav
482
18
0.171
32.72
35.75
-3.03
Sums of squares of diffdB's is 42114
Square root of the sums is 205
71
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
Duration difference
native
learner
numSegs segNo nativeSegID learnerSegID timeStart
nativeDur
ratio
normNativeDur learnerDur
normDiffDur
5053_02.TextGrid 1044_02.TextGrid
33
1
SIL
SIL
0
330
1.027
321
328
-7
5053_02.TextGrid 1044_02.TextGrid
33
2
dh
dh
0.330
22
1.027
22
16
5
5053_02.TextGrid 1044_02.TextGrid
33
3
ax
ax
0.353
60
1.027
59
86
-27
5053_02.TextGrid 1044_02.TextGrid
33
4
SIL
SIL
0.413
104
1.027
101
67
34
5053_02.TextGrid 1044_02.TextGrid
33
5
dd
dd
0.517
19
1.027
19
14
5
5053_02.TextGrid 1044_02.TextGrid
33
6
ae
ae
0.536
151
1.027
147
126
21
5053_02.TextGrid 1044_02.TextGrid
33
7
nn
nn
0.686
57
1.027
55
92
-37
5053_02.TextGrid 1044_02.TextGrid
33
8
ss
ss
0.743
92
1.027
89
102
-13
5053_02.TextGrid 1044_02.TextGrid
33
9
ih
ih
0.835
67
1.027
66
111
-45
5053_02.TextGrid 1044_02.TextGrid
33
10
ng
ng
0.902
100
1.027
98
70
28
5053_02.TextGrid 1044_02.TextGrid
33
11
kk
kk
1.002
147
Sums of squares of diffDur's is 59266
Square root of the sums is 243
72
Pilot data (5) Difference database of prosodic features for automatic prosody evaluation
F0 difference
native
learner
numFrames
frameNo
time
nativeF0
learnerF0
diffF0
5053_02.wav
1044_02.wav
388
1
0.024
--undefined--
--undefined--
--undefined--
5053_02.wav
1044_02.wav
388
2
0.034
--undefined--
--undefined--
--undefined--
5053_02.wav
1044_02.wav
388
3
0.044
--undefined--
--undefined--
--undefined--
5053_02.wav
1044_02.wav
388
4
0.054
--undefined--
--undefined--
--undefined--
5053_02.wav
1044_02.wav
388
35
0.364
220
198
22
5053_02.wav
1044_02.wav
388
36
0.374
213
197
16
5053_02.wav
1044_02.wav
388
37
0.384
207
197
11
5053_02.wav
1044_02.wav
388
38
0.394
203
196
7
5053_02.wav
1044_02.wav
388
39
0.404
200
195
5
5053_02.wav
1044_02.wav
388
40
0.414
198
194
4
5053_02.wav
1044_02.wav
388
41
0.424
197
194
4
…
…
…
…
Sums of squares of diffF0's is 236363
Square root of the sums is 486
73
6. Transforming Korean alveolar
lax fricatives into tense
• Goal
– Test factors that distinguish /ㅅ/ from /ㅆ/
• Type of factors
– Consonantal: noise durations, center of gravity
– Vocalic: formant/bandwidth switching
– Prosodic: clone F0/intensity/durations, switch source
signals
74
Pilot data (6) Transforming Korean alveolar lax fricatives into tense
 사자 vs. 싸자 
75
Pilot data (6) Transforming Korean alveolar lax fricatives into tense
사자
사자
Prosody:
Durations
F0
Intensity
싸자
76
Pilot data (6) Transforming Korean alveolar lax fricatives into tense
사자
사자
Prosody
+
Formants
Bandwidths
싸자
77
Design (6) Transforming Korean alveolar lax fricatives into tense
• Things to do
–
–
Try the reverse: manipulate /ㅆ/ to simulate /ㅅ/
Try this with other lax/tense pairs of stops
•
–
사  싸, 다  따, 바  빠, 가  까
Try switching the source signal
• Listening experiments
–
[1] Render /ssa/ from /sha/
•
(1) prosody
–
–
(3) source
(1)+(2): shift?, (1)+(3): shift?, (1)+(2)+(3): shift?, (1)+(2)+undo(1): see effect of (2) only, (1)+(3)+undo(1): see effect of (3)
only, (1)+(2)+(3)+undo(1): see the effects of (2) and (3) only
[2] Render /sha/ from /ssa/
•
(1) prosody
–
–
(2) formant/bandwidth
(2) formant/bandwidth
(3) source
(1)+(2): shift?, (1)+(3): shift?, (1)+(2)+(3): shift?, (1)+(2)+undo(1): ?, (1)+(3)+undo(1): ?, (1)+(2)+(3)+undo(1): ?
[3] Statistical analyses of formants/bandwidths
•
•
Examine post-consonantal vowels in terms of their formants/bandwidths for any possible intra/inter-consonantal
differences
Identify the portion of the vowels that contributes to the distinction of lax/tense consonants, e.g ½, ¼ from the
vowel onset
78
7. Gender transformation of utterances
• Examine male vs. female utterances in terms of
prosodic & segmental differences
– Identify factors that differ
– Refer to Praat’s change gender… under Convert button
– Verify with synthesizing
• Prosody manipulation
– F0/intensity/durations/source
• Segment manipulation
– Formant frequencies & bandwidths
79