Prosody modification in speech signals

Download Report

Transcript Prosody modification in speech signals

Prosody modification in speech
signals
Project by
Edi Fridman & Alex Zalts
supervision by
Yizhar Lavner
Prosody: the "non-textual" aspects of the speech
signal
”Segmental" aspects: timing, duration, rhythm, stress, and
metrical structure. The duration of each individual "segment"
is under the control of the speaker to varying degrees, and
varies with stress and rate.
The relative strength of an individual syllable, word, or
phrase may be realized in a number of ways, including
lengthening (or shortening and cliticization), changes in pitch,
and amplitude, and spectral character.
Project goals
• Prosody modification with TDPSOLA
algorithm
• Prosody modification with HNM model
• Conversion of male voice to female voice &
vice versa
Four steps in prosody modification
• Time-scale
modification
0.04
0.02
0.02
0.01
• Pitch-scale
modification
0
0
-0.02
-0.01
-0.02
-0.04
3500
0
• Energy envelope
modification
0.04
0.02
0.02
0.01
0.02
0.01
0
0
0
-0.01
• Modification of
distribution of
utterancers
-0.02
-0.01
-0.02
-0.02
-0.04
-0.03
2500
0
50
4000100
50 50 1003000100
150
150 4500 200
200150 3500
250 200300
2505000
300
5500
350
350
500
350
2504000400 300450 4500
TDPSOLA Approach
(*) Based on Overlapp-and-Add idea
(*) Synchronization with original
pitch by:
1) Setting up pitch marks in
analysis signal
2) Setting up new pitch marks in synthesis signal according to
time-scale and pitch-scale factors (0.6 for pitch 1.3 for time)
(*) Building synthesis signal using OLA
Setting up new pitch marks
Let us define time instants in analysis signal ta(s) as original pitch
marks and pitch contour as P(t)
The stream of synthesis pitch-marks ts(u) is determined from ta(s)
according to desired time-scale modification (tD(t)) and pitchscale modification Fp(P) by:
ts`(u+1)
1
ts(u+1)-ts(u) =
ts`(u+1)-ts`(u)
 P`(t) dt
ts`(u)
with
ts(u+1) = D(ts`(u+1))
P`(t) = Fp (P(t))
Problem of TDPSOLA:
Impossible to change pitch contour
because algorithm is based on original pitch marks
original pitch-marks
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
1
0.8
0.6
0.4
0.2
0
new pitch-marks
Problem: too many pitch marks are not counted in, resulting
bad sound quality
HNM Approach
• Speech signal is modeled as harmonics of
pitch plus noise
• Harmonics and noise are treated in different
• ways
• Synthesis and analysis are performed in
pitch synchronous way
Let X(n) be the speech segment. According to HNM

model X (n) can be found and written as:

p
X (n)   hk z
k 1
n 1
k
 w(n)

To minimize error X (n)  X (n)
where the complex constants hk and zk are defined as:
hk  Ak exp( j k )
zk  exp( j 2f k )T
hk - complex amplitude of harmonic K
fk - frequency of harmonic K
T - sampling period
W(n) - noise
Harmonic K is set to be K*F0 where F0 is pitch that found by
PDA
Amplitudes and phases of pitch-harmonics computed with Prony
algorithm by minimizing least square error between harmonics and
original signal yielding:
[ Z Z ] * h  [ Z x]
H
H
In each voiced speech fragment maximum voiced frequency Fm is calculated
and noise part obtained by filtering signal with HP filter with cutoff frequency
Fm
In unvoiced fragments signal’s specturm is modeled by pth-order all-pole filter
H(z). The noise is synthesized by filtering a unit variance gaussian noise through
H(z)
When pitch scaling is done there is a need to re-compute amplitudes and
phases of modified pitch-harmonics.
For this purpose a frequency-continuous spectral and phase envelope is
necessary.
Comparing between TDPSOLA & HNM
Sound quality
Pitch contour
modification
TDPSOLA
HNM
very good
very good
with possible
buzziness
can be done
in easy way
can be done
with a lot of
computational
load
Computational load low
high
The only target in pitch-scaling was
to change F0 preserving other
formants
There was an attempt to change
spectral envelope in order to change
male voice to female voice and vice
versa
New algorithm was proposed