Transcript Slide 1
SPECTRO-TEMPORAL POST-SMOOTHING IN NMF
BASED SINGLE-CHANNEL SOURCE SEPARATION
Emad M. Grais and Hakan Erdogan
Sabanci University, Istanbul, Turkey
INTRODUCTION
SIGNALS RECONSTRUCTION AND
SMOOTHED MASKS
Single-channel source separation aims to
find estimates of source signals that are
mixed when a single mixture is available.
We
compare
enforcing
temporal
smoothness by using post-smoothed
spectral masks with enforcing smoothness
by using regularized NMF.
The regularized NMF is defined as
The initial estimates are used to build a
p
spectral mask as
1
Az
PROBLEM FORMULATION
Bz G z
p
BlGl
, Az 0,1
C Bd ,G = D X || BdG R G ,
l
Changing p leads to different type of mask.
The spectral mask can be used to find
estimate for each source by element-wise
multiplication with the spectrogram of the
mixed signal as.
The observed mixed signal x(t) is a
mixture of multi-source signals sz(t).
This can be formed in the short time
Fourier transform (STFT) domain as
Sz t , f .
This can be approximated as a sum of
magnitude spectrograms as
Sz (t , f )
p
Bz G z
Hz
p
Bl Gl
l
z
The magnitude spectrograms can be
written as nonnegative matrices as
X Sz
z
2
NONNEGATIVE MATRIX
FACTORIZATION
NMF is used to decompose a nonnegative
matrix V into a low rank nonnegative basis
vectors matrix B and a nonnegative weights
matrix G.
V BG
B G
B G
i
p
z
z
l
p
,
l
l
Sz H z X
V
B
BG
G G
T
B 1
T
3
NMF FOR SOURCE SEPARATION
In training stage:
Magnitude spectrogram of each source
training data is used to build dictionary
Bz for each source using NMF.
In testing stage:
NMF is used to decompose the
magnitude spectrogram of the mixed
signal X into a nonnegative weighted
linear combinations of the trained
dictionaries as
G1
....
X B1 ,.., Bz ,.., BZ G z
....
GZ
The initial estimate for each source is
found as:
Just
SMR
Using a = 1
dB
Mask b = 3
-5
7.05 7.26
0
10.37 10.69
5
12.46 12.80
10 15.23 15.83
15 17.05 17.81
20 18.40 19.37
Median Filter
a=1 a=1 a=1
b=5 b=7 b=9
7.44 7.45 7.30
10.86 10.82 10.71
12.95 12.92 12.73
16.03 15.97 15.78
17.98 17.91 17.72
19.56 19.58 19.41
1
T
SMR
dB
-5
0
5
10
15
20
S1 B1G1 ,.., Sz BzGz ,.., SZ BZ GZ .
a=1
b=3
7.56
10.95
13.12
16.19
18.25
19.85
T
g
j 1
2
i, j
.
a=2
b=3
7.04
10.47
12.31
15.40
17.54
19.11
Moving
a=1 a=1
b=2 b=3
7.18 7.34
10.56 10.72
12.60 12.77
15.44 15.65
17.34 17.52
18.74 18.87
Regularized NMF
SMR
-5
-5
α
=
10
α
=
10
s
s
dB
α m= 10-5 α m= 10-3
-5
6.17
6.13
3.53
0
9.15
9.16
7.37
5
10.81
10.81
10.18
10
12.81
12.81
14.58
15
14.02
14.03
17.60
20
14.67
14.66
20.37
Table 1: Signal to Noise Ratio (SNR) in dB for
the speech signal using regularized NMF
6
Just NMF
No mask
No prior
Average Filter
a=1 a=1 a=2
b=5 b=7 b=3
7.38 7.32 6.84
10.74 10.57 10.13
12.72 12.44 11.87
15.53 15.13 14.67
17.32 16.81 16.56
18.63 18.07 17.87
Moving
a=1
b=5
7.79
11.16
13.40
16.49
18.60
20.33
Average Filter
a=1 a=1
b=7 b=9
7.85
7.82
11.18 11.12
13.48 13.44
16.58 16.56
18.73 18.75
20.56 20.67
a=1
b =11
7.74
10.99
13.31
16.48
18.70
20.59
a=1
b=3
7.21
10.56
12.67
15.53
17.44
18.86
a=1
b=3
7.17
10.51
12.59
15.40
17.24
18.60
a=1
b=5
7.60
10.97
13.15
16.20
18.26
19.87
B G
B G
3
Table 3: SNR in dB for the speech signal using the smoothed mask
4
gi , j 1 ,
7
Table 2: Signal to Noise Ratio (SNR) in dB for the speech signal using the smoothed mask
Median Filter
a=1 a=1 a=1
b=3 b=5 b=7
7.16
7.17
7.15
10.46 10.48 10.41
12.57 12.69 12.57
15.57 15.59 15.54
17.57 17.54 17.29
19.00 19.06 18.89
i, j
5
The proposed algorithm is used to separate a speech
signal from a background piano music signal.
For STFT, 512-point FFT, first 257 points are only
used , the sampling rate is 16kHz.
We train 128 basis vectors for each source dictionary,
so the size of each matrix B is 257x128.
WE CAN ADD SOMETHING HERE
The update solutions of B and G are
2
i j 2
2
In this work, we choose different αs
values for speech and αm for music.
Table 1 shows the separation results using
the regularized NMF to enforce smoothness
on the estimated source signals.
Tables 2 and 3 show the separation results
where the smoothness is enforced using
smoothed spectral masks.
The tables show that, enforcing
smoothness using smoothed masks gives
better separation results than enforcing
smoothness using regularized NMF.
EXPERIMENTS AND RESULTS
Vi, j
D V || BG = Vi, j log
-Vi, j + BG i, j
BG i, j
i, j
Subject to elements of B, G 0 .
,
or H z
g
1
Where
The . is a smoothing filter, which can be
1. The median filter.
2. The moving average low pass filter.
3. The Hamming windowed moving average
filter (Hamming filter).
The smoothing direction is the horizontal
(time) direction of the spectrograms.
The final estimate for each source can be
found as
B and G can be found by minimizing the
generalized Kullback-Leibler divergence
V
T
G
BG
B B
T
1G
i=1
To add temporal smoothness to the
estimated source signal spectrograms, the
spectral mask is smoothed by a 2-D
smoothing filter with dimensions (a,b) as
z 1
X (t , f )
R G =
Sz Az X
Z
X (t , f )
Where Bd = [Bspeech, Bmusic], α is the
regularization parameter, and R(G) is the
continuity prior penalty term defined as:
Hz
z
z
3
l
l
l
Hamming filter
a=1 a=1 a=1
b=5 b=7 b=9
7.39 7.43 7.42
10.76 10.80 10.75
12.81 12.81 12.70
15.68 15.66 15.50
17.55 17.50 17.28
18.91 18.84 18.60
3
Bz G z
Hz
3
B
G
l l
l
Hamming filter
a=1 a=1
b=7 b=9
7.76
7.85
11.13 11.20
13.35 13.46
16.43 16.55
18.53 18.68
20.24 20.46
a=2
b=3
6.72
10.01
11.78
14.54
16.43
17.75
a=1
b =11
7.88
11.22
13.51
16.60
18.76
20.68
a=1
b =13
7.89
11.20
13.51
16.61
18.79
20.67