Transcript Slide 1

SPECTRO-TEMPORAL POST-SMOOTHING IN NMF
BASED SINGLE-CHANNEL SOURCE SEPARATION
Emad M. Grais and Hakan Erdogan
Sabanci University, Istanbul, Turkey
INTRODUCTION
SIGNALS RECONSTRUCTION AND
SMOOTHED MASKS
Single-channel source separation aims to
find estimates of source signals that are
mixed when a single mixture is available.
We
compare
enforcing
temporal
smoothness by using post-smoothed
spectral masks with enforcing smoothness
by using regularized NMF.
The regularized NMF is defined as
 The initial estimates are used to build a
p
spectral mask as
1
Az
PROBLEM FORMULATION
Bz G z 


p
  BlGl 
, Az   0,1
C  Bd ,G  = D  X || BdG    R G  ,
l
 Changing p leads to different type of mask.
 The spectral mask can be used to find
estimate for each source by element-wise
multiplication with the spectrogram of the
mixed signal as.
 The observed mixed signal x(t) is a
mixture of multi-source signals sz(t).
 This can be formed in the short time
Fourier transform (STFT) domain as
 Sz t , f .
 This can be approximated as a sum of
magnitude spectrograms as
 Sz (t , f )

p
  Bz G z 
Hz   
p
   Bl Gl 
 l
z
The magnitude spectrograms can be
written as nonnegative matrices as
X   Sz
z
2
NONNEGATIVE MATRIX
FACTORIZATION
 NMF is used to decompose a nonnegative
matrix V into a low rank nonnegative basis
vectors matrix B and a nonnegative weights
matrix G.
V  BG
 B  G  
  B  G  
i 
p
z
z
l
p
,
l
l
Sz  H z  X
V
B
BG
G G
T
B 1
T
3
NMF FOR SOURCE SEPARATION
In training stage:
 Magnitude spectrogram of each source
training data is used to build dictionary
Bz for each source using NMF.
In testing stage:
 NMF is used to decompose the
magnitude spectrogram of the mixed
signal X into a nonnegative weighted
linear combinations of the trained
dictionaries as
G1 
 .... 
 
X   B1 ,.., Bz ,.., BZ  G z 
 
 .... 
GZ 
 The initial estimate for each source is
found as:
Just
SMR
Using a = 1
dB
Mask b = 3
-5
7.05 7.26
0
10.37 10.69
5
12.46 12.80
10 15.23 15.83
15 17.05 17.81
20 18.40 19.37
Median Filter
a=1 a=1 a=1
b=5 b=7 b=9
7.44 7.45 7.30
10.86 10.82 10.71
12.95 12.92 12.73
16.03 15.97 15.78
17.98 17.91 17.72
19.56 19.58 19.41
1
T
SMR
dB
-5
0
5
10
15
20
S1  B1G1 ,.., Sz  BzGz ,.., SZ  BZ GZ .
a=1
b=3
7.56
10.95
13.12
16.19
18.25
19.85
T
g
j 1
2
i, j
.
a=2
b=3
7.04
10.47
12.31
15.40
17.54
19.11
Moving
a=1 a=1
b=2 b=3
7.18 7.34
10.56 10.72
12.60 12.77
15.44 15.65
17.34 17.52
18.74 18.87
Regularized NMF
SMR
-5
-5
α
=
10
α
=
10
s
s
dB
α m= 10-5 α m= 10-3
-5
6.17
6.13
3.53
0
9.15
9.16
7.37
5
10.81
10.81
10.18
10
12.81
12.81
14.58
15
14.02
14.03
17.60
20
14.67
14.66
20.37
Table 1: Signal to Noise Ratio (SNR) in dB for
the speech signal using regularized NMF
6
Just NMF
No mask
No prior
Average Filter
a=1 a=1 a=2
b=5 b=7 b=3
7.38 7.32 6.84
10.74 10.57 10.13
12.72 12.44 11.87
15.53 15.13 14.67
17.32 16.81 16.56
18.63 18.07 17.87
Moving
a=1
b=5
7.79
11.16
13.40
16.49
18.60
20.33
Average Filter
a=1 a=1
b=7 b=9
7.85
7.82
11.18 11.12
13.48 13.44
16.58 16.56
18.73 18.75
20.56 20.67
a=1
b =11
7.74
10.99
13.31
16.48
18.70
20.59
a=1
b=3
7.21
10.56
12.67
15.53
17.44
18.86
a=1
b=3
7.17
10.51
12.59
15.40
17.24
18.60
a=1
b=5
7.60
10.97
13.15
16.20
18.26
19.87
B  G  


  B  G  
3
Table 3: SNR in dB for the speech signal using the smoothed mask
4
 gi , j 1  ,
7
Table 2: Signal to Noise Ratio (SNR) in dB for the speech signal using the smoothed mask
Median Filter
a=1 a=1 a=1
b=3 b=5 b=7
7.16
7.17
7.15
10.46 10.48 10.41
12.57 12.69 12.57
15.57 15.59 15.54
17.57 17.54 17.29
19.00 19.06 18.89
i, j
5
 The proposed algorithm is used to separate a speech
signal from a background piano music signal.
 For STFT, 512-point FFT, first 257 points are only
used , the sampling rate is 16kHz.
 We train 128 basis vectors for each source dictionary,
so the size of each matrix B is 257x128.
WE CAN ADD SOMETHING HERE
 The update solutions of B and G are

2
i j 2
2
 In this work, we choose different αs
values for speech and αm for music.
Table 1 shows the separation results using
the regularized NMF to enforce smoothness
on the estimated source signals.
Tables 2 and 3 show the separation results
where the smoothness is enforced using
smoothed spectral masks.
The tables show that, enforcing
smoothness using smoothed masks gives
better separation results than enforcing
smoothness using regularized NMF.
EXPERIMENTS AND RESULTS


Vi, j
D V || BG  =  Vi, j log
-Vi, j +  BG i, j 

BG i, j

i, j 


Subject to elements of B, G  0 .
,


 or H z 


 g
1
Where
The  . is a smoothing filter, which can be
1. The median filter.
2. The moving average low pass filter.
3. The Hamming windowed moving average
filter (Hamming filter).
 The smoothing direction is the horizontal
(time) direction of the spectrograms.
 The final estimate for each source can be
found as
 B and G can be found by minimizing the
generalized Kullback-Leibler divergence
V
T
G
BG
B B 
T
1G
i=1
To add temporal smoothness to the
estimated source signal spectrograms, the
spectral mask is smoothed by a 2-D
smoothing filter with dimensions (a,b) as
z 1
X (t , f ) 
R G  = 
Sz  Az  X
Z
X (t , f ) 
Where Bd = [Bspeech, Bmusic], α is the
regularization parameter, and R(G) is the
continuity prior penalty term defined as:
Hz
z
z
3
l
l
l
Hamming filter
a=1 a=1 a=1
b=5 b=7 b=9
7.39 7.43 7.42
10.76 10.80 10.75
12.81 12.81 12.70
15.68 15.66 15.50
17.55 17.50 17.28
18.91 18.84 18.60

3
  Bz G z 
Hz   
3
B
G


 l l
 l
Hamming filter
a=1 a=1
b=7 b=9
7.76
7.85
11.13 11.20
13.35 13.46
16.43 16.55
18.53 18.68
20.24 20.46
a=2
b=3
6.72
10.01
11.78
14.54
16.43
17.75





a=1
b =11
7.88
11.22
13.51
16.60
18.76
20.68
a=1
b =13
7.89
11.20
13.51
16.61
18.79
20.67