Transcript Slide 1
SPECTRO-TEMPORAL POST-SMOOTHING IN NMF BASED SINGLE-CHANNEL SOURCE SEPARATION Emad M. Grais and Hakan Erdogan Sabanci University, Istanbul, Turkey INTRODUCTION SIGNALS RECONSTRUCTION AND SMOOTHED MASKS Single-channel source separation aims to find estimates of source signals that are mixed when a single mixture is available. We compare enforcing temporal smoothness by using post-smoothed spectral masks with enforcing smoothness by using regularized NMF. The regularized NMF is defined as The initial estimates are used to build a p spectral mask as 1 Az PROBLEM FORMULATION Bz G z p BlGl , Az 0,1 C Bd ,G = D X || BdG R G , l Changing p leads to different type of mask. The spectral mask can be used to find estimate for each source by element-wise multiplication with the spectrogram of the mixed signal as. The observed mixed signal x(t) is a mixture of multi-source signals sz(t). This can be formed in the short time Fourier transform (STFT) domain as Sz t , f . This can be approximated as a sum of magnitude spectrograms as Sz (t , f ) p Bz G z Hz p Bl Gl l z The magnitude spectrograms can be written as nonnegative matrices as X Sz z 2 NONNEGATIVE MATRIX FACTORIZATION NMF is used to decompose a nonnegative matrix V into a low rank nonnegative basis vectors matrix B and a nonnegative weights matrix G. V BG B G B G i p z z l p , l l Sz H z X V B BG G G T B 1 T 3 NMF FOR SOURCE SEPARATION In training stage: Magnitude spectrogram of each source training data is used to build dictionary Bz for each source using NMF. In testing stage: NMF is used to decompose the magnitude spectrogram of the mixed signal X into a nonnegative weighted linear combinations of the trained dictionaries as G1 .... X B1 ,.., Bz ,.., BZ G z .... GZ The initial estimate for each source is found as: Just SMR Using a = 1 dB Mask b = 3 -5 7.05 7.26 0 10.37 10.69 5 12.46 12.80 10 15.23 15.83 15 17.05 17.81 20 18.40 19.37 Median Filter a=1 a=1 a=1 b=5 b=7 b=9 7.44 7.45 7.30 10.86 10.82 10.71 12.95 12.92 12.73 16.03 15.97 15.78 17.98 17.91 17.72 19.56 19.58 19.41 1 T SMR dB -5 0 5 10 15 20 S1 B1G1 ,.., Sz BzGz ,.., SZ BZ GZ . a=1 b=3 7.56 10.95 13.12 16.19 18.25 19.85 T g j 1 2 i, j . a=2 b=3 7.04 10.47 12.31 15.40 17.54 19.11 Moving a=1 a=1 b=2 b=3 7.18 7.34 10.56 10.72 12.60 12.77 15.44 15.65 17.34 17.52 18.74 18.87 Regularized NMF SMR -5 -5 α = 10 α = 10 s s dB α m= 10-5 α m= 10-3 -5 6.17 6.13 3.53 0 9.15 9.16 7.37 5 10.81 10.81 10.18 10 12.81 12.81 14.58 15 14.02 14.03 17.60 20 14.67 14.66 20.37 Table 1: Signal to Noise Ratio (SNR) in dB for the speech signal using regularized NMF 6 Just NMF No mask No prior Average Filter a=1 a=1 a=2 b=5 b=7 b=3 7.38 7.32 6.84 10.74 10.57 10.13 12.72 12.44 11.87 15.53 15.13 14.67 17.32 16.81 16.56 18.63 18.07 17.87 Moving a=1 b=5 7.79 11.16 13.40 16.49 18.60 20.33 Average Filter a=1 a=1 b=7 b=9 7.85 7.82 11.18 11.12 13.48 13.44 16.58 16.56 18.73 18.75 20.56 20.67 a=1 b =11 7.74 10.99 13.31 16.48 18.70 20.59 a=1 b=3 7.21 10.56 12.67 15.53 17.44 18.86 a=1 b=3 7.17 10.51 12.59 15.40 17.24 18.60 a=1 b=5 7.60 10.97 13.15 16.20 18.26 19.87 B G B G 3 Table 3: SNR in dB for the speech signal using the smoothed mask 4 gi , j 1 , 7 Table 2: Signal to Noise Ratio (SNR) in dB for the speech signal using the smoothed mask Median Filter a=1 a=1 a=1 b=3 b=5 b=7 7.16 7.17 7.15 10.46 10.48 10.41 12.57 12.69 12.57 15.57 15.59 15.54 17.57 17.54 17.29 19.00 19.06 18.89 i, j 5 The proposed algorithm is used to separate a speech signal from a background piano music signal. For STFT, 512-point FFT, first 257 points are only used , the sampling rate is 16kHz. We train 128 basis vectors for each source dictionary, so the size of each matrix B is 257x128. WE CAN ADD SOMETHING HERE The update solutions of B and G are 2 i j 2 2 In this work, we choose different αs values for speech and αm for music. Table 1 shows the separation results using the regularized NMF to enforce smoothness on the estimated source signals. Tables 2 and 3 show the separation results where the smoothness is enforced using smoothed spectral masks. The tables show that, enforcing smoothness using smoothed masks gives better separation results than enforcing smoothness using regularized NMF. EXPERIMENTS AND RESULTS Vi, j D V || BG = Vi, j log -Vi, j + BG i, j BG i, j i, j Subject to elements of B, G 0 . , or H z g 1 Where The . is a smoothing filter, which can be 1. The median filter. 2. The moving average low pass filter. 3. The Hamming windowed moving average filter (Hamming filter). The smoothing direction is the horizontal (time) direction of the spectrograms. The final estimate for each source can be found as B and G can be found by minimizing the generalized Kullback-Leibler divergence V T G BG B B T 1G i=1 To add temporal smoothness to the estimated source signal spectrograms, the spectral mask is smoothed by a 2-D smoothing filter with dimensions (a,b) as z 1 X (t , f ) R G = Sz Az X Z X (t , f ) Where Bd = [Bspeech, Bmusic], α is the regularization parameter, and R(G) is the continuity prior penalty term defined as: Hz z z 3 l l l Hamming filter a=1 a=1 a=1 b=5 b=7 b=9 7.39 7.43 7.42 10.76 10.80 10.75 12.81 12.81 12.70 15.68 15.66 15.50 17.55 17.50 17.28 18.91 18.84 18.60 3 Bz G z Hz 3 B G l l l Hamming filter a=1 a=1 b=7 b=9 7.76 7.85 11.13 11.20 13.35 13.46 16.43 16.55 18.53 18.68 20.24 20.46 a=2 b=3 6.72 10.01 11.78 14.54 16.43 17.75 a=1 b =11 7.88 11.22 13.51 16.60 18.76 20.68 a=1 b =13 7.89 11.20 13.51 16.61 18.79 20.67