Transcript Slide 1
Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection Baiying LEI1,2 Man-Wai MAK2 1Department of Biomedical Engineering, Shenzhen University, Shenzhen, China 2Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong SAR, China Funding Sources: • Motorola Solutions Foundation • The Hong Kong Polytechnic University Contents 1. Motivations of Sound-Event Detection 2. Objectives 3. Methodology – System Architecture – Acoustic Features and Fusion – Sound Event Partitioning 4. Experiments and Results 5. Conclusions 2 Motivation • Under some situations (e.g., in a washroom), surveillance via video cameras is inappropriate. Audio is a viable alternative under such situations. • With the high processing power of today’s smartphones, it becomes possible to turn a smartphone into a personal audio surveillance and monitoring system. • Audio-based surveillance can make effective use of mobile devices, allowing the surveillance system to be moved from one place to another easily. • Abnormal sound events such as screaming can be detected and emergency phone calls can be automatically made. 3 Objectives of This Work 1. Determining suitable acoustic features for scream sound detection 2. Addressing the data-imbalance problem (scream vs. non-scream) in training SVM classifiers 3. Implement the detection algorithm on mobile phones 4 Methodology 5 System Architecture Playback sound events Playback background noise Android App 6 Feature Extraction and Fusion • Characteristics of scream sounds – Almost impossible to detect them in the time domain – But their spectral characteristics are still visible in the spectrogram under very noise condition Amplitude Waveform of scream sound wtih babble noise(SNR -5dB) 0.1 0 -0.1 1 2 3 Time(s) 4 5 6 x 10 4 Frequency(Hz) Spectrogram of scream sound wtih babble noise(SNR -5dB) 8000 6000 4000 2000 0 0.5 1 1.5 2 Time(s) 2.5 3 3.5 7 Feature Extraction and Fusion • Time-Frequency Acoustic Features – MFCC (Mel-frequency cepstral coefficients) • Commonly used in speech and speaker recognition systems • Known to be not very noise robust – GFCC (Gammatone frequency cepstral coefficients) • Based on auditory filtering and cepstral analysis • More noise robust than MFCC 8 Feature Extraction and Fusion • Correlation between MFCC and GFCC 0.04 0.02 Fusion may help improve performance GFCC 0 -0.02 -0.04 -0.06 -0.06 -0.04 -0.02 0 MFCC 0.02 0.04 0.06 9 Feature Extraction and Fusion • Feature Fusion: F = FMFCC Å FGFCC FGFCC FMFCC F • Score Fusion: – Fuse the scores produced by MFCC-based and GFCC-based SVM classifiers s = a ´ sMFCC + (1- a ) ´ sGFCC SVM scores 10 Feature Extraction and Fusion • Feature Fusion + Score Fusion: s f = g ´ s(F) + (1- g ) ´ s, Score from feature-fusion SVM Score from score-fusion SVM PCA Whitening and Normalization - Zˆ = 1 2 1 2 D' 1 2 D' - diag( l1 ,..., l ))P T F 1 2 1 , diag( l ,..., l ))P T F 2 • P: projection matrix comprising eigenvectors • λ: Eigenvalues Sound-Event Partitioning • Based on our previously work on Utterance Partitioning for speaker verification Experiments and Results 14 Sound Data • 1000 sound events collected from – Human Sound Effect (www.sound-ideas.com) – Freesound.org • 240 Screams and 760 Non-screams • Non-Screams (22 types): – Applause, babycry, cheering, cough, crowd, door-slam, groan, grunt, gunshot, kiss, laugh, nose-blow, phone-ring, sniff, sniffle, snore, snort, speech, spit, throat, vocal, whistle 15 Sound Data 5 x 10 -3 Nonscream sound p(N) 4 Scream sound 3 2 1 0 0 100 200 300 No. of frames (N) 400 500 16 Effect of Background Noise • Babble noise from NOISEX’92 was added to the sound events so that the resulting noisy sound events have SNR of 10dB, 5dB, 0dB, and -5dB • Performance (%EER, False Acceptance = False Rejection) Perform better under matched conditions 17 Effect of Sound-Event Partitioning and Fusion • 2 Partitions per sound event is sufficient • Score Fusion + Feature Fusion is the best 18 Effect of Sound-Event Partitioning and Feature Preprocessing • Having sound-event partitioning is always better • PCA-Whitening and L2-norm are useful 19 Conclusion • Sound-event partitioning and feature prepreprocessing methods are proposed for scream sound detection. • It was found that – Having sound-event partitioning is always better – PCA-Whitening and L2-norm are useful – Score fusion + feature fusion is effective • Demo 20 21