Transcript Slide 1

Sound-Event Partitioning and Feature
Normalization for Robust Sound-Event Detection
Baiying LEI1,2
Man-Wai MAK2
1Department
of Biomedical Engineering, Shenzhen University,
Shenzhen, China
2Department
of Electronic and Information Engineering
The Hong Kong Polytechnic University, Hong Kong SAR, China
Funding Sources:
• Motorola Solutions Foundation
• The Hong Kong Polytechnic University
Contents
1. Motivations of Sound-Event Detection
2. Objectives
3. Methodology
–
System Architecture
–
Acoustic Features and Fusion
–
Sound Event Partitioning
4. Experiments and Results
5. Conclusions
2
Motivation
• Under some situations (e.g., in a washroom), surveillance via video
cameras is inappropriate. Audio is a viable alternative under such
situations.
• With the high processing power of today’s smartphones, it becomes
possible to turn a smartphone into a personal audio surveillance and
monitoring system.
• Audio-based surveillance can make effective use of mobile devices,
allowing the surveillance system to be moved from one place to
another easily.
• Abnormal sound events such as screaming can be detected and
emergency phone calls can be automatically made.
3
Objectives of This Work
1. Determining suitable acoustic features for
scream sound detection
2. Addressing the data-imbalance problem
(scream vs. non-scream) in training SVM
classifiers
3. Implement the detection algorithm on mobile
phones
4
Methodology
5
System Architecture
Playback
sound events
Playback
background
noise
Android App
6
Feature Extraction and Fusion
• Characteristics of scream sounds
– Almost impossible to detect them in the time domain
– But their spectral characteristics are still visible in the
spectrogram under very noise condition
Amplitude
Waveform of scream sound wtih babble noise(SNR -5dB)
0.1
0
-0.1
1
2
3
Time(s)
4
5
6
x 10
4
Frequency(Hz)
Spectrogram of scream sound wtih babble noise(SNR -5dB)
8000
6000
4000
2000
0
0.5
1
1.5
2
Time(s)
2.5
3
3.5
7
Feature Extraction and Fusion
• Time-Frequency Acoustic Features
– MFCC (Mel-frequency cepstral coefficients)
• Commonly used in speech and speaker recognition systems
• Known to be not very noise robust
– GFCC (Gammatone frequency cepstral coefficients)
• Based on auditory filtering and cepstral analysis
• More noise robust than MFCC
8
Feature Extraction and Fusion
• Correlation between MFCC and GFCC
0.04
0.02
Fusion may help
improve
performance
GFCC
0
-0.02
-0.04
-0.06
-0.06
-0.04
-0.02
0
MFCC
0.02
0.04
0.06
9
Feature Extraction and Fusion
• Feature Fusion:
F = FMFCC Å FGFCC
FGFCC
FMFCC
F
• Score Fusion:
– Fuse the scores produced by MFCC-based and GFCC-based SVM
classifiers
s = a ´ sMFCC + (1- a ) ´ sGFCC
SVM scores
10
Feature Extraction and Fusion
• Feature Fusion + Score Fusion:
s f = g ´ s(F) + (1- g ) ´ s,
Score from
feature-fusion
SVM
Score from
score-fusion
SVM
PCA Whitening and Normalization
-
Zˆ =
1
2
1
2
D'
1
2
D'
-
diag( l1 ,..., l ))P T F
1
2
1
,
diag( l ,..., l ))P T F
2
• P: projection matrix comprising eigenvectors
• λ: Eigenvalues
Sound-Event Partitioning
• Based on our previously work on Utterance Partitioning for
speaker verification
Experiments and Results
14
Sound Data
• 1000 sound events collected from
– Human Sound Effect (www.sound-ideas.com)
– Freesound.org
• 240 Screams and 760 Non-screams
• Non-Screams (22 types):
– Applause, babycry, cheering, cough, crowd, door-slam, groan, grunt,
gunshot, kiss, laugh, nose-blow, phone-ring, sniff, sniffle, snore,
snort, speech, spit, throat, vocal, whistle
15
Sound Data
5
x 10
-3
Nonscream sound
p(N)
4
Scream sound
3
2
1
0
0
100
200
300
No. of frames (N)
400
500
16
Effect of Background Noise
• Babble noise from NOISEX’92 was added to the sound
events so that the resulting noisy sound events have SNR
of 10dB, 5dB, 0dB, and -5dB
• Performance (%EER, False Acceptance = False Rejection)
Perform better under matched conditions
17
Effect of Sound-Event Partitioning
and Fusion
• 2 Partitions per sound event is sufficient
• Score Fusion + Feature Fusion is the best
18
Effect of Sound-Event Partitioning
and Feature Preprocessing
• Having sound-event partitioning is always better
• PCA-Whitening and L2-norm are useful
19
Conclusion
• Sound-event partitioning and feature prepreprocessing methods are proposed for scream
sound detection.
• It was found that
– Having sound-event partitioning is always better
– PCA-Whitening and L2-norm are useful
– Score fusion + feature fusion is effective
• Demo
20
21