05.basicAudioFeature
Download
Report
Transcript 05.basicAudioFeature
Basic Features of Audio Signals
(音訊的基本特徵)
Jyh-Shing Roger Jang (張智星)
http://www.cs.nthu.edu.tw/~jang
MIR Lab, CS Dept, Tsing Hua Univ.
Hsinchu, Taiwan
Audio Features
Four commonly used audio features
Volume
Pitch
Zero crossing rate
Timber
Our goal
These features can be perceived subjectively.
But we need to compute them quantitatively for
further processing and recognition.
Audio Features in Time Domain
Audio features presented in the time domain
Fundamental period
Intensity
Timbre: Waveform within an FP
Audio Features in Frequency Domain
Volume: Magnitude of spectrum
Pitch: Distance between harmonics
Timber: Smoothed spectrum
First formant
F1
Intensity
Pitch freq
Second formant
F2
Demo: Real-time Spectrogram
Try “dspstfft_audio” under MATLAB:
Spectrum:
Spectrogram:
Steps for Audio Feature Extraction
Frame blocking
Frame duration of 20 ms or so
Feature extraction
Volume, zero-crossing rate, pitch, MFCC, etc
Endpoint detection
Usually based on volume & zero-crossing rate
Frame Blocking
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
0
500
1000
1500
Overlap
2000
2500
0.3
0.2
0.1
0
Sample rate = 11025 Hz
Zoom in
Frame
Frame size = 256 samples
Overlap = 84 samples
(Hop size = 256-84)
Frame rate = 11025/(256-84)=64 frames/sec
-0.1
-0.2
-0.3
-0.4
0
50
100
150
200
250
300
Intensity (I)
Intensity
Visual cue: Amplitude of vibration
Computation:
n
Volume: vol
s
i 1
i
n 2
Log energy (in decibel): energy 10*log10 si
i 1
Characteristics
Influenced by
microphone types
Microphone setups
Perceived volume is influenced by frequency and timbre
Intensity (II)
To avoid DC drifting
DC drifting: The vibration is not around zero
Computation:
n
Volume:
vol si median s
i 1
2
n
energy
10*log
s
mean
s
10 i
Log energy (in decibel):
i 1
Theoretical background (How to prove?)
n
s s1 , s2 ,..., sn arg min si x median s
x
i 1
n
s s1 , s2 ,..., sn arg min si x mean s
x
i 1
2
Intensity (III)
Examples
Please refer to the online tutorial
Pitch
Definition
Pitch is known as fundamental frequency, which
is equal to the no. of fundamental period within a
second. The unit used here is Hertz (Hz).
More commonly, pitch is in terms of semitone,
which can be converted from pitch in Hertz:
Hz
semitone 69 12*log 2
440
Pitch Computation (I)
Pitch of tuning forks
ff 16000/ 187 7 / 5 439.56 Hz
ff
pitch 69 12* log2
68.98 sem itone
440
Pitch Computation (II)
Pitch of speech
ff 16000/ 477 75 / 3 119.403 Hz
ff
pitch 69 12* log2
46.42 sem itone
440
Statistics of Mandarin Chinese
5401 characters, each character is at least associated with a
base syllable and a tone
411 base syllables, and most syllables have 4 ones, so we
have 1501 tonal syllables
Tone is characterized by the pitch curves:
Tone 1: high-high
Tone 2: low-high
Tone 3: high-low-high
Tone 4: high-low
Some examples of tones:
1242:清華大學
1234:三民主義、優柔寡斷、搭達打大、依宜以易、夫福府負
?????:美麗大教堂、滷蛋有夠鹹(Taiwanese)
Sinusoidal Signals
How to generate a stream of sinusoidal signals
fs=16000;
duration=3;
f=440;
t=(1:fs*duration)/fs;
y=0.8*sin(2*pi*f*t);
plot(t,y); axis([0.6, 0.65, -1 1]);
sound(y, fs);
Zero Crossing Rate
Zero crossing rate (ZCR)
The number of zero crossing in a frame.
Characteristics:
Noise and unvoiced sound have high ZCR.
ZCR is commonly used in endpoint detection,
especially in detection the start and end of
unvoiced sounds.
To distinguish noise/silence from unvoiced sound,
usually we add a bias before computing ZCR.
ZCR Computations
Two types of ZCR definition
If a sample with zero value is considered a case of
ZCR, then the value of ZCR is higher. Otherwise
its lower.
It affects the ZCR, especially when the sample
rate is low.
Other consideration
Zero-justification is required.
ZCR with shift can be used to distinguish between
unvoiced sounds and silence. (How to determine
the shift amount?)
ZCR
Examples
Please refer to the online tutorial.