what ? Automatic classification of what ? of large musical instrument

Download Report

Transcript what ? Automatic classification of what ? of large musical instrument

Description et
Classification automatique
des sons instrumentaux
Geoffroy Peeters
[email protected]
Ircam (Analysis/Synthesis Team)
1. Introduction
 Musical Instrument Sound Classification
trumpet
 numerous studies on sound classification
 few of them address the problem of generalization of sound
sources
(recognition of the same source possibly recorded in different
conditions with various instrument manufacturers and players)
 Evaluation of the system performance
 training on a subset of the database, evaluation on the rest of
the database
 does not prove any applicability for the classification of sounds
which do not belong to the database
 Martin [1999]
 Eronen [2001]
76% (family)
77% (family)
39% for 14 instruments
35% for 16 instruments
 Goal of this study





study large database classification
How ? New classification system
Extract a large amount of features
New feature selection algorithm
Compare flat and hierarchical gaussian classifier
[email protected]
2
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
3
2. Feature extraction
Features
Extraction
Temporal
Modeling
Feature
Transform:
Gaussianity
Feature
Selection
IRMFSP
Feature
Transform
LDA
Class
modeling
 Features for sound recognition:
 speech recognition community,
previous studies on musical
instrument sounds classification,
results of psycho-acoustical
studies.
 each feature set is supposed to
perform well for a specific task
global features (attack time, increase/decrease)
 Principle:
 1) extract a large set of features
 2) filter the feature set a
posteriori by a Feature Selection
Algorithm
A B C D E F G H I J K L M N ...
temporal modeling
mean
variance
derivative
modulation
polynomial
Whole set of
features
Feature selection
algorithm
Instantaneous (frame based) features
- harmonic features
- spectral shape features
- perceptual features
- MFCC, xcorr, zcr
- MPEG-7 LLDs (spectral flatness, crest)
Classes
Reduced set of
features
CKN
[email protected]
4
2. Feature extraction
Audio features Taxonomy
Fundamental
Frequency
Signal
Descriptors
Extraction
Module
Temporal
Modeling
Segmentation
Instantaneous
Descriptors
Global
Descriptors
Descriptors
 Global descriptors
 Instantaneous descriptors
 Temporal modeling
 Mean,
 Variance
 Modulation (pitch, energy)
[email protected]
5
2. Feature extraction
Audio features Taxonomy
Instantaneous
Harmonic
Descriptors
Sinusoidal
Harmonic
Model
Signal
Signal frame
FFT
Perceptual
Model
Instantaneous
Temporal
Descriptors
Instantaneous
Spectral
Descriptors
Energy
Envelop
Global
Temporal
Descriptors
Instantaneous
Perceptual
Descriptors





DT: temporal descriptors
DE: energy descriptors
DS: spectral descriptors
DH: harmonic descriptors
DP: perceptual descriptors
[email protected]
6
2. Feature extraction
DT/DE: Temporal/Energy descriptors
sound
Energy
Envelop
 DT.zero-crossing rate
 DT.auto-correlation





DT.log-attack time
DT.temporal increase
DT.temporal decrease
DT.temporal centroid
DT.effective duration
 DE.total energy
 DE.energy of harmonic part
 DE.energy of noise part
[email protected]
7
2. Feature extraction
DS: Spectral descriptors
sound
Window
FFT
 DS.centroid, DS.spread, DS.skewness, DS.kurtosis
 DS.slope, DS.decrease, DS.roll-off
 DS.variation
[email protected]
8
2. Feature extraction
DH: Harmonic descriptors
sound
Window
FFT
Sinudoidal model
 DH.Centroid, DH.Spread, DH.Skewness, DH.Kurtosis
 DH.Slope, DH.Decrease, DH.Roll-off
 DH.Variation




DH.Fundamental frequency
DH.Noisiness, DH.OddEvenRatio, DH.Inharmonicity
DH.Tristimulus
DH.Deviation
[email protected]
9
2. Feature extraction
DP: Perceptual descriptors / DV: Various descriptors
sound
Window
Mid-ear filering
FFT
Perception
Bark scale
Mel scale
 DP.Centroid, DP.Spread, DP.Skewness, DP.Kurtosis
 DP.Slope, DP.Decrease, DP.Roll-off
 DP.Variation
 DP.Loudness, RelativeSpecific Loudness
 DP.Sharpness, DP.Spread
 DP.Roughness, DP.FluctuationStrength
 DV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCC
 DV.SpectralFlatness, DV.SpectralCrest
[email protected]
10
2. Feature extraction
Audio features design
 No consensus on the use of amplitude and frequency scale
 All features are computed using the following scale:
 Frequency scale:
linear / log / bark-bands
 Amplitude scale:
linear / power / log
 note: log(0.0)=-infty
-> normalization 24bits
 Features must be independent of the recording level
 Normalization in linear, in power scale
 Normalization in logarithmic scale
 a * f   2a * f
sc 
a
 2a
 a 
log
  a  * f
 
sc 
 a * f   ( 2a ) * f
sc 
a
 (2a)
2
2
2
2
 a 
log
  a 
 
 Features must be independent of the sampling rate
 Maximum frequency taken into account: 11025/2 Hz
 Resampling (for zcr, xcorr)
[email protected]
11
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
12
3. Feature selection algorithm (FSA)
Features
Extraction
Temporal
Modeling
Feature
Selection
IRMFSP
Feature
Transform
LDA
Class
modeling
 Problem: using a high number of features
A B C D E F G H I J K L M N ...
 some features can be irrelevant for the given task
 over fitting of the model to the training set (especially with LDA)
 classification models are difficult to interpret by human
Whole set of
features
Feature selection
algorithm
Feature
Transform:
Gaussianity
Classes
 Goal of feature selection algorithm (FTA)
Reduced set of
features
CKN
 find the minimal set of
criterion 1) informative features with respect to the classes
criterion 2) features that provide non redundant information
 Forms of feature selection algorithm
 embedded: the FSA is part of the classifier
 filter:
the FSA is distinct from the classifier
and used before the classifier
 wrapper:
the FSA makes use of the classification results
[email protected]
13
3. Feature selection algorithm: IRMFSP
 Inertia Ratio Maximization using Feature Space Projection
 Criterion 2
 Criterion 1
informative features with respect
to the classes
 principle: “feature values for sounds
belonging to a specific class should be
separated from the values for all the
other classes »
 measure: for a specific feature i ratio
of the Between-class inertia B to the
Total class inertia T
fi+1
features that provide non redundant
information
 apply an orthogonalization process of the
feature space after the selection of each new
feature
(Gram-Schmidt Orthogonalization)
gi  f i / f i
f j '  f j  ( f j  gi ) gi
j  F
whole set of feature
W1
m1 m-m
W2
1
m
compute inertia ratio for all
features
m-m 2 m2
fi
take the feature with largest ratio
K
Nk
(mi ,k  mi )( mi ,k  mi )'
B 
k 1 N
r 
1 N
T
 ( fi,n  mi )( fi,n  mi )'
N n 1
project the whole feature space
on the selected feature
[email protected]
14
3. Feature selection algorithm: IRMFSP
 Example :
sustained/non-sustained sound separation
 computation of the BT ratio for each feature
 feature with the weakest ratio (r=6.9e-6)
 Specific loudness m8 mean
 feature with the highest ratio (r=0.58)
 Energy temporal decrease
 first three selected dimensions
 1st dim:
temporal decrease
 2nd dim:
spectral centroid
 3rd dim:
temporal increase
[email protected]
15
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
16
4. Feature transformation: LDA
Features
Extraction
Temporal
Modeling
Feature
Transform:
Gaussianity
Feature
Selection
IRMFSP
Feature
Transform
LDA
Class
modeling
 Linear Discriminant Analysis
 find linear combination among features in order to maximize
discrimination between classes: F -> F’
 Total inertia
1 n
T   (d i  m)( d i  m)'
n i 1
K
 Between Class Inertia
nk
(m k  m)( m k  m)'
n
k 1
B
 Transform initial feature space F by a transformation matrix U
in order to maximize the ratio
ru 
u ' Bu
u'T u
 Solution:
1
T B
 eigen vectors of
 associated to the eigen values 
(discriminative power)
[email protected]
17
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
18
5. Class modeling: flat classifiers
Features
Extraction
Feature
Transform:
Gaussianity
Temporal
Modeling
top
feature selection
use only f 1,f 2,...,F N
feature transformation
apply matrix
for each class
evaluate Bayes formula
node j
node
j+1
Class
modeling
 Flat gaussian classifier (F-GC)
 “Flat”= all classes considered on a same level
 Training: model each class k by a multi-dimensional
gaussian pdf (mean vector, covariance matrix)
 Evaluation: Bayes formula
feature selection
best set of features f1,f2,...,F N ?
feature transformation
Linear Discriminant Analysis matrix ?
for each class
gaussian pdf parameters estimation
node j-1
Feature
Transform
LDA
 Flat classifiers
TRAINING
EVALUATION
Feature
Selection
IRMFSP
...
 Flat KNN classifier (F-KNN)
 instance-based algorithm
 assign to the input sound the majority class among its K
Nearest Neighbors in the Feature Space
 Euclidean distance => weighting of the axes ?
fi+1
fi
 Apply to the output of the LDA (implicit weighting of the
axes)
[email protected]
19
5. Class modeling: hierarchical classifiers
Features
Extraction
Feature
Transform:
Gaussianity
Temporal
Modeling
top
 Tree construction is supervised (>< decision tree)
 Only the subset of sounds belonging to the classes of the
current node are used
feature selection
use only f 1,f 2,...,F N
feature transformation
apply matrix
for each class
evaluate Bayes formula
node j
Class
modeling
 Hierarchical gaussian classifier (H-GC)
 Training: a tree of flat gaussian classifier
each node has its own FSA, FTA and F-GC
feature selection
best set of features f1,f2,...,F N ?
feature transformation
Linear Discriminant Analysis matrix ?
for each class
gaussian pdf parameters estimation
node j-1
Feature
Transform
LDA
 Hierarchical classifiers (F-GC)
TRAINING
EVALUATION
Feature
Selection
IRMFSP
 Evaluation: local probability decides which branch of the
tree to follow
 Advantages of H-GC
...
node
j+1
 Learning facilities: it is easier to learn differences in a small
subset of classes
 Reduced class confusion: benefit from the higher recognition
rate at the higher levels of the tree
top
TRAINING
EVALUATION
...
...
 Hierarchical KNN classifier (H-KNN)
node i
node j-1
node j
node
j+1
...
[email protected]
20
5. Class modeling: hierarchical classifiers
Features
Extraction
Feature
Transform:
Gaussianity
Temporal
Modeling
top
 Tree construction is supervised (>< decision tree)
 Only the subset of sounds belonging to the classes of the
current node are used
feature selection
use only f 1,f 2,...,F N
feature transformation
apply matrix
for each class
evaluate Bayes formula
node j
Class
modeling
 Hierarchical gaussian classifier (H-GC)
 Training: a tree of flat gaussian classifier
each node has its own FSA, FTA and F-GC
feature selection
best set of features f1,f2,...,F N ?
feature transformation
Linear Discriminant Analysis matrix ?
for each class
gaussian pdf parameters estimation
node j-1
Feature
Transform
LDA
 Hierarchical classifiers (F-GC)
TRAINING
EVALUATION
Feature
Selection
IRMFSP
 Evaluation: local probability decides which branch of the
tree to follow
 Advantages of H-GC
...
node
j+1
 Learning facilities: it is easier to learn differences in a small
subset of classes
 Reduced class confusion: benefit from the higher recognition
rate at the higher levels of the tree
top
TRAINING
EVALUATION
...
...
 Hierarchical KNN classifier (H-KNN)
node i
node j-1
node j
node
j+1
...
 Decision Trees:
 Binary Entropy Reduction Tree (BERT)
 C4.5.
 Partial Decision Tree (PART)
[email protected]
21
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
22
6. Evaluation
Taxonomy used
 Three different levels
 T1: sustained/non-sustained sounds
 T2: instrument families
 T3: instrument names
Instrument
T1
Non Sustained
Sustained
Strings
T2
Struck Strings
Plucked Strings
Pizz Strings
T3
Piano
Guitar
Harp
Violin
Viola
Cello
Double
W oodwinds
Bowed Strings
Brass
Violin
Viola
Cello
Double
Trumpet
Cornet
Trombone
French Horn
Tuba
Single Double
Reeds
Single Reeds
Clarinet
Tenor sax
Alto sax
Sop sax
Accordeon
Double Reeds
Oboe
Bassoon
English horn
Air Reeds
Flute
Piccolo
Recorder
[email protected]
23
6. Evaluation
Test set
 6 databases
400
Vi
Pro
350
Microsoft
McGill
Iowa
300
SOL
250
200
150
100
saxalto
saxtenor
oboe
saxsop
clarinet
english-horn
bassoon
recorder
accordeon
flute
piccolo
tuba
trumpet
cornet
trombone
french horn
cello
violin
viola
double
cello-pizz
violin-pizz
double-pizz
harp
piano
guitar
0
viola-pizz
50
 Ircam Studio OnLine
(1323 sounds, 16 instruments),
 Iowa University database
(816 sounds, 12 instruments),
 McGill University database
(585 sounds, 23 instruments),
 Microsoft “Musical Instruments” CD-ROM
(216 sounds, 20 instruments),
 two commercial databases Pro
(532 sounds, 20 instruments) Vi databases
(691 sounds, 18 instruments),
 total = 4163 sounds.
 notes:
 27 instrument have been considered
 a large pitch range has been considered
(4 octaves on average)
 no muted, martele/staccato sounds
[email protected]
24
6. Evaluation
Evaluation process
DB1
DB1
DB1 DB2 DB3 DB4 DB5 DB6
DB1
DB2
DB3
DB4
DB5
DB6
 2) One to One (O2O)
each database is used in turns
to classify all other databases
DB1 DB2 DB3 DB4 DB5 DB6
DB1
DB2
DB3
DB4
DB5
DB6
 1) Random 66%/33% partition of database
(50 sets)
[Livshin2003]:
 3) Leave One Database Out (LODO) [Livshin 2003]:
all database except one are used in turns
to classify the remaining one
[email protected]
25
6. Evaluation
Results O2O (II)
DB1
DB1
LDA
CFS weka
IRMFSP (t=0.01,
nbdescmax=20)
T1
96
99.0 (0.5)
T2
89
93.2 (0.8)
T3
86
60.8 (12.9)
99.2 (0.4)
95.8 (1.2)
95.1 (1.2)
DB1 DB2 DB3 DB4 DB5 DB6
DB1
DB2
DB3
DB4
DB5
DB6
T2
57
63
T3
30
38
T1
F-GC
98
F-GC (BC+LDA) 99
F-KNN (K=10, LDA)99
H-GC
98
H-GC (BC+LDA) 99
H-KNN (K=10, LDA)99
BERT
95
C4.5.
PART
T2
78
81
77
80
85
84
65
65
71
T3
55
54
51
57
64
64
42
48
[email protected]
F-GC
H-GC
DB1 DB2 DB3 DB4 DB5 DB6
DB1
DB2
DB3
DB4
DB5
DB6
T1
89
93
26
6. Evaluation
Results O2O (II)
 O2O (mean value over the 30 (6*5) experiments)
 Discussion
 low recognition rate for O2O compared to 66%/33%
-> problem of generalization ?
 system mainly learns the instrument instance instead of the
instrument (each database contains a single instance of an
instrument)
 LODO (mean value over the 6 Left Out databases)
 Goal: to increase the number of instances of each instrument
 How: by combining several databases
[email protected]
27
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
28
5. Evaluation
Confusion matrix
4
6
76
18
37
14
93
5
1
14
6
16
55
1
2
1
1
23
5
2
13
3
49
10
7
3
2
2
12
50
1
15
1
2
1
2
1
7
1
4
2
1
1
1
356
264
242
30
15
47
1
13
7
61
146
4
5
2
1
159
130
54
186
170
97
225
1
280
c la r in e t
b a ssoo n
re c o rd e r
p ic c o lo
2
4
3
4
53
202
1
1
2
2
1
15
2
1
4
1
1
1
2
4
2
3
1
83
39
203
212
1
9
3
323
[email protected]
2
4
8
2
0
2
9
1
20
4
58
41
184
23
5
1
5
140
1
46
1
14
4
2
2
81
10
71
4
157
3
1
10
5
59
3
10
3
3
77
4
2
4
1
5
3
79
1
1
1
1
f lu t e
tu b a
tru m p e t
tro m b o n e
2
2
4
1
1
88
1
2
c o rn e t
2
9
5
4
68
3
5
10
2
1
44
o b o e
2
85
fre n c h -h o rn
1
v io lin
2
8
5
1
12
71
1
c e llo
2
1
3
b a ss
4
v io la
v io lin - p iz z
2
c e llo - p iz z
3
20
1
4
12
68
6
b a s s - p iz z
3
48
22
v io la - p iz z
g u it a r
36
29
24
1
1
2
3
h a rp
p ia n o
p ia n o
g u it a r
h a rp
vio la -p iz z
b a s s -p iz z
c e llo -p iz z
vio lin -p iz z
vio la
bas s
c e llo
vio lin
fr e n c h - h o r n
c o rn e t
tro m b o n e
tru m p e t
tu b a
fl u t e
p ic c o lo
re c o rd e r
bas s oon
c la rin e t
e n g lis h -h o rn
oboe
num ber of
s ounds
o rig in a l c la s s
e n g lis h - h o r n
r e c o g n iz e d
c la s s
 Low confusion between sustained / non-sustained
sounds
1
1
1
4
29
5. Evaluation
Confusion matrix
4
6
76
18
37
14
93
5
1
14
6
16
55
1
2
1
1
23
5
2
13
3
49
10
7
3
2
2
12
50
1
15
1
2
1
2
1
7
1
4
2
1
1
1
356
264
242
30
15
47
1
13
7
61
146
4
5
2
1
159
130
54
186
170
97
225
1
280
c la r in e t
b a ssoo n
re c o rd e r
p ic c o lo
2
4
3
4
53
202
1
1
2
2
1
15
2
1
4
1
1
1
2
4
2
3
1
83
39
203
212
1
9
3
323
[email protected]
2
4
8
2
0
2
9
1
20
4
58
41
184
23
5
1
5
140
1
46
1
14
4
2
2
81
10
71
4
157
3
1
10
5
59
3
10
3
3
77
4
2
4
1
5
3
79
1
1
1
1
f lu t e
tu b a
tru m p e t
tro m b o n e
2
2
4
1
1
88
1
2
c o rn e t
2
9
5
4
68
3
5
10
2
1
44
o b o e
2
85
fre n c h -h o rn
1
v io lin
2
8
5
1
12
71
1
c e llo
2
1
3
b a ss
4
v io la
v io lin - p iz z
2
c e llo - p iz z
3
20
1
4
12
68
6
b a s s - p iz z
3
48
22
v io la - p iz z
g u it a r
36
29
24
1
1
2
3
h a rp
p ia n o
p ia n o
g u it a r
h a rp
vio la -p iz z
b a s s -p iz z
c e llo -p iz z
vio lin -p iz z
vio la
bas s
c e llo
vio lin
fr e n c h - h o r n
c o rn e t
tro m b o n e
tru m p e t
tu b a
fl u t e
p ic c o lo
re c o rd e r
bas s oon
c la rin e t
e n g lis h -h o rn
oboe
num ber of
s ounds
o rig in a l c la s s
e n g lis h - h o r n
r e c o g n iz e d
c la s s
 Largest confusions inside each instrument family
1
1
1
4
30
5. Evaluation
Confusion matrix
4
6
76
18
37
14
93
5
1
14
6
16
55
1
2
1
1
23
5
2
13
3
49
10
7
3
2
2
12
50
1
15
1
2
1
2
1
7
1
4
2
1
1
1
356
264
242
30
15
47
1
13
7
61
146
4
5
2
1
159
130
54
186
170
97
225
1
280
c la r in e t
b a ssoo n
re c o rd e r
p ic c o lo
2
4
3
4
53
202
1
1
2
2
1
15
2
1
4
1
1
1
2
4
2
3
1
83
39
203
212
1
9
3
323
[email protected]
2
4
8
2
0
2
9
1
20
4
58
41
184
23
5
1
5
140
1
46
1
14
4
2
2
81
10
71
4
157
3
1
10
5
59
3
10
3
3
77
4
2
4
1
5
3
79
1
1
1
1
f lu t e
tu b a
tru m p e t
tro m b o n e
2
2
4
1
1
88
1
2
c o rn e t
2
9
5
4
68
3
5
10
2
1
44
o b o e
2
85
fre n c h -h o rn
1
v io lin
2
8
5
1
12
71
1
c e llo
2
1
3
b a ss
4
v io la
v io lin - p iz z
2
c e llo - p iz z
3
20
1
4
12
68
6
b a s s - p iz z
3
48
22
v io la - p iz z
g u it a r
36
29
24
1
1
2
3
h a rp
p ia n o
p ia n o
g u it a r
h a rp
vio la -p iz z
b a s s -p iz z
c e llo -p iz z
vio lin -p iz z
vio la
bas s
c e llo
vio lin
fr e n c h - h o r n
c o rn e t
tro m b o n e
tru m p e t
tu b a
fl u t e
p ic c o lo
re c o rd e r
bas s oon
c la rin e t
e n g lis h -h o rn
oboe
num ber of
s ounds
o rig in a l c la s s
e n g lis h - h o r n
r e c o g n iz e d
c la s s
 Lowest recognition rates -> smallest training sets
1
1
1
4
31
5. Evaluation
Confusion matrix
4
6
76
18
37
14
93
5
1
14
6
16
55
1
2
1
1
23
5
2
13
3
49
10
7
3
2
2
12
50
1
15
1
2
1
2
1
7
1
4
2
1
1
1
356
264
242
30
15
47
1
13
7
61
146
4
5
2
1
159
130
54
186
170
97
225
1
280
c la r in e t
b a ssoo n
re c o rd e r
p ic c o lo
2
4
3
4
53
202
1
1
2
2
1
15
2
1
4
1
1
1
2
4
2
3
1
83
39
203
212
1
9
3
323
[email protected]
2
4
8
2
0
2
9
1
20
4
58
41
184
23
5
1
5
140
1
46
1
14
4
2
2
81
10
71
4
157
3
1
10
5
59
3
10
3
3
77
4
2
4
1
5
3
79
1
1
1
1
f lu t e
tu b a
tru m p e t
tro m b o n e
2
2
4
1
1
88
1
2
c o rn e t
2
9
5
4
68
3
5
10
2
1
44
o b o e
2
85
fre n c h -h o rn
1
v io lin
2
8
5
1
12
71
1
c e llo
2
1
3
b a ss
4
v io la
v io lin - p iz z
2
c e llo - p iz z
3
20
1
4
12
68
6
b a s s - p iz z
3
48
22
v io la - p iz z
g u it a r
36
29
24
1
1
2
3
h a rp
p ia n o
p ia n o
g u it a r
h a rp
vio la -p iz z
b a s s -p iz z
c e llo -p iz z
vio lin -p iz z
vio la
bas s
c e llo
vio lin
fr e n c h - h o r n
c o rn e t
tro m b o n e
tru m p e t
tu b a
fl u t e
p ic c o lo
re c o rd e r
bas s oon
c la rin e t
e n g lis h -h o rn
oboe
num ber of
s ounds
o rig in a l c la s s
e n g lis h - h o r n
r e c o g n iz e d
c la s s
 Confusion piano / guitar-harp
1
1
1
4
32
5. Evaluation
Confusion matrix
4
6
76
18
37
14
93
5
1
14
6
16
55
1
2
1
1
23
5
2
13
3
49
10
7
3
2
2
12
50
1
15
1
2
1
2
1
7
1
4
2
1
1
1
356
264
242
30
15
47
1
13
7
61
146
4
5
2
1
159
130
54
186
170
97
225
1
280
c la r in e t
b a ssoo n
re c o rd e r
p ic c o lo
2
4
3
4
53
202
1
1
2
2
1
15
2
1
4
1
1
1
2
4
2
3
1
83
39
203
212
1
9
3
323
[email protected]
2
4
8
2
0
2
9
1
20
4
58
41
184
23
5
1
5
140
1
46
1
14
4
2
2
81
10
71
4
157
3
1
10
5
59
3
10
3
3
77
4
2
4
1
5
3
79
1
1
1
1
f lu t e
tu b a
tru m p e t
tro m b o n e
2
2
4
1
1
88
1
2
c o rn e t
2
9
5
4
68
3
5
10
2
1
44
o b o e
2
85
fre n c h -h o rn
1
v io lin
2
8
5
1
12
71
1
c e llo
2
1
3
b a ss
4
v io la
v io lin - p iz z
2
c e llo - p iz z
3
20
1
4
12
68
6
b a s s - p iz z
3
48
22
v io la - p iz z
g u it a r
36
29
24
1
1
2
3
h a rp
p ia n o
p ia n o
g u it a r
h a rp
vio la -p iz z
b a s s -p iz z
c e llo -p iz z
vio lin -p iz z
vio la
bas s
c e llo
vio lin
fr e n c h - h o r n
c o rn e t
tro m b o n e
tru m p e t
tu b a
fl u t e
p ic c o lo
re c o rd e r
bas s oon
c la rin e t
e n g lis h -h o rn
oboe
num ber of
s ounds
o rig in a l c la s s
e n g lis h - h o r n
r e c o g n iz e d
c la s s
 Cross-family confusions
1
1
1
4
33
5. Evaluation
Confusion matrix
 Cross-family confusions
 Cornet
-> Bassoon
 Cornet
-> English-horn
 Flute
-> Clarinet
 Oboe
-> Flute
 Trombone -> Flute
[email protected]
34
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
35
5. Evaluation
Main selected features
 Par FSA (IRMFSP)
sust./non-susts
among non-sust.
among sust.
temporal increase
temporal decrease
temporal decrease
temporal decrease
temporal centroid
among bow ed-string among brass
among air reeds
among sing/dbl reeds
temporal decrease
temporal log-attack
spectral centroid
spectral centroid
spectral spread
spectral centroid
spectral centroid
spectral skew ness
spectral spread
spectral spread
spectral skew ness
spectral spread
spectral skew ness
spectral kurtosis + std spectral spread
spectral skew ness
spectral kurtosis + std sharpness
spectral variation
spectrall kurtosis std spectral slope
spectrall skew ness std
spectral centroid
spectral skew ness
spectral variation std
spectral decrease std spectral kurtosis
harmonic deviation
mfcc2,6 std
harmonic deviation
various mfcc
tristimulus
mfcc3,4,6
noisiness
xcorr 3, 6, 8
harmonic deviation
tristimulus
tristimuls std
harmonic deviation
xcorr3
xcorr3
[email protected]
36
5. Evaluation
Main selected features
 Par arbre de décision (C4.5)
DTg_decr <= 10.033592
| DPi_specloud_m17-mm <= 0.013381
| | DSi_sc_v4-ss <= 0.164903
| | | DPi_specloud_m5-mm <= 0.0124: htb (18.0/11.0)
| | | DPi_specloud_m5-mm > 0.0124
| | | | DPi_sc_v2-mm <= 443.455871
| | | | | DPi_ss_v1-mm <= 477.501186
| | | | | | DPi_loud_v-mm <= 5.759929: cb-pizz (10.0/6.0)
| | | | | | DPi_loud_v-mm > 5.759929
| | | | | | | DTi_xcorr_m11-mm <= -0.272094: cor (50.0/7.0)
| | | | | | | DTi_xcorr_m11-mm > -0.272094
| | | | | | | | DPg_flustr_v7 <= 0.006614: tubb (10.0/5.0)
| | | | | | | | DPg_flustr_v7 > 0.006614
| | | | | | | | | DPi_Dmfcc_m3-mm <= -0.013356: cor (19.0/7.0)
| | | | | | | | | DPi_Dmfcc_m3-mm > -0.013356: tubb (67.0/3.0)
DTg_decr > 10.033592
| DTg_incr <= -0.744688
| | DPi_tri_v7-mm <= 0.035614
| | | DPg_roughn_v4 <= 0.120563
| | | | DTg_ed <= 0.278571
| | | | | DSi_skew_v6-mm <= -1.673777: vln-pizz (53.0)
| | | | | DSi_skew_v6-mm > -1.673777: alto-pizz (34.0/9.0)
| | | | DTg_ed > 0.278571: harp (10.0/4.0)
| | | DPg_roughn_v4 > 0.120563: picc (11.0/7.0)
| | DPi_tri_v7-mm > 0.035614
[email protected]
37
5. Evaluation
Main selected features
 Par arbre de décision, décision regroupée (PART)
DTg_incr <= -1.670978 AND
DTg_lat <= -0.982531 AND
DPi_specloud_m1-mm <= 0.012608 AND
DSi_variation_v1-mm > 0.001828 AND
DSi_kurto_v6-mm > 6.786784: vln-pizz (82.0/1.0)
DPi_ss_v4-mm > 0.897333 AND
DHi_devs_v3-mm > 2.790707 AND
DHi_oeratio_v1-mm > 2.250247: clsb (74.0/5.0)
DPi_ss_v4-mm > 0.950127 AND
DPi_DDmfcc_m7-ss > 0.009458 AND
DPg_roughn_v6 > 0.079858 AND
DHg_mod_am > 0.000158 AND
DPi_specloud_m21-mm > 0.026443 AND
DPi_specloud_m5-mm <= 0.114309 AND
DPi_DDmfcc_m3-mm > -0.000202: vln (66.0/8.0)
[email protected]
38
Feature extraction
Feature selection
Feature Transform
Classification
Evaluation
Confusion matrix
Which features
Classes organization
[email protected]
39
?
T1
Non Sustained
T3
Struck Strings
Piano
Plucked Strings
Guitar
Harp
Pizz Strings
Violin
Viola
Cello
Double
Recorder
Piccolo
Flute
Tuba
French Horn
 Goal:
 check that the proposed tree
structure corresponds to natural
class organization
 How ?
Instrument
Sustained
Strings
T2
Trombone
Cornet
Trumpet
Double
Cello
Viola
Violin
Double pizz
Cello pizz
Viola pizz
Violin pizz
Harp
Guitar
Piano
7. Instrument Class Similarity
Woodwinds
Bowed Strings
Brass
Violin
Viola
Cello
Double
Trumpet
Cornet
Trombone
French Horn
Tuba
Single Double
Reeds
Single Reeds
Clarinet
Tenor sax
Alto sax
Sop sax
Accordeon
Double Reeds
Oboe
Bassoon
English horn
Air Reeds
Flute
Piccolo
Recorder
 Most people use Martin hierarchy
 1) check the grouping among the
decision trees leaves
 2) MDS ?
 MDS on acoustic features ?
[Herrera AES114th]
 Compute the dissimilarity between each class
 How ?Compute the between-group F-matrix between class models
 Observe the dissimilarity between the classes
 How ? MDS (Multi-dimensional scaling) analysis
 MDS preserve as much as possible distances between the data
and allows representing them into a lower dimensional space
 usually MDS is used for representing dissimilarity judgements (Timbre similarity),
used here on acoustic features
 MDS (Kruskal’s STRESS formula 1 scaling method)
 3 dimensional space
[email protected]
40
7. Instrument Class Similarity
 Clusters ?
 non-sustained sounds
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
41
7. Instrument Class Similarity
 Clusters ?
 non-sustained sounds
 Bowed-strings sounds
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
42
7. Instrument Class Similarity
 Clusters ?
 non-sustained sounds
 Bowed-strings sounds
 Brass sounds (TRPU ?)
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
43
7. Instrument Class Similarity
 Clusters ?




non-sustained sounds
Bowed-strings sounds
Brass sounds (TRPU ?)
mix between single/double
reeds and brass
instruments
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
44
7. Instrument Class Similarity
 Dimension 1:
 separate sustained sounds /
non sustained sounds
 negative values: PIAN, GUI,
HARP, VLNP, VLAP, CELLP, DBLP
 -> attack-time, decrease time
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
45
7. Instrument Class Similarity
 Dimension 1:
 separate sustained sounds /
non sustained sounds
 negative values:
PIAN, GUI, HARP, VLNP, VLAP,
CELLP, DBLP
 -> attack-time, decrease time
 Dimension 2:
 brightness
 dark sounds:
TUBB, BSN, TBTB, FHOR
 bright sounds:
PICC, CLA, FLUT
 problem DBL ?
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
46
7. Instrument Class Similarity
 Dimension 1:
 separate sustained sounds /
non sustained sounds
 negative values:
PIAN, GUI, HARP, VLNP, VLAP,
CELLP, DBLP
 -> attack-time, decrease time
 Dimension 2:
 brightness
 dark sounds
TUBB, BSN, TBTB, FHOR
 bright sounds:
PICC, CLA, FLUT
 problem DBL ?
 Dimension 3:
 ?
 Separation of bowed stings
(VLN, VLA, CELL, DBL)
 amount of modulation ?
PIAN
GUI
HARP
VLNP
VLAP
CELLP
DBLP
VLN
VLA
CELL
DBL
TRPU
COR
TBTB
FHOR
TUBB
FLTU
PICC
RECO
CLA
SAXTE
SAXAL
SAXSO
ACC
OBOE
BS
EHOR
[email protected]
Piano
Guitar
Harp
Violin pizz
Viola pizz
Cello pizz
Double pizz
Violin pizz
Viola pizz
Cello pizz
Double pizz
Trumpet
Cornet
Trombone
French-horn
Tuba
Flute
Piccolo
Recorder
Clarinet
Tenor sax
Alto sax
Soprano sax
Accordeon
Oboe
Bassoon
English-horn
47
Conclusion ?
[email protected]
48
Conclusion
 State of the art
 Martin [1999]
 Eronen [2001]
76% (family)
77% (family)
39% for 14 instruments
35% for 16 instruments
 This study

85% (family)
64% for 23 instruments
 increased recognition rates mainly explained by the use of new features
 Perspectives
 derive automatically the tree structure (analysis of decision tree ?)
 test other classification algorithm (GMM, SVM, …)
 test the system for other sound classes (non-instrumental sounds, sound FX)
 extend the system to musical phrases
 extend the system to polyphonic sounds
 extend the system to multi-sources sounds
 Links:
http://www.cuidado.mu
http://www.cs.waikato.ac.nz/ml/weka/
[email protected]
49