Transcript slides

G54DMT – Data Mining Techniques and
Dr. Jaume Bacardit
[email protected]
Topic 2: Data Preprocessing
Lecture 2: Dimensionality Reduction and Imbalanced Classification
Outline of the lecture
• Dimensionality Reduction
– Definition and taxonomy
– Linear Methods
– Non-Linear Methods
• Imbalanced Classification
– Definition and taxonomy
– Over-sampling methods
– Under-sampling methods
• Resources
Dimensionality reduction
• Dimensionality reduction methods take an original
dataset and convert every instance from the original
Rd space to a Rd’ space, where d’<d
• For each instance x in the dataset X:
– y=f(x), where x={x1,x2,…,xd} and y={y1,y2,…,yd’}
• The definition of f is computed from X (the training
set), and it is what determines each of the different
reduction methods
• In general we find two main classes of dimensionality
reduction methods: linear and non-linear
Principal Component Analysis
• Classic linear dimensionality reduction
method (Pearson, 1901)
• Given a set of original variables X1 … Xp (the
attributes of the problem)
• PCA finds a set of vectors Z1 … Zp that are
defined as linear combinations of the original
variables and that are uncorrelated between
them, the principal components
• The PC are also sorted such as
Principal Components
Applying the PCs to transform data
• Using all PCs
z1n ö æ
÷ ç
z2n ÷ ç
÷ ç
÷ ç
znn ÷÷ çç
ø è
x1 ö æ x '1 ö
÷ ç
x2 ÷ ç x '2 ÷
÷ ç
. ÷=ç . ÷
. ÷ ç . ÷
xn ÷÷ çç x 'n ÷÷
ø è
ö ç
÷ ç
ø ç
x1 ö
x2 ÷ æ
÷ ç x '1 ÷
. ÷=
ç x '2 ÷
. ÷ è
xn ÷÷
• Using only 2 PCs
æ z
ç 11
ç z21
z12 . .
z22 . . z2n
How many components do we use?
• Using all components is useful if
– The problem is small
– We are interested in using an axis-parallel
knowledge representation (rules, decision trees,
• But many times what we are interested is in
using just a subset of PC
– PC are ranked by their variance
– We can select the top N
– Or we the number of PC that account for e.g. 95%
of the cumulative variance
So what happens to the data when
we transform it?
Data is rotated, so the PC become the axis of the new domain
How PCA is computed
• Normalize the data so all dimensions have mean 0
and variance 1
• Using Singular Value Decomposition (will
describe in the missing values lecture)
• Using the covariance method
– Compute the co-variance matrix of the data
c jk = [å(x ij - x j )(x ik - x k )]/(n -1)
– Compute the eigenvectors (PC) and eigenvalues
(Variances) of the covariance matrix
Implementations of PCA in WEKA
• Simple implementation in the interface, which
can’t be used to partition more than one file using
the same set of PC (e.g. Training and test set)
• Command line version:
– java weka.filters.supervised.attribute.AttributeSelection
-E "weka.attributeSelection.PrincipalComponents -R 0.5"
-b -i <input training> -o <output training> -r <input test>
-s <output test> -c last
Cumulative variance of 50%
Implementations of PCA in R
> pca<-prcomp(data,scale=T)
> pca
Standard deviations:
[1] 1.3613699 0.3829777
V1 0.7071068 -0.7071068
V2 0.7071068 0.7071068
> plot(pca)
> data_filtered<-predict(pca,data)[,1]
Select only the first PC
Independent Component Analysis
• PCA tries to identify the components that
characterise the data
• ICA assumes that the data is no single entity, it
is the linear combination of statistically
independent sources, and tries to identify
• How is the independence measured?
– Minimization of Mutual Information
– Maximization of non-Gaussianity
• FastICA is a very popular implementation
(available in R)
Multidimensional Scaling (MDS)
• Family of dimensionality reduction methods originating/used
mainly in the information visualisation field
• It contains both linear and nonlinear variants (some of which
are equivalent to PCA)
• All variants starts by computing a NxN distance matrix D that
contains all pair-wise distances between the instances in the
training set
• Then the method finds the mapping from the original space
into a M-dimensional space (e.g. 2,3) so that the distances
between instances in the new space are as close as possible
to D
• Available in R as well (cmdscale,isoMDS)
Self-Organized Maps (SOM)
• Truly non-linear dimensionality reduction method
• Actually it is a type of unsupervised artificial neural
• Imagine it as a mesh adapting to a complex surface
SOM algorithm (from Wikipedia)
1. Randomize the map's nodes' weight vectors (or initialize
them using e.g. the two main PC)
2. Grab an input vector
3. Traverse each node in the map
Use Euclidean distance formula to find similarity between the input
vector and the map's node's weight vector
Track the node that produces the smallest distance (this node is the
best matching unit, BMU)
4. Update the nodes in the neighbourhood of BMU by pulling
them closer to the input vector
Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t))
5. Increase t and repeat from 2 while t < λ
Imbalanced Classification
• Tackling classification problems where the
class distribution is extremely uneven
• These kind of problems are very difficult for
standard data mining methods
50% of blue dots
10% of blue dots
Effect of Class imbalance
• Performance of XCS
(evolutionary learning
system) on the
Multiplexer synthetic
dataset with different
degrees of class
• IR = ratio between the
majority and the
minority class
Three approaches of Imbalance
• Cost-sensitive classification
– Adapting the machine learning methods to
penalise more misclassifications of the minority
class (later in the module)
• Over-sampling methods
– Generate more examples from the minority class
• Under-sampling methods
– Remove some of the examples from the majority
Synthetic Minority Over-Sampling
Technique (SMOTE)
• (Chawla et al., 02)
• Generates synthetic instances from the minority class
to balance the dataset
• Instances are generated using real examples from the
minority class as seed
• For each real example its k nearest neighbours are
• Synthetic instances are generated to be at a random
point between the seed and the neighbour
(Orriols-Puig, 08)
Under-sampling based on Tomek
• (Batista et al., 04)
• A Tomek Link is a pair of examples <Ei,Ej> of different
class from the dataset for which there is no other
example Ek in the dataset that is closer to any of
• The collection of Tomek Links in the dataset define
the class frontiers
• This undersampling method removes all examples
from the majority class that are not Tomek links
• Comprehensive list of nonlinear
dimensionality reduction methods
• Good lecture slides about PCA and SVD
• Survey on class imbalance
• Class imbalance methods in KEEL