Transcript slides

G54DMT – Data Mining Techniques and
Applications
http://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
[email protected]
Topic 2: Data Preprocessing
Lecture 1: Introduction and data quantification
Some slides taken from “Jiawei Han, Data Mining: Concepts and Techniques. Chapter 2 “
Outline of the lecture
• Introduction to data preprocessing (J. Han)
• Evaluating preprocessing methods
• Looking at data
– Statistical quantification methods (J. Han)
– Data complexity metrics
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?
• Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected
and when it is analyzed.
– Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
– Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Value added
– Interpretability
– Accessibility
• Broad categories:
– Intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
Forms of Data Preprocessing
Evaluating preprocessing methods
• A typical data mining pipeline would be….
Training
set
Dataset
Classification
method
Crossvalidation
model
Test set
Where would you put the preprocessing?
Validation
Accuracy
Evaluating preprocessing methods
• Right at the beginning?
Training
set
Dataset
Preprocess
-ing
Classification
method
Crossvalidation
model
Test set
Problem with this: As the whole
dataset is used for the preprocessing,
there is danger that information can
leak from the training to the test set
Validation
Accuracy
Evaluating preprocessing methods
• A better way
Training
set
Dataset
Preprocess
ing
Classification
method
Crossvalidation
Preprocessing
model
model
Test set
Filtered
test set
This is called external cross-validation
Validation
Accuracy
Mining Data Descriptive Characteristics
•
Motivation
–
•
Data dispersion characteristics
–
•
•
To better understand the data: central tendency, variation and
spread
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
–
Data dispersion: analyzed with multiple granularities of precision
–
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
–
Folding measures into numerical dimensions
–
Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency
•
Mean (algebraic measure) (sample vs. population):
–
–
•
1 n
x   xi
n i 1
x
N
n
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
x
w x
i 1
n
i
i
w
i 1
Median: A holistic measure
–

i
Middle value if odd number of values, or average of the middle two
values otherwise
–
•
Estimated by interpolation (for grouped data):
Mode
median L1  (
–
Value that occurs most frequently in the data
–
Unimodal, bimodal, trimodal
–
Empirical formula:
n / 2  ( f )l
f median
mean mode 3  (mean median)
)c
Symmetric vs. Skewed Data
• Median, mean and mode of
symmetric, positively and negatively
skewed data
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
•
–
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
–
Inter-quartile range: IQR = Q3 – Q1
–
Five number summary: min, Q1, M, Q3, max
–
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot
outlier individually
–
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
•
–
Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n 2
2
s 
( xi  x ) 
[ xi  ( xi ) ]

n  1 i 1
n  1 i 1
n i 1
2
–
1
 
N
2
n
1
(
x


)


i
N
i 1
2
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
n
 xi   2
i 1
2
Properties of Normal Distribution
Curve
• The normal (distribution) curve
– From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third quartiles,
i.e., the height of the box is IRQ
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to
Minimum and Maximum
Visualization of Data Dispersion: Boxplot Analysis
Histogram Analysis
• Graph displays of basic statistical class descriptions
– Frequency histograms
• A univariate graphical method
• Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data
Quantile Plot
• Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
• Allows the user to view whether there is a shift in going from
one distribution to another
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Loess Curve
• Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
• Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomials that are fitted
by the regression
Positively and Negatively Correlated Data
Not Correlated Data
Pitfalls of correlation
• http://en.wikipedia.org/wiki/Anscombe%27s_quartet
• These four datasets have the same correlation value
Graphic Displays of Basic Statistical Descriptions
•
•
•
•
•
•
Histogram: (shown before)
Boxplot: (covered before)
Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are  xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter
plot to provide better perception of the pattern of
dependence
Complexity metrics for
classification methods
• Work proposed by (Basu and Ho, 2002).
• Implementation of these metrics
• The previous section focused on extracting
characteristics from the dataset alone.
• This set of complexity metrics has a different
approach. It aims at answering this question:
– What makes this dataset difficult to learn?
A simple example
Imagine that you got a dataset like this.
How would you classify it?
Linear classifier
Rule learning
Certain problems are easier for some
Knowledge representations than others
Is it possible to generate metrics that quantify these difficulties?
Some of the aspects being
quantified
Degree of linear
separability
(Luengo, 2011)
Length of the class
boundary by means of
spanning trees
Shape of class
manifolds
Groups of metrics:
1. Measures of overlap of individual features
– Quantifies the discrimination power of individual
attributes of a dataset
2. Measures of separability of classes
– Multivariate, considering together all attributes
3. Measures of geometry, topology and density
of manifolds
– These metrics analyse the shape of the class
boundary using different kind of techniques
Measures of overlap of individual features
• Fisher’s discriminant ratio (F1)
– Assesses the overlap between the distributions
(normal) of values for each class
– F1 = score of best attribute (maximum score)
• Volume of overlap region (F2)
• Feature efficiency (F3)
– Percentage of examples from the dataset that
are out of the overlap region for the best att.
Measures of separability of classes
•
Linear separability (L1,L2)
–
–
–
•
Construct a linear classifier for the problem
L1 = sum of distances from line/hyperplane to
misclassifier examples
L2 = rate of misclassified examples
Mixture identifiability (N1, N2, N3)
–
–
–
N1: Compute MST. Count rate of edges across classes
N2: Check nearest neighbours within and outside the
class for each instance. Average measures and divide
them
N3: Error rate of 1-NN classifier using leave-one-out
Measures of geometry, topology and
density of manifolds
• Non-linearity of a linear classifier (L3) or of a
nearest neighbour classifier (N4)
– L3 and N4 create a new test set interpolating
randomly chosen instances from the same class,
and test the accuracy of the L1 and N3 classifiers
• Space covering by ε-neighbourhoods (T1)
• Ratio of instances/features (T2)
So how can we use these metrics?
• Application of these metrics to study the performance of
Fuzzy Rule Based Classification Systems (paper)
• A fuzzy classifier called “Fuzzy Hybrid Genetics-Based Machine Learning”
(FH-GBML) was evaluated on a very large set of 450 datasets. The training
and test accuracy in each of these datasets was ranked
Ranking the performance using the
metrics
And some rules can be extracted…
Considerations about complexity
metrics
• Can they predict the performance of
classification methods?
– So far results have not been very conclusive
– Past experience of the meta-learning community
• Scalability issues
– Several of these metrics cannot be computed for
large-scale datasets in reasonable time
Questions?