Tin Kam Ho Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law.

Download Report

Transcript Tin Kam Ho Bell Labs, Lucent Technologies With contributions from Mitra Basu, Ester Bernado, Martin Law.

Tin Kam Ho
Bell Labs, Lucent Technologies
With contributions from
Mitra Basu, Ester Bernado, Martin Law
What Is the Story in this Image?
2
Automatic Pattern Recognition
A fascinating theme.
What will the world be like if machines can do it?
• Robots can see, read, hear, and smell.
• We can get rid of all spam email, and find exactly what we
need from the web.
• We can protect everything with our signatures,
fingerprints, iris patterns, or just our face.
• We can track anybody by look, gesture, and gait.
• We can be warned of disease outbreaks, terrorist threats.
• Feed in our vital data and we will have perfect diagnosis.
• We will know whom we can trust to lend our money.
3
Automatic Pattern Recognition
And more …
•
•
•
•
•
We can tell the true emotions of all our encounters!
We can predict stock prices!
We can identify all potential criminals!
Our weapons will never lose their targets!
We will be alerted about any extraordinary events going
on in the heavens, on or inside the Earth!
• We will have machines discovering all knowledge!
… But how far are we from these?
4
Automatic Pattern Recognition
samples
• Statistical Classifiers
–
–
–
–
–
–
–
–
Bayesian classifiers
polynomial discriminators
nearest-neighbor methods
decision trees & forests
neural networks
genetic algorithms
support vector machines
ensembles and classifier combination
features
• Why are machines still far from perfect?
• What is still missing in our techniques?
5
Large Variations in Accuracies of Different Classifiers
classifier
aud
aus
bal
bpa
bps
bre
cmc
gls
h-c
hep
irs
krk
lab
led
lym
mmg
mus
mux
pmi
prt
seg
sick
soyb
tao
thy
veh
vote
vow
wne
zoo
Avg
ZeroR
NN1
NNK
NB
C4.5
PART
SMO
XCS
25.3
55.5
45.0
58.0
51.6
65.5
42.7
34.6
54.5
79.3
33.3
52.2
65.4
10.5
55.0
56.0
51.8
49.9
65.1
24.9
14.3
93.8
13.5
49.8
19.5
25.1
61.4
9.1
39.8
41.7
44.8
76.0
81.9
76.2
63.5
83.2
96.0
44.4
66.3
77.4
79.9
95.3
89.4
81.1
62.4
83.3
63.0
100.0
78.6
70.3
34.5
97.4
96.1
89.5
96.1
68.1
69.4
92.4
99.1
95.6
94.6
80.0
68.4
85.4
87.2
60.6
82.8
96.7
46.8
66.4
83.2
80.8
95.3
94.9
92.1
75.0
83.6
65.3
100.0
99.8
73.9
42.5
96.1
96.3
90.3
96.0
65.1
69.7
92.6
96.6
96.8
92.5
82.4
69.6
77.5
90.4
54.3
78.6
96.0
50.6
47.6
83.6
83.2
94.7
87.0
95.2
74.9
85.6
64.7
96.4
61.9
75.4
50.8
80.1
93.3
92.8
80.8
80.6
46.2
90.1
65.3
97.8
95.4
78.0
79.0
85.2
78.5
65.8
80.1
95.4
52.1
65.8
73.6
78.9
95.3
98.3
73.3
74.9
77.0
64.8
100.0
99.9
73.1
41.6
97.2
98.4
91.4
95.1
92.1
73.6
96.3
80.7
94.6
91.6
82.1
81.2
83.3
81.9
65.8
79.0
95.3
49.8
69.0
77.9
80.0
95.3
98.4
73.9
75.1
71.5
61.9
100.0
100.0
72.6
39.8
96.8
97.0
90.3
93.6
92.1
72.6
96.5
78.3
92.9
92.5
81.8
84.9
58.0
86.4
96.7
83.9
96.1
93.2
67.0
100.0
61.6
76.7
93.8
83.6
95.6
84.1
57.7
85.7
79.8
68.2
83.3
96.0
52.3
72.6
79.9
83.2
94.7
98.6
75.4
74.8
79.0
63.4
99.8
100.0
76.0
43.7
96.1
96.7
76.2
88.4
86.3
72.2
95.4
87.6
96.3
92.6
81.7
6
Many classifiers are in close rivalry with each
other. Why?
•
•
•
•
Do they represent the limit of our technology?
What do the new classifiers add to the methodology?
Is there still value in the older methods?
Have they used up all information contained in a data set?
When I face a new recognition task …
• How much can automatic classifiers do?
• How should I choose a classifier?
• Can I make the problem easier for a specific classifier?
7
Sources of Difficulty
in Classification
• Class ambiguity
• Boundary complexity
• Sample size and dimensionality
8
Class Ambiguity
• Is the concept intrinsically ambiguous?
• Are the classes well defined?
• What information do the features carry?
• Are the features sufficient for discrimination?
Bayes error
9
Boundary Complexity
•
•
•
•
Kolmogorov complexity
Length can be exponential in dimensionality
Trivial description: list all points & class labels
Is there a shorter description?
10
Classification Boundaries As Decided
by Different Classfiers
feature 2
Training samples for a 2D classification problem
feature 1
11
Classification Boundaries
Inferred by Different Classfiers
• XCS: a genetic
algorithm
• Nearest neighbor
classifier
• Linear
classifier
12
Match between Classifiers and Problems
Problem A
Problem B
Better
!
Better
!
XCS
error=
1.9%
NN
error=
0.06%
XCS
error=
0.6%
NN
error=
0.7%
13
Measures of Geometrical Complexity
of Classification Problems
Our approach: develop mathematical language
and algorithmic tools for studying
• Characteristics of geometry & topology of high-dim data
• How they change with feature transformations, noise
conditions, and sampling strategies
• How they interact with classifier geometry
Focus on descriptors computable from real data
and relevant to classifier geometry
14
Geometry of Datasets and Classifiers
• Data sets:
– length of class boundary
– fragmentation of classes / existence of subclasses
– global or local linear separability
– convexity and smoothness of boundaries
– intrinsic / extrinsic dimensionality
– stability of these characteristics as sampling rate
changes
• Classifier models:
– polygons, hyper-spheres, Gaussian kernels, axisparallel hyper-planes, piece-wise linear surfaces,
polynomial surfaces, their unions or intersections, …
15
Measures of Geometric Complexity
Fisher’s Discriminant Ratio
Degree of Linear Separability
•
•
Find separating
hyper-plane by linear
programming
Error counts and
distances to plane
measure separability
•
•
•
Compute minimum
spanning tree
Count classcrossing edges
(μ1  μ2 )2
f  σ 2 σ 2
1
2
Shapes of Class Manifolds
Length of Class Boundary
•
Classical measure
of class separability
Maximize over all
features to find the
most discriminating
•
•
Cover same-class
pts with maximal
balls
Ball counts describe
shape of class
manifold
16
Measures of Geometrical Complexity
17
Experiments with Controlled Data Sets
• Real-World Data Sets:
Benchmarking data from UC-Irvine archive
844 two-class problems
452 are linearly separable, 392 nonseparable
• Synthetic Data Sets:
Random labeling of randomly located points
100 problems in 1-100 dimensions
18
Patterns in Complexity Measure Space
lin.sep lin.nonsep random•
19
Problem Distribution in 1st & 2nd Principal
Components of Complexity Space
20
Loadings of the First 6 Prin. Comp.
21
Interpretation of the First 4 Prin. Comp.
(1)50% of variance: Linearity of boundary and
proximity of opposite class neighbors
(2)12% of variance: Balance between within-class
scatter and between-class distance
(3)11% of variance: Concentration & orientation
of intrusion into opposite class
(4)9% of variance: Within-class scatter
22
Problem Distribution in 1st & 2nd Principal
Components of Complexity Space
• Continuous
Linearly
separable
distribution
• Known easy &
difficult problems
occupy opposite
ends
• Few outliers
Random
labels
• Empty regions
23
Questions for the Theoretician
• Is the distribution necessarily continuous?
• What caused the outliers?
• Will the empty regions ever be filled?
• How are the complexity measures related?
• What is the intrinsic dimensionality of the distribution?
24
Questions for the Practitioner
• Where does a particular problem fit in this continuum?
• Can I use this to guide feature selection & transformation?
• How do I set expectations on recognition accuracy?
• Can I use this to help choosing classifiers?
25
Domains of Competence of Classifiers
Complexity measure 2
• Given a classification problem,
determine which classifier is the best for it
?
LC
XCS
Decision
Forest
NN
Complexity measure 1
26
Domain of Competence Experiment
• Use a set of 9 complexity measures
Boundary, Pretop, IntraInter, NonLinNN, NonLinLP,
Fisher, MaxEff, VolumeOverlap, Npts/Ndim
• Characterize 392 two-class problems from UCI data
All shown to be linearly non-separable
• Evaluate 6 classifiers
NN
(1-nearest neighbor)
LP
(linear classifier by linear programming)
Odt
(oblique decision tree)
Pdfc (random subspace decision forest)
Bdfc (bagging based decision forest)
ensemble
methods
XCS (a genetic-algorithm based classifier)
27
Classifier Domains of Competence
Best Classifier for Benchmarking Data
28
Best Classifier Being
nn,lp,odt vs an ensemble technique
Boundary-NonLinNN
IntraInter-Pretop
MaxEffVolumeOverlap
•ensemble
+ nn,lp,odt
29
Other Studies on Data Complexity
Multi-Class Measures
Global vs. Local Properties
Intrinsic Ambiguity & Mislabeling
Task Trajectory with Changing
Sampling & Noise Conditions
k=99
k=1
30
Extension to Multiple Classes
• Fisher’s discriminant score  Mulitple discriminant scores
• Boundary point in a MST: a point is a boundary point as
long as it is next to a point from other classes in the MST
31
Global vs. Local Properties
• Boundaries can be simple locally but complex
globally
– These types of problems are relatively
simple, but are characterized as
complex by the measures
• Solution: complexity measure at different
scales
– This can be combined with different error levels
• Let Ni,k be the k neighbors of the i-th point
defined by, say, Euclidean distance. The
complexity measure for data set D, error level
, evaluated at scale k is
32
Intrinsic Ambiguity
• The complexity measures can be severely affected
when there exists intrinsic class ambiguity (or data
mislabeling)
– Example: FeatureOverlap (in 1D only)
• Cannot distinguish between intrinsic ambiguity or
complex class decision boundary
33
Tackling Intrinsic Ambiguity
• Compute the complexity measure at different error levels
–
–
–
–
f(D): a complexity measure on the data set D
D*: a “perturbed” version of D, so that some points are relabeled
h(D, D*): a distance measure between D and D* (error level)
The new complexity measure is defined as a curve:
– The curve can be summarized by, say, area under curve
• Minimization by greedy procedures
– Discard erroneous points that decrease complexity by the most
34
Sampling Density
Problem may
appear deceptively
simple or complex
with small samples
2 points
10 points
100 points
500 points
1000 points
35
Real Problems Have a Mixture of Difficulties
Sparse Sample & Complex Geometry
cause ill-posedness
Requires further hypotheses on data geometry
36
To Conclude:
• We have some early success in using geometrical measures to
characterize classification complexity, pointing to a potentially
fruitful research area.
“Data Complexity In Pattern Recognition”,
Future Directions:
M.Basu, T.K. Ho (eds.),
Springer-Verlag, in press.
• More, better measures;
• Detailed studies of their utilities, interactions, and estimation
uncertainty;
• Deeper understanding of constraints on these measures from
point set geometry and topology;
• Apply these to understand practical recognition tasks;
• Apply these to find transformations that simplify boundaries;
• Apply these to make better pattern recognition algorithms.
37