Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer Science Bangor University, Wales, UK.

Download Report

Transcript Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer Science Bangor University, Wales, UK.

Classifier Ensembles:
Facts, Fiction, Faults
and Future
Ludmila I Kuncheva
School of Computer Science
Bangor University, Wales, UK
1. Facts
Classifier ensembles
class label
classifier
feature values
(object description)
Classifier ensembles
class label
“combiner”
classifier
classifier
feature values
(object description)
classifier
Classifier ensembles
class label
combiner
ensemble?
classifier
classifier
a neural network
feature values
(object description)
Classifier ensembles
class label
a fancy
combiner
ensemble?
classifier
classifier
classifier
classifier
classifier
classifier
combiner
classifier
feature values
(object description)
Classifier ensembles
class label
classifier?
classifier
combiner
classifier
classifier
feature values
(object description)
a fancy
feature
extractor
classifier
Why classifier ensembles then?
a. because we like to complicate entities beyond necessity
(anti-Occam’s razor)
b. because we are lazy and stupid and can’t be bothered to
design and train one single sophisticated classifier
c. because democracy is so important to our society, it must
be important to classification
Classifier ensembles
Juan : “I just like to combine things…”
Classifier ensembles
Juan : “I just like combining things…”
Classifier ensembles
combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]
classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96]
mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91]
committees of neural networks [Bishop95,Drucker94]
consensus aggregation [Benediktsson92,Ng92,Benediktsson97]
voting pool of classifiers [Battiti94]
dynamic classifier selection [Woods97]
oldest
composite classifier systems [Dasarathy78]
classifier ensembles [Drucker94,Filippi94,Sharkey99]
bagging, boosting, arcing, wagging [Sharkey99]
modular systems [Sharkey99]
collective recognition [Rastrigin81,Barabash83] oldest
stacked generalization [Wolpert92]
divide-and-conquer classifiers [Chiang94]
fanciest
pandemonium system of reflective agents [Smieja96]
change-glasses approach to classifier selection [KunchevaPRL93]
etc.
≈ 1c
The method of collective recognition
Moscow, Energoizdat, 1981
classifier ensemble
classifier selection
(regions of competence)
weighted majority vote
Collective
statistical
decisions in
[pattern]
recognition
Moscow, Radio I svyaz’, 1983
weighted majority vote
This superb graph was borrowed from “Fuzzy models and digital signal processing
(for pattern recognition): Is this a good marriage?”, Digital Signal Processing, 3,
1993, 253-270, by my good friend Jim Bezdek.
Expectation
Peak of hype
Asymptote of reality
1965
1970
1975
1980
1985
1993
Naive euphoria
Overreaction to
immature technology
True user benefit
Depth of Cynicism
So where are we?
Expectation
Peak of hype
Asymptote of reality
1965
1970
1975
1980
1985
1993
Naive euphoria
Overreaction to
immature technology
True user benefit
Depth of Cynicism
So where are we?
Expectation
Peak of hype
Asymptote of reality
1978
1
2
3
4
5
2008
2008
2008
2008
2008
Naive euphoria
Overreaction to
immature technology
True user benefit
Depth of Cynicism
To make the matter worse...
Expert 1: J. Ghosh
half full
Forum: 3rd International Workshop on Multiple Classifier Systems,
2002 (invited lecture)
Quote: “... our current understanding of ensemble-type
multiclassifier systems is now quite mature...”
Expert 2: T.K. Ho
half empty
Forum: Invited book chapter, 2002
Quote: “Many of the above questions are there because we do not
yet have a scientific understanding of the classifier combination
mechanisms”
Number of publications (13 Nov 2008)
300
1. Classifier ensembles
200
2. AdaBoost – (1)
3. Random Forest – (1) – (2)
incomplete for 2008
250
4. Decision Templates – (1) – (2) – (3)
150
100
50
0
2000
2002
2004
2006
2008
Number of publications (13 Nov 2008)
500
1. Classifier ensembles
2. AdaBoost – (1)
300
3. Random Forest – (1) – (2)
incomplete for 2008
400
4. Decision Templates – (1) – (2) – (3)
200
Literature
“One cannot embrace the unembraceable.”
100
Kozma Prutkov
0
2000
2002
2004
2006
2008
ICPR 2008
984 papers
~2000 words in the titles
first 2 principal components
image
segment
feature
feature
local
select
video
track
object
classifier ensembles
So where are we?
Expectation
Peak of hype
still here… somewhere…
Asymptote of reality
1978
1
2
3
4
5
2008
2008
2008
2008
2008
Naive euphoria
Overreaction to
immature technology
True user benefit
Depth of Cynicism
2. Fiction
Fiction?
Diversity.
Diverse ensembles are better ensembles?
Diversity = independence?
Adaboost.
“The best off-the-shelf classifier”?
Minority Report - a science fiction short story
by Philip K. Dick first published in 1956. It is
about a future society where murders are
prevented through the efforts of three mutants
(“precogs”) who can see two weeks ahead in
the future. The story was made into a popular
film in 2002.
Each of the three “precogs” generates
its own report or prediction. The three
reports of are analysed by a computer.
If these reports differ from one another, the computer
identifies the two reports with the greatest overlap and produces a
"majority report," taking this as the accurate prediction of the
future.
But the existence of majority reports implies the existence of
a "minority report."
And, of course, the most interesting case is when the classifiers
disagree – the minority report.
Diversity is good
3 classifiers
individual accuracy = 10/15 = 0.667
Wrong
Correct
independent classifiers
ensemble accuracy (majority vote)
= 11/15 = 0.733
identical classifiers
ensemble accuracy (majority vote)
= 10/15 = 0.667
dependent classifiers 1
ensemble accuracy (majority vote)
= 7/15 = 0.467
dependent classifiers 2
ensemble accuracy (majority vote)
= 15/15 = 1.000
Myth: Independence is the best scenario.
Myth: Diversity is always good.
identical
independent
dependent 1
dependent 2
0.667
0.733
0.467
1.000
worse than individual
better than independence
Example
The set-up
•
•
•
•
•
•
UCI data repository
“heart” data set
First 9 features; all 280 different partitions into [3, 3, 3]
Ensemble of 3 linear classifiers
Majority vote
10-fold cross-validation
What we measured:
• Individual accuracies of the ensemble members
• The ensemble accuracy
• The ensemble diversity (just one of all these measures…)
Example
minimum individual accuracy
ensemble is better
average individual accuracy
0.9
0.85
Ensemble accuracy
0.8
0.75
0.7
0.65
maximum individual accuracy
0.6
0.55
0.5
0.45
0.4
0.4
0.5
0.6
0.7
0.8
0.9
Individual accuracy
280 ensembles
Example
0.8
0.79
0.78
?
0.76
less accurate
Ensemble accuracy
0.77
0.75
0.74
0.73
0.72
0
more diverse
0.2
0.1
0.3
0.4
0.5
0.6
diversity
0.7
Example
0.82
Ensemble
0.8
0.78
0.76
0.74
0.72
0.8
0.6
0.74
0.4
0.72
0.7
0.2
Diversity
0.68
0
0.66
Individual
Example
expected large ensemble accuracy
0.73
0.72
Individual accuracy
0.71
0.7
0.69
0.68
0.67
0.66
0
0.1
0.2
0.3
large
0.4
0.5
0.6
0.7
diversity
AdaBoost is everything
Swiss Army Knife
Surely, there is more
to combining
classifiers than
Bagging and AdaBoost
AdaBoost
Russian Army Knife
AdaBoost
AdaBoost
Bagging
AdaBoost
AdaBoost
AdaBoost
AdaBoost
Example – Rotation Forest
“This altogether gives a very bad impression of ill-conceived
experiments and confusing and unreliable conclusions. ... The
current spotty conclusions are incomprehensible, and are
of no generalization or reference value.”
“This is a potentially great new method and any
experimental analysis would be very useful for
understanding its potential. Good study, with very useful
information in the Conclusions.”
% of data sets (out of 32) where the respective ensemble method is best
100
80
Rotation Forest
60
40
Random Forest
20
Bagging
Boosting
0
20
40
60
Ensemble size
80
100
So, no,
AdaBoost is NOT everything
3. Faults
OUR faults!

Complacent: We don’t care about terminology.
Vain: To get publications, we invent complex models for
simple problems or, worse even, complex non-existent problems.
Untidy: There is little effort to systemise the area.
Ignorant and lazy: By virtue of ignorance we tackle
problems well and truly solved by others. Krassi’s motto “I don’t
have time to read papers because I am busy writing them”.
Simple things that work do not impress us until
theyHaughty:
get proper theoretical proofs.
Terminology
•Pattern recognition land
•Data mining kingdom
•Machine learning ocean
•Statistics underworld and…
•Weka…
God, seeing what the people were doing,
gave each person a different language
to confuse them and scattered the people
throughout the earth…
image taken from http://en.wikipedia.org/wiki/Tower_of_Babel
AODE
object
instance
SVM
J48
variable
nearest neighbour
classifier ensemble
C4.5
hypothesis
example
decision tree
attribute
learner
observation
data point
classifier
SMO
naïve Bayes
feature
lazy learner
classifier
learner
decision tree
C4.5
naïve Bayes
AODE
SVM
ML
hypothesis
Stats
J48
Weka
SMO
nearest neighbour
classifier ensemble
object
data point
feature
variable
lazy learner
meta learner
instance
attribute
example
observation
Classifier ensembles - names
combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]
classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96]
mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91]
committees of neural networks [Bishop95,Drucker94]
consensus aggregation [Benediktsson92,Ng92,Benediktsson97]
voting pool of classifiers [Battiti94]
dynamic classifier selection [Woods97]
composite classifier systems [Dasarathy78]
classifier ensembles [Drucker94,Filippi94,Sharkey99]
bagging, boosting, arcing, wagging [Sharkey99]
Out of fashion
modular systems [Sharkey99]
collective recognition [Rastrigin81,Barabash83]
stacked generalization [Wolpert92]
divide-and-conquer classifiers [Chiang94]
Subsumed
pandemonium system of reflective agents [Smieja96]
change-glasses approach to classifier selection [KunchevaPRL93]
etc.
United terminology! Yey!
combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]
MCS – Multiple Classifier Systems Workshops 2000-2009
classifier ensembles [Drucker94,Filippi94,Sharkey99]
Simple things that work…
We detest simple things that work well for an unknown reason!!!
Simple things that work…
We detest simple things that work well for an unknown reason!!!
Ideal scenario…
Real
 hijacked by heuristics… 
theory
empirics and applications
HEURISTICS
flagship of THEORY…
Lessons from the past:
Fuzzy sets
•stability of the system?
•reliability?
•optimality?
•why not probability?
Who cares?...
•temperature for washing machine programmes
•automatic focus in digital cameras
•ignition angle of internal combustion in cars
Because it is
•computationally simpler (faster)
•easier to build, interpret and maintain
Learn to trust heuristics and empirics…
4. Future
Future
Branch out ?
Multiple instance learning
Expectation
Non i.i.d. examples
Skewed class distributions
Noisy class labels
Sparse data
Asymptote of reality
Non-stationary data
1978
2008
classifier ensembles for changing environments
classifier ensembles for change detection
D.J. Hand
Classifier Technology and the Illusion of Progress
Statistical Science 21(1), 2006, 1-14.
“… I am not suggesting that no major advances in classification methods will
ever be made. Such a claim would be absurd in the face of developments such
as the bootstrap and other resampling approaches, which have led to significant
advances in classification and other statistical models. All I am saying is that
much of the purported advance may well be illusory.”...
Empty-y-y…
(not even half empty-y-y-y …)
So have we truly made progress or are we just kidding ourselves?
Bo-o-ori-i-i-ing....