Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer Science Bangor University, Wales, UK.
Download ReportTranscript Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer Science Bangor University, Wales, UK.
Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer Science Bangor University, Wales, UK 1. Facts Classifier ensembles class label classifier feature values (object description) Classifier ensembles class label “combiner” classifier classifier feature values (object description) classifier Classifier ensembles class label combiner ensemble? classifier classifier a neural network feature values (object description) Classifier ensembles class label a fancy combiner ensemble? classifier classifier classifier classifier classifier classifier combiner classifier feature values (object description) Classifier ensembles class label classifier? classifier combiner classifier classifier feature values (object description) a fancy feature extractor classifier Why classifier ensembles then? a. because we like to complicate entities beyond necessity (anti-Occam’s razor) b. because we are lazy and stupid and can’t be bothered to design and train one single sophisticated classifier c. because democracy is so important to our society, it must be important to classification Classifier ensembles Juan : “I just like to combine things…” Classifier ensembles Juan : “I just like combining things…” Classifier ensembles combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98] classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96] mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91] committees of neural networks [Bishop95,Drucker94] consensus aggregation [Benediktsson92,Ng92,Benediktsson97] voting pool of classifiers [Battiti94] dynamic classifier selection [Woods97] oldest composite classifier systems [Dasarathy78] classifier ensembles [Drucker94,Filippi94,Sharkey99] bagging, boosting, arcing, wagging [Sharkey99] modular systems [Sharkey99] collective recognition [Rastrigin81,Barabash83] oldest stacked generalization [Wolpert92] divide-and-conquer classifiers [Chiang94] fanciest pandemonium system of reflective agents [Smieja96] change-glasses approach to classifier selection [KunchevaPRL93] etc. ≈ 1c The method of collective recognition Moscow, Energoizdat, 1981 classifier ensemble classifier selection (regions of competence) weighted majority vote Collective statistical decisions in [pattern] recognition Moscow, Radio I svyaz’, 1983 weighted majority vote This superb graph was borrowed from “Fuzzy models and digital signal processing (for pattern recognition): Is this a good marriage?”, Digital Signal Processing, 3, 1993, 253-270, by my good friend Jim Bezdek. Expectation Peak of hype Asymptote of reality 1965 1970 1975 1980 1985 1993 Naive euphoria Overreaction to immature technology True user benefit Depth of Cynicism So where are we? Expectation Peak of hype Asymptote of reality 1965 1970 1975 1980 1985 1993 Naive euphoria Overreaction to immature technology True user benefit Depth of Cynicism So where are we? Expectation Peak of hype Asymptote of reality 1978 1 2 3 4 5 2008 2008 2008 2008 2008 Naive euphoria Overreaction to immature technology True user benefit Depth of Cynicism To make the matter worse... Expert 1: J. Ghosh half full Forum: 3rd International Workshop on Multiple Classifier Systems, 2002 (invited lecture) Quote: “... our current understanding of ensemble-type multiclassifier systems is now quite mature...” Expert 2: T.K. Ho half empty Forum: Invited book chapter, 2002 Quote: “Many of the above questions are there because we do not yet have a scientific understanding of the classifier combination mechanisms” Number of publications (13 Nov 2008) 300 1. Classifier ensembles 200 2. AdaBoost – (1) 3. Random Forest – (1) – (2) incomplete for 2008 250 4. Decision Templates – (1) – (2) – (3) 150 100 50 0 2000 2002 2004 2006 2008 Number of publications (13 Nov 2008) 500 1. Classifier ensembles 2. AdaBoost – (1) 300 3. Random Forest – (1) – (2) incomplete for 2008 400 4. Decision Templates – (1) – (2) – (3) 200 Literature “One cannot embrace the unembraceable.” 100 Kozma Prutkov 0 2000 2002 2004 2006 2008 ICPR 2008 984 papers ~2000 words in the titles first 2 principal components image segment feature feature local select video track object classifier ensembles So where are we? Expectation Peak of hype still here… somewhere… Asymptote of reality 1978 1 2 3 4 5 2008 2008 2008 2008 2008 Naive euphoria Overreaction to immature technology True user benefit Depth of Cynicism 2. Fiction Fiction? Diversity. Diverse ensembles are better ensembles? Diversity = independence? Adaboost. “The best off-the-shelf classifier”? Minority Report - a science fiction short story by Philip K. Dick first published in 1956. It is about a future society where murders are prevented through the efforts of three mutants (“precogs”) who can see two weeks ahead in the future. The story was made into a popular film in 2002. Each of the three “precogs” generates its own report or prediction. The three reports of are analysed by a computer. If these reports differ from one another, the computer identifies the two reports with the greatest overlap and produces a "majority report," taking this as the accurate prediction of the future. But the existence of majority reports implies the existence of a "minority report." And, of course, the most interesting case is when the classifiers disagree – the minority report. Diversity is good 3 classifiers individual accuracy = 10/15 = 0.667 Wrong Correct independent classifiers ensemble accuracy (majority vote) = 11/15 = 0.733 identical classifiers ensemble accuracy (majority vote) = 10/15 = 0.667 dependent classifiers 1 ensemble accuracy (majority vote) = 7/15 = 0.467 dependent classifiers 2 ensemble accuracy (majority vote) = 15/15 = 1.000 Myth: Independence is the best scenario. Myth: Diversity is always good. identical independent dependent 1 dependent 2 0.667 0.733 0.467 1.000 worse than individual better than independence Example The set-up • • • • • • UCI data repository “heart” data set First 9 features; all 280 different partitions into [3, 3, 3] Ensemble of 3 linear classifiers Majority vote 10-fold cross-validation What we measured: • Individual accuracies of the ensemble members • The ensemble accuracy • The ensemble diversity (just one of all these measures…) Example minimum individual accuracy ensemble is better average individual accuracy 0.9 0.85 Ensemble accuracy 0.8 0.75 0.7 0.65 maximum individual accuracy 0.6 0.55 0.5 0.45 0.4 0.4 0.5 0.6 0.7 0.8 0.9 Individual accuracy 280 ensembles Example 0.8 0.79 0.78 ? 0.76 less accurate Ensemble accuracy 0.77 0.75 0.74 0.73 0.72 0 more diverse 0.2 0.1 0.3 0.4 0.5 0.6 diversity 0.7 Example 0.82 Ensemble 0.8 0.78 0.76 0.74 0.72 0.8 0.6 0.74 0.4 0.72 0.7 0.2 Diversity 0.68 0 0.66 Individual Example expected large ensemble accuracy 0.73 0.72 Individual accuracy 0.71 0.7 0.69 0.68 0.67 0.66 0 0.1 0.2 0.3 large 0.4 0.5 0.6 0.7 diversity AdaBoost is everything Swiss Army Knife Surely, there is more to combining classifiers than Bagging and AdaBoost AdaBoost Russian Army Knife AdaBoost AdaBoost Bagging AdaBoost AdaBoost AdaBoost AdaBoost Example – Rotation Forest “This altogether gives a very bad impression of ill-conceived experiments and confusing and unreliable conclusions. ... The current spotty conclusions are incomprehensible, and are of no generalization or reference value.” “This is a potentially great new method and any experimental analysis would be very useful for understanding its potential. Good study, with very useful information in the Conclusions.” % of data sets (out of 32) where the respective ensemble method is best 100 80 Rotation Forest 60 40 Random Forest 20 Bagging Boosting 0 20 40 60 Ensemble size 80 100 So, no, AdaBoost is NOT everything 3. Faults OUR faults! Complacent: We don’t care about terminology. Vain: To get publications, we invent complex models for simple problems or, worse even, complex non-existent problems. Untidy: There is little effort to systemise the area. Ignorant and lazy: By virtue of ignorance we tackle problems well and truly solved by others. Krassi’s motto “I don’t have time to read papers because I am busy writing them”. Simple things that work do not impress us until theyHaughty: get proper theoretical proofs. Terminology •Pattern recognition land •Data mining kingdom •Machine learning ocean •Statistics underworld and… •Weka… God, seeing what the people were doing, gave each person a different language to confuse them and scattered the people throughout the earth… image taken from http://en.wikipedia.org/wiki/Tower_of_Babel AODE object instance SVM J48 variable nearest neighbour classifier ensemble C4.5 hypothesis example decision tree attribute learner observation data point classifier SMO naïve Bayes feature lazy learner classifier learner decision tree C4.5 naïve Bayes AODE SVM ML hypothesis Stats J48 Weka SMO nearest neighbour classifier ensemble object data point feature variable lazy learner meta learner instance attribute example observation Classifier ensembles - names combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98] classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96] mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91] committees of neural networks [Bishop95,Drucker94] consensus aggregation [Benediktsson92,Ng92,Benediktsson97] voting pool of classifiers [Battiti94] dynamic classifier selection [Woods97] composite classifier systems [Dasarathy78] classifier ensembles [Drucker94,Filippi94,Sharkey99] bagging, boosting, arcing, wagging [Sharkey99] Out of fashion modular systems [Sharkey99] collective recognition [Rastrigin81,Barabash83] stacked generalization [Wolpert92] divide-and-conquer classifiers [Chiang94] Subsumed pandemonium system of reflective agents [Smieja96] change-glasses approach to classifier selection [KunchevaPRL93] etc. United terminology! Yey! combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98] MCS – Multiple Classifier Systems Workshops 2000-2009 classifier ensembles [Drucker94,Filippi94,Sharkey99] Simple things that work… We detest simple things that work well for an unknown reason!!! Simple things that work… We detest simple things that work well for an unknown reason!!! Ideal scenario… Real hijacked by heuristics… theory empirics and applications HEURISTICS flagship of THEORY… Lessons from the past: Fuzzy sets •stability of the system? •reliability? •optimality? •why not probability? Who cares?... •temperature for washing machine programmes •automatic focus in digital cameras •ignition angle of internal combustion in cars Because it is •computationally simpler (faster) •easier to build, interpret and maintain Learn to trust heuristics and empirics… 4. Future Future Branch out ? Multiple instance learning Expectation Non i.i.d. examples Skewed class distributions Noisy class labels Sparse data Asymptote of reality Non-stationary data 1978 2008 classifier ensembles for changing environments classifier ensembles for change detection D.J. Hand Classifier Technology and the Illusion of Progress Statistical Science 21(1), 2006, 1-14. “… I am not suggesting that no major advances in classification methods will ever be made. Such a claim would be absurd in the face of developments such as the bootstrap and other resampling approaches, which have led to significant advances in classification and other statistical models. All I am saying is that much of the purported advance may well be illusory.”... Empty-y-y… (not even half empty-y-y-y …) So have we truly made progress or are we just kidding ourselves? Bo-o-ori-i-i-ing....