Transcript Document
Machine Learning Georg Pölzlbauer December 11, 2006 Outline Exercises Data Preparation Decision Trees Model Selection Random Forests Support Vector Machines Exercises Groups of 2 or 3 students UCI ML Repository: pick 3 data sets (different characteristics, i.e. number of samples, number of dimensions, number of classes) Estimate classification error with 3 classifiers of choice; compare results Estimate appropriate parameters for these classifiers Implement in Matlab, R, WEKA, YALE, KNIME Exercises: Software Matlab YALE http://rapid-i.com/ WEKA http://www.cs.waikato.ac.nz/ml/weka/ KNIME http://www.knime.org/ R http://www.r-project.org/ Exercises: Software WEKA: recommended; easy to use, easy to learn, no programming KNIME, YALE: also easy to use R: most advanced and powerful software; do not use if you do not know R really well! Matlab: not recommended; requires installation of packages from internet etc. Exercises: Written Report Report should be 5-10 pages Discuss characteristics of data sets (i.e. handling of missing values, scaling etc.) Summarize classifiers used (one paragraph each) Discuss experimental results (tables, figures) Do not include code in report Exercises: How to proceed It is not necessary to implement anything; rely on libraries, modules etc UCI ML Repository: http://www.ics.uci.edu/~mlearn/ML Summary.html Import data file, scale data, apply model selection, write down any problems/findings Grading No written/oral exam End of January submission of report Ca. 15 minutes discussion of results and code (individually for each group) Grading bonus: Use of sophisticated models, detailed comparision of classifiers, thorough discussion of experiments, justification of choices Questions? Questions regarding theory: [email protected] [email protected] Questions regarding R, Weka, …: Forum Machine Learning: Setting gender age smoker eye color lung cancer male 19 yes green no female 44 yes gray yes male 49 yes blue yes male 12 no brown no female 37 no brown no female 60 no brown yes male 44 no blue no female 27 yes brown no female 51 yes green yes female 81 yes gray no male 22 yes brown no male 29 no blue no male 77 yes gray ? male 19 yes green ? female 44 no gray ? Machine Learning: Setting gender age smoker eye color lung cancer male 19 yes green no female 44 yes gray yes male 49 yes blue yes male 12 no brown no female 37 no brown no female 60 no brown yes male 44 no blue no female 27 yes brown no female 51 yes green yes female 81 yes gray no male 22 yes brown no male 29 no blue no male 77 yes gray ? male 19 yes green ? female 44 no gray ? Train ML Model Machine Learning: Setting gender age smoker eye color lung cancer male 19 yes green no female 44 yes gray yes male 49 yes blue yes male 12 no brown no female 37 no brown no female 60 no brown yes male 44 no blue no female 27 yes brown no female 51 yes green yes female 81 yes gray no male 22 yes brown no male 29 no blue no male 77 yes gray yes male 19 yes green no female 44 no gray no Train ML Model Data Preparation -> Example adult census data Table format data (data matrix) Missing values Categorical data Quantitative (continuous) data with different scales Categorical variables Non-numeric variables with a finite number of levels E.g. "red", "blue", "green" Some ML algorithms can only handle numeric variables Solution: 1-to-N coding 1-to-N Coding feature red blue green red 1 0 0 blue 0 1 0 green 0 0 1 red 1 0 0 red 1 0 0 green 0 0 1 blue 0 1 0 Scaling of continuous variables Many ML algorithms rely on measuring the distance between 2 samples There should be no difference if a length variable is measured in cm, inch, or km To remove the unit of measure (e.g. kg, mph, …) each variable dimension is normalized: subtract mean divide by standard deviation Scaling of continuous variables Data set now has mean 0, variance 1 Chebyshev's inequality: 75% of data between -2 and +2 89% of data between -3 and +3 94% of data between -4 and +4 Output variables ML requires categorical output (continuous output = regression) ML methods can be applied by binning continuous output (loss of prediction accuracy) Household income $10.000 very low $200.000 low average high very high Binary Decision Trees Rely on Information Theory (Shannon) Recursive algorithm that splits feature space into 2 areas at each recursion step Classification works by going through the tree from the root node until arriving at a leaf node Decision Trees: Example x < 12.3 y < 4.6 blue y < 3.9 x < 11.7 red x < 13.1 blue y < 1.7 red red red blue Information Theory, Entropy Introduced by Claude Shannon Applications in data compression Concerned with measuring actual information vs. redundancy Measures information in bits What is „Entropy“? In Machine Learning, Entropy is a measure for the impurity of a set High Entropy => bad for prediction High Entropy => needs to be reduced (Information Gain) n H ( X ) p( xi ) log2 p( xi ) i 1 Calculating H(X) 4 p ( xred ) 0.33 12 8 p ( xblue ) 0.67 12 n H ( X ) p( xi ) log2 p( xi ) i 1 H ( X ) 0.33 log2 0.33 0.67 log2 0.67 ( 0.33) (1.59) (0.67) (0.59) 0.53 0.39 0.92 H(X): Case studies 1 0.9 p(xred) p(xblue) H(X) I 0.5 0.5 1 II 0.3 0.7 0.88 III 0.7 0.3 0.88 IV 0 1 0 0.8 n H ( X ) p( xi ) log2 p( xi ) 0.7 H(X) 0.6 i 1 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 p(x red) 0.6 0.7 0.8 0.9 1 H(X): Relative vs. absolute frequencies vs. red blue I 8 4 II 18 9 => H(XI) = H(XII) Only relative frequencies matter! Information Gain Given Information a set and Gain: a choice Sets that between minimize possible Entropy by subsets, largestwhich amount one is preferable? IG( X A , X B ) H ( X ) p( xA ) H ( X A ) p( xB ) H ( X B ) H(X) = 1 A (green) B (yellow) Points 5 9 6 5 1 4 p(X.) 0.5 0.9 0.6 0.5 0.1 0.4 p(xred) 0.2 0.44 0.33 0.8 1 0.75 p(xblue) 0.8 0.56 0.67 0.2 0 0.25 H(X.) 0.72 0.99 0.92 0.72 0 0.81 IG 0.28 0.11 0.12 Informatin Gain (Properties) IG is at most as large as the Entropy of the original set IG is the amount by which the original Entropy can be reduced by splitting into subsets IG is at least zero (if Entropy is not reduced) 0 <= IG <= H(X) Building (binary) Decision Trees Data set: categorical or quantitative variables Iterate variables, calculate IG for every possible split categorical variables: one variable vs. the rest quantitative variables: sort values, split between each pair of values recurse until prediction is good enough Decision Trees: Quantitative variables 0.07 0.00 0.01 0.03 0.08 0.03 0.00 0.00 0.01 0.13 0.06 0.43 0.17 0.26 0.06 0.01 0.11 0.13 0.05 0.29 0.28 0.09 0.16 original H: 0.99 Decision Trees: Quantitative variables x < 12.3 4.6 blue yy<<4.6 blue blue blue y 3.9 < 3.9 redy < 11.7 xx<<11.7 blue red red xx<<13.1 red 13.1 blue blue y?< 1.7 red red red red redred blue Decision Trees: Classification x < 12.3 y < 4.6 blue y < 3.9 x < 11.7 red x < 13.1 blue y < 1.7 red red red blue Decision Trees: Classification x < 12.3 y < 4.6 blue y < 3.9 x < 11.7 red x < 13.1 blue y < 1.7 red red red blue Decision Trees: Classification x < 12.3 y < 4.6 blue y < 3.9 x < 11.7 red x < 13.1 blue y < 1.7 red red red blue Decision Trees: More than 2 classes x < 12.3 y < 4.6 blue y < 3.9 x < 11.7 orange x < 13.1 blue y < 1.7 red m IG( X 1 ,..., X m ) H ( X ) p( x j ) H ( X j ) n j 1 H ( X ) p( xi ) log2 p( xi ) i 1 red orange blue Decision Trees: Non-binary trees drive-wheels? fwd rwd 4wd … … … m IG( X 1 ,..., X m ) H ( X ) p( x j ) H ( X j ) n j 1 H ( X ) p( xi ) log2 p( xi ) i 1 Decision Trees: Overfitting Fully grown trees are usually too complicated Decision Trees: Stopping Criteria Stop when absolute number of samples is low (below a threshold) Stop when Entropy is already relatively low (below a threshold) Stop if IG is low Stop if decision could be random (Chi-Square test) Threshold values are hyperparameters Decision Trees: Pruning "Pruning" means removing nodes from a tree after training has finished Stopping criteria are sometimes referred to as "pre-pruning" Redundant nodes are removed, sometimes tree is remodeled Example: Pruning x < 12.3 4.6 blue yy<<4.6 blue blue blue y 3.9 < 3.9 redy < 11.7 xx<<11.7 blue red red xx<<13.1 red 13.1 blue blue y?< 1.7 red red red red redred blue Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Decision Trees: Stability Model Selection General ML Framework Takes care of estimating hyperparameters Takes care of selecting good model (avoid local minima) Why is Generalization an Issue? 140 150 160 170 180 190 200 Why is Generalization an Issue? 140 150 160 170 180 190 200 150 160 170 180 190 200 Why is Generalization an Issue? 150 160 170 180 190 200 Why is Generalization an Issue? f 150 m f 160 170 m 180 190 200 Why is Generalization an Issue? 140 150 160 f 150 170 m f 160 170 180 190 200 190 200 m 180 Bayes Optimal Classifier 140 150 160 170 180 190 200 150 160 170 180 190 200 Training Set, Test Set Solution: Split data into training and test sets ~80% training, 20% test Train different models Classify test set Pick the one model that has the least test set error Trade-off complexity vs. generalization Minimum of Test Set Error Error Test Set Training Set Model Complexity Estimation of Generalization Error Test set is used in model selection and tuning of parameters thus, test set error is not an accurate estimate of generalization error Generalization error is the expected error that the classifier will make on a given data set Training-Test-Validation Save part of the data set for validation Split: E.g. 60% training set 20% test set 20% validation set Train classifiers on training set Select classifier based on test set performance Estimate generalization error on validation set Crossvalidation Split data into 10 parts of equal sizes This is called 10-fold crossvalidation repeat 10 times: use 9 parts for training/tuning calculate performance on remaining part (validation set) Estimate of generalization error is average of the validation set errors Bootstrapping A bootstrap sample is a random subset of the data sample Validation set is also random sample In the sampling process, data points may be selected repeatedly (with replacement) An arbitrary number of bootstrap samples may be used Bootstrapping is more reliable than crossvalidation, training-test-validation Example: Bootstrapping Example: Bootstrapping Random Forests Combination of decision trees and bootstrapping concepts A large number of decision trees is trained, each on a different bootstrap sample At each split, only a random number of the original variables is available (i.e. small selection of columns) Data points are classified by majority voting of the individual trees Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II 9 B 1.1 504 blue II 15 A 1.8 480 green II 2 C 1.0 511 red I -2 C 0.7 512 green II 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 10 A 1.5 500 cyan I 0 C 0.3 505 blue II 9 B 1.9 502 blue II Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II Bootstrap sample Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II ? 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II ? 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II ? ? Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II ? ? Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II ? ? Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II var3 < 0.4 II ? I Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II var3 < 0.4 II ? I Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II var3 < 0.4 II ? I Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II var3 < 0.4 II I var3 < 0.7 I II Example: Random Forests 12 A 0.1 501 red I 8 B 1.2 499 red II var1 < 8 15 A 1.8 480 green II 2 C 1.0 511 red I 7 C 0.4 488 cyan I 7 A 0.6 491 cyan I 0 C 0.3 505 blue II var3 < 0.4 II I var3 < 0.7 I II Classification with Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Green = Class I Blue = Class II 3 votes for I 2 votes for II => classify as I . . Properties of Random Forests Easy to use ("off-the-shelve"), only 2 parameters (no. of trees, %variables for split) Very high accuracy No overfitting if selecting large number of trees (choose high) Insensitive to choice of split% (~20%) Returns an estimate of variable importance Support Vector Machines Introduced by Vapnik Rather sophisticated mathematical model Based on 2 concepts: Optimization (maximization of margin) Kernel (non-linear separation) Linear separation Linear separation Linear separation Linear separation Largest Margin Largest Margin Finding optimal hyperplane can be expressed as an optimization problem Solved by quadratic programming Soft margin: Not necessarily 100% classification accuracy on test set Non-linearly separable data Non-linearly separable data Non-linearly separable data Additional coordinate z=x2 3 2 1 y 0 -1 -2 -3 -4 -4 -3 -2 -1 0 x 1 2 3 4 Additional coordinate z=x2 15 10 z=x2 5 0 4 2 y 0 -2 -4 -4 -3 -2 -1 1 0 x 2 3 4 Additional coordinate z=x2 3 2 1 y 0 -1 -2 -3 -4 0 2 4 6 z=x2 8 10 12 Kernels Projection of data space into higher dimensional space Data may be separable in this high dimensional space Projection = multiplication of vectors with kernel matrix Kernel matrix determines shape of possible separators Common Kernels Quadratic Kernel Radial Basis Kernel General Polynomial Kernel (arbitrary degree) Linear Kernel (=no kernel) Kernel Trick Other ML algorithms could work with projected (high dimensional) data, so why bother with SVM? Working with high dimensional data is problematic (complexity) Kernel Trick: The optimization problem can be restated such that it uses only distances in high dimensional data This is computationally very inexpensive Properties of SVM High classification accuracy Linear kernels: Good for sparse, high dimensional data Much research has been directed at SVM, VC-dimension etc. => solid background The End Additional topics Confusion Matrix (weights) Prototype based methods (LVQ,…) k-NN