Transcript Document
Pattern Recognition and Machine Learning Part 2 Lucy Kuncheva School of Computer Science Bangor University [email protected] 1 Pattern Recognition – DIY using WEKA 2 The weka (also known as Maori hen or woodhen) (Gallirallus australis) is a flightless bird species of the rail family. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. http://en.wikipedia.org/wiki/Weka 3 WEKA http://www.cs.waikato.ac.nz/ml/weka/ “WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.” 4 WEKA And we will be using only the hammer... 5 PROBLEM Data set: Your data sets are of the WIDE type: small number of objects, large number of features OBJECTS FEATURES (attributes, variables, covariates...) 1 2 3 . . . 1 2 3 . N 8 n object # 3 feature # 2 6 WEKA Prepare the file .arff: 1. Open in an ascii editor 2. Add rows • @RELATION one_word • @ATTRIBUTE name NUMERIC ... for all features • @ATTRIBUTE class {1,2} ... for the class variable • @DATA 3. Paste the data underneath 7 Feature selection (b) Feature subsets 2 questions How do we select the subsets? How do we evaluate the worth of a subset? 8 (b) Feature subsets Feature selection 2 questions How do we select the subsets? How do we evaluate the worth of a subset? Not our problem now Wrapper Classification accuracy Filter Some easier-to-calculate proxy for the Classification accuracy Embedded Decision tree classifier SVM 9 Feature selection (b) Feature subsets 2 questions How do we select the subsets? Ranker Bespoke Greedy Sequential Forward Selection (SFS) How do we evaluate the worth of a subset? Wrapper Filter Heuristic search Random Embedded Swarm optimisation Genetic Algorithms (GA) 10 Feature selection (b) Feature subsets 2 questions How do we select the subsets? Ranker Bespoke Greedy Sequential Forward Selection (SFS) How do we evaluate the worth of a subset? Wrapper Filter Heuristic search Random Genetic Embedded Swarm optimisation Algorithms (GA) 11 Feature selection methods CfsSubsetEval FCBF (Fast Correlation-Based Filter) - originally proposed for microarray data analysis (Yu and Liu, 2003). The idea of FCBF is that the features that are worth keeping should be correlated with the class variable but not correlated among themselves. 1. L. Yu and H. Liu (2003), Feature selection for high-dimensional data: A fast correlation-based filter solution. 12 Feature selection methods ReliefFAttributeEval Relief-F. Kira and Rendell, 1992; Kononenko et al., 1997. For each object in the data set, find the nearest neighbour from the same class (NearHit) and the nearest neighbour from the opposite class (NearMiss) using all features. The relevance score of a feature increases if the feature value in the current object is closer to that in the NearHit compared to that in the NearMiss. Otherwise, the relevance score of the feature decreases. 1. K. Kira and L. Rendell (1992). The Feature Selection Problem: Traditional Methods and a New Algorithm. AAAI-92 Proceedings. 2. I. Kononenko et al. Overcoming the myopia of inductive learning algorithms with RELIEFF (1997), Applied Intelligence, 7(1), p39-55 13 Feature selection methods Relief-F. Current object NearMiss Relevance score for y decreases NearHit Relevance score for x increases 14 Feature selection methods SVM. This classifier builds a linear function that separates the classes. The hyperplane is calculated so as to maximise the distance to the nearest points. The absolute values of the coefficients in front of the features can be interpreted as “importance”. SVM-RFE. RFE stands for “Recursive Feature Elimination” (Guyon et al., 2006). Starting with an SVM on the entire feature set, a fraction of the features with the lowest weights is dropped. A new SVM is trained with the remaining features, and subsequently reduced in the same way. The procedure stops when the set of the desired cardinality is reached. While SVM-RFE has been found to be extremely useful for wide data such as functional magnetic resonance imaging (fMRI) data (DeMartino et al., 2008), it was discovered that the RFE step is not always needed (Abeel et al., 2010; Geurts et al., 2005). SVMAttributeEval 15 Feature selection methods SVM-RFE Eliminate one feature at each iteration (default) SVM Set this value to 0 16 Feature selection methods For this example, both SVM and SVM-RFE give the same result Relief-F Ranked attributes: 6 2 GRIP_TEST_Right 5 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left 3 4 HEIGHT_Seated_cm 2 3 WEIGHT_Kg 1 6 ARM_SPAN_cm Ranked attributes: 0.07863 2 GRIP_TEST_Right 0.07549 5 HEIGHT_Standing_cm 0.05528 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left 0.03172 3 WEIGHT_Kg 0.00797 6 ARM_SPAN_cm FCBF Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm 17 PROBLEM Feature selection methods For this example, both SVM and SVM-RFE give the same result Relief-F Ranked attributes: Ranked attributes: 6 2 GRIP_TEST_Right 0.07863 2 GRIP_TEST_Right While these results are 5 5 HEIGHT_Standing_cm 0.07549 5 HEIGHT_Standing_cm 4 1 GRIP_TEST_Left (probably) curious, there is 0.05528 4 HEIGHT_Seated_cm 3 4 HEIGHT_Seated_cm 0.05414 1 GRIP_TEST_Left no statistical significance we 2 3 WEIGHT_Kg 0.03172 3 WEIGHT_Kg 1 6 ARM_SPAN_cm 0.00797 6 ARM_SPAN_cm can attach to them... FCBF Selected attributes: 1,2,5 : 3 GRIP_TEST_Left GRIP_TEST_Right HEIGHT_Standing_cm 18 Time for a coffee-break 19 Feature selection methods Permutation test X Y Feature of interest: X 4.3 G Class label variable: Y (say, G/N) 2.1 N 1.8 G 2.3 G 3.2 N Let XG be the sample from class G, and XN, the sample from class N. Two-sample t-test can be used to test the hypothesis of equal means when XG and XN come from approximately normal distributions. If we cannot ascertain this condition, use PERMUTATION tests. Quantity of interest ... ... V = | mXG - mXN | (difference between the two means) Observed value for our data: V* Question: What is the probability that we observe V* if there was no relationship between X and the class label Y. 20 Feature selection methods Permutation test 1. ANTHRO- HEIGHT - Standing (cm) 30000 Observed value # occurences 25000 20000 15000 Very small chance to obtain the observed V* or larger. 10000 5000 0 0 Histogram of V for permuted labels 2 4 6 abs(mean1 - mean2) p-value = 0.0046 8 10 21 Feature selection methods p-value 0.0046 0.0058 0.0077 0.0100 0.0123 0.0159 0.0193 0.0266 0.0319 0.0489 Permutation test feature 1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO - GRIP TEST Right 2.1 DT PACE BOWL - Average MPH 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - ARM SPAN (cm) 2.1 DT PACE BOWL - max MPH 8.1 FT - SPRINT (40m) 8.1 FT - SPRINT (30m) 22 The Dead Salmon Lo and behold! Brain activity responding to the stimuli! Neuroscientist Craig Bennett purchased a whole Atlantic salmon, took it to a lab at Dartmouth, and put it into an fMRI machine used to study the brain. The beautiful fish was to be the lab’s test object as they worked out some new methods. So, as the fish sat in the scanner, they showed it “a series of photographs depicting human individuals in social situations.” To maintain the rigor of the protocol (and perhaps because it was hilarious), the salmon, just like a human test subject, “was asked to determine what emotion the individual in the photo must have been experiencing.” 23 Bonferroni correction for multiple comparisons = the simplest and most conservative method to control the familywise error rate If we increase the number of hypotheses in a test, we also increase the likelihood of witnessing a rare event, and therefore declaring difference when there is none. So, if the desired significance level for the whole family of n tests should be (at most) α, then the Bonferroni correction would test each individual hypothesis at a significance level of α/n. In our case, we have n = 50, significance level 0.05/50 = 0.001. 24 Feature selection methods p-value Permutation test PROBLEM feature 0.0046 1. ANTHRO- HEIGHT - Standing (cm) 0.0058 1. ANTHRO- HEIGHT - Seated (cm) None of the features 0.0077 1. ANTHRO - GRIP TEST Right 0.0100the Bonferroni 2.1 DT PACE BOWL - Average MPH survives 0.0123 1. ANTHRO-WEIGHT (Kg) correction (p < 0.001 for- GRIP TEST - Left 0.0159 1. ANTHRO significance level 0.05). - ARM SPAN (cm) 0.0193 1. ANTHRO 0.0266 2.1 DT PACE BOWL - max MPH 0.0319 8.1 FT - SPRINT (40m) 0.0489 8.1 FT - SPRINT (30m) 25 Feature selection methods Permutation test More PROBLEMs 1. If there are permutation tests in WEKA, they are hidden very well... 2. If there is Bonferroni correction in WEKA, it is hidden very well too... Solution? DIY... 26 Feature selection methods Permutation test Here is an algorithm for those of you with some programming experience: (the null hypothesis is “no difference”, hence V = 0; assume the greater the V, the larger the difference) 1. Calculate the observed value V*. Choose the number of iterations, e.g., T = 10,000 . 2. for i = 1:T a) Permute the labels randomly b) Calculate and store V(i) with the permuted labels 3. end (for) 4. Calculate the p-value as the proportion of V greater than or equal to V* . 5. If you do this experiment for n features, compare p with alpha/n, where alpha is your chosen significance level (typically alpha = 0.05). 27 Feature selection methods Permutation test And here is a MATLAB script % Permutation test (assume that there are no missing values) clear, close, clc X = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G2:L27'); [~,Y] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','F2:F27'); % symbolic label [~,Names] = xlsread('ECB U13 2010 talent testing data.xlsx',... 'U13 Talent Test Raw Data','G1:L1'); % feature names % Convert Y to numbers (1 selected, 2 not selected) u = unique(Y); L = ones(size(Y)); L(strcmp(u(1),Y)) = 2; T = 20000; continues on the next slide 28 Feature selection methods And here is a MATLAB script Permutation test continued from previous slide... for i = 1:T la = L(randperm(numel(L))); for j = 1:size(X,2) % for each feature fe = X(:,j); V(i,j) = abs(mean(fe(la == 1)) - mean(fe(la == 2))); end end % p-values for the features for j = 1:size(X,2) V_star(j) = abs(mean(X(L == 1,j)) - mean(X(L == 2,j))); p(j) = mean(V(:,j) > V_star(j)); fprintf('%35s %.4f\n',Names{j},p(j)) end 29 Feature selection methods Permutation test MATLAB output 1. ANTHRO - GRIP TEST - Left 1. ANTHRO - GRIP TEST Right 1. ANTHRO-WEIGHT (Kg) 1. ANTHRO- HEIGHT - Seated (cm) 1. ANTHRO- HEIGHT - Standing (cm) 1. ANTHRO - ARM SPAN (cm) 0.0176 0.0087 0.0124 0.0059 0.0055 0.0202 The numbers may vary slightly from one run to the next because of the random generator. However, the larger the iteration number (T), the better. The p-values are not corrected (Bonferroni). Correction should be applied if necessary. 30 Time for a coffee-break 31 Time for our classifiers!!! The Classification tab Choose a classifier (SVM) Choose a trainingtesting protocol When ready (all chosen) click here 32 Where to find the results The confusion matrix 33 Where to find the results Classification accuracy (and classification error) 34 And a lot lot more ... 35 Thank you! 36