Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira Carnegie Mellon.
Download ReportTranscript Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira Carnegie Mellon.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira Carnegie Mellon Motivation • Often have little labeled data but lots of unlabeled data. • We want to use the relationships between the unlabeled examples to guide our predictions. • Assumption: “Similar examples should generally be labeled similarly." 2 Learning using Graph Mincuts: Blum and Chawla (ICML 2001) 3 Construct an (unweighted) Graph 4 Add auxiliary “super-nodes” + - 5 Obtain s-t mincut + - Mincut 6 Classification + - Mincut 7 Problem • Plain mincut gives no indication of it’s confidence on different examples. Solution • • • • • Add random weights to the edges. Run plain mincut and obtain a classification. Repeat the above process several times. For each unlabeled example take a majority vote. Margin of the vote gives a measure of the confidence. 8 Before adding random weights + - Mincut 9 After adding random weights + - Mincut 10 PAC-Bayes • PAC-Bayes bounds show that the ‘average’ of several hypotheses that are all consistent with the training data will probably be more accurate than any single hypothesis. • In our case each distinct cut corresponds to a different hypothesis. • Hence the average of these cuts will probably be more accurate than any single cut. 11 Markov Random Fields • Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. • This corresponds to a Markov Random Field model. • We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation. 12 Related Work –Gaussian Fields • Zhu, Gharamani and Lafferty (ICML 2003). • Each unlabeled example receives a label that is the average of its neighbors. • Equivalent to minimizing the squared difference of the labels. 13 How to construct the graph? • k-NN – Graph may not have small balanced cuts. – How to learn k? • Connect all points within distance δ – Can have disconnected components. – How to learn δ? • Minimum Spanning Tree – No parameters to learn. – Gives connected, sparse graph. – Seems to work well on most datasets. 14 Experiments • ONE vs. TWO: 1128 examples . • (8 X 8 array of integers, Euclidean distance). • ODD vs. EVEN: 4000 examples . • (16 X 16 array of integers, Euclidean distance). • PC vs. MAC: 1943 examples . • (20 newsgroup dataset, TFIDF distance) . 15 ONE vs. TWO 16 ODD vs. EVEN 17 PC vs. MAC 18 Accuracy Coverage: PC vs. MAC (12 labeled) 19 Conclusions • We can get useful estimates of the confidence of our predictions. • Often get better accuracy than plain mincut. • Minimum spanning tree gives good results across different datasets. 20 Future Work • Sample complexity lower bounds (i.e. how much unlabeled data do we need to see?). • More principled way of sampling cuts? 21 THE END 22 Questions? 23