David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.
Download ReportTranscript David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.
David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J. Marchette Jeffrey L. Solka Naval Surface Warfare Center, Dahlgren Division 7 November 2015 1 How to make sense of an overwhelming amount of data? 7 November 2015 2 How to make sense of an overwhelming amount of data? Can “dimensionality reduction” help? 7 November 2015 3 Outline • Problem Statement • Background – Text Mining Process – Dimensionality Reduction • Experimental Analysis – Classification • Contributions and Conclusions • Future Work 7 November 2015 4 Text Mining Overview Term Document Matrix Encode 7 November 2015 Compare Distance Matrix Analyze 5 Text Mining Overview Dimensional Reduction 7 November 2015 6 Dimensionality Reduction (DR) • Goal: simplify a complex data set in a way that preserves meanings inherent in the original data – Usually applied to geometric or numerical data • How can DR improve text mining? – May reveal patterns obscured in the original data – Improves analysis time over the original, larger data – Greatly decreases storage and transmission costs 7 November 2015 7 Outline • Problem Statement • Background • Experimental Analysis – Experimental Question and Method – Task 1: Classification • Nearest Neighbor Classifier • Linear Classifier • Quadratic Classifier • Contributions and Conclusions • Future Work 7 November 2015 8 Experimental Question • Can DR improve text mining performance? – Many valid DR approaches – Relative DR performance unknown for textual data Ultimate Goal Identify DR techniques that best facilitate text mining. 7 November 2015 9 Experimental Method • Evaluate 5 DR methods – Linear 1) PCA (Principal Components Analysis) 2) MDS (Multidimensional Scaling) – Non-Linear 3) Isomap 4) LLE (Locally Linear Embedding) 5) LDM (Lafon’s Diffusion Maps) – Baseline Will these more complex techniques perform better? • None-Sort – original features sorted by average weight • Evaluate 3 classifiers 1) Nearest Neighbor 2) Linear 3) Quadratic • Evaluate 3 data sets 1) Science News 2) Google News 3) Science & Technology 7 November 2015 10 Outline • Problem Statement • Background • Experimental Analysis – Experimental Question – Classification • Nearest Neighbor Classifier • Linear Classifier • Quadratic Classifier • Contributions and Conclusions • Future Work 7 November 2015 11 Classification Labeling documents with known categories based on training data Training Data: Physics Physics Biology Chemistry Input Docs Biology Encode Dimension Reduction Classify Chemistry (standard classifier) Assessment: accuracy of category assignments 7 November 2015 12 k-Nearest Neighbor Classifier • Assign category based on k nearest neighbors • Most frequent category is assigned • k = 9 used for following graphs – Trends similar for other values 7 November 2015 13 kNN Classifier on Science News 7 November 2015 14 kNN Classifier on Google News 7 November 2015 15 kNN Classifier on Science & Technology 7 November 2015 16 Linear Classifier • Assign category based on a linear combination of features • Assumes features are normally distributed – Results for the quadratic classifier, which doesn’t make this assumption, were comparable 7 November 2015 17 Linear Classifier on Science News 7 November 2015 18 Linear Classifier on Google News 7 November 2015 19 Linear Classifier on Science & Technology 7 November 2015 20 Outline • Problem Statement • Background • Experimental Analysis • Contributions and Conclusions • Future Work 7 November 2015 21 Classification Results • Applying DR improves accuracy versus not applying DR for a fixed number of dimensions • Best DR techniques achieve high accuracy in few dimensions • MDS & Isomap yield the most consistent and reliable results – This advantage is more pronounced on difficult corpuses – Contradicts van der Maaten et al. 2007: results show PCA best, but only evaluates one textual data set – PCA is good, but not the best: it suffers on harder data sets 7 November 2015 22 Outline • Problem Statement • Background • Experimental Analysis • Contributions and Conclusions • Future Work 7 November 2015 23 Future Work • More precisely characterize MDS, Isomap advantage • Investigate other classification methods • Evaluate data sets with different kinds of information 7 November 2015 24 Acknowledgements • Trident Scholar Research Program • Office of Naval Research 7 November 2015 25 David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J. Marchette Jeffrey L. Solka Naval Surface Warfare Center, Dahlgren Division 7 November 2015 26 2-Dimensional Visualizations • Reduction to just 2 dimensions • Easy visualization: graph on Cartesian plot – Each point is colored according to its category • Assess quality of separation with best 2 dimensions – Highlight areas of confusion 7 November 2015 27 2D Visualization of Science News (2-cat) 7 November 2015 28 2D Visualization of Science News (8-cat) 7 November 2015 29 2D Visualization of Google News 7 November 2015 30 2D Visualization of Science & Technology 7 November 2015 31 kNN Classifier on Science News (2-category) 7 November 2015 32 kNN Classifier on Science News (4-OL) 7 November 2015 33