David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.

Transcript David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.

David G. Underhill
Luke K. McDowell
Computer Science Department, United States Naval Academy
David J. Marchette
Jeffrey L. Solka
Naval Surface Warfare Center, Dahlgren Division
7 November 2015
1
How to make sense of an
overwhelming amount of data?
7 November 2015
2
How to make sense of an
overwhelming amount of data?
Can “dimensionality reduction” help?
7 November 2015
3
Outline
• Problem Statement
• Background
– Text Mining Process
– Dimensionality Reduction
• Experimental Analysis
– Classification
• Contributions and Conclusions
• Future Work
7 November 2015
4
Text Mining Overview
Term Document
Matrix
Encode
7 November 2015
Compare
Distance
Matrix
Analyze
5
Text Mining Overview
Dimensional
Reduction
7 November 2015
6
Dimensionality Reduction (DR)
• Goal: simplify a complex data set in a way that preserves
meanings inherent in the original data
– Usually applied to geometric or numerical data
• How can DR improve text mining?
– May reveal patterns obscured in the original data
– Improves analysis time over the original, larger data
– Greatly decreases storage and transmission costs
7 November 2015
7
Outline
• Problem Statement
• Background
• Experimental Analysis
– Experimental Question and Method
– Task 1: Classification
• Nearest Neighbor Classifier
• Linear Classifier
• Quadratic Classifier
• Contributions and Conclusions
• Future Work
7 November 2015
8
Experimental Question
• Can DR improve text mining performance?
– Many valid DR approaches
– Relative DR performance unknown for textual data
Ultimate Goal
Identify DR techniques that best facilitate text mining.
7 November 2015
9
Experimental Method
• Evaluate 5 DR methods
– Linear
1) PCA (Principal Components Analysis)
2) MDS (Multidimensional Scaling)
– Non-Linear
3) Isomap
4) LLE (Locally Linear Embedding)
5) LDM (Lafon’s Diffusion Maps)
– Baseline
Will these more complex
techniques perform better?
• None-Sort – original features sorted by average weight
• Evaluate 3 classifiers
1) Nearest Neighbor
2) Linear
3) Quadratic
• Evaluate 3 data sets
1) Science News
2) Google News
3) Science & Technology
7 November 2015
10
Outline
• Problem Statement
• Background
• Experimental Analysis
– Experimental Question
– Classification
• Nearest Neighbor Classifier
• Linear Classifier
• Quadratic Classifier
• Contributions and Conclusions
• Future Work
7 November 2015
11
Classification
Labeling documents with known categories based on training data
Training Data:
Physics
Physics Biology Chemistry
Input
Docs
Biology
Encode
Dimension
Reduction
Classify
Chemistry
(standard classifier)
Assessment: accuracy of category assignments
7 November 2015
12
k-Nearest Neighbor Classifier
• Assign category based on k nearest neighbors
• Most frequent category is assigned
• k = 9 used for following graphs
– Trends similar for other values
7 November 2015
13
kNN Classifier on Science News
7 November 2015
14
kNN Classifier on Google News
7 November 2015
15
kNN Classifier on Science & Technology
7 November 2015
16
Linear Classifier
• Assign category based on a linear combination of features
• Assumes features are
normally distributed
– Results for the quadratic classifier,
which doesn’t make this assumption,
were comparable
7 November 2015
17
Linear Classifier on Science News
7 November 2015
18
Linear Classifier on Google News
7 November 2015
19
Linear Classifier on Science & Technology
7 November 2015
20
Outline
• Problem Statement
• Background
• Experimental Analysis
• Contributions and Conclusions
• Future Work
7 November 2015
21
Classification Results
• Applying DR improves accuracy versus not applying DR for
a fixed number of dimensions
• Best DR techniques achieve high accuracy in few dimensions
• MDS & Isomap yield the most consistent and reliable results
– This advantage is more pronounced on difficult corpuses
– Contradicts van der Maaten et al. 2007: results show PCA best,
but only evaluates one textual data set
– PCA is good, but not the best: it suffers on harder data sets
7 November 2015
22
Outline
• Problem Statement
• Background
• Experimental Analysis
• Contributions and Conclusions
• Future Work
7 November 2015
23
Future Work
• More precisely characterize MDS, Isomap advantage
• Investigate other classification methods
• Evaluate data sets with different kinds of information
7 November 2015
24
Acknowledgements
• Trident Scholar Research Program
• Office of Naval Research
7 November 2015
25
David G. Underhill
Luke K. McDowell
Computer Science Department, United States Naval Academy
David J. Marchette
Jeffrey L. Solka
Naval Surface Warfare Center, Dahlgren Division
7 November 2015
26
2-Dimensional Visualizations
• Reduction to just 2 dimensions
• Easy visualization: graph on Cartesian plot
– Each point is colored according to its category
• Assess quality of separation with best 2 dimensions
– Highlight areas of confusion
7 November 2015
27
2D Visualization of Science News (2-cat)
7 November 2015
28
2D Visualization of Science News (8-cat)
7 November 2015
29
2D Visualization of Google News
7 November 2015
30
2D Visualization of Science & Technology
7 November 2015
31
kNN Classifier on Science News (2-category)
7 November 2015
32
kNN Classifier on Science News (4-OL)
7 November 2015
33

David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.

Transcript David G. Underhill Luke K. McDowell Computer Science Department, United States Naval Academy David J.

Directory