PROPOSAL SLIDES

Transcript PROPOSAL SLIDES

Ph.D. Thesis Proposal
Fast Nonparametric Machine Learning
Algorithms for High-dimensional
Massive Data and Applications
Ting Liu
Carnegie Mellon University
February, 2005
Thesis Committee
•
•
•
•
Andrew Moore (Chair)
Martial Hebert
Jeff Schneider
Trevor Darrell (MIT)
Ting Liu, CMU
2
Thesis Proposal
Goal: to make nonparametric methods
tractable for high-dim, massive datasets
• Nonparametric methods:
–
–
–
–
K-nearest-neighbor (K-NN)
Kernel density estimation
SVM evaluation phase
and more …
Ting Liu, CMU
My thesis
3
Why K-NN?
• It is simple
– goes back to as early as [Fix-Hodges 1951]
– [Cover-Hart 1967] justifies k-NN theoretically
• It is easy to implement
– sanity check for other (more complicated) algorithms
– similar insights for other nonparametric algorithms
• It is useful
many applications in
– text categorization
– drug activity detection
– multimedia, computer vision
– and more…
Ting Liu, CMU
High-dim,
massive data
4
Application: Video Segmentation
Task: Shot transition detection
• Cut
• Gradual transition (fades, dissolves …)
Ting Liu, CMU
5
Technically [Qi-Hauptmann-Liu 2003]
Video
frames
4 hours MPEG-1 video
(420,970 frames)
Color histogram
We want a fast k-NN
classification
Pair-wise
method.
similarity
features
Classification
normal: 0
cut: 1
gradual: 2
K-NN
• very slow 
• good performance 
Ting Liu, CMU
6
Application: Near-duplicate
Detection and Sub-image Retrieval
Copyrighted Image Database
Ting Liu, CMU
7
Algorithm Overview [Yan-Rahul 2004]
train
query
Transformation
DoG + PCA-SIFT
Search
store
We want a fast
Each image:1000 patches
k-NN search
Each patch: 36-dim
method.
Each image: 1000 patches
1000 k-NN search per query
12,100,000 patches
(12,100 copyrighted images)
Ting Liu, CMU
8
K-NN Methods
K-NN
Exact K-NN
K-NN search
Naïve
Approximate K-NN
K-NN
classification
Spatial tree
• KNS2 (2-class)
• SR-tree
• KNS3 (2-class)
• Kd-tree
• IOC (multi-class)
• Random sample
• PCA
• LSH
Spill-tree
• Metric-tree
My work
slow
Ting Liu, CMU
9
K-NN Methods
K-NN
Exact K-NN
K-NN search
Approximate K-NN
K-NN
classification
Spatial tree
• KNS2 (2-class)
• SR-tree
• KNS3 (2-class)
• Kd-tree
• IOC (multi-class)
Spill-tree
• Metric-tree
Ting Liu, CMU
10
Problems with Exact K-NN Search: Efficiency
• Slow with huge dataset in high dimensions
• Complexity of algorithms
– Naïve (linear scan): O(dN) per query
– Advanced: O(dlogN) ~ O(dN)
(spatial data structure to avoid searching all points)
• SR-tree [Katayama-Satoh 1997]
• Kd-tree [Friedman-Bentley-Finkel 1977]
• Metric-tree (ball-tree) [Uhlmann 1991, Omohundro 1991]
Ting Liu, CMU
11
Metric-tree: an Example
A set of points in R2
Ting Liu, CMU
12
Build a metric-tree
[Uhlmann 1991, Omohundro 1991]
P2
L
P1
Ting Liu, CMU
13
Metric-tree Data Structure
[Uhlmann 1991, Omohundro 1991]
P2
P1
A metric-tree
Internal data structure
Ting Liu, CMU
14
Metric-tree: the Triangle Inequality
• Let q be any query point
• Let x be a point inside ball B
q
q
x
x
Ting Liu, CMU
15
Metric-tree Based K-NN Search
• Depth first search
• Pruning using the triangle inequality
• Significant speed-up when d is small: O(dlogN)
• Little speed-up when d is large: O(dN)
• “Curse of dimensionality”
Ting Liu, CMU
16
K-NN Methods
K-NN
Exact K-NN
K-NN search
Approximate K-NN
K-NN
classification
Spatial tree
• KNS2 (2-class)
• SR-tree
• KNS3 (2-class)
• Kd-tree
• IOC (multi-class)
Spill-tree
• Metric-tree
Ting Liu, CMU
17
My Work (part 1): Fast K-NN Classification
Based on Metric-tree
Idea: Do classification w/o finding the k-NNs



KNS2: Fast k-NN classification for skewed 2-class
KNS3: Fast k-NN classification for 2-class
IOC: Fast k-NN classification for multi-class
Ting Liu, CMU
18
KNS2: Fast K-NN Classification for
Skewed 2-class
Assumptions:
(1) 2 classes: pos. / neg.
(2) pos. class much less frequent than neg. class
Example: video segmentation
(~10,000 shot transitions, ~400,000 normal frames)
Q: How many of the k-NN are from pos. class?
Ting Liu, CMU
19
How Many of the K-NN are From pos. Class?
• Step 1 ---
Find positive
Find the k closest pos. points
Example: k = 3
d1
d2
di : distance of the i’th
q
d3
closest pos. point to q
Fewer pos. points → easy to compute
Ting Liu, CMU
20
How Many of the K-NN are From pos. Class?
• Step 2 --- Count negative
ci: Num. of neg. points within di
d1
c1
d2
q
Example: k = 3
c2
c3
c1 = 1
c2 = 5
d3
c3 = 8
Ting Liu, CMU
21
How Many of the K-NN are From pos. Class?
• Step 2 --- Lowerbound negative
ci: Num. of neg. points within di
d1
c1
d2
q
Example: k = 3
c2
c3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
d3
c3 ≥ 1 ?
Idea: lowerbound each ci instead of computing it
Ting Liu, CMU
22
Metric-tree: the Triangle Inequality
• Let q be any query point
• Let x be a point inside ball B
q
q
x
x
Ting Liu, CMU
23
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
ci: Num. of neg. points within di
20
Example: k = 3
q
A
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 0, c2 ≥ 0, c3 ≥ 0
Ting Liu, CMU
24
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
C
8
B
12
q
ci: Num. of neg. points within di
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 0, c2 ≥ 0, c3 ≥ 12
Ting Liu, CMU
25
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
C
8
B
12
q
ci: Num. of neg. points within di
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 0, c2 ≥ 0, c3 ≥ 12
Ting Liu, CMU
26
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
E
D
5
7
q
ci: Num. of neg. points within di
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 0, c2 ≥ 5, c3 ≥ 12
Ting Liu, CMU
27
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
E
D
5
7
q
ci: Num. of neg. points within di
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 0, c2 ≥ 5, c3 ≥ 12
Ting Liu, CMU
28
How Many of the K-NN are From pos. Class?
• Step 2 --- Estimate negative
E
7
F
4
q
ci: Num. of neg. points within di
Example: k = 3
Estimate c1 ≥ 3 ?
c2 ≥ 2 ?
c3 ≥ 1 ?
c1 ≥ 4, c2 ≥ 5, c3 ≥ 12
We are done! Return 0
Ting Liu, CMU
29
KNS2: the Algorithm
Build two metric-trees (Pos_tree / Neg_tree)
Search Pos_tree, find k pos. NNs
Search Neg_tree
repeat
pick a node from Neg_tree
refine C = {c1, c2,…,ck}
if ci ≥ k-i+1 remove ci from C
end repeat
Let k’ = size(C) after the search
return k’
Ting Liu, CMU
30
Experimental Results (KNS2)
Dataset
ds1
Dimension
(d)
10
Data Size
(N)
26,733
Letter
16
20,000
Video
45
420,970
J_Lee
100
181,395
Blanc_Mel
100
186,414
1.1£106
88,358
ds2
Ting Liu, CMU
31
CPU Time Speedup Over Naïve K-NN
= 9)
(k
70
60
speedups
50
Naive
40
Metric-tree
30
KNS2
20
10
0
Ds1(d=10)
Letter(d=16)
video(d=45)
J_Lee(d=100)
Blanc_Mel(d=100)
ds2(d=1.1M)
KNS2: 3x – 60x speed-up over naïve
Ting Liu, CMU
32
K-NN Methods
K-NN
Exact K-NN
K-NN search
Approximate K-NN
K-NN
classification
• KNS2 (2-class)
Spill-tree
• KNS3 (2-class)
• IOC (multi-class)
Ting Liu, CMU
33
My Work (Part 2): a New Metric-tree
Based Approximate NN Search
--- “I’m Feeling Lucky” search
--- spill-tree
Ting Liu, CMU
34
Why is Metric-tree Slow?
p2
q
p1
Empirically…
• takes 10% of the time finding the NN
• takes 90% of the time backtracking
Ting Liu, CMU
35
“I’m Feeling Lucky” Search
• Algorithm: simple
– Descends a metric-tree without backtracking
– Return the first point hit in a leaf node
• Complexity: super fast
– O(logN) per query
• Accuracy: quite low
– Liable to make mistakes when q is near the decision
boundary
Ting Liu, CMU
36
Spill-tree:
– adding redundancy to help
“I’m-Feeling-Lucky” search
Ting Liu, CMU
37
Spill-tree
• A variant of metric-tree
• The children of a node can “spill over” onto each
other, and contain shared data-points
Ting Liu, CMU
38
A Spill-tree Data Structure
Overlapping buffer size
• Metric-tree: each child only
owns points to
one side of L
p1
p2
LL
• Spill-tree: Both children own
points between
LL and LR
L LR
Overlapping buffer
Ting Liu, CMU
39
A Spill-tree Data Structure
Overlapping buffer size
Advantage of Spill-tree
q
p1
p2
LL
– higher accuracy
– makes mistake only when
true NN is far away
L LR
Overlapping buffer
Ting Liu, CMU
40
A Spill-tree Data Structure
Overlapping buffer size
Problem with spill-tree
q
p2
p1
LL
– uncontrolled depth
– O(logN) when
–  when
– empirically,
L LR
Overlapping buffer
Ting Liu, CMU
is the expected dist.
of a point to its NN
41
Hybrid Spill-tree Search
Overlapping node
-- “I’m Feeling Lucky” search
Non-overlapping node
-- backtracking search
•
Balance threshold ρ = 70% (empirically)
if either child of a node v contains more than ρ of the total points,
then split v in the conventional way.
Ting Liu, CMU
42
Further Efficiency Improvement by
Random Projection
Intuition: random projection approximately preserves distance.
Ting Liu, CMU
43
Experiments for Spill-tree
Dataset
Aerial
Num. Data
(N)
275,465
Num. Dim
(d)
60
Corel_hist
20,000
64
Corel_uci
68,040
64
Disk
40,000
1024
Galaxy
40,000
3838
Ting Liu, CMU
44
Comparison Methods
• Naïve k-NN
• Metric-tree
• Locality Sensitive Hashing (LSH)
• Spill-tree
Ting Liu, CMU
45
Spill-tree vs. Metric-tree
The CPU time (s) speed-up of Spill-tree over metric-tree
Spill-tree enjoys 3.3 ~ 706 folds speed-up over metric-tree
Ting Liu, CMU
46
Spill-tree vs. LSH
The CPU time (s) of Spill-tree and its speedup (in parentheses) over LSH
Spill-tree enjoys 2.5 ~ 31 folds speed-up over LSH
Ting Liu, CMU
47
K-NN Methods
K-NN
Exact K-NN
K-NN search
Approximate K-NN
K-NN
classification
• KNS2 (2-class)
Spill-tree
• KNS3 (2-class)
• IOC (multi-class)
Ting Liu, CMU
48
My Contribution
• T.Liu, A. W. Moore, A. Gray.
Efficient Exact k-NN and Nonparametric Classification in
High Dimensions, NIPS 2003.
• Y. Qi, A. Hauptman, T.Liu.
Supervised Classification for Video Shot Segmentation,
ICME 2003.
• T.Liu, K. Yang, A. W. Moore.
The IOC algorithm: Efficient Many-Class Non-parametric
Classification for High-Dimensional Data, KDD 2004.
• T.Liu, A. W. Moore, A. Gray, K. Yang.
An Investigation of Practical Approximate Nearest
Neighbor Algorithms, NIPS 2004.
Ting Liu, CMU
49
Related Work
• [Uhlmann 1991, Omohundro 1991]
•
•
•
Propose the idea of Metric-tree (Ball-tree)
[Omachi-Aso, 1997]
Similar idea of KNS2 for NN classification
[Gionis-Indyk-Motwani, 1999]
A practical approximate NN method: LSH
[Arya-Fu, 2003]
Expected-case complexity of approximate NN searching
• [Yan-Rahul, 2004]
•
Near-duplicate Detection and Sub-image Retrieval
[Indyk, 1998]
Approximate NN under L∞ norm
Ting Liu, CMU
50
Future Work
• Improve my previous work
– Self-tuning spill-tree
– Theoretical analysis of spill-tree
• Explore new related area
– Dual-tree
• Applications in real-world
Ting Liu, CMU
51
Future Work (1): Self-Tuning Spill-tree
• Two key factors of spill-tree
– random projection dimension d’
– overlapping size
Ting Liu, CMU
52
Benefits of Automatic Parameter Tuning
• Avoid tedious hand-tuning
• Gain more insights into the approx. NN
Ting Liu, CMU
53
Future work(2): Theoretical Analysis
• Spill-tree + “I’m feeling lucky search”
– good performance in practice 
– no theoretic guarantee 
Ting Liu, CMU
54
Idea: when the number of points is large enough, then
I’m feeling lucky search finds the true NN w.h.p.
Ting Liu, CMU
55
Idea: with overlapping buffer, the probability of
successfully finding the true NN can be increased
Ting Liu, CMU
56
Future Work(3): Dual-Tree Search
• N-body problem [Gray-Moore, 2001]
–
–
–
–
NN classification
Kernel density estimation
Outlier detection
Two-point correlation
• Require pair-wise comparison of all N points
– Naïve solution: O(N2)
– Advanced solution based metric-tree
• Single-tree: only build trees on training data
• Dual-tree: build trees on both training, query data
Ting Liu, CMU
57
Metric-tree: the Triangle Inequality
• Let q be a point inside query node Q
• Let x be a point inside training node B
q
x
Q
Ting Liu, CMU
B
58
Pruning Opportunity [Gray-Moore 2001]
Q
OQ
OA
OB
A, B: nodes from
training set
Q: node from test set
Dmax(Q, B)
B
A
Prune A when
A can’t be pruned in this case
But, this is too pessimistic!
Ting Liu, CMU
59
More pruning opportunity
q
OA
A
Q
OB
B
Hyperbola H determined
by OA,OB, rA+rB
Prune A when
A can be pruned in this case
Challenge: to compute this efficiently
Ting Liu, CMU
60
Future Work(4): Applications
• Multimedia --- video segmentation
– shot-based segmentation
– story-based segmentation
• Image retrieval --- near-duplicate detection
• Computer vision --- object recognition
Ting Liu, CMU
61
Time Line
• Now – Apr., 2005
– Dual-tree (design and implementation)
– Testing on real-world datasets
• May – Aug., 2005
– Improving spill-tree algorithm
– Theoretical analysis
• Sept. – Dec., 2005
– Applications of new k-NN algorithm
• Jan. – Mar., 2006
– Write up final thesis
Ting Liu, CMU
62
Thank you!
QUES
TIONS
Ting Liu, CMU
63